Fixes#3331
In #2389, we've implemented support for partitioned tables with rep > 1.
The implementation is limiting the use of modification queries on the
partitions. In fact, we error out when any partition is modified via
EnsurePartitionTableNotReplicated().
However, we seem to forgot an important case, where the parent table's
partition is marked as INVALID. In that case, at least one of the partition
becomes INVALID. However, we do not mark partitions as INVALID ever.
If the user queries the partition table directly, Citus could happily send
the query to INVALID placements -- which are not marked as INVALID.
This PR fixes it by marking the placements of the partitions as INVALID
as well.
The shard placement repair logic already re-creates all the partitions,
so should be fine in that front.
Different versions of reindent tool reformatted citus_custom_scan.c
and citus_copyfuncs.c differently. So some developers spent some
extra attention not to commit these two files after reindent.
This PR tries to address this.
* WIP
* wip
* add basic logic to run a single job with repartioning joins with adaptive executor
* fix some warnings and return in ExecuteDependedTasks if there is none
* Add the logic to run depended jobs in adaptive executor
The execution of depended tasks logic is changed. With the current
logic:
- All tasks are created from the top level task list.
- At one iteration:
- CurTasks whose dependencies are executed are found.
- CurTasks are executed in parallel with adapter executor main
logic.
- The iteration is repeated until all tasks are completed.
* Separate adaptive executor repartioning logic
* Remove duplicate parts
* cleanup directories and schemas
* add basic repartion tests for adaptive executor
* Use the first placement to fetch data
In task tracker, when there are replicas, we try to fetch from a replica
for which a map task is succeeded. TaskExecution is used for this,
however TaskExecution is not used in adaptive executor. So we cannot use
the same thing as task tracker.
Since adaptive executor fails when a map task fails (There is no retry
logic yet). We know that if we try to execute a fetch task, all of its
map tasks already succeeded, so we can just use the first one to fetch
from.
* fix clean directories logic
* do not change the search path while creating a udf
* Enable repartition joins with adaptive executor with only enable_reparitition_joins guc
* Add comments to adaptive_executor_repartition
* dont run adaptive executor repartition test in paralle with other tests
* execute cleanup only in the top level execution
* do cleanup only in the top level ezecution
* not begin a transaction if repartition query is used
* use new connections for repartititon specific queries
New connections are opened to send repartition specific queries. The
opened connections will be closed at the FinishDistributedExecution.
While sending repartition queries no transaction is begun so that
we can see all changes.
* error if a modification was done prior to repartition execution
* not start a transaction if a repartition query and sql task, and clean temporary files and schemas at each subplan level
* fix cleanup logic
* update tests
* add missing function comments
* add test for transaction with DDL before repartition query
* do not close repartition connections in adaptive executor
* rollback instead of commit in repartition join test
* use close connection instead of shutdown connection
* remove unnecesary connection list, ensure schema owner before removing directory
* rename ExecuteTaskListRepartition
* put fetch query string in planner not executor as we currently support only replication factor = 1 with adaptive executor and repartition query and we know the query string in the planner phase in that case
* split adaptive executor repartition to DAG execution logic and repartition logic
* apply review items
* apply review items
* use an enum for remote transaction state and fix cleanup for repartition
* add outside transaction flag to find connections that are unclaimed instead of always opening a new transaction
* fix style
* wip
* rename removejobdir to partition cleanup
* do not close connections at the end of repartition queries
* do repartition cleanup in pg catch
* apply review items
* decide whether to use transaction or not at execution creation
* rename isOutsideTransaction and add missing comment
* not error in pg catch while doing cleanup
* use replication factor of the creation time, not current time to decide if task tracker should be chosen
* apply review items
* apply review items
* apply review item
DESCRIPTION: Fix counter that keeps track of internal depth in executor
While reviewing #3302 I ran into the `ExecutorLevel` variable which used a variable to keep the original value to restore on successful exit. I haven't explored the full space and if it is possible to get into an inconsistent state. However using `PG_TRY`/`PG_CATCH` seems generally more correct.
Given very bad things will happen if this level is not reset, I kept the failsafe of setting the variiable back to 0 on the `XactCallback` but I did add an assert to treat it as a developer bug.
Currently in mx isolation tests the setup is the same except the creation of tables. Isolation framework lets us define multiple `setup` stages, therefore I thought that we can put the `mx_setup` to one file and prepend this prior to running tests.
How the structure works:
- cpp is used before running isolation tests to preprocess spec files. This way we can include any file we want to. Currently this is used to include mx common part.
- spec files are put to `/build/specs` for clear separation between generated files and template files
- a symbolic link is created for `/expected` in `build/expected/`.
- when running isolation tests, as the `inputdir`, `build` is passed so it runs the spec files from `build/specs` and checks the expected output from `build/expected`.
`/specs` is renamed as `/spec` because postgres first look at the `specs` file under current directory, so this is renamed to avoid that since we are running the isolation tests from `build/specs` now.
Note: now we use `//` instead of `#` in comments in spec files, because cpp interprets `#` as a directive and it ignores `//`.
Postgres keeps track of recursive CTEs in the queryTree in two ways:
- queryTree->hasRecursive is set to true, whenever a RECURSIVE CTE
is used in the SQL. Citus checks for it
- If the CTE is actually a recursive one (a.k.a., references itself)
Postgres marks CommonTableExpr->cterecursive as true as well
The tests that are changed in the PR doesn't cover (b), and this becomes
an issue with CTE inlining (#3161). In that case, Citus/Postgres can inline
such CTEs, and the queries works with Citus.
However, this tests intend to check if there is any recursive CTE in the queryTree.
So, we're actually making the CTEs recursive CTEs by referring itself.
We'll add cases where a recursive CTE works by inlining in #3161.
Use partition column's collation for range distributed tables
Don't allow non deterministic collations for hash distributed tables
CoPartitionedTables: don't compare unequal types
Test ALTER ROLE doesn't deadlock when coordinator added, or propagate from mx workers
Consolidate wait_until_metadata_sync & verify_metadata to multi_test_helpers
Previously,
- we'd push down ORDER BY, but this doesn't order intermediate results between workers
- we'd keep FILTER on master aggregate, which would raise an error about unexpected cstrings
DESCRIPTION: add gitref to the output of citus_version
During debugging of custom builds it is hard to know the exact version of the citus build you are using. This patch will add a human readable/understandable git reference to the build of citus which can be retrieved by calling `citus_version();`.
Support for ARRAY[] expressions is limited to having a consistent shape,
eg ARRAY[(int,text),(int,text)] as opposed to ARRAY[(int,text),(float,text)] or ARRAY[(int,text),(int,text,float)]
Initialization of queryWindowClause and queryOrderByLimit "memset" underflow these variables.
It's possible due to the invalid usage sizeof this part of the program cause buffer overflow and function return data corruption in future changes.
* Improve extension command propagation tests
* patch for hardcoded citus extension name
(cherry picked from commit 0bb3dbac0afabda10e8928f9c17eda048dc4361a)
In plain words, each distributed plan pulls the necessary intermediate
results to the worker nodes that the plan hits. This is primarily useful
in three ways.
(i) If the distributed plan that uses intermediate
result(s) is a router query, then the intermediate results are only
broadcasted to a single node.
(ii) If a distributed plan consists of only intermediate results, which
is not uncommon, the intermediate results are broadcasted to a single
node only.
(iii) If a distributed query hits a sub-set of the shards in multiple
workers, the intermediate results will be broadcasted to the relevant
node(s).
The final item (iii) becomes crucial for append/range distributed
tables where typically the distributed queries hit a small subset of
shards/workers.
To do this, for each query that Citus creates a distributed plan, we keep
track of the subPlans used in the queryTree, and save it in the distributed
plan. Just before Citus executes each subPlan, Citus first keeps track of
every worker node that the distributed plan hits, and marks every subPlan
should be broadcasted to these nodes. Later, for each subPlan which is a
distributed plan, Citus does this operation recursively since these
distributed plans may access to different subPlans, and those have to be
recorded as well.
Prevent Citus extension being distributed
Because that could prevent doing rolling upgrades, where users may
prefer to upgrade the version on the coordinator but not the workers.
There could be some other edge cases, so I'd prefer to keep Citus
extension outside the picture for now.
DESCRIPTION: Expression in reference join
Fixed: #2582
This patch allows arbitrary expressions in the join clause when joining to a reference table. An example of such joins could be found in CHbenCHmark queries 7, 8, 9 and 11; `mod((s_w_id * s_i_id),10000) = su_suppkey` and `ascii(substr(c_state,1,1)) = n2.n_nationkey`. Since the join is on a reference table these queries are able to be pushed down to the workers.
To implement these queries we will widen the `IsJoinClause` predicate to not check if the expressions are a type `Var` after stripping the implicit coerciens. Instead we define a join clause when the `Var`'s in a clause come from more than 1 table.
This allows more clauses to pass into the logical planner's `MultiNodeTree(...)` planning function. To compensate for this we tighten down the `LocalJoin`, `SinglePartitionJoin` and `DualPartitionJoin` to check for direct column references when planning. This allows the planner to work with arbitrary join expressions on reference tables.
With this commit, we're slightly changing the dependency traversal
logic to enable extension propagation.
The main idea is to "follow" the extension dependencies, but do not
"apply" them.
Since some extension dependencies are base types, and base types
could have circular dependencies, we implement a logic to prevent
revisiting an already visited object.
When the user picks "round-robin" policy, the aim is that the load
is distributed across nodes. However, for reference tables on the
coordinator, since local execution kicks in immediately, round-robin
is ignored.
With this change, we're excluding the placement on the coordinator.
Although the approach seems a little bit invasive because of
modifications in the placement list, that sounds acceptable.
We could have done this in some other ways such as:
1) Add a field to "Task->roundRobinPlacement" (or such), which is
updated as the first element after RoundRobinPolicy is applied.
During the execution, if that placement is local to the coordinator,
skip it and try the other remote placements.
2) On TaskAccessesLocalNode()@local_execution.c, check
task_assignment_policy, if round-robin selected and there is local
placement on the coordinator, skip it. However, task assignment is done
on planning, but this decision is happening on the execution, which
could create weird edge cases.
This change was actually already intended in #3124. However, the
postgres Makefile manually enables this warning too. This way we undo
that.
To confirm that it works two functions were changed to make use of not
having the warning anymore.
Phase 1 seeks to implement minimal infrastructure, so does not include:
- dynamic generation of support aggregates to handle multiple arguments
- configuration methods to direct aggregation strategy,
or mark an aggregate's serialize/deserialize as safe to operate across nodes
Aggregates can be distributed when:
- they have a single argument
- they have a combinefunc
- their transition type is not a pseudotype
This is necassery to support Q20 of the CHbenCHmark: #2582.
To summarize the fix: The subquery is converted into an INNER JOIN on a
table. This fixes the issue, since an INNER JOIN on a table is already
supported by the repartion planner.
The way this replacement is happening.:
1. Postgres replaces `col in (subquery)` with a SEMI JOIN (subquery) on col = subquery_result
2. If this subquery is simple enough Postgres will replace it with a
regular read from a table
3. If the subquery returns unique results (e.g. a primary key) Postgres
will convert the SEMI JOIN into an INNER JOIN during the planning. It
will not change this in the rewritten query though.
4. We check if Postgres sends us any SEMI JOINs during its join order
planning, if it doesn't we replace all SEMI JOINs in the rewritten
query with INNER JOIN (which we already support).
Since we've removed the executor, we don't need the specific tests.
Since the tests are already using adaptive executor, they were passing.
But, we've plenty of extra tests for adaptive executor, so seems safe
to remove.
Postgres doesn't require you to add all columns that are in the target list to
the GROUP BY when you group by a unique column (or columns). It even actively
removes these group by clauses when you do.
This is normally fine, but for repartition joins it is not. The reason for this
is that the temporary tables don't have these primary key columns. So when the
worker executes the query it will complain that it is missing columns in the
group by.
This PR fixes that by adding an ANY_VALUE aggregate around each variable in
the target list that does is not contained in the group by or in an aggregate.
This is done only for repartition joins.
The ANY_VALUE aggregate chooses the value from an undefined row in the
group.
It looks like the logic to prevent RETURNING in reference tables to
have duplicate entries that comes from local and remote executions
leads to missing some tuples for distributed tables.
With this PR, we're ensuring to kick in the logic for reference tables
only.
* Remove unused executor codes
All of the codes of real-time executor. Some functions
in router executor still remains there because there
are common functions. We'll move them to accurate places
in the follow-up commits.
* Move GUCs to transaction mngnt and remove unused struct
* Update test output
* Get rid of references of real-time executor from code
* Warn if real-time executor is picked
* Remove lots of unused connection codes
* Removed unused code for connection restrictions
Real-time and router executors cannot handle re-using of the existing
connections within a transaction block.
Adaptive executor and COPY can re-use the connections. So, there is no
reason to keep the code around for applying the restrictions in the
placement connection logic.
We've changed the logic for pulling RTE_RELATIONs in #3109 and
non-colocated subquery joins and partitioned tables.
@onurctirtir found this steps where I traced back and found the issues.
While looking into it in more detail, we decided to expand the list in a
way that the callers get all the relevant RTE_RELATIONs RELKIND_RELATION,
RELKIND_PARTITIONED_TABLE, RELKIND_FOREIGN_TABLE and RELKIND_MATVIEW.
These are all relation kinds that Citus planner is aware of.
When citus.enable_repartition_joins guc is set to on, and we have
adaptive executor, there was a typo in the debug message, which was
saying realtime executor no adaptive executor.
See #3125 for details on each item.
* Remove real-time/router executor tests-1
These are the ones which doesn't have '_%d' in the test
output files.
* Remove real-time/router executor tests-2
These are the ones which has in the test
output files.
* Move the tests outputs to correct place
* Make sure that single shard commits use 2PC on adaptive executor
It looks like we've messed the tests in #2891. Fixing back.
* Use adaptive executor for all router queries
This becomes important because when task-tracker is picked, we
used to pick router executor, which doesn't make sense.
* Remove explicit references to real-time/router executors in the tests
* JobExecutorType never picks real-time/router executors
* Make sure to go incremental in test output numbers
* Even users cannot pick real-time anymore
* Do not use real-time/router custom scans
* Get rid of unnecessary normalizations
* Reflect unneeded normalizations
* Get rid of unnecessary test output file
This completely hides `ListCell` to the user of the loop
Example usage:
```c
WorkerNode *workerNode = NULL;
foreach_ptr(workerNode, workerNodeList) {
// Do stuff with workerNode
}
```
Instead of:
```c
ListCell *workerNodeCell = NULL;
foreach(cell, workerNodeList) {
WorkerNode *workerNode = lfirst(workerNodeCell);
// Do stuff with workerNode
}
```
It turns out that TupleDescGetAttInMetadata() allocates quite a lot
of memory. And, if the target list is long and there are too many rows
returning, the leak becomes appereant.
You can reproduce the issue wout the fix with the following commands:
```SQL
CREATE TABLE users_table (user_id int, time timestamp, value_1 int, value_2 int, value_3 float, value_4 bigint);
SELECT create_distributed_table('users_table', 'user_id');
insert into users_table SELECT i, now(), i, i, i, i FROM generate_series(0,99999)i;
-- load faster
-- 200,000
INSERT INTO users_table SELECT * FROM users_table;
-- 400,000
INSERT INTO users_table SELECT * FROM users_table;
-- 800,000
INSERT INTO users_table SELECT * FROM users_table;
-- 1,600,000
INSERT INTO users_table SELECT * FROM users_table;
-- 3,200,000
INSERT INTO users_table SELECT * FROM users_table;
-- 6,400,000
INSERT INTO users_table SELECT * FROM users_table;
-- 12,800,000
INSERT INTO users_table SELECT * FROM users_table;
-- making the target list entry wider speeds up the leak to show up
select *,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,* FROM users_table ;
```
This is an improvement over #2512.
This adds the boolean shouldhaveshards column to pg_dist_node. When it's false, create_distributed_table for new collocation groups will not create shards on that node. Reference tables will still be created on nodes where it is false.
Areas for further optimization:
- Don't save subquery results to a local file on the coordinator when the subquery is not in the having clause
- Push the the HAVING with subquery to the workers if there's a group by on the distribution column
- Don't push down the results to the workers when we don't push down the HAVING clause, only the coordinator needs it
Fixes#520Fixes#756Closes#2047
* add support to run citus upgrade tests locally
* dont build tars if they already exist
* use current code instead of master for upgrade
* always build the current code
* copy the current citus code to have isolated citus upgrade tests
* fix configure and simplify copy
DESCRIPTION: Fix order for enum values and correctly support pg12
PG 12 introduces `ALTER TYPE ... ADD VALUE ...` during transactions. Earlier versions would error out when called in a transaction, hence we connect to workers outside of the transaction which could cause inconsistencies on pg12 now that postgres doesn't error with this syntax anymore.
During the implementation of this fix it became apparent there was an error with the ordering of enum labels when the type was recreated. A patch and test have been included.
Objectives:
(a) both super user and regular user should have the correct owner for the function on the worker
(b) The transactional semantics would work fine for both super user and regular user
(c) non-super-user and non-function owner would get a reasonable error message if tries to distribute the function
Co-authored-by: @serprex
* Add initial citus upgrade test
* Add restart databases and run tests in all nodes
* Add output for citus versions 8.0 8.1 8.2 and 8.3
* Add verify step for citus upgrade
* Add target for citus upgrade test in makefile
* Add check citus upgrade job
* Fix installation file path and add missing tar
* Run citus upgrade for v8.0 v8.1 v8.2 and v8.3
* Create upgrade_common file and rename upgrade check
* Add pg version to citus upgrade test
* Test with postgres 10 and 11 in citus upgrade tests
* Add readme for citus upgrade test
* Add some basic tests to citus upgrade tests
* Add citus upgrade mixed mode test
* Remove citus artifacts before installing another one
* Refactor citus upgrade test according to reviews
* quick and dirty rewrite of citus upgrade tests to support local execution.
I think we need to change the makefile in such a way that the tar files can be injected from the circle ci config file.
Also I removed some of the citus version checks you had to not have the requirement to pass that in separately from the pre tar file. I am not super happy with it, but two flags that need to be kept in sync is also not desirable. Instead I print out the citus version that is installed per node. This will not cause a failure if they are not what one would expect but it lets us verify we are running the expected version.
* use latest citusupgradetester in circleci
* update readme and use common alias for upgrade_common import
* Add PG12 test outputs
* Add jobs to run tests with pg 12
* use POSIX collate for compatibility between pg10/pg11/pg12
* do not override the new default value when running vanilla tests
* fix 2 problems with pg12 tests
* update pg12 images with pg12 rc1
* remove pg10 jobs
* Revert "Add PG12 test outputs"
This reverts commit f3545b92ef.
* change images to use latest instead of dev
* add missing coverage flags
DESCRIPTION: Disallow distributed functions for functions depending on an extension
Functions depending on an extension cannot (yet) be distributed by citus. If we would allow this it would cause issues with our dependency following mechanism as we stop following objects depending on an extension.
By not allowing functions to be distributed when they depend on an extension as well as not allowing to make distributed functions depend on an extension we won't break the ability to add new nodes. Allowing functions depending on extensions to be distributed at the moment could cause problems in that area.
DESCRIPTION: Propagate CREATE OR REPLACE FUNCTION
Distributed functions could be replaced, which should be propagated to the workers to keep the function in sync between all nodes.
Due to the complexity of deparsing the `CreateFunctionStmt` we actually produce the plan during the processing phase of our utilityhook. Since the changes have already been made in the catalog tables we can reuse `pg_get_functiondef` to get us the generated `CREATE OR REPLACE` sql.
DESCRIPTION: Propagate ALTER FUNCTION statements for distributed functions
Using the implemented deparser for function statements to propagate changes to both functions and procedures that are previously distributed.
This PR aims to add all the necessary logic to qualify and deparse all possible `{ALTER|DROP} .. {FUNCTION|PROCEDURE}` queries.
As Procedures are introduced in PG11, the code contains many PG version checks. I tried my best to make it easy to clean up once we drop PG10 support.
Here are some caveats:
- I assumed that the parse tree is a valid one. There are some queries that are not allowed, but still are parsed successfully by postgres planner. Such queries will result in errors in execution time. (e.g. `ALTER PROCEDURE p STRICT` -> `STRICT` action is valid for functions but not procedures. Postgres decides to parse them nevertheless.)
When a function is marked as colocated with a distributed table,
we try delegating queries of kind "SELECT func(...)" to workers.
We currently only support this simple form, and don't delegate
forms like "SELECT f1(...), f2(...)", "SELECT f1(...) FROM ...",
or function calls inside transactions.
As a side effect, we also fix the transactional semantics of DO blocks.
Previously we didn't consider a DO block a multi-statement transaction.
Now we do.
Co-authored-by: Marco Slot <marco@citusdata.com>
Co-authored-by: serprex <serprex@users.noreply.github.com>
Co-authored-by: pykello <hadi.moshayedi@microsoft.com>
In this PR the default `threshold` of `rebalance_table_shards` was set to 0: https://github.com/citusdata/shard_rebalancer/pull/73
However, the default for get_rebalance_table_shards_plan was not updated. This
can cause the confusing situation where the actual steps run by
`rebalance_table_shards` are not the same as the ones returned by
`get_rebalance_table_shards_plan`.
We started copying parse trees by default further on in `multi_ProcessUtility`. That's not a problem for maintenance command, but might register for things like `PREPARE` and `EXECUTE`, which might happen thousands of times per second. Add a few common commands to the check at the start.
Since the distributed functions are useful when the workers have
metadata, we automatically sync it.
Also, after master_add_node(). We do it lazily and let the deamon
sync it. That's mainly because the metadata syncing cannot be done
in transaction blocks, and we don't want to add lots of transactional
limitations to master_add_node() and create_distributed_function().
* Enhance pg upgrade tests
* Add a specific upgrade test for pg_dist_partition
We store the index of distribution column, and when a column with an
index that is smaller than distribution column index is dropped before
an upgrade, the index should still match the distribution column after
an upgrade
With this commit, we're changing the API for create_distributed_function()
such that users can provide the distribution argument and the colocation
information.
We've recently merged two commits, db5d03931d
and eccba1d4c3, which actually operates
on the very similar places.
It turns out that we've an integration issue, where master_add_node()
fails to replicate the functions to newly added node.
DESCRIPTION: Provide a GUC to turn of the new dependency propagation functionality
In the case the dependency propagation functionality introduced in 9.0 causes issues to a cluster of a user they can turn it off almost completely. The only dependency that will still be propagated and kept track of is the schema to emulate the old behaviour.
GUC to change is `citus.enable_object_propagation`. When set to `false` the functionality will be mostly turned off. Be aware that objects marked as distributed in `pg_dist_object` will still be kept in the catalog as a distributed object. Alter statements to these objects will not be propagated to workers and may cause desynchronisation.
DESCRIPTION: Rename remote types during type propagation
To prevent data to be destructed when a remote type differs from the type on the coordinator during type propagation we wanted to rename the type instead of `DROP CASCADE`.
This patch removes the `DROP` logic and adds the creation of a rename statement to a free name.
DESCRIPTION: Add feature flag to turn off create type propagation
When `citus.enable_create_type_propagation` is set to `false` citus will not propagate `CREATE TYPE` statements to the workers. Types are still distributed when tables that depend on these types are distributed.
This PR simply adds the columns to pg_dist_object and
implements the necessary metadata changes to keep track of
distribution argument of the functions/procedures.
A better fix for #2975. Apparently for OSX cpp -MF and -MT shouldn't have a
space in between the flag and their value. Without the space it still works for
gcc as well.
This PR aims to add the minimal set of changes required to start
distributing functions. You can use create_distributed_function(regproc)
UDF to distribute a function.
SELECT create_distributed_function('add(int,int)');
The function definition should include the param types to properly
identify the correct function that we wish to distribute
@thanodnl told me it was a bit of a problem that it's impossible to see
the history of a UDF in git. The only way to do so is by reading all the
sql migration files from new to old. Another problem is that it's also
hard to review the changed UDF during code review, because to find out
what changed you have to do the same. I thought of a IMHO better (but
not perfect) way to handle this.
We keep the definition of a UDF in sql/udfs/{name_of_udf}/latest.sql.
That file we change whenever we need to make a change to the the UDF. On
top of that you also make a snapshot of the file in
sql/udfs/{name_of_udf}/{migration-version}.sql (e.g. 9.0-1.sql) by
copying the contents. This way you can easily view what the actual
changes were by looking at the latest.sql file.
There's still the question on how to use these files then. Sadly
postgres doesn't allow inclusion of other sql files in the migration sql
file (it does in psql using \i). So instead I used the C preprocessor+
make to compile a sql/xxx.sql to a build/sql/xxx.sql file. This final
build/sql/xxx.sql file has every occurence of #include "somefile.sql" in
sql/xxx.sql replaced by the contents of somefile.sql.
DESCRIPTION: Distribute Types to worker nodes
When to propagate
==============
There are two logical moments that types could be distributed to the worker nodes
- When they get used ( just in time distribution )
- When they get created ( proactive distribution )
The just in time distribution follows the model used by how schema's get created right before we are going to create a table in that schema, for types this would be when the table uses a type as its column.
The proactive distribution is suitable for situations where it is benificial to have the type on the worker nodes directly. They can later on be used in queries where an intermediate result gets created with a cast to this type.
Just in time creation is always the last resort, you cannot create a distributed table before the type gets created. A good example use case is; you have an existing postgres server that needs to scale out. By adding the citus extension, add some nodes to the cluster, and distribute the table. The type got created before citus existed. There was no moment where citus could have propagated the creation of a type.
Proactive is almost always a good option. Types are not resource intensive objects, there is no performance overhead of having 100's of types. If you want to use them in a query to represent an intermediate result (which happens in our test suite) they just work.
There is however a moment when proactive type distribution is not beneficial; in transactions where the type is used in a distributed table.
Lets assume the following transaction:
```sql
BEGIN;
CREATE TYPE tt1 AS (a int, b int);
CREATE TABLE t1 AS (a int PRIMARY KEY, b tt1);
SELECT create_distributed_table('t1', 'a');
\copy t1 FROM bigdata.csv
```
Types are node scoped objects; meaning the type exists once per worker. Shards however have best performance when they are created over their own connection. For the type to be visible on all connections it needs to be created and committed before we try to create the shards. Here the just in time situation is most beneficial and follows how we create schema's on the workers. Outside of a transaction block we will just use 1 connection to propagate the creation.
How propagation works
=================
Just in time
-----------
Just in time propagation hooks into the infrastructure introduced in #2882. It adds types as a supported object in `SupportedDependencyByCitus`. This will make sure that any object being distributed by citus that depends on types will now cascade into types. When types are depending them self on other objects they will get created first.
Creation later works by getting the ddl commands to create the object by its `ObjectAddress` in `GetDependencyCreateDDLCommands` which will dispatch types to `CreateTypeDDLCommandsIdempotent`.
For the correct walking of the graph we follow array types, when later asked for the ddl commands for array types we return `NIL` (empty list) which makes that the object will not be recorded as distributed, (its an internal type, dependant on the user type).
Proactive distribution
---------------------
When the user creates a type (composite or enum) we will have a hook running in `multi_ProcessUtility` after the command has been applied locally. Running after running locally makes that we already have an `ObjectAddress` for the type. This is required to mark the type as being distributed.
Keeping the type up to date
====================
For types that are recorded in `pg_dist_object` (eg. `IsObjectDistributed` returns true for the `ObjectAddress`) we will intercept the utility commands that alter the type.
- `AlterTableStmt` with `relkind` set to `OBJECT_TYPE` encapsulate changes to the fields of a composite type.
- `DropStmt` with removeType set to `OBJECT_TYPE` encapsulate `DROP TYPE`.
- `AlterEnumStmt` encapsulates changes to enum values.
Enum types can not be changed transactionally. When the execution on a worker fails a warning will be shown to the user the propagation was incomplete due to worker communication failure. An idempotent command is shown for the user to re-execute when the worker communication is fixed.
Keeping types up to date is done via the executor. Before the statement is executed locally we create a plan on how to apply it on the workers. This plan is executed after we have applied the statement locally.
All changes to types need to be done in the same transaction for types that have already been distributed and will fail with an error if parallel queries have already been executed in the same transaction. Much like foreign keys to reference tables.
For another PR I needed to add another column which would require to add
another argument to an already 9 argument function signature. In this
case it would be a boolean flag and there were already two boolean flags
in there. In my experience it becomes really easy to mess up the order
of these flags at that point. Especially because the type system doesn't
distinguish between the 3 different booleans with completely different
meanings.
So I refactored these signatures to receive a struct containing most of
these arguments. Like that you don't mess up orderening, because the
meaning of the boolean is not order dependent but fieldname dependent.
It also makes it possible to set good shared defaults for this struct.
DESCRIPTION: Fix schema leak on CREATE INDEX statement
When a CREATE INDEX is cached between execution we might leak the schema name onto the cached statement of an earlier execution preventing the right index to be created.
Even though the cache is cleared when the search_path changes we can trigger this behaviour by having the schema already on the search path before a colliding table is created in a schema earlier on the `search_path`. When calling an unqualified create index via a function (used to trigger the caching behaviour) we see that the index is created on the wrong table after the schema leaked onto the statement.
By copying the complete `PlannedStmt` and `utilityStmt` during our planning phase for distributed ddls we make sure we are not leaking the schema name onto a cached data structure.
Caveat; COPY statements already have a lot of parsestree copying ongoing without directly putting it back on the `pstmt`. We should verify that copies modify the statement and potentially copy the complete `pstmt` there already.
/*
* local_executor.c
*
* The scope of the local execution is locally executing the queries on the
* shards. In other words, local execution does not deal with any local tables
* that are not shards on the node that the query is being executed. In that sense,
* the local executor is only triggered if the node has both the metadata and the
* shards (e.g., only Citus MX worker nodes).
*
* The goal of the local execution is to skip the unnecessary network round-trip
* happening on the node itself. Instead, identify the locally executable tasks and
* simply call PostgreSQL's planner and executor.
*
* The local executor is an extension of the adaptive executor. So, the executor uses
* adaptive executor's custom scan nodes.
*
* One thing to note that Citus MX is only supported with replication factor = 1, so
* keep that in mind while continuing the comments below.
*
* On the high level, there are 3 slightly different ways of utilizing local execution:
*
* (1) Execution of local single shard queries of a distributed table
*
* This is the simplest case. The executor kicks at the start of the adaptive
* executor, and since the query is only a single task the execution finishes
* without going to the network at all.
*
* Even if there is a transaction block (or recursively planned CTEs), as long
* as the queries hit the shards on the same, the local execution will kick in.
*
* (2) Execution of local single queries and remote multi-shard queries
*
* The rule is simple. If a transaction block starts with a local query execution,
* all the other queries in the same transaction block that touch any local shard
* have to use the local execution. Although this sounds restrictive, we prefer to
* implement in this way, otherwise we'd end-up with as complex scenarious as we
* have in the connection managements due to foreign keys.
*
* See the following example:
* BEGIN;
* -- assume that the query is executed locally
* SELECT count(*) FROM test WHERE key = 1;
*
* -- at this point, all the shards that reside on the
* -- node is executed locally one-by-one. After those finishes
* -- the remaining tasks are handled by adaptive executor
* SELECT count(*) FROM test;
*
*
* (3) Modifications of reference tables
*
* Modifications to reference tables have to be executed on all nodes. So, after the
* local execution, the adaptive executor keeps continuing the execution on the other
* nodes.
*
* Note that for read-only queries, after the local execution, there is no need to
* kick in adaptive executor.
*
* There are also few limitations/trade-offs that is worth mentioning. First, the
* local execution on multiple shards might be slow because the execution has to
* happen one task at a time (e.g., no parallelism). Second, if a transaction
* block/CTE starts with a multi-shard command, we do not use local query execution
* since local execution is sequential. Basically, we do not want to lose parallelism
* across local tasks by switching to local execution. Third, the local execution
* currently only supports queries. In other words, any utility commands like TRUNCATE,
* fails if the command is executed after a local execution inside a transaction block.
* Forth, the local execution cannot be mixed with the executors other than adaptive,
* namely task-tracker, real-time and router executors. Finally, related with the
* previous item, COPY command cannot be mixed with local execution in a transaction.
* The implication of that any part of INSERT..SELECT via coordinator cannot happen
* via the local execution.
*/
Before this patch, when a connection is lost, we'd have the following
situation:
- Pop a task execution from readyQueue
- Lost connection
- Fail the session/pool. -> This step was not acting properly
because we've popped the task, but not set to session->currentTask
yet
After the patch:
- Pop a task execution from readyQueue
- Immediately set it to session->currentTask
- Lost connection
- Fail the session/pool. -> At this step, failing the
session would trigger query failures (or failovers)
properly.
* Add creating a citus cluster script
Creating a citus cluster is automated.
Before running this script:
- Citus should be installed and its control file should be added to postgres. (make install)
- Postgres should be installed.
* Initialize upgrade test table and fill
* Finalize the layout of upgrade tests
Postgres upgrade function is added.
The newly added UDFs(citus_prepare_pg_upgrade, citus_finish_pg_upgrade) are used to
perform upgrade.
* Refactor upgrade test and add config file
* Add schedules for upgrade testing
* Use pg_regress for upgrade tests
pg_regress is used for creating a simple distributed table in
upgrade tests. After upgrading another schedule is used to verify
that the distributed table exists. Router and realtime queries are
used for verifying.
* Run upgrade tests as a postgres user in a temp dir
postgres user is used for psql to be consistent at running tests.
A temp dir is created and the temp dir's permissions are changed so
that postgres user can access it. All psql commands are now run with
postgres user.
"Select * from t" query is changed as "Select * from t order by a"
so that the result is always in the same order.
* Add docopt and arguments for the upgrade script
Docopt dependency is added to parse flags in script.
Some refactoring in variable names is done.
* Add readme for upgrade tests
* Refactor upgrade tests
Use relative data path instead of absolute assuming that this script will
always be run from 'src/test/regress'
Remove 'citus-path' flag
Use specific version for docopt instead of *
Use named args in string formatting
* Resolve a security problem
Instead of using string formatting in subprocess.call, arguments
list is used. Otherwise users could do shell injection.
Shell = True is removed from subprocess call as it is not recommended
to use this.
* Add how the test works to readme
* Refactor some variables to be consistent
* Update upgrade script based on the reviews
It was possible that postgres server would stay running even when the script
crashes, atexit library is used to ensure that we always do a teardown where we stop
the databases.
Some formatting is done in the code for better readability.
Config class is used instead of a dictonary.
A target for upgrade test is added to makefile.
Unused flags/functions/variables are removed.
* Format commands and remove unnecessary flag from readme
This is a bug that got in when we inlined the body of a function into this loop. Earlier revisions had two loops, hence a function that would be reused.
With a return instead of a continue the list of dependencies being walked is dependent on the order in which we find them in pg_depend. This became apparent during pg12 compatibility. The order of entries in pg12 was luckily different causing a random test to fail due to this return.
By changing it to a continue we only skip the entries that we don’t want to follow instead of skipping all entries that happen to be found later.
sidefix for more stable isolation tests around ensure dependency
DESCRIPTION: Refactor ensure schema exists to dependency exists
Historically we only supported schema's as table dependencies to be created on the workers before a table gets distributed. This PR puts infrastructure in place to walk pg_depend to figure out which dependencies to create on the workers. Currently only schema's are supported as objects to create before creating a table.
We also keep track of dependencies that have been created in the cluster. When we add a new node to the cluster we use this catalog to know which objects need to be created on the worker.
Side effect of knowing which objects are already distributed is that we don't have debug messages anymore when creating schema's that are already created on the workers.
* Add tuplestore helpers
* More detailed error messages in tuplestore
* Add CreateTupleDescCopy to SetupTuplestore
* Use new SetupTuplestore helper function
* Remove unnecessary copy
* Remove comment about undefined behaviour
See a9c35cf85c
clang raises a warning due to FunctionCall2InfoData technically being variable sized
This is fine, as the struct is the size we want it to be. So silence the warning
master_deactivate_node is updated to decrement the replication factor
Otherwise deactivation could have create_reference_table produce a second record
UpdateColocationGroupReplicationFactor is renamed UpdateColocationGroupReplicationFactorForReferenceTables
& the implementation looks up the record based on distributioncolumntype == InvalidOid, rather than by id
Otherwise the record's replication factor fails to be maintained when there are no reference tables
DESCRIPTION: Add functions to help with postgres upgrades
Currently there is [a list of manual steps](https://docs.citusdata.com/en/v8.2/admin_guide/upgrading_citus.html?highlight=upgrade#upgrading-postgresql-version-from-10-to-11) to perform during a postgres upgrade. These steps guarantee our catalog tables are kept and counter values are maintained across upgrades.
Having more than 1 command in our docs for users to manually execute during upgrades is error prone for both the user, and our docs. There are already 2 catalog tables that have been introduced to citus that have not been added to our docs for backing up during upgrades (`pg_authinfo` and `pg_dist_poolinfo`).
As we add more functionality to citus we run into situations where there are more steps required either before or after the upgrade. At the same time, when we move catalog tables to a place where the contents will be maintained automatically during upgrades we could have less steps in our docs. This will come to a hard to maintain matrix of citus versions and steps to be performed.
Instead we could take ownership of these steps within the extension itself. This PR introduces two new functions for the user to use instead of long lists of error prone instructions to follow.
- `citus_prepare_pg_upgrade`
This function should be called by the user right before shutting down the cluster. This will ensure all citus catalog tables are backed up in a location where the information will be retained during an upgrade.
- `citus_finish_pg_upgrade`
This function should be called right after a pg_upgrade of the cluster. This will restore the catalog tables to the state before the upgrade happend.
Both functions need to be executed both on the coordinator and on all the workers, in the same fashion our current documentation instructs to do.
There are two known problems with this function in its current form, which is also a problem with our docs. We should schedule time in the future to improve on this, but having it automated now is better as we are about to add extra steps to take after upgrades.
- When you install citus in a clean cluster we do enable ssl for communication between the coordinator and the workers. If an upgrade to a clean cluster is performed we do not setup ssl on the new cluster causing the communication to fail.
- There are no automated tests added in this PR to execute an upgrade test durning every build.
Our current test infrastructure does not allow for 2 versions of postgres to exist in the same environment. We will need to invest time to create a new testing harness that could run the following scenario:
1. Create cluster
2. Run extensible scripts to execute arbitrary statements on this cluster
3. Perform an upgrade by preparing, upgrading and finishing
4. Run extensible scripts to verify all objects created by earlier scripts exists in correct form in the upgraded cluster
Given the non trivial amount of work involved for such a suite I'd like to land this before we have
automated testing.
On a side note; As the reviewer noticed, the tables created in the public namespace are not visible in `psql` with `\d`. The backup catalog tables have the same name as the tables in `pg_catalog`. Due to postgres internals `pg_catalog` is first in the search path and therefore the non-qualified name would alwasy resolve to `pg_catalog.pg_dist_*`. Internally this is called a non-visible table as it would resolve to a different table without a qualified name. Only visible tables are shown with `\d`.
Before this commit, we've recorded the relation accesses in 3 different
places
- FindPlacementListConnection -- applies all executor in tx block
- StartPlacementExecutionOnSession() -- adaptive executor only
- StartPlacementListConnection() -- router/real-time only
This is different than Citus 8.2, and could lead to query execution times
increase considerably on multi-shard commands in transaction block
that are on partitioned tables.
Benchmarks:
```
1+8 c5.4xlarge cluster
Empty distributed partitioned table with 365 partitions: https://gist.github.com/onderkalaci/1edace4ed6bd6f061c8a15594865bb51#file-partitions_365-sql
./pgbench -f /tmp/multi_shard.sql -c10 -j10 -P 1 -T 120 postgres://citus:w3r6KLJpv3mxe9E-NIUeJw@c.fy5fkjcv45vcepaogqcaskmmkee.db.citusdata.com:5432/citus?sslmode=require
cat /tmp/multi_shard.sql
BEGIN;
DELETE FROM collections_list;
DELETE FROM collections_list;
DELETE FROM collections_list;
COMMIT;
cat /tmp/single_shard.sql
BEGIN;
DELETE FROM collections_list WHERE key = :aid;
DELETE FROM collections_list WHERE key = :aid;
DELETE FROM collections_list WHERE key = :aid;
COMMIT;
cat /tmp/mix.sql
BEGIN;
DELETE FROM collections_list WHERE key = :aid;
DELETE FROM collections_list WHERE key = :aid;
DELETE FROM collections_list WHERE key = :aid;
DELETE FROM collections_list;
DELETE FROM collections_list;
DELETE FROM collections_list;
COMMIT;
```
The table shows `latency average` of pgbench runs explained above, so we have a pretty solid improvement even over 8.2.2.
| Test | Citus 8.2.2 | Citus 8.3.1 | Citus 8.3.2 (this branch) | Citus 8.3.1 (FKEYs disabled via GUC) |
| ------------- | ------------- | ------------- |------------- | ------------- |
|multi_shard | 2370.083 ms |3605.040 ms |1324.094 ms |1247.255 ms |
| single_shard | 85.338 ms |120.934 ms |73.216 ms | 78.765 ms |
| mix | 2434.459 ms | 3727.080 ms |1306.456 ms | 1280.326 ms |
This causes no behaviorial changes, only organizes better to implement modifying CTEs
Also rename ExtactInsertRangeTableEntry to ExtractResultRelationRTE,
as the source of this function didn't match the documentation
Remove Task's upsertQuery in favor of ROW_MODIFY_NONCOMMUTATIVE
Split up AcquireExecutorShardLock into more internal functions
Tests: Normalize multi_reference_table multi_create_table_constraints
Also automated all manual tests around multi user isolation for internal citus udf's
automate upgrade_to_reference_table tests
add negative tests for lock_relation_if_exists
add tests for permissions on worker_cleanup_job_schema_cache
add tests for worker_fetch_partition_file
add tests for worker_merge_files_into_table
fix problem with worker_merge_files_and_run_query when run as non-super user and add tests for behaviour
With this commit, we're introducing the Adaptive Executor.
The commit message consists of two distinct sections. The first part explains
how the executor works. The second part consists of the commit messages of
the individual smaller commits that resulted in this commit. The readers
can search for the each of the smaller commit messages on
https://github.com/citusdata/citus and can learn more about the history
of the change.
/*-------------------------------------------------------------------------
*
* adaptive_executor.c
*
* The adaptive executor executes a list of tasks (queries on shards) over
* a connection pool per worker node. The results of the queries, if any,
* are written to a tuple store.
*
* The concepts in the executor are modelled in a set of structs:
*
* - DistributedExecution:
* Execution of a Task list over a set of WorkerPools.
* - WorkerPool
* Pool of WorkerSessions for the same worker which opportunistically
* executes "unassigned" tasks from a queue.
* - WorkerSession:
* Connection to a worker that is used to execute "assigned" tasks
* from a queue and may execute unasssigned tasks from the WorkerPool.
* - ShardCommandExecution:
* Execution of a Task across a list of placements.
* - TaskPlacementExecution:
* Execution of a Task on a specific placement.
* Used in the WorkerPool and WorkerSession queues.
*
* Every connection pool (WorkerPool) and every connection (WorkerSession)
* have a queue of tasks that are ready to execute (readyTaskQueue) and a
* queue/set of pending tasks that may become ready later in the execution
* (pendingTaskQueue). The tasks are wrapped in a ShardCommandExecution,
* which keeps track of the state of execution and is referenced from a
* TaskPlacementExecution, which is the data structure that is actually
* added to the queues and describes the state of the execution of a task
* on a particular worker node.
*
* When the task list is part of a bigger distributed transaction, the
* shards that are accessed or modified by the task may have already been
* accessed earlier in the transaction. We need to make sure we use the
* same connection since it may hold relevant locks or have uncommitted
* writes. In that case we "assign" the task to a connection by adding
* it to the task queue of specific connection (in
* AssignTasksToConnections). Otherwise we consider the task unassigned
* and add it to the task queue of a worker pool, which means that it
* can be executed over any connection in the pool.
*
* A task may be executed on multiple placements in case of a reference
* table or a replicated distributed table. Depending on the type of
* task, it may not be ready to be executed on a worker node immediately.
* For instance, INSERTs on a reference table are executed serially across
* placements to avoid deadlocks when concurrent INSERTs take conflicting
* locks. At the beginning, only the "first" placement is ready to execute
* and therefore added to the readyTaskQueue in the pool or connection.
* The remaining placements are added to the pendingTaskQueue. Once
* execution on the first placement is done the second placement moves
* from pendingTaskQueue to readyTaskQueue. The same approach is used to
* fail over read-only tasks to another placement.
*
* Once all the tasks are added to a queue, the main loop in
* RunDistributedExecution repeatedly does the following:
*
* For each pool:
* - ManageWorkPool evaluates whether to open additional connections
* based on the number unassigned tasks that are ready to execute
* and the targetPoolSize of the execution.
*
* Poll all connections:
* - We use a WaitEventSet that contains all (non-failed) connections
* and is rebuilt whenever the set of active connections or any of
* their wait flags change.
*
* We almost always check for WL_SOCKET_READABLE because a session
* can emit notices at any time during execution, but it will only
* wake up WaitEventSetWait when there are actual bytes to read.
*
* We check for WL_SOCKET_WRITEABLE just after sending bytes in case
* there is not enough space in the TCP buffer. Since a socket is
* almost always writable we also use WL_SOCKET_WRITEABLE as a
* mechanism to wake up WaitEventSetWait for non-I/O events, e.g.
* when a task moves from pending to ready.
*
* For each connection that is ready:
* - ConnectionStateMachine handles connection establishment and failure
* as well as command execution via TransactionStateMachine.
*
* When a connection is ready to execute a new task, it first checks its
* own readyTaskQueue and otherwise takes a task from the worker pool's
* readyTaskQueue (on a first-come-first-serve basis).
*
* In cases where the tasks finish quickly (e.g. <1ms), a single
* connection will often be sufficient to finish all tasks. It is
* therefore not necessary that all connections are established
* successfully or open a transaction (which may be blocked by an
* intermediate pgbouncer in transaction pooling mode). It is therefore
* essential that we take a task from the queue only after opening a
* transaction block.
*
* When a command on a worker finishes or the connection is lost, we call
* PlacementExecutionDone, which then updates the state of the task
* based on whether we need to run it on other placements. When a
* connection fails or all connections to a worker fail, we also call
* PlacementExecutionDone for all queued tasks to try the next placement
* and, if necessary, mark shard placements as inactive. If a task fails
* to execute on all placements, the execution fails and the distributed
* transaction rolls back.
*
* For multi-row INSERTs, tasks are executed sequentially by
* SequentialRunDistributedExecution instead of in parallel, which allows
* a high degree of concurrency without high risk of deadlocks.
* Conversely, multi-row UPDATE/DELETE/DDL commands take aggressive locks
* which forbids concurrency, but allows parallelism without high risk
* of deadlocks. Note that this is unrelated to SEQUENTIAL_CONNECTION,
* which indicates that we should use at most one connection per node, but
* can run tasks in parallel across nodes. This is used when there are
* writes to a reference table that has foreign keys from a distributed
* table.
*
* Execution finishes when all tasks are done, the query errors out, or
* the user cancels the query.
*
*-------------------------------------------------------------------------
*/
All the commits involved here:
* Initial unified executor prototype
* Latest changes
* Fix rebase conflicts to master branch
* Add missing variable for assertion
* Ensure that master_modify_multiple_shards() returns the affectedTupleCount
* Adjust intermediate result sizes
The real-time executor uses COPY command to get the results
from the worker nodes. Unified executor avoids that which
results in less data transfer. Simply adjust the tests to lower
sizes.
* Force one connection per placement (or co-located placements) when requested
The existing executors (real-time and router) always open 1 connection per
placement when parallel execution is requested.
That might be useful under certain circumstances:
(a) User wants to utilize as much as CPUs on the workers per
distributed query
(b) User has a transaction block which involves COPY command
Also, lots of regression tests rely on this execution semantics.
So, we'd enable few of the tests with this change as well.
* For parameters to be resolved before using them
For the details, see PostgreSQL's copyParamList()
* Unified executor sorts the returning output
* Ensure that unified executor doesn't ignore sequential execution of DDLJob's
Certain DDL commands, mainly creating foreign keys to reference tables,
should be executed sequentially. Otherwise, we'd end up with a self
distributed deadlock.
To overcome this situaiton, we set a flag `DDLJob->executeSequentially`
and execute it sequentially. Note that we have to do this because
the command might not be called within a transaction block, and
we cannot call `SetLocalMultiShardModifyModeToSequential()`.
This fixes at least two test: multi_insert_select_on_conflit.sql and
multi_foreign_key.sql
Also, I wouldn't mind scattering local `targetPoolSize` variables within
the code. The reason is that we'll soon have a GUC (or a global
variable based on a GUC) that'd set the pool size. In that case, we'd
simply replace `targetPoolSize` with the global variables.
* Fix 2PC conditions for DDL tasks
* Improve closing connections that are not fully established in unified execution
* Support foreign keys to reference tables in unified executor
The idea for supporting foreign keys to reference tables is simple:
Keep track of the relation accesses within a transaction block.
- If a parallel access happens on a distributed table which
has a foreign key to a reference table, one cannot modify
the reference table in the same transaction. Otherwise,
we're very likely to end-up with a self-distributed deadlock.
- If an access to a reference table happens, and then a parallel
access to a distributed table (which has a fkey to the reference
table) happens, we switch to sequential mode.
Unified executor misses the function calls that marks the relation
accesses during the execution. Thus, simply add the necessary calls
and let the logic kick in.
* Make sure to close the failed connections after the execution
* Improve comments
* Fix savepoints in unified executor.
* Rebuild the WaitEventSet only when necessary
* Unclaim connections on all errors.
* Improve failure handling for unified executor
- Implement the notion of errorOnAnyFailure. This is similar to
Critical Connections that the connection managament APIs provide
- If the nodes inside a modifying transaction expand, activate 2PC
- Fix few bugs related to wait event sets
- Mark placement INACTIVE during the execution as much as possible
as opposed to we do in the COMMIT handler
- Fix few bugs related to scheduling next placement executions
- Improve decision on when to use 2PC
Improve the logic to start a transaction block for distributed transactions
- Make sure that only reference table modifications are always
executed with distributed transactions
- Make sure that stored procedures and functions are executed
with distributed transactions
* Move waitEventSet to DistributedExecution
This could also be local to RunDistributedExecution(), but in that case
we had to mark it as "volatile" to avoid PG_TRY()/PG_CATCH() issues, and
cast it to non-volatile when doing WaitEventSetFree(). We thought that
would make code a bit harder to read than making this non-local, so we
move it here. See comments for PG_TRY() in postgres/src/include/elog.h
and "man 3 siglongjmp" for more context.
* Fix multi_insert_select test outputs
Two things:
1) One complex transaction block is now supported. Simply update
the test output
2) Due to dynamic nature of the unified executor, the orders of
the errors coming from the shards might change (e.g., all of
the queries on the shards would fail, but which one appears
on the error message?). To fix that, we simply added it to
our shardId normalization tool which happens just before diff.
* Fix subeury_and_cte test
The error message is updated from:
failed to execute task
To:
more than one row returned by a subquery or an expression
which is a lot clearer to the user.
* Fix intermediate_results test outputs
Simply update the error message from:
could not receive query results
to
result "squares" does not exist
which makes a lot more sense.
* Fix multi_function_in_join test
The error messages update from:
Failed to execute task XXX
To:
function f(..) does not exist
* Fix multi_query_directory_cleanup test
The unified executor does not create any intermediate files.
* Fix with_transactions test
A test case that just started to work fine
* Fix multi_router_planner test outputs
The error message is update from:
Could not receive query results
To:
Relation does not exists
which is a lot more clearer for the users
* Fix multi_router_planner_fast_path test
The error message is update from:
Could not receive query results
To:
Relation does not exists
which is a lot more clearer for the users
* Fix isolation_copy_placement_vs_modification by disabling select_opens_transaction_block
* Fix ordering in isolation_multi_shard_modify_vs_all
* Add executor locks to unified executor
* Make sure to allocate enought WaitEvents
The previous code was missing the waitEvents for the latch and
postmaster death.
* Fix rebase conflicts for master rebase
* Make sure that TRUNCATE relies on unified executor
* Implement true sequential execution for multi-row INSERTS
Execute the individual tasks executed one by one. Note that this is different than
MultiShardConnectionType == SEQUENTIAL_CONNECTION case (e.g., sequential execution
mode). In that case, running the tasks across the nodes in parallel is acceptable
and implemented in that way.
However, the executions that are qualified here would perform poorly if the
tasks across the workers are executed in parallel. We currently qualify only
one class of distributed queries here, multi-row INSERTs. If we do not enforce
true sequential execution, concurrent multi-row upserts could easily form
a distributed deadlock when the upserts touch the same rows.
* Remove SESSION_LIFESPAN flag in unified_executor
* Apply failure test updates
We've changed the failure behaviour a bit, and also the error messages
that show up to the user. This PR covers majority of the updates.
* Unified executor honors citus.node_connection_timeout
With this commit, unified executor errors out if even
a single connection cannot be established within
citus.node_connection_timeout.
And, as a side effect this fixes failure_connection_establishment
test.
* Properly increment/decrement pool size variables
Before this commit, the idle and active connection
counts were not properly calculated.
* insert_select_executor goes through unified executor.
* Add missing file for task tracker
* Modify ExecuteTaskListExtended()'s signature
* Sort output of INSERT ... SELECT ... RETURNING
* Take partition locks correctly in unified executor
* Alternative implementation for force_max_query_parallelization
* Fix compile warnings in unified executor
* Fix style issues
* Decrement idleConnectionCount when idle connection is lost
* Always rebuild the wait event sets
In the previous implementation, on waitFlag changes, we were only
modifying the wait events. However, we've realized that it might
be an over optimization since (a) we couldn't see any performance
benefits (b) we see some errors on failures and because of (a)
we prefer to disable it now.
* Make sure to allocate enough sized waitEventSet
With multi-row INSERTs, we might have more sessions than
task*workerCount after few calls of RunDistributedExecution()
because the previous sessions would also be alive.
Instead, re-allocate events when the connectino set changes.
* Implement SELECT FOR UPDATE on reference tables
On master branch, we do two extra things on SELECT FOR UPDATE
queries on reference tables:
- Acquire executor locks
- Execute the query on all replicas
With this commit, we're implementing the same logic on the
new executor.
* SELECT FOR UPDATE opens transaction block even if SelectOpensTransactionBlock disabled
Otherwise, users would be very confused and their logic is very likely
to break.
* Fix build error
* Fix the newConnectionCount calculation in ManageWorkerPool
* Fix rebase conflicts
* Fix minor test output differences
* Fix citus indent
* Remove duplicate sorts that is added with rebase
* Create distributed table via executor
* Fix wait flags in CheckConnectionReady
* failure_savepoints output for unified executor.
* failure_vacuum output (pg 10) for unified executor.
* Fix WaitEventSetWait timeout in unified executor
* Stabilize failure_truncate test output
* Add an ORDER BY to multi_upsert
* Fix regression test outputs after rebase to master
* Add executor.c comment
* Rename executor.c to adaptive_executor.c
* Do not schedule tasks if the failed placement is not ready to execute
Before the commit, we were blindly scheduling the next placement executions
even if the failed placement is not on the ready queue. Now, we're ensuring
that if failed placement execution is on a failed pool or session where the
execution is on the pendingQueue, we do not schedule the next task. Because
the other placement execution should be already running.
* Implement a proper custom scan node for adaptive executor
- Switch between the executors, add GUC to set the pool size
- Add non-adaptive regression test suites
- Enable CIRCLE CI for non-adaptive tests
- Adjust test output files
* Add slow start interval to the executor
* Expose max_cached_connection_per_worker to user
* Do not start slow when there are cached connections
* Consider ExecutorSlowStartInterval in NextEventTimeout
* Fix memory issues with ReceiveResults().
* Disable executor via TaskExecutorType
* Make sure to execute the tests with the other executor
* Use task_executor_type to enable-disable adaptive executor
* Remove useless code
* Adjust the regression tests
* Add slow start regression test
* Rebase to master
* Fix test failures in adaptive executor.
* Rebase to master - 2
* Improve comments & debug messages
* Set force_max_query_parallelization in isolation_citus_dist_activity
* Force max parallelization for creating shards when asked to use exclusive connection.
* Adjust the default pool size
* Expand description of max_adaptive_executor_pool_size GUC
* Update warnings in FinishRemoteTransactionCommit()
* Improve session clean up at the end of execution
Explicitly list all the states that the execution might end,
otherwise warn.
* Remove MULTI_CONNECTION_WAIT_RETRY which is not used at all
* Add more ORDER BYs to multi_mx_partitioning
- All the schema creations on the workers will now be via superuser connections
- If a shard is being repaired or a shard is replicated, we will create the
schema only in the relevant worker; and in all the other cases where a schema
creation is needed, we will block operations until we ensure the schema exists
in all the workers
It has been reported a null pointer dereference could be triggered in FreeConnParamsHashEntryFields. Likely cause is an error in GetConnParams which will leave the cached ConnParamsHashEntry in a state that would cause the null pointer dereference in a subsequent connection establishment to the same server. This has been simulated by inserting ereport(ERROR, ...) at certain places in the code.
Not only would ConnParamsHashEntry be in a state that would cause a crash, it was also leaking memory in the ConnectionContext due to the loss of pointers as they are only stored on the ConnParamsHashEntry at the end of the function.
This patch rewrites both the GetConnParams to store pointers 'durably' at every point in the code so that an error would not lose the pointer as well as FreeConnParamsHashEntryFields in a way that it can clear half initialised ConnParamsHashEntry's in a safer manner.
GRANT queries are propagated on Enterprise. If a user attempts to
create a user and run a GRANT query before creating it on workers, we
fail. This issue does not happen in community as the user needs to run
the GRANTs on the workers manually.
When `master_update_node` is called to update a node's location it waits for appropriate locks to become available. This is useful during normal operation as new operations will be blocked till after the metadata update while running operations have time to finish.
When `master_update_node` is called after a node failure it is less useful to wait for running operations to finish as they can't. The lock being held indicates an operation that once attempted to commit will fail as the machine already failed. Now the downside is the failover is postponed till the termination point of the operation. This has been observed by users to take a significant amount of time causing the rest of the system to be observed unavailable.
With this patch it is possible in such situations to invoke `master_update_node` with 2 optional arguments:
- `force` (bool defaults to `false`): When called with true the update of the metadata will be forced to proceed by terminating conflicting backends. A cancel is not enough as the backend might be in idle time (eg. an interactive session, or going back and forth between an appliaction), therefore a more intrusive solution of termination is used here.
- `lock_cooldown` (int defaults to `10000`): This is the time in milliseconds before conflicting backends are terminated. This is to allow the backends to finish cleanly before terminating them. This allows the user to set an upperbound to the expected time to complete the metadata update, eg. performing the failover.
The functionality is implemented by spawning a background worker that has the task of helping a certain backend in acquiring its locks. The backend is either terminated on successful execution of the metadata update, or once the memory context of the expression gets reset, eg. on a cancel of the statement.
Adds support for propagation of SET LOCAL commands to all workers
involved in a query. For now, SET SESSION (i.e. plain SET) is not
supported whatsoever, though this code is intended as somewhat of a
base for implementing such support in the future.
As SET LOCAL modifications are scoped to the body of a BEGIN/END xact
block, queries wishing to use SET LOCAL propagation must be within such
a block. In addition, subsequent modifications after e.g. any SAVEPOINT
or ROLLBACK statements will correspondingly push or pop variable mod-
ifications onto an internal stack such that the behavior of changed
values across the cluster will be identical to such behavior on e.g.
single-node PostgreSQL (or equivalently, what values are visible to
the end user by running SHOW on such variables on the coordinator).
If nodes enter the set of participants at some point after SET LOCAL
modifications (or SAVEPOINT, ROLLBACK, etc.) have occurred, the SET
variable state is eagerly propagated to them upon their entrance (this
is identical to, and indeed just augments, the existing logic for the
propagation of the SAVEPOINT "stack").
A new GUC (citus.propagate_set_commands) has been added to control this
behavior. Though the code suggests the valid settings are 'none', 'local',
'session', and 'all', only 'none' (the default) and 'local' are presently
implemented: attempting to use other values will result in an error.
This is a preperation for the new executor, where creating shards
would go through the executor. So, explicitly generate the commands
for further processing.
If a query is router executable, it hits a single shard and therefore has a
single task associated with it. Therefore there is no need to sort the task list
that has a single element.
Also we already have a list of active shard placements, sending it in param
and reuse it.
If replication factor eqauls to 2 and there are two worker nodes,
even if two modifications hit different shards, Citus doesn't use
2PC. The reason is that it doesn't fit into the definition of
"expanding participating worker nodes".
Thus, we're simply fixing the test to fit in the comment
on top of it.
InitializeCaches() method may prematurely set
performedInitialization without actually creating
DistShardCacheHash.
Fix makes sure flag is set only if DistShardCacheHash is created successfully.
Also introduced a new memory context to allocate aforementioned hash tables.
If allocation/initialization fails for any reason we make sure
memory is reclaimed by deleting the memory context.
Instead of scattering the code around, we move all the
logic into a single function.
This will help supporting foreign keys to reference tables
in the unified executor with a single line of change, just
calling this function.
The feature is only intended for getting consistent outputs for the regression tests.
RETURNING does not have any ordering gurantees and with unified executor, the ordering
of query executions on the shards are also becoming unpredictable. Thus, we're enforcing
ordering when a GUC is set.
We implicitly add an `ORDER BY` something equivalent of
`
RETURNING expr1, expr2, .. ,exprN
ORDER BY expr1, expr2, .. ,exprN
`
As described in the code comments as well, this is probably not the most
performant approach we could implement. However, since we're only
targeting regression tests, I don't see any issues with that. If we
decide to expand this to a feature to users, we should revisit the
implementation and improve the performance.
This commit has two goals:
(a) Ensure to access both edges of the allocated stack
(b) Ensure that any compiler optimizations to prevent the
function optimized away.
Stack size after the patch:
sudo grep -A 1 stack /proc/2119/smaps
7ffe305a6000-7ffe307a9000 rw-p 00000000 00:00 0 [stack]
Size: 2060 kB
Stack size before the patch:
sudo grep -A 1 stack /proc/3610/smaps
7fff09957000-7fff09978000 rw-p 00000000 00:00 0 [stack]
Size: 132 kB
We used to rely on PG function flatten_join_alias_vars
to resolve actual columns referenced in target entry list.
The function goes deep and finds the actual relation. This logic
usually works fine. However, when joins are given an alias, inner
relation names are not visible to target entry entry. Thus relation
resolving should stop when we the target entry column refers an
rte of an aliased join.
We stopped using PG function and provided our own flatten function.
Our assumption that strip_implicit_coercions would leave us with a bi-
nary-compatible type to that of the partition key was wrong. Instead,
we should ensure the RHS of the comparison we perform is proactively
coerced into a compatible type (at least binary compatible).