Fixes#3331
In #2389, we've implemented support for partitioned tables with rep > 1.
The implementation is limiting the use of modification queries on the
partitions. In fact, we error out when any partition is modified via
EnsurePartitionTableNotReplicated().
However, we seem to forgot an important case, where the parent table's
partition is marked as INVALID. In that case, at least one of the partition
becomes INVALID. However, we do not mark partitions as INVALID ever.
If the user queries the partition table directly, Citus could happily send
the query to INVALID placements -- which are not marked as INVALID.
This PR fixes it by marking the placements of the partitions as INVALID
as well.
The shard placement repair logic already re-creates all the partitions,
so should be fine in that front.
Different versions of reindent tool reformatted citus_custom_scan.c
and citus_copyfuncs.c differently. So some developers spent some
extra attention not to commit these two files after reindent.
This PR tries to address this.
* WIP
* wip
* add basic logic to run a single job with repartioning joins with adaptive executor
* fix some warnings and return in ExecuteDependedTasks if there is none
* Add the logic to run depended jobs in adaptive executor
The execution of depended tasks logic is changed. With the current
logic:
- All tasks are created from the top level task list.
- At one iteration:
- CurTasks whose dependencies are executed are found.
- CurTasks are executed in parallel with adapter executor main
logic.
- The iteration is repeated until all tasks are completed.
* Separate adaptive executor repartioning logic
* Remove duplicate parts
* cleanup directories and schemas
* add basic repartion tests for adaptive executor
* Use the first placement to fetch data
In task tracker, when there are replicas, we try to fetch from a replica
for which a map task is succeeded. TaskExecution is used for this,
however TaskExecution is not used in adaptive executor. So we cannot use
the same thing as task tracker.
Since adaptive executor fails when a map task fails (There is no retry
logic yet). We know that if we try to execute a fetch task, all of its
map tasks already succeeded, so we can just use the first one to fetch
from.
* fix clean directories logic
* do not change the search path while creating a udf
* Enable repartition joins with adaptive executor with only enable_reparitition_joins guc
* Add comments to adaptive_executor_repartition
* dont run adaptive executor repartition test in paralle with other tests
* execute cleanup only in the top level execution
* do cleanup only in the top level ezecution
* not begin a transaction if repartition query is used
* use new connections for repartititon specific queries
New connections are opened to send repartition specific queries. The
opened connections will be closed at the FinishDistributedExecution.
While sending repartition queries no transaction is begun so that
we can see all changes.
* error if a modification was done prior to repartition execution
* not start a transaction if a repartition query and sql task, and clean temporary files and schemas at each subplan level
* fix cleanup logic
* update tests
* add missing function comments
* add test for transaction with DDL before repartition query
* do not close repartition connections in adaptive executor
* rollback instead of commit in repartition join test
* use close connection instead of shutdown connection
* remove unnecesary connection list, ensure schema owner before removing directory
* rename ExecuteTaskListRepartition
* put fetch query string in planner not executor as we currently support only replication factor = 1 with adaptive executor and repartition query and we know the query string in the planner phase in that case
* split adaptive executor repartition to DAG execution logic and repartition logic
* apply review items
* apply review items
* use an enum for remote transaction state and fix cleanup for repartition
* add outside transaction flag to find connections that are unclaimed instead of always opening a new transaction
* fix style
* wip
* rename removejobdir to partition cleanup
* do not close connections at the end of repartition queries
* do repartition cleanup in pg catch
* apply review items
* decide whether to use transaction or not at execution creation
* rename isOutsideTransaction and add missing comment
* not error in pg catch while doing cleanup
* use replication factor of the creation time, not current time to decide if task tracker should be chosen
* apply review items
* apply review items
* apply review item
DESCRIPTION: Fix counter that keeps track of internal depth in executor
While reviewing #3302 I ran into the `ExecutorLevel` variable which used a variable to keep the original value to restore on successful exit. I haven't explored the full space and if it is possible to get into an inconsistent state. However using `PG_TRY`/`PG_CATCH` seems generally more correct.
Given very bad things will happen if this level is not reset, I kept the failsafe of setting the variiable back to 0 on the `XactCallback` but I did add an assert to treat it as a developer bug.
Currently in mx isolation tests the setup is the same except the creation of tables. Isolation framework lets us define multiple `setup` stages, therefore I thought that we can put the `mx_setup` to one file and prepend this prior to running tests.
How the structure works:
- cpp is used before running isolation tests to preprocess spec files. This way we can include any file we want to. Currently this is used to include mx common part.
- spec files are put to `/build/specs` for clear separation between generated files and template files
- a symbolic link is created for `/expected` in `build/expected/`.
- when running isolation tests, as the `inputdir`, `build` is passed so it runs the spec files from `build/specs` and checks the expected output from `build/expected`.
`/specs` is renamed as `/spec` because postgres first look at the `specs` file under current directory, so this is renamed to avoid that since we are running the isolation tests from `build/specs` now.
Note: now we use `//` instead of `#` in comments in spec files, because cpp interprets `#` as a directive and it ignores `//`.
Postgres keeps track of recursive CTEs in the queryTree in two ways:
- queryTree->hasRecursive is set to true, whenever a RECURSIVE CTE
is used in the SQL. Citus checks for it
- If the CTE is actually a recursive one (a.k.a., references itself)
Postgres marks CommonTableExpr->cterecursive as true as well
The tests that are changed in the PR doesn't cover (b), and this becomes
an issue with CTE inlining (#3161). In that case, Citus/Postgres can inline
such CTEs, and the queries works with Citus.
However, this tests intend to check if there is any recursive CTE in the queryTree.
So, we're actually making the CTEs recursive CTEs by referring itself.
We'll add cases where a recursive CTE works by inlining in #3161.
Use partition column's collation for range distributed tables
Don't allow non deterministic collations for hash distributed tables
CoPartitionedTables: don't compare unequal types
Test ALTER ROLE doesn't deadlock when coordinator added, or propagate from mx workers
Consolidate wait_until_metadata_sync & verify_metadata to multi_test_helpers
Previously,
- we'd push down ORDER BY, but this doesn't order intermediate results between workers
- we'd keep FILTER on master aggregate, which would raise an error about unexpected cstrings
DESCRIPTION: add gitref to the output of citus_version
During debugging of custom builds it is hard to know the exact version of the citus build you are using. This patch will add a human readable/understandable git reference to the build of citus which can be retrieved by calling `citus_version();`.
Support for ARRAY[] expressions is limited to having a consistent shape,
eg ARRAY[(int,text),(int,text)] as opposed to ARRAY[(int,text),(float,text)] or ARRAY[(int,text),(int,text,float)]
Initialization of queryWindowClause and queryOrderByLimit "memset" underflow these variables.
It's possible due to the invalid usage sizeof this part of the program cause buffer overflow and function return data corruption in future changes.
* Improve extension command propagation tests
* patch for hardcoded citus extension name
(cherry picked from commit 0bb3dbac0afabda10e8928f9c17eda048dc4361a)
In plain words, each distributed plan pulls the necessary intermediate
results to the worker nodes that the plan hits. This is primarily useful
in three ways.
(i) If the distributed plan that uses intermediate
result(s) is a router query, then the intermediate results are only
broadcasted to a single node.
(ii) If a distributed plan consists of only intermediate results, which
is not uncommon, the intermediate results are broadcasted to a single
node only.
(iii) If a distributed query hits a sub-set of the shards in multiple
workers, the intermediate results will be broadcasted to the relevant
node(s).
The final item (iii) becomes crucial for append/range distributed
tables where typically the distributed queries hit a small subset of
shards/workers.
To do this, for each query that Citus creates a distributed plan, we keep
track of the subPlans used in the queryTree, and save it in the distributed
plan. Just before Citus executes each subPlan, Citus first keeps track of
every worker node that the distributed plan hits, and marks every subPlan
should be broadcasted to these nodes. Later, for each subPlan which is a
distributed plan, Citus does this operation recursively since these
distributed plans may access to different subPlans, and those have to be
recorded as well.
Prevent Citus extension being distributed
Because that could prevent doing rolling upgrades, where users may
prefer to upgrade the version on the coordinator but not the workers.
There could be some other edge cases, so I'd prefer to keep Citus
extension outside the picture for now.
DESCRIPTION: Expression in reference join
Fixed: #2582
This patch allows arbitrary expressions in the join clause when joining to a reference table. An example of such joins could be found in CHbenCHmark queries 7, 8, 9 and 11; `mod((s_w_id * s_i_id),10000) = su_suppkey` and `ascii(substr(c_state,1,1)) = n2.n_nationkey`. Since the join is on a reference table these queries are able to be pushed down to the workers.
To implement these queries we will widen the `IsJoinClause` predicate to not check if the expressions are a type `Var` after stripping the implicit coerciens. Instead we define a join clause when the `Var`'s in a clause come from more than 1 table.
This allows more clauses to pass into the logical planner's `MultiNodeTree(...)` planning function. To compensate for this we tighten down the `LocalJoin`, `SinglePartitionJoin` and `DualPartitionJoin` to check for direct column references when planning. This allows the planner to work with arbitrary join expressions on reference tables.
With this commit, we're slightly changing the dependency traversal
logic to enable extension propagation.
The main idea is to "follow" the extension dependencies, but do not
"apply" them.
Since some extension dependencies are base types, and base types
could have circular dependencies, we implement a logic to prevent
revisiting an already visited object.
When the user picks "round-robin" policy, the aim is that the load
is distributed across nodes. However, for reference tables on the
coordinator, since local execution kicks in immediately, round-robin
is ignored.
With this change, we're excluding the placement on the coordinator.
Although the approach seems a little bit invasive because of
modifications in the placement list, that sounds acceptable.
We could have done this in some other ways such as:
1) Add a field to "Task->roundRobinPlacement" (or such), which is
updated as the first element after RoundRobinPolicy is applied.
During the execution, if that placement is local to the coordinator,
skip it and try the other remote placements.
2) On TaskAccessesLocalNode()@local_execution.c, check
task_assignment_policy, if round-robin selected and there is local
placement on the coordinator, skip it. However, task assignment is done
on planning, but this decision is happening on the execution, which
could create weird edge cases.
This change was actually already intended in #3124. However, the
postgres Makefile manually enables this warning too. This way we undo
that.
To confirm that it works two functions were changed to make use of not
having the warning anymore.
Phase 1 seeks to implement minimal infrastructure, so does not include:
- dynamic generation of support aggregates to handle multiple arguments
- configuration methods to direct aggregation strategy,
or mark an aggregate's serialize/deserialize as safe to operate across nodes
Aggregates can be distributed when:
- they have a single argument
- they have a combinefunc
- their transition type is not a pseudotype
This is necassery to support Q20 of the CHbenCHmark: #2582.
To summarize the fix: The subquery is converted into an INNER JOIN on a
table. This fixes the issue, since an INNER JOIN on a table is already
supported by the repartion planner.
The way this replacement is happening.:
1. Postgres replaces `col in (subquery)` with a SEMI JOIN (subquery) on col = subquery_result
2. If this subquery is simple enough Postgres will replace it with a
regular read from a table
3. If the subquery returns unique results (e.g. a primary key) Postgres
will convert the SEMI JOIN into an INNER JOIN during the planning. It
will not change this in the rewritten query though.
4. We check if Postgres sends us any SEMI JOINs during its join order
planning, if it doesn't we replace all SEMI JOINs in the rewritten
query with INNER JOIN (which we already support).
Since we've removed the executor, we don't need the specific tests.
Since the tests are already using adaptive executor, they were passing.
But, we've plenty of extra tests for adaptive executor, so seems safe
to remove.
Postgres doesn't require you to add all columns that are in the target list to
the GROUP BY when you group by a unique column (or columns). It even actively
removes these group by clauses when you do.
This is normally fine, but for repartition joins it is not. The reason for this
is that the temporary tables don't have these primary key columns. So when the
worker executes the query it will complain that it is missing columns in the
group by.
This PR fixes that by adding an ANY_VALUE aggregate around each variable in
the target list that does is not contained in the group by or in an aggregate.
This is done only for repartition joins.
The ANY_VALUE aggregate chooses the value from an undefined row in the
group.
It looks like the logic to prevent RETURNING in reference tables to
have duplicate entries that comes from local and remote executions
leads to missing some tuples for distributed tables.
With this PR, we're ensuring to kick in the logic for reference tables
only.
* Remove unused executor codes
All of the codes of real-time executor. Some functions
in router executor still remains there because there
are common functions. We'll move them to accurate places
in the follow-up commits.
* Move GUCs to transaction mngnt and remove unused struct
* Update test output
* Get rid of references of real-time executor from code
* Warn if real-time executor is picked
* Remove lots of unused connection codes
* Removed unused code for connection restrictions
Real-time and router executors cannot handle re-using of the existing
connections within a transaction block.
Adaptive executor and COPY can re-use the connections. So, there is no
reason to keep the code around for applying the restrictions in the
placement connection logic.
We've changed the logic for pulling RTE_RELATIONs in #3109 and
non-colocated subquery joins and partitioned tables.
@onurctirtir found this steps where I traced back and found the issues.
While looking into it in more detail, we decided to expand the list in a
way that the callers get all the relevant RTE_RELATIONs RELKIND_RELATION,
RELKIND_PARTITIONED_TABLE, RELKIND_FOREIGN_TABLE and RELKIND_MATVIEW.
These are all relation kinds that Citus planner is aware of.
When citus.enable_repartition_joins guc is set to on, and we have
adaptive executor, there was a typo in the debug message, which was
saying realtime executor no adaptive executor.
See #3125 for details on each item.
* Remove real-time/router executor tests-1
These are the ones which doesn't have '_%d' in the test
output files.
* Remove real-time/router executor tests-2
These are the ones which has in the test
output files.
* Move the tests outputs to correct place
* Make sure that single shard commits use 2PC on adaptive executor
It looks like we've messed the tests in #2891. Fixing back.
* Use adaptive executor for all router queries
This becomes important because when task-tracker is picked, we
used to pick router executor, which doesn't make sense.
* Remove explicit references to real-time/router executors in the tests
* JobExecutorType never picks real-time/router executors
* Make sure to go incremental in test output numbers
* Even users cannot pick real-time anymore
* Do not use real-time/router custom scans
* Get rid of unnecessary normalizations
* Reflect unneeded normalizations
* Get rid of unnecessary test output file
This completely hides `ListCell` to the user of the loop
Example usage:
```c
WorkerNode *workerNode = NULL;
foreach_ptr(workerNode, workerNodeList) {
// Do stuff with workerNode
}
```
Instead of:
```c
ListCell *workerNodeCell = NULL;
foreach(cell, workerNodeList) {
WorkerNode *workerNode = lfirst(workerNodeCell);
// Do stuff with workerNode
}
```
It turns out that TupleDescGetAttInMetadata() allocates quite a lot
of memory. And, if the target list is long and there are too many rows
returning, the leak becomes appereant.
You can reproduce the issue wout the fix with the following commands:
```SQL
CREATE TABLE users_table (user_id int, time timestamp, value_1 int, value_2 int, value_3 float, value_4 bigint);
SELECT create_distributed_table('users_table', 'user_id');
insert into users_table SELECT i, now(), i, i, i, i FROM generate_series(0,99999)i;
-- load faster
-- 200,000
INSERT INTO users_table SELECT * FROM users_table;
-- 400,000
INSERT INTO users_table SELECT * FROM users_table;
-- 800,000
INSERT INTO users_table SELECT * FROM users_table;
-- 1,600,000
INSERT INTO users_table SELECT * FROM users_table;
-- 3,200,000
INSERT INTO users_table SELECT * FROM users_table;
-- 6,400,000
INSERT INTO users_table SELECT * FROM users_table;
-- 12,800,000
INSERT INTO users_table SELECT * FROM users_table;
-- making the target list entry wider speeds up the leak to show up
select *,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,* FROM users_table ;
```
This is an improvement over #2512.
This adds the boolean shouldhaveshards column to pg_dist_node. When it's false, create_distributed_table for new collocation groups will not create shards on that node. Reference tables will still be created on nodes where it is false.
Areas for further optimization:
- Don't save subquery results to a local file on the coordinator when the subquery is not in the having clause
- Push the the HAVING with subquery to the workers if there's a group by on the distribution column
- Don't push down the results to the workers when we don't push down the HAVING clause, only the coordinator needs it
Fixes#520Fixes#756Closes#2047
* add support to run citus upgrade tests locally
* dont build tars if they already exist
* use current code instead of master for upgrade
* always build the current code
* copy the current citus code to have isolated citus upgrade tests
* fix configure and simplify copy
DESCRIPTION: Fix order for enum values and correctly support pg12
PG 12 introduces `ALTER TYPE ... ADD VALUE ...` during transactions. Earlier versions would error out when called in a transaction, hence we connect to workers outside of the transaction which could cause inconsistencies on pg12 now that postgres doesn't error with this syntax anymore.
During the implementation of this fix it became apparent there was an error with the ordering of enum labels when the type was recreated. A patch and test have been included.
Objectives:
(a) both super user and regular user should have the correct owner for the function on the worker
(b) The transactional semantics would work fine for both super user and regular user
(c) non-super-user and non-function owner would get a reasonable error message if tries to distribute the function
Co-authored-by: @serprex
* Add initial citus upgrade test
* Add restart databases and run tests in all nodes
* Add output for citus versions 8.0 8.1 8.2 and 8.3
* Add verify step for citus upgrade
* Add target for citus upgrade test in makefile
* Add check citus upgrade job
* Fix installation file path and add missing tar
* Run citus upgrade for v8.0 v8.1 v8.2 and v8.3
* Create upgrade_common file and rename upgrade check
* Add pg version to citus upgrade test
* Test with postgres 10 and 11 in citus upgrade tests
* Add readme for citus upgrade test
* Add some basic tests to citus upgrade tests
* Add citus upgrade mixed mode test
* Remove citus artifacts before installing another one
* Refactor citus upgrade test according to reviews
* quick and dirty rewrite of citus upgrade tests to support local execution.
I think we need to change the makefile in such a way that the tar files can be injected from the circle ci config file.
Also I removed some of the citus version checks you had to not have the requirement to pass that in separately from the pre tar file. I am not super happy with it, but two flags that need to be kept in sync is also not desirable. Instead I print out the citus version that is installed per node. This will not cause a failure if they are not what one would expect but it lets us verify we are running the expected version.
* use latest citusupgradetester in circleci
* update readme and use common alias for upgrade_common import
* Add PG12 test outputs
* Add jobs to run tests with pg 12
* use POSIX collate for compatibility between pg10/pg11/pg12
* do not override the new default value when running vanilla tests
* fix 2 problems with pg12 tests
* update pg12 images with pg12 rc1
* remove pg10 jobs
* Revert "Add PG12 test outputs"
This reverts commit f3545b92ef.
* change images to use latest instead of dev
* add missing coverage flags
DESCRIPTION: Disallow distributed functions for functions depending on an extension
Functions depending on an extension cannot (yet) be distributed by citus. If we would allow this it would cause issues with our dependency following mechanism as we stop following objects depending on an extension.
By not allowing functions to be distributed when they depend on an extension as well as not allowing to make distributed functions depend on an extension we won't break the ability to add new nodes. Allowing functions depending on extensions to be distributed at the moment could cause problems in that area.
DESCRIPTION: Propagate CREATE OR REPLACE FUNCTION
Distributed functions could be replaced, which should be propagated to the workers to keep the function in sync between all nodes.
Due to the complexity of deparsing the `CreateFunctionStmt` we actually produce the plan during the processing phase of our utilityhook. Since the changes have already been made in the catalog tables we can reuse `pg_get_functiondef` to get us the generated `CREATE OR REPLACE` sql.
DESCRIPTION: Propagate ALTER FUNCTION statements for distributed functions
Using the implemented deparser for function statements to propagate changes to both functions and procedures that are previously distributed.
This PR aims to add all the necessary logic to qualify and deparse all possible `{ALTER|DROP} .. {FUNCTION|PROCEDURE}` queries.
As Procedures are introduced in PG11, the code contains many PG version checks. I tried my best to make it easy to clean up once we drop PG10 support.
Here are some caveats:
- I assumed that the parse tree is a valid one. There are some queries that are not allowed, but still are parsed successfully by postgres planner. Such queries will result in errors in execution time. (e.g. `ALTER PROCEDURE p STRICT` -> `STRICT` action is valid for functions but not procedures. Postgres decides to parse them nevertheless.)
When a function is marked as colocated with a distributed table,
we try delegating queries of kind "SELECT func(...)" to workers.
We currently only support this simple form, and don't delegate
forms like "SELECT f1(...), f2(...)", "SELECT f1(...) FROM ...",
or function calls inside transactions.
As a side effect, we also fix the transactional semantics of DO blocks.
Previously we didn't consider a DO block a multi-statement transaction.
Now we do.
Co-authored-by: Marco Slot <marco@citusdata.com>
Co-authored-by: serprex <serprex@users.noreply.github.com>
Co-authored-by: pykello <hadi.moshayedi@microsoft.com>
In this PR the default `threshold` of `rebalance_table_shards` was set to 0: https://github.com/citusdata/shard_rebalancer/pull/73
However, the default for get_rebalance_table_shards_plan was not updated. This
can cause the confusing situation where the actual steps run by
`rebalance_table_shards` are not the same as the ones returned by
`get_rebalance_table_shards_plan`.
We started copying parse trees by default further on in `multi_ProcessUtility`. That's not a problem for maintenance command, but might register for things like `PREPARE` and `EXECUTE`, which might happen thousands of times per second. Add a few common commands to the check at the start.
Since the distributed functions are useful when the workers have
metadata, we automatically sync it.
Also, after master_add_node(). We do it lazily and let the deamon
sync it. That's mainly because the metadata syncing cannot be done
in transaction blocks, and we don't want to add lots of transactional
limitations to master_add_node() and create_distributed_function().
* Enhance pg upgrade tests
* Add a specific upgrade test for pg_dist_partition
We store the index of distribution column, and when a column with an
index that is smaller than distribution column index is dropped before
an upgrade, the index should still match the distribution column after
an upgrade
With this commit, we're changing the API for create_distributed_function()
such that users can provide the distribution argument and the colocation
information.
We've recently merged two commits, db5d03931d
and eccba1d4c3, which actually operates
on the very similar places.
It turns out that we've an integration issue, where master_add_node()
fails to replicate the functions to newly added node.
DESCRIPTION: Provide a GUC to turn of the new dependency propagation functionality
In the case the dependency propagation functionality introduced in 9.0 causes issues to a cluster of a user they can turn it off almost completely. The only dependency that will still be propagated and kept track of is the schema to emulate the old behaviour.
GUC to change is `citus.enable_object_propagation`. When set to `false` the functionality will be mostly turned off. Be aware that objects marked as distributed in `pg_dist_object` will still be kept in the catalog as a distributed object. Alter statements to these objects will not be propagated to workers and may cause desynchronisation.
DESCRIPTION: Rename remote types during type propagation
To prevent data to be destructed when a remote type differs from the type on the coordinator during type propagation we wanted to rename the type instead of `DROP CASCADE`.
This patch removes the `DROP` logic and adds the creation of a rename statement to a free name.
DESCRIPTION: Add feature flag to turn off create type propagation
When `citus.enable_create_type_propagation` is set to `false` citus will not propagate `CREATE TYPE` statements to the workers. Types are still distributed when tables that depend on these types are distributed.
This PR simply adds the columns to pg_dist_object and
implements the necessary metadata changes to keep track of
distribution argument of the functions/procedures.
A better fix for #2975. Apparently for OSX cpp -MF and -MT shouldn't have a
space in between the flag and their value. Without the space it still works for
gcc as well.
This PR aims to add the minimal set of changes required to start
distributing functions. You can use create_distributed_function(regproc)
UDF to distribute a function.
SELECT create_distributed_function('add(int,int)');
The function definition should include the param types to properly
identify the correct function that we wish to distribute
@thanodnl told me it was a bit of a problem that it's impossible to see
the history of a UDF in git. The only way to do so is by reading all the
sql migration files from new to old. Another problem is that it's also
hard to review the changed UDF during code review, because to find out
what changed you have to do the same. I thought of a IMHO better (but
not perfect) way to handle this.
We keep the definition of a UDF in sql/udfs/{name_of_udf}/latest.sql.
That file we change whenever we need to make a change to the the UDF. On
top of that you also make a snapshot of the file in
sql/udfs/{name_of_udf}/{migration-version}.sql (e.g. 9.0-1.sql) by
copying the contents. This way you can easily view what the actual
changes were by looking at the latest.sql file.
There's still the question on how to use these files then. Sadly
postgres doesn't allow inclusion of other sql files in the migration sql
file (it does in psql using \i). So instead I used the C preprocessor+
make to compile a sql/xxx.sql to a build/sql/xxx.sql file. This final
build/sql/xxx.sql file has every occurence of #include "somefile.sql" in
sql/xxx.sql replaced by the contents of somefile.sql.
DESCRIPTION: Distribute Types to worker nodes
When to propagate
==============
There are two logical moments that types could be distributed to the worker nodes
- When they get used ( just in time distribution )
- When they get created ( proactive distribution )
The just in time distribution follows the model used by how schema's get created right before we are going to create a table in that schema, for types this would be when the table uses a type as its column.
The proactive distribution is suitable for situations where it is benificial to have the type on the worker nodes directly. They can later on be used in queries where an intermediate result gets created with a cast to this type.
Just in time creation is always the last resort, you cannot create a distributed table before the type gets created. A good example use case is; you have an existing postgres server that needs to scale out. By adding the citus extension, add some nodes to the cluster, and distribute the table. The type got created before citus existed. There was no moment where citus could have propagated the creation of a type.
Proactive is almost always a good option. Types are not resource intensive objects, there is no performance overhead of having 100's of types. If you want to use them in a query to represent an intermediate result (which happens in our test suite) they just work.
There is however a moment when proactive type distribution is not beneficial; in transactions where the type is used in a distributed table.
Lets assume the following transaction:
```sql
BEGIN;
CREATE TYPE tt1 AS (a int, b int);
CREATE TABLE t1 AS (a int PRIMARY KEY, b tt1);
SELECT create_distributed_table('t1', 'a');
\copy t1 FROM bigdata.csv
```
Types are node scoped objects; meaning the type exists once per worker. Shards however have best performance when they are created over their own connection. For the type to be visible on all connections it needs to be created and committed before we try to create the shards. Here the just in time situation is most beneficial and follows how we create schema's on the workers. Outside of a transaction block we will just use 1 connection to propagate the creation.
How propagation works
=================
Just in time
-----------
Just in time propagation hooks into the infrastructure introduced in #2882. It adds types as a supported object in `SupportedDependencyByCitus`. This will make sure that any object being distributed by citus that depends on types will now cascade into types. When types are depending them self on other objects they will get created first.
Creation later works by getting the ddl commands to create the object by its `ObjectAddress` in `GetDependencyCreateDDLCommands` which will dispatch types to `CreateTypeDDLCommandsIdempotent`.
For the correct walking of the graph we follow array types, when later asked for the ddl commands for array types we return `NIL` (empty list) which makes that the object will not be recorded as distributed, (its an internal type, dependant on the user type).
Proactive distribution
---------------------
When the user creates a type (composite or enum) we will have a hook running in `multi_ProcessUtility` after the command has been applied locally. Running after running locally makes that we already have an `ObjectAddress` for the type. This is required to mark the type as being distributed.
Keeping the type up to date
====================
For types that are recorded in `pg_dist_object` (eg. `IsObjectDistributed` returns true for the `ObjectAddress`) we will intercept the utility commands that alter the type.
- `AlterTableStmt` with `relkind` set to `OBJECT_TYPE` encapsulate changes to the fields of a composite type.
- `DropStmt` with removeType set to `OBJECT_TYPE` encapsulate `DROP TYPE`.
- `AlterEnumStmt` encapsulates changes to enum values.
Enum types can not be changed transactionally. When the execution on a worker fails a warning will be shown to the user the propagation was incomplete due to worker communication failure. An idempotent command is shown for the user to re-execute when the worker communication is fixed.
Keeping types up to date is done via the executor. Before the statement is executed locally we create a plan on how to apply it on the workers. This plan is executed after we have applied the statement locally.
All changes to types need to be done in the same transaction for types that have already been distributed and will fail with an error if parallel queries have already been executed in the same transaction. Much like foreign keys to reference tables.
For another PR I needed to add another column which would require to add
another argument to an already 9 argument function signature. In this
case it would be a boolean flag and there were already two boolean flags
in there. In my experience it becomes really easy to mess up the order
of these flags at that point. Especially because the type system doesn't
distinguish between the 3 different booleans with completely different
meanings.
So I refactored these signatures to receive a struct containing most of
these arguments. Like that you don't mess up orderening, because the
meaning of the boolean is not order dependent but fieldname dependent.
It also makes it possible to set good shared defaults for this struct.
DESCRIPTION: Fix schema leak on CREATE INDEX statement
When a CREATE INDEX is cached between execution we might leak the schema name onto the cached statement of an earlier execution preventing the right index to be created.
Even though the cache is cleared when the search_path changes we can trigger this behaviour by having the schema already on the search path before a colliding table is created in a schema earlier on the `search_path`. When calling an unqualified create index via a function (used to trigger the caching behaviour) we see that the index is created on the wrong table after the schema leaked onto the statement.
By copying the complete `PlannedStmt` and `utilityStmt` during our planning phase for distributed ddls we make sure we are not leaking the schema name onto a cached data structure.
Caveat; COPY statements already have a lot of parsestree copying ongoing without directly putting it back on the `pstmt`. We should verify that copies modify the statement and potentially copy the complete `pstmt` there already.
/*
* local_executor.c
*
* The scope of the local execution is locally executing the queries on the
* shards. In other words, local execution does not deal with any local tables
* that are not shards on the node that the query is being executed. In that sense,
* the local executor is only triggered if the node has both the metadata and the
* shards (e.g., only Citus MX worker nodes).
*
* The goal of the local execution is to skip the unnecessary network round-trip
* happening on the node itself. Instead, identify the locally executable tasks and
* simply call PostgreSQL's planner and executor.
*
* The local executor is an extension of the adaptive executor. So, the executor uses
* adaptive executor's custom scan nodes.
*
* One thing to note that Citus MX is only supported with replication factor = 1, so
* keep that in mind while continuing the comments below.
*
* On the high level, there are 3 slightly different ways of utilizing local execution:
*
* (1) Execution of local single shard queries of a distributed table
*
* This is the simplest case. The executor kicks at the start of the adaptive
* executor, and since the query is only a single task the execution finishes
* without going to the network at all.
*
* Even if there is a transaction block (or recursively planned CTEs), as long
* as the queries hit the shards on the same, the local execution will kick in.
*
* (2) Execution of local single queries and remote multi-shard queries
*
* The rule is simple. If a transaction block starts with a local query execution,
* all the other queries in the same transaction block that touch any local shard
* have to use the local execution. Although this sounds restrictive, we prefer to
* implement in this way, otherwise we'd end-up with as complex scenarious as we
* have in the connection managements due to foreign keys.
*
* See the following example:
* BEGIN;
* -- assume that the query is executed locally
* SELECT count(*) FROM test WHERE key = 1;
*
* -- at this point, all the shards that reside on the
* -- node is executed locally one-by-one. After those finishes
* -- the remaining tasks are handled by adaptive executor
* SELECT count(*) FROM test;
*
*
* (3) Modifications of reference tables
*
* Modifications to reference tables have to be executed on all nodes. So, after the
* local execution, the adaptive executor keeps continuing the execution on the other
* nodes.
*
* Note that for read-only queries, after the local execution, there is no need to
* kick in adaptive executor.
*
* There are also few limitations/trade-offs that is worth mentioning. First, the
* local execution on multiple shards might be slow because the execution has to
* happen one task at a time (e.g., no parallelism). Second, if a transaction
* block/CTE starts with a multi-shard command, we do not use local query execution
* since local execution is sequential. Basically, we do not want to lose parallelism
* across local tasks by switching to local execution. Third, the local execution
* currently only supports queries. In other words, any utility commands like TRUNCATE,
* fails if the command is executed after a local execution inside a transaction block.
* Forth, the local execution cannot be mixed with the executors other than adaptive,
* namely task-tracker, real-time and router executors. Finally, related with the
* previous item, COPY command cannot be mixed with local execution in a transaction.
* The implication of that any part of INSERT..SELECT via coordinator cannot happen
* via the local execution.
*/
Before this patch, when a connection is lost, we'd have the following
situation:
- Pop a task execution from readyQueue
- Lost connection
- Fail the session/pool. -> This step was not acting properly
because we've popped the task, but not set to session->currentTask
yet
After the patch:
- Pop a task execution from readyQueue
- Immediately set it to session->currentTask
- Lost connection
- Fail the session/pool. -> At this step, failing the
session would trigger query failures (or failovers)
properly.
* Add creating a citus cluster script
Creating a citus cluster is automated.
Before running this script:
- Citus should be installed and its control file should be added to postgres. (make install)
- Postgres should be installed.
* Initialize upgrade test table and fill
* Finalize the layout of upgrade tests
Postgres upgrade function is added.
The newly added UDFs(citus_prepare_pg_upgrade, citus_finish_pg_upgrade) are used to
perform upgrade.
* Refactor upgrade test and add config file
* Add schedules for upgrade testing
* Use pg_regress for upgrade tests
pg_regress is used for creating a simple distributed table in
upgrade tests. After upgrading another schedule is used to verify
that the distributed table exists. Router and realtime queries are
used for verifying.
* Run upgrade tests as a postgres user in a temp dir
postgres user is used for psql to be consistent at running tests.
A temp dir is created and the temp dir's permissions are changed so
that postgres user can access it. All psql commands are now run with
postgres user.
"Select * from t" query is changed as "Select * from t order by a"
so that the result is always in the same order.
* Add docopt and arguments for the upgrade script
Docopt dependency is added to parse flags in script.
Some refactoring in variable names is done.
* Add readme for upgrade tests
* Refactor upgrade tests
Use relative data path instead of absolute assuming that this script will
always be run from 'src/test/regress'
Remove 'citus-path' flag
Use specific version for docopt instead of *
Use named args in string formatting
* Resolve a security problem
Instead of using string formatting in subprocess.call, arguments
list is used. Otherwise users could do shell injection.
Shell = True is removed from subprocess call as it is not recommended
to use this.
* Add how the test works to readme
* Refactor some variables to be consistent
* Update upgrade script based on the reviews
It was possible that postgres server would stay running even when the script
crashes, atexit library is used to ensure that we always do a teardown where we stop
the databases.
Some formatting is done in the code for better readability.
Config class is used instead of a dictonary.
A target for upgrade test is added to makefile.
Unused flags/functions/variables are removed.
* Format commands and remove unnecessary flag from readme
This is a bug that got in when we inlined the body of a function into this loop. Earlier revisions had two loops, hence a function that would be reused.
With a return instead of a continue the list of dependencies being walked is dependent on the order in which we find them in pg_depend. This became apparent during pg12 compatibility. The order of entries in pg12 was luckily different causing a random test to fail due to this return.
By changing it to a continue we only skip the entries that we don’t want to follow instead of skipping all entries that happen to be found later.
sidefix for more stable isolation tests around ensure dependency
DESCRIPTION: Refactor ensure schema exists to dependency exists
Historically we only supported schema's as table dependencies to be created on the workers before a table gets distributed. This PR puts infrastructure in place to walk pg_depend to figure out which dependencies to create on the workers. Currently only schema's are supported as objects to create before creating a table.
We also keep track of dependencies that have been created in the cluster. When we add a new node to the cluster we use this catalog to know which objects need to be created on the worker.
Side effect of knowing which objects are already distributed is that we don't have debug messages anymore when creating schema's that are already created on the workers.
* Add tuplestore helpers
* More detailed error messages in tuplestore
* Add CreateTupleDescCopy to SetupTuplestore
* Use new SetupTuplestore helper function
* Remove unnecessary copy
* Remove comment about undefined behaviour
See a9c35cf85c
clang raises a warning due to FunctionCall2InfoData technically being variable sized
This is fine, as the struct is the size we want it to be. So silence the warning
master_deactivate_node is updated to decrement the replication factor
Otherwise deactivation could have create_reference_table produce a second record
UpdateColocationGroupReplicationFactor is renamed UpdateColocationGroupReplicationFactorForReferenceTables
& the implementation looks up the record based on distributioncolumntype == InvalidOid, rather than by id
Otherwise the record's replication factor fails to be maintained when there are no reference tables
DESCRIPTION: Add functions to help with postgres upgrades
Currently there is [a list of manual steps](https://docs.citusdata.com/en/v8.2/admin_guide/upgrading_citus.html?highlight=upgrade#upgrading-postgresql-version-from-10-to-11) to perform during a postgres upgrade. These steps guarantee our catalog tables are kept and counter values are maintained across upgrades.
Having more than 1 command in our docs for users to manually execute during upgrades is error prone for both the user, and our docs. There are already 2 catalog tables that have been introduced to citus that have not been added to our docs for backing up during upgrades (`pg_authinfo` and `pg_dist_poolinfo`).
As we add more functionality to citus we run into situations where there are more steps required either before or after the upgrade. At the same time, when we move catalog tables to a place where the contents will be maintained automatically during upgrades we could have less steps in our docs. This will come to a hard to maintain matrix of citus versions and steps to be performed.
Instead we could take ownership of these steps within the extension itself. This PR introduces two new functions for the user to use instead of long lists of error prone instructions to follow.
- `citus_prepare_pg_upgrade`
This function should be called by the user right before shutting down the cluster. This will ensure all citus catalog tables are backed up in a location where the information will be retained during an upgrade.
- `citus_finish_pg_upgrade`
This function should be called right after a pg_upgrade of the cluster. This will restore the catalog tables to the state before the upgrade happend.
Both functions need to be executed both on the coordinator and on all the workers, in the same fashion our current documentation instructs to do.
There are two known problems with this function in its current form, which is also a problem with our docs. We should schedule time in the future to improve on this, but having it automated now is better as we are about to add extra steps to take after upgrades.
- When you install citus in a clean cluster we do enable ssl for communication between the coordinator and the workers. If an upgrade to a clean cluster is performed we do not setup ssl on the new cluster causing the communication to fail.
- There are no automated tests added in this PR to execute an upgrade test durning every build.
Our current test infrastructure does not allow for 2 versions of postgres to exist in the same environment. We will need to invest time to create a new testing harness that could run the following scenario:
1. Create cluster
2. Run extensible scripts to execute arbitrary statements on this cluster
3. Perform an upgrade by preparing, upgrading and finishing
4. Run extensible scripts to verify all objects created by earlier scripts exists in correct form in the upgraded cluster
Given the non trivial amount of work involved for such a suite I'd like to land this before we have
automated testing.
On a side note; As the reviewer noticed, the tables created in the public namespace are not visible in `psql` with `\d`. The backup catalog tables have the same name as the tables in `pg_catalog`. Due to postgres internals `pg_catalog` is first in the search path and therefore the non-qualified name would alwasy resolve to `pg_catalog.pg_dist_*`. Internally this is called a non-visible table as it would resolve to a different table without a qualified name. Only visible tables are shown with `\d`.
Before this commit, we've recorded the relation accesses in 3 different
places
- FindPlacementListConnection -- applies all executor in tx block
- StartPlacementExecutionOnSession() -- adaptive executor only
- StartPlacementListConnection() -- router/real-time only
This is different than Citus 8.2, and could lead to query execution times
increase considerably on multi-shard commands in transaction block
that are on partitioned tables.
Benchmarks:
```
1+8 c5.4xlarge cluster
Empty distributed partitioned table with 365 partitions: https://gist.github.com/onderkalaci/1edace4ed6bd6f061c8a15594865bb51#file-partitions_365-sql
./pgbench -f /tmp/multi_shard.sql -c10 -j10 -P 1 -T 120 postgres://citus:w3r6KLJpv3mxe9E-NIUeJw@c.fy5fkjcv45vcepaogqcaskmmkee.db.citusdata.com:5432/citus?sslmode=require
cat /tmp/multi_shard.sql
BEGIN;
DELETE FROM collections_list;
DELETE FROM collections_list;
DELETE FROM collections_list;
COMMIT;
cat /tmp/single_shard.sql
BEGIN;
DELETE FROM collections_list WHERE key = :aid;
DELETE FROM collections_list WHERE key = :aid;
DELETE FROM collections_list WHERE key = :aid;
COMMIT;
cat /tmp/mix.sql
BEGIN;
DELETE FROM collections_list WHERE key = :aid;
DELETE FROM collections_list WHERE key = :aid;
DELETE FROM collections_list WHERE key = :aid;
DELETE FROM collections_list;
DELETE FROM collections_list;
DELETE FROM collections_list;
COMMIT;
```
The table shows `latency average` of pgbench runs explained above, so we have a pretty solid improvement even over 8.2.2.
| Test | Citus 8.2.2 | Citus 8.3.1 | Citus 8.3.2 (this branch) | Citus 8.3.1 (FKEYs disabled via GUC) |
| ------------- | ------------- | ------------- |------------- | ------------- |
|multi_shard | 2370.083 ms |3605.040 ms |1324.094 ms |1247.255 ms |
| single_shard | 85.338 ms |120.934 ms |73.216 ms | 78.765 ms |
| mix | 2434.459 ms | 3727.080 ms |1306.456 ms | 1280.326 ms |
This causes no behaviorial changes, only organizes better to implement modifying CTEs
Also rename ExtactInsertRangeTableEntry to ExtractResultRelationRTE,
as the source of this function didn't match the documentation
Remove Task's upsertQuery in favor of ROW_MODIFY_NONCOMMUTATIVE
Split up AcquireExecutorShardLock into more internal functions
Tests: Normalize multi_reference_table multi_create_table_constraints
Also automated all manual tests around multi user isolation for internal citus udf's
automate upgrade_to_reference_table tests
add negative tests for lock_relation_if_exists
add tests for permissions on worker_cleanup_job_schema_cache
add tests for worker_fetch_partition_file
add tests for worker_merge_files_into_table
fix problem with worker_merge_files_and_run_query when run as non-super user and add tests for behaviour
With this commit, we're introducing the Adaptive Executor.
The commit message consists of two distinct sections. The first part explains
how the executor works. The second part consists of the commit messages of
the individual smaller commits that resulted in this commit. The readers
can search for the each of the smaller commit messages on
https://github.com/citusdata/citus and can learn more about the history
of the change.
/*-------------------------------------------------------------------------
*
* adaptive_executor.c
*
* The adaptive executor executes a list of tasks (queries on shards) over
* a connection pool per worker node. The results of the queries, if any,
* are written to a tuple store.
*
* The concepts in the executor are modelled in a set of structs:
*
* - DistributedExecution:
* Execution of a Task list over a set of WorkerPools.
* - WorkerPool
* Pool of WorkerSessions for the same worker which opportunistically
* executes "unassigned" tasks from a queue.
* - WorkerSession:
* Connection to a worker that is used to execute "assigned" tasks
* from a queue and may execute unasssigned tasks from the WorkerPool.
* - ShardCommandExecution:
* Execution of a Task across a list of placements.
* - TaskPlacementExecution:
* Execution of a Task on a specific placement.
* Used in the WorkerPool and WorkerSession queues.
*
* Every connection pool (WorkerPool) and every connection (WorkerSession)
* have a queue of tasks that are ready to execute (readyTaskQueue) and a
* queue/set of pending tasks that may become ready later in the execution
* (pendingTaskQueue). The tasks are wrapped in a ShardCommandExecution,
* which keeps track of the state of execution and is referenced from a
* TaskPlacementExecution, which is the data structure that is actually
* added to the queues and describes the state of the execution of a task
* on a particular worker node.
*
* When the task list is part of a bigger distributed transaction, the
* shards that are accessed or modified by the task may have already been
* accessed earlier in the transaction. We need to make sure we use the
* same connection since it may hold relevant locks or have uncommitted
* writes. In that case we "assign" the task to a connection by adding
* it to the task queue of specific connection (in
* AssignTasksToConnections). Otherwise we consider the task unassigned
* and add it to the task queue of a worker pool, which means that it
* can be executed over any connection in the pool.
*
* A task may be executed on multiple placements in case of a reference
* table or a replicated distributed table. Depending on the type of
* task, it may not be ready to be executed on a worker node immediately.
* For instance, INSERTs on a reference table are executed serially across
* placements to avoid deadlocks when concurrent INSERTs take conflicting
* locks. At the beginning, only the "first" placement is ready to execute
* and therefore added to the readyTaskQueue in the pool or connection.
* The remaining placements are added to the pendingTaskQueue. Once
* execution on the first placement is done the second placement moves
* from pendingTaskQueue to readyTaskQueue. The same approach is used to
* fail over read-only tasks to another placement.
*
* Once all the tasks are added to a queue, the main loop in
* RunDistributedExecution repeatedly does the following:
*
* For each pool:
* - ManageWorkPool evaluates whether to open additional connections
* based on the number unassigned tasks that are ready to execute
* and the targetPoolSize of the execution.
*
* Poll all connections:
* - We use a WaitEventSet that contains all (non-failed) connections
* and is rebuilt whenever the set of active connections or any of
* their wait flags change.
*
* We almost always check for WL_SOCKET_READABLE because a session
* can emit notices at any time during execution, but it will only
* wake up WaitEventSetWait when there are actual bytes to read.
*
* We check for WL_SOCKET_WRITEABLE just after sending bytes in case
* there is not enough space in the TCP buffer. Since a socket is
* almost always writable we also use WL_SOCKET_WRITEABLE as a
* mechanism to wake up WaitEventSetWait for non-I/O events, e.g.
* when a task moves from pending to ready.
*
* For each connection that is ready:
* - ConnectionStateMachine handles connection establishment and failure
* as well as command execution via TransactionStateMachine.
*
* When a connection is ready to execute a new task, it first checks its
* own readyTaskQueue and otherwise takes a task from the worker pool's
* readyTaskQueue (on a first-come-first-serve basis).
*
* In cases where the tasks finish quickly (e.g. <1ms), a single
* connection will often be sufficient to finish all tasks. It is
* therefore not necessary that all connections are established
* successfully or open a transaction (which may be blocked by an
* intermediate pgbouncer in transaction pooling mode). It is therefore
* essential that we take a task from the queue only after opening a
* transaction block.
*
* When a command on a worker finishes or the connection is lost, we call
* PlacementExecutionDone, which then updates the state of the task
* based on whether we need to run it on other placements. When a
* connection fails or all connections to a worker fail, we also call
* PlacementExecutionDone for all queued tasks to try the next placement
* and, if necessary, mark shard placements as inactive. If a task fails
* to execute on all placements, the execution fails and the distributed
* transaction rolls back.
*
* For multi-row INSERTs, tasks are executed sequentially by
* SequentialRunDistributedExecution instead of in parallel, which allows
* a high degree of concurrency without high risk of deadlocks.
* Conversely, multi-row UPDATE/DELETE/DDL commands take aggressive locks
* which forbids concurrency, but allows parallelism without high risk
* of deadlocks. Note that this is unrelated to SEQUENTIAL_CONNECTION,
* which indicates that we should use at most one connection per node, but
* can run tasks in parallel across nodes. This is used when there are
* writes to a reference table that has foreign keys from a distributed
* table.
*
* Execution finishes when all tasks are done, the query errors out, or
* the user cancels the query.
*
*-------------------------------------------------------------------------
*/
All the commits involved here:
* Initial unified executor prototype
* Latest changes
* Fix rebase conflicts to master branch
* Add missing variable for assertion
* Ensure that master_modify_multiple_shards() returns the affectedTupleCount
* Adjust intermediate result sizes
The real-time executor uses COPY command to get the results
from the worker nodes. Unified executor avoids that which
results in less data transfer. Simply adjust the tests to lower
sizes.
* Force one connection per placement (or co-located placements) when requested
The existing executors (real-time and router) always open 1 connection per
placement when parallel execution is requested.
That might be useful under certain circumstances:
(a) User wants to utilize as much as CPUs on the workers per
distributed query
(b) User has a transaction block which involves COPY command
Also, lots of regression tests rely on this execution semantics.
So, we'd enable few of the tests with this change as well.
* For parameters to be resolved before using them
For the details, see PostgreSQL's copyParamList()
* Unified executor sorts the returning output
* Ensure that unified executor doesn't ignore sequential execution of DDLJob's
Certain DDL commands, mainly creating foreign keys to reference tables,
should be executed sequentially. Otherwise, we'd end up with a self
distributed deadlock.
To overcome this situaiton, we set a flag `DDLJob->executeSequentially`
and execute it sequentially. Note that we have to do this because
the command might not be called within a transaction block, and
we cannot call `SetLocalMultiShardModifyModeToSequential()`.
This fixes at least two test: multi_insert_select_on_conflit.sql and
multi_foreign_key.sql
Also, I wouldn't mind scattering local `targetPoolSize` variables within
the code. The reason is that we'll soon have a GUC (or a global
variable based on a GUC) that'd set the pool size. In that case, we'd
simply replace `targetPoolSize` with the global variables.
* Fix 2PC conditions for DDL tasks
* Improve closing connections that are not fully established in unified execution
* Support foreign keys to reference tables in unified executor
The idea for supporting foreign keys to reference tables is simple:
Keep track of the relation accesses within a transaction block.
- If a parallel access happens on a distributed table which
has a foreign key to a reference table, one cannot modify
the reference table in the same transaction. Otherwise,
we're very likely to end-up with a self-distributed deadlock.
- If an access to a reference table happens, and then a parallel
access to a distributed table (which has a fkey to the reference
table) happens, we switch to sequential mode.
Unified executor misses the function calls that marks the relation
accesses during the execution. Thus, simply add the necessary calls
and let the logic kick in.
* Make sure to close the failed connections after the execution
* Improve comments
* Fix savepoints in unified executor.
* Rebuild the WaitEventSet only when necessary
* Unclaim connections on all errors.
* Improve failure handling for unified executor
- Implement the notion of errorOnAnyFailure. This is similar to
Critical Connections that the connection managament APIs provide
- If the nodes inside a modifying transaction expand, activate 2PC
- Fix few bugs related to wait event sets
- Mark placement INACTIVE during the execution as much as possible
as opposed to we do in the COMMIT handler
- Fix few bugs related to scheduling next placement executions
- Improve decision on when to use 2PC
Improve the logic to start a transaction block for distributed transactions
- Make sure that only reference table modifications are always
executed with distributed transactions
- Make sure that stored procedures and functions are executed
with distributed transactions
* Move waitEventSet to DistributedExecution
This could also be local to RunDistributedExecution(), but in that case
we had to mark it as "volatile" to avoid PG_TRY()/PG_CATCH() issues, and
cast it to non-volatile when doing WaitEventSetFree(). We thought that
would make code a bit harder to read than making this non-local, so we
move it here. See comments for PG_TRY() in postgres/src/include/elog.h
and "man 3 siglongjmp" for more context.
* Fix multi_insert_select test outputs
Two things:
1) One complex transaction block is now supported. Simply update
the test output
2) Due to dynamic nature of the unified executor, the orders of
the errors coming from the shards might change (e.g., all of
the queries on the shards would fail, but which one appears
on the error message?). To fix that, we simply added it to
our shardId normalization tool which happens just before diff.
* Fix subeury_and_cte test
The error message is updated from:
failed to execute task
To:
more than one row returned by a subquery or an expression
which is a lot clearer to the user.
* Fix intermediate_results test outputs
Simply update the error message from:
could not receive query results
to
result "squares" does not exist
which makes a lot more sense.
* Fix multi_function_in_join test
The error messages update from:
Failed to execute task XXX
To:
function f(..) does not exist
* Fix multi_query_directory_cleanup test
The unified executor does not create any intermediate files.
* Fix with_transactions test
A test case that just started to work fine
* Fix multi_router_planner test outputs
The error message is update from:
Could not receive query results
To:
Relation does not exists
which is a lot more clearer for the users
* Fix multi_router_planner_fast_path test
The error message is update from:
Could not receive query results
To:
Relation does not exists
which is a lot more clearer for the users
* Fix isolation_copy_placement_vs_modification by disabling select_opens_transaction_block
* Fix ordering in isolation_multi_shard_modify_vs_all
* Add executor locks to unified executor
* Make sure to allocate enought WaitEvents
The previous code was missing the waitEvents for the latch and
postmaster death.
* Fix rebase conflicts for master rebase
* Make sure that TRUNCATE relies on unified executor
* Implement true sequential execution for multi-row INSERTS
Execute the individual tasks executed one by one. Note that this is different than
MultiShardConnectionType == SEQUENTIAL_CONNECTION case (e.g., sequential execution
mode). In that case, running the tasks across the nodes in parallel is acceptable
and implemented in that way.
However, the executions that are qualified here would perform poorly if the
tasks across the workers are executed in parallel. We currently qualify only
one class of distributed queries here, multi-row INSERTs. If we do not enforce
true sequential execution, concurrent multi-row upserts could easily form
a distributed deadlock when the upserts touch the same rows.
* Remove SESSION_LIFESPAN flag in unified_executor
* Apply failure test updates
We've changed the failure behaviour a bit, and also the error messages
that show up to the user. This PR covers majority of the updates.
* Unified executor honors citus.node_connection_timeout
With this commit, unified executor errors out if even
a single connection cannot be established within
citus.node_connection_timeout.
And, as a side effect this fixes failure_connection_establishment
test.
* Properly increment/decrement pool size variables
Before this commit, the idle and active connection
counts were not properly calculated.
* insert_select_executor goes through unified executor.
* Add missing file for task tracker
* Modify ExecuteTaskListExtended()'s signature
* Sort output of INSERT ... SELECT ... RETURNING
* Take partition locks correctly in unified executor
* Alternative implementation for force_max_query_parallelization
* Fix compile warnings in unified executor
* Fix style issues
* Decrement idleConnectionCount when idle connection is lost
* Always rebuild the wait event sets
In the previous implementation, on waitFlag changes, we were only
modifying the wait events. However, we've realized that it might
be an over optimization since (a) we couldn't see any performance
benefits (b) we see some errors on failures and because of (a)
we prefer to disable it now.
* Make sure to allocate enough sized waitEventSet
With multi-row INSERTs, we might have more sessions than
task*workerCount after few calls of RunDistributedExecution()
because the previous sessions would also be alive.
Instead, re-allocate events when the connectino set changes.
* Implement SELECT FOR UPDATE on reference tables
On master branch, we do two extra things on SELECT FOR UPDATE
queries on reference tables:
- Acquire executor locks
- Execute the query on all replicas
With this commit, we're implementing the same logic on the
new executor.
* SELECT FOR UPDATE opens transaction block even if SelectOpensTransactionBlock disabled
Otherwise, users would be very confused and their logic is very likely
to break.
* Fix build error
* Fix the newConnectionCount calculation in ManageWorkerPool
* Fix rebase conflicts
* Fix minor test output differences
* Fix citus indent
* Remove duplicate sorts that is added with rebase
* Create distributed table via executor
* Fix wait flags in CheckConnectionReady
* failure_savepoints output for unified executor.
* failure_vacuum output (pg 10) for unified executor.
* Fix WaitEventSetWait timeout in unified executor
* Stabilize failure_truncate test output
* Add an ORDER BY to multi_upsert
* Fix regression test outputs after rebase to master
* Add executor.c comment
* Rename executor.c to adaptive_executor.c
* Do not schedule tasks if the failed placement is not ready to execute
Before the commit, we were blindly scheduling the next placement executions
even if the failed placement is not on the ready queue. Now, we're ensuring
that if failed placement execution is on a failed pool or session where the
execution is on the pendingQueue, we do not schedule the next task. Because
the other placement execution should be already running.
* Implement a proper custom scan node for adaptive executor
- Switch between the executors, add GUC to set the pool size
- Add non-adaptive regression test suites
- Enable CIRCLE CI for non-adaptive tests
- Adjust test output files
* Add slow start interval to the executor
* Expose max_cached_connection_per_worker to user
* Do not start slow when there are cached connections
* Consider ExecutorSlowStartInterval in NextEventTimeout
* Fix memory issues with ReceiveResults().
* Disable executor via TaskExecutorType
* Make sure to execute the tests with the other executor
* Use task_executor_type to enable-disable adaptive executor
* Remove useless code
* Adjust the regression tests
* Add slow start regression test
* Rebase to master
* Fix test failures in adaptive executor.
* Rebase to master - 2
* Improve comments & debug messages
* Set force_max_query_parallelization in isolation_citus_dist_activity
* Force max parallelization for creating shards when asked to use exclusive connection.
* Adjust the default pool size
* Expand description of max_adaptive_executor_pool_size GUC
* Update warnings in FinishRemoteTransactionCommit()
* Improve session clean up at the end of execution
Explicitly list all the states that the execution might end,
otherwise warn.
* Remove MULTI_CONNECTION_WAIT_RETRY which is not used at all
* Add more ORDER BYs to multi_mx_partitioning
- All the schema creations on the workers will now be via superuser connections
- If a shard is being repaired or a shard is replicated, we will create the
schema only in the relevant worker; and in all the other cases where a schema
creation is needed, we will block operations until we ensure the schema exists
in all the workers
It has been reported a null pointer dereference could be triggered in FreeConnParamsHashEntryFields. Likely cause is an error in GetConnParams which will leave the cached ConnParamsHashEntry in a state that would cause the null pointer dereference in a subsequent connection establishment to the same server. This has been simulated by inserting ereport(ERROR, ...) at certain places in the code.
Not only would ConnParamsHashEntry be in a state that would cause a crash, it was also leaking memory in the ConnectionContext due to the loss of pointers as they are only stored on the ConnParamsHashEntry at the end of the function.
This patch rewrites both the GetConnParams to store pointers 'durably' at every point in the code so that an error would not lose the pointer as well as FreeConnParamsHashEntryFields in a way that it can clear half initialised ConnParamsHashEntry's in a safer manner.
GRANT queries are propagated on Enterprise. If a user attempts to
create a user and run a GRANT query before creating it on workers, we
fail. This issue does not happen in community as the user needs to run
the GRANTs on the workers manually.
When `master_update_node` is called to update a node's location it waits for appropriate locks to become available. This is useful during normal operation as new operations will be blocked till after the metadata update while running operations have time to finish.
When `master_update_node` is called after a node failure it is less useful to wait for running operations to finish as they can't. The lock being held indicates an operation that once attempted to commit will fail as the machine already failed. Now the downside is the failover is postponed till the termination point of the operation. This has been observed by users to take a significant amount of time causing the rest of the system to be observed unavailable.
With this patch it is possible in such situations to invoke `master_update_node` with 2 optional arguments:
- `force` (bool defaults to `false`): When called with true the update of the metadata will be forced to proceed by terminating conflicting backends. A cancel is not enough as the backend might be in idle time (eg. an interactive session, or going back and forth between an appliaction), therefore a more intrusive solution of termination is used here.
- `lock_cooldown` (int defaults to `10000`): This is the time in milliseconds before conflicting backends are terminated. This is to allow the backends to finish cleanly before terminating them. This allows the user to set an upperbound to the expected time to complete the metadata update, eg. performing the failover.
The functionality is implemented by spawning a background worker that has the task of helping a certain backend in acquiring its locks. The backend is either terminated on successful execution of the metadata update, or once the memory context of the expression gets reset, eg. on a cancel of the statement.
Adds support for propagation of SET LOCAL commands to all workers
involved in a query. For now, SET SESSION (i.e. plain SET) is not
supported whatsoever, though this code is intended as somewhat of a
base for implementing such support in the future.
As SET LOCAL modifications are scoped to the body of a BEGIN/END xact
block, queries wishing to use SET LOCAL propagation must be within such
a block. In addition, subsequent modifications after e.g. any SAVEPOINT
or ROLLBACK statements will correspondingly push or pop variable mod-
ifications onto an internal stack such that the behavior of changed
values across the cluster will be identical to such behavior on e.g.
single-node PostgreSQL (or equivalently, what values are visible to
the end user by running SHOW on such variables on the coordinator).
If nodes enter the set of participants at some point after SET LOCAL
modifications (or SAVEPOINT, ROLLBACK, etc.) have occurred, the SET
variable state is eagerly propagated to them upon their entrance (this
is identical to, and indeed just augments, the existing logic for the
propagation of the SAVEPOINT "stack").
A new GUC (citus.propagate_set_commands) has been added to control this
behavior. Though the code suggests the valid settings are 'none', 'local',
'session', and 'all', only 'none' (the default) and 'local' are presently
implemented: attempting to use other values will result in an error.
This is a preperation for the new executor, where creating shards
would go through the executor. So, explicitly generate the commands
for further processing.
If a query is router executable, it hits a single shard and therefore has a
single task associated with it. Therefore there is no need to sort the task list
that has a single element.
Also we already have a list of active shard placements, sending it in param
and reuse it.
If replication factor eqauls to 2 and there are two worker nodes,
even if two modifications hit different shards, Citus doesn't use
2PC. The reason is that it doesn't fit into the definition of
"expanding participating worker nodes".
Thus, we're simply fixing the test to fit in the comment
on top of it.
InitializeCaches() method may prematurely set
performedInitialization without actually creating
DistShardCacheHash.
Fix makes sure flag is set only if DistShardCacheHash is created successfully.
Also introduced a new memory context to allocate aforementioned hash tables.
If allocation/initialization fails for any reason we make sure
memory is reclaimed by deleting the memory context.
Instead of scattering the code around, we move all the
logic into a single function.
This will help supporting foreign keys to reference tables
in the unified executor with a single line of change, just
calling this function.
The feature is only intended for getting consistent outputs for the regression tests.
RETURNING does not have any ordering gurantees and with unified executor, the ordering
of query executions on the shards are also becoming unpredictable. Thus, we're enforcing
ordering when a GUC is set.
We implicitly add an `ORDER BY` something equivalent of
`
RETURNING expr1, expr2, .. ,exprN
ORDER BY expr1, expr2, .. ,exprN
`
As described in the code comments as well, this is probably not the most
performant approach we could implement. However, since we're only
targeting regression tests, I don't see any issues with that. If we
decide to expand this to a feature to users, we should revisit the
implementation and improve the performance.
This commit has two goals:
(a) Ensure to access both edges of the allocated stack
(b) Ensure that any compiler optimizations to prevent the
function optimized away.
Stack size after the patch:
sudo grep -A 1 stack /proc/2119/smaps
7ffe305a6000-7ffe307a9000 rw-p 00000000 00:00 0 [stack]
Size: 2060 kB
Stack size before the patch:
sudo grep -A 1 stack /proc/3610/smaps
7fff09957000-7fff09978000 rw-p 00000000 00:00 0 [stack]
Size: 132 kB
We used to rely on PG function flatten_join_alias_vars
to resolve actual columns referenced in target entry list.
The function goes deep and finds the actual relation. This logic
usually works fine. However, when joins are given an alias, inner
relation names are not visible to target entry entry. Thus relation
resolving should stop when we the target entry column refers an
rte of an aliased join.
We stopped using PG function and provided our own flatten function.
Our assumption that strip_implicit_coercions would leave us with a bi-
nary-compatible type to that of the partition key was wrong. Instead,
we should ensure the RHS of the comparison we perform is proactively
coerced into a compatible type (at least binary compatible).
At configuration reload, we free all "global" (i.e. GUC-set) connection
parameters, but these may still have live references in the connection
parameters hash. By marking the entries as invalid, we can ensure they
will not be used after free.
Having DATA-segment string literals made blindly freeing the keywords/
values difficult, so I've switched to allocating all in the provided
context; because of this (and with the knowledge of the end point of
the global parameters), we can safely pfree non-global parameters when
we come across an invalid connection parameter entry.
Do it in two ways (a) re-use the rte list as much as possible instead of
re-calculating over and over again (b) Limit the recursion to the relevant
parts of the query tree
Before this commit, shardPlacements were identified with shardId, nodeName
and nodeport. Instead of using nodeName and nodePort, we now use nodeId
since it apparently has performance benefits in several places in the
code.
The rule for infinite recursion is the following:
- If the query contains a subquery which is recursively planned, and
no other subqueries can be recursively planned due to correlation
(e.g., LATERAL joins), the planner keeps recursing again and again.
One interesting thing here is that even if a subquery contains only intermediate
result(s), we re-recursively plan that. In the end, the logic in the code does the following:
- Try recursive planning any of the subqueries in the query tree
- If any subquery is recursively planned, call the planner again
where the subquery is replaced with the intermediate result.
- Try recursively planning any of the queries
- If any subquery is recursively planned, call the planner again
where the subquery (in this case it is already intermediate result)
is replaced with the intermediate result.
- Try recursively planning any of the queries
- If any subquery is recursively planned, call the planner again
where the subquery (in this case it is already intermediate result)
is replaced with the intermediate result.
- Try recursively planning any of the queries
- If any subquery is recursively planned, call the planner again
where the subquery (in this case it is already intermediate result)
is replaced with the intermediate result.
......
Following scenario resulted in distributed deadlock before this commit:
CREATE TABLE partitioning_test(id int, time date) PARTITION BY RANGE (time);
CREATE TABLE partitioning_test_2009 (LIKE partitioning_test);
CREATE TABLE partitioning_test_reference(id int PRIMARY KEY, subid int);
SELECT create_distributed_table('partitioning_test_2009', 'id'),
create_distributed_table('partitioning_test', 'id'),
create_reference_table('partitioning_test_reference');
ALTER TABLE partitioning_test ADD CONSTRAINT partitioning_reference_fkey FOREIGN KEY (id) REFERENCES partitioning_test_reference(id) ON DELETE CASCADE;
ALTER TABLE partitioning_test_2009 ADD CONSTRAINT partitioning_reference_fkey_2009 FOREIGN KEY (id) REFERENCES partitioning_test_reference(id) ON DELETE CASCADE;
ALTER TABLE partitioning_test ATTACH PARTITION partitioning_test_2009 FOR VALUES FROM ('2009-01-01') TO ('2010-01-01');
Since flattening query may flatten outer joins' columns into coalesce expr that is
in the USING part, and that was not expected before this commit, these queries were
erroring out. It is fixed by this commit with considering coalesce expression as well.
We'd been ignoring updating uncrustify for some time now because I'd
thought these were misclassifications that would require an update in
our rules to address. Turns out they're legit, so I'm checking them in.
The configuration for the build is in the YAML file; the changes to the
regression runner are backward-compatible with Travis and just add the
logic to detect whether our custom (isolation- and vanilla-enabled) pkg
is present.
Before this commit, round-robin task assignment policy was relying
on the taskId. Thus, even inside a transaction, the tasks were
assigned to different nodes. This was especially problematic
while reading from reference tables within transaction blocks.
Because, we had to expand the distributed transaction to many
nodes that are not necessarily already in the distributed transaction.
In this context, we define "Fast Path Planning for SELECT" as trivial
queries where Citus can skip relying on the standard_planner() and
handle all the planning.
For router planner, standard_planner() is mostly important to generate
the necessary restriction information. Later, the restriction information
generated by the standard_planner is used to decide whether all the shards
that a distributed query touches reside on a single worker node. However,
standard_planner() does a lot of extra things such as cost estimation and
execution path generations which are completely unnecessary in the context
of distributed planning.
There are certain types of queries where Citus could skip relying on
standard_planner() to generate the restriction information. For queries
in the following format, Citus does not need any information that the
standard_planner() generates:
SELECT ... FROM single_table WHERE distribution_key = X; or
DELETE FROM single_table WHERE distribution_key = X; or
UPDATE single_table SET value_1 = value_2 + 1 WHERE distribution_key = X;
Note that the queries might not be as simple as the above such that
GROUP BY, WINDOW FUNCIONS, ORDER BY or HAVING etc. are all acceptable. The
only rule is that the query is on a single distributed (or reference) table
and there is a "distribution_key = X;" in the WHERE clause. With that, we
could use to decide the shard that a distributed query touches reside on
a worker node.
Failure&Cancellation tests for initial start_metadata_sync() calls
to worker and DDL queries that send metadata syncing messages to an MX node
Also adds message type definitions for messages that are exchanged
during metadata syncing
-
We used to error out if there is a reference table
in the query participating a union. This has caused
pushdownable queries to be evaluated in coordinator.
Now we let reference tables inside union queries as long
as there is a distributed table in from clause.
Existing join checks (reference table on the outer part)
sufficient enought that we do not need check the join relation
of reference tables.
Previously we allowed task assignment policy to have affect on router queries
with only intermediate results. However, that is erroneous since the code-path
that assigns placements relies on shardIds and placements, which doesn't exists
for intermediate results.
With this commit, we do not apply task assignment policies when a router query
hits only intermediate results.
PG recently started propagating foreign key constraints
to partition tables. This came with a select query
to validate the the constaint.
We are already setting sequential mode execution for this
command. In order for validation select query to respect
this setting we need to explicitly set the GUC.
This commit also handles detach partition part.
We update column attributes of various clauses for a query
inluding target columns, select clauses when we introduce
new range table entries in the query.
It seems having clause column attributes were not updated.
This fix resolves the issue
We had recently fixed a spinlock issue due to functions
failing, but spinlock is not being released.
This is the continuation of that work to eliminate possible
regression of the issue. Function calls that are moved out of
spinlock scope are macros and plain type casting. However,
depending on the configuration they have an alternate implementation
in PG source that performs memory allocation.
This commit moves last bit of codes to out of spinlock for completion purposes.
We were establishing connections synchronously. Establishing
connections asynchronously results in some parallelization, saving
hundreds of milliseconds.
In a test I did, this decreased the query time from 150ms to 40ms.
A spinlock is not released when an exception is thrown after
spinlock is acquired. This has caused infinite wait and eventual
crash in maintenance daemon.
This work moves the code than can fail to the outside of spinlock
scope so that in the case of failure spinlock is not left locked
since it was not locked in the first place.
Postgresql loads shared libraries before calculating MaxBackends.
However, Citus relies on MaxBackends being set. Thus, with this
commit we use the same steps to calculate MaxBackends while
Citus is being loaded (e.g., PG_Init is called).
Note that this is safe since all the elements that are used to
calculate MaxBackends are PGC_POSTMASTER gucs and a constant
value.
We disable bunch of planning options on the workers. This might be
risky if any concurrent test relies on EXPLAIN OUTPUT as well. Still,
we want to keep this test, so we should try to not parallelize this
test with such test.
Before this commit, Citus supported INSERT...SELECT queries with
ON CONFLICT or RETURNING clauses only for pushdownable ones, since
queries supported via coordinator were utilizing COPY infrastructure
of PG to send selected tuples to the target worker nodes.
After this PR, INSERT...SELECT queries with ON CONFLICT or RETURNING
clauses will be performed in two phases via coordinator. In the first
phase selected tuples will be saved to the intermediate table which
is colocated with target table of the INSERT...SELECT query. Note that,
a utility function to save results to the colocated intermediate result
also implemented as a part of this commit. In the second phase, INSERT..
SELECT query is directly run on the worker node using the intermediate
table as the source table.
When initializing a Citus formation automatically from an external piece of
software such as Citus-HA, the following process process may be used:
- decide on the groupId in the external software
- SELECT * FROM master_add_inactive_node('localhost', 9701, groupid => X)
When Citus checks for maxGroupId, it forbids other software to pick their
own group Ids to ues with the master_add_inactive_node() API.
This patch removes the extra testing around maxGroupId.
Description: Support round-robin `task_assignment_policy` for queries to reference tables.
This PR allows users to query multiple placements of shards in a round robin fashion. When `citus.task_assignment_policy` is set to `'round-robin'` the planner will use a round robin scheduling feature when multiple shard placements are available.
The primary use-case is spreading the load of reference table queries to all the nodes in the cluster instead of hammering only the first placement of the reference table. Since reference tables share the same path for selecting the shards with single shard queries that have multiple placements (`citus.shard_replication_factor > 1`) this setting also allows users to spread the query load on these shards.
For modifying queries we do not apply a round-robin strategy. This would be negated by an extra reordering step in the executor for such queries where a `first-replica` strategy is enforced.
The file handling the utility functions (DDL) for citus organically grew over time and became unreasonably large. This refactor takes that file and refactored the functionality into separate files per command. Initially modeled after the directory and file layout that can be found in postgres.
Although the size of the change is quite big there are barely any code changes. Only one two functions have been added for readability purposes:
- PostProcessIndexStmt which is extracted from PostProcessUtility
- PostProcessAlterTableStmt which is extracted from multi_ProcessUtility
A README.md has been added to `src/backend/distributed/commands` describing the contents of the module and every file in the module.
We need more documentation around the overloading of the COPY command, for now the boilerplate has been added for people with better knowledge to fill out.
Each PostgreSQL backend starts with a predefined amount of stack and this stack
size can be increased if there is a need. However, stack size increase during
high memory load may cause unexpected crashes, because if there is not enough
memory for stack size increase, there is nothing to do for process apart from
crashing. An interesting thing is; the process would get OOM error instead of
crash, if the process had an explicit memory request (with palloc) for example.
However, in the case of stack size increase, there is no system call to get OOM
error, so the process simply crashes.
With this change, we are increasing the stack size explicitly by requesting extra
memory from the stack, so that, even if there is not memory, we can at least get
an OOM instead of a crash.
In recent postgres builds you cannot set client_min_messages to
values higher then ERROR, if will silently set it to ERROR if so.
During some tests we would set it to fatal to hide random values
(eg. pid's of processes) from the test output. This patch will use
different tactics for hiding these values.
After Fast ALTER TABLE ADD COLUMN with a non-NULL default in PG11, physical heaps might not contain all attributes after a ALTER TABLE ADD COLUMN happens. heap_getattr() returns NULL when the physical tuple doesn't contain an attribute. So we should use heap_deform_tuple() in these cases, which fills in the missing attributes.
Our catalog tables evolve over time, and an upgrade might involve some ALTER TABLE ADD COLUMN commands.
Note that we don't need to worry about postgres catalog tables and we can use heap_getattr() for them, because they only change between major versions.
This also fixes#2453.
PG 11 has change the way that PARAM_EXTERN is processed.
This commit ensures that Citus follows the same pattern.
For details see the related Postgres commit:
6719b238e8
Assign the distributed transaction id before trying to acquire the
executor advisory locks. This is useful to show this backend in citus
lock graphs (e.g., dump_global_wait_edges() and citus_lock_waits).
I'm pretty sure a lot of this test functionality may be covered in some
of our existing regression tests, but I've included them to ensure we
put all failure-based tests under our new testing method for that kind
of test.
Didn't include lower replication factor, as (for a single-shard mod.),
it's indistinguishable from modifying a reference table. So these all
test modifications which hit a single, replicated shard.
We made PG11 builds optional when we had an issue
with mx isolation test that we could not solve back then.
This commit solves the issue with a workaround by running
start_metadata_sync_to_node outside the transaction block.
Both of these are a bit of a shot in the dark. In one case, we noticed
a stack trace where a caller received a null pointer and attempted to
dereference the memory context field (at 0x010). In the other, I saw
that any error thrown from within AdjustParseTree could keep the stack
from being cleaned up (presumably if we push we should always pop).
Both stack traces were collected during times of high memory pressure
and locally reproducing the problem locally or otherwise has been very
tricky (i.e. it hasn't been reproduced reliably at all).
* Keep track of cached entries in case of interruption.
Previously we set DistTableCacheEntry->sortedShardIntervalArray
and DistTableCacheEntry->shardIntervalArrayLength after we entered
all related shard entries into DistShardCacheHash. The drawback was
that if populating DistShardCacheHash was interrupted,
ResetDistTableCacheEntry() didn't see the shard hash entries created,
so was unable to clean them up.
This patch fixes that by setting sortedShardIntervalArray earlier,
and incrementing shardIntervalArrayLength as we enter shards into
the cache.
Fairly straightforward; verified that modifications fail atomically if
a worker is down or fails mid-transaction (i.e. all workers need to ack
modifications to reference tables in order to persist changes).
Including several examples from #1926. I couldn't understand why the
recover_prepared_transactions "should be an error", and EXPLAIN has
changed since the original bug (so that it runs EXPLAINs in txns, I
think for EXPLAIN ANALYZE to not have side effects); other than that,
most of the reported bugs now error out rather than crash or return
an empty result set.
VACUUM runs outside of a transaction, so the failure modes for it are
somewhat straightforward, though ANALYZE runs in a 1pc transaction and
multi-table VACUUM can fail between statements (PG 11 and higher).
Tests various failure points during a multi-shard modification within
a transaction with multiple statements. Verifies three cases:
* Reference tables (single shard, many placements)
* Normal table with replication factor two
* Multi-shard table with no replication
In the replication-factor case, we expect shard health to be affected
in some transactions; most others fail the transaction entirely and
all we need verify is that no effects of the transaction are visible.
Had trouble testing the final PREPARE/COMMIT/ROLLBACK phase of the 2pc,
in particular because the error message produced includes the PID of
the backend, which is unpredictable.
Drop schema command fails in mx mode if there
is a partitioned table with active partitions.
This is due to fact that sql drop trigger receives
all the dropped objects including partitions. When
we call drop table on parent partition, it also drops
the partitions on the mx node. This causes the drop
table command on partitions to fail on mx node because
they are already dropped when the partition parent was
dropped.
With this work we did not require the table to exist on
worker_drop_distributed_table.
PG now allows foreign keys on partitioned tables.
Each foreign key constraint on partitioned table
is propagated down to partitions.
We used to create all constraints on shards when we are creating
a new shard, or when just simply moving a shard from one worker
to another. We also used the same logic when creating a copy of
coordinator table in mx node.
With this change we create the constraint on worker node only if
it is not an inherited constraint.
We used to set the execution mode in the truncate trigger. However,
when multiple tables are truncated with a single command, we could
set the execution mode very late. Instead, now set the execution mode
on the utility hook.
By setting the CPU tuple cost so high, we were triggering JIT. Instead,
we should use parallel_tuple_cost.
See: rhaas.blogspot.com/2018/06/using-forceparallelmode-correctly.html
This reverts commit a2fb5a84f1.
JIT wasn't actually interfering with the operation of Citus, a test was
just written in a way which caused JIT to run for a function on every
row in a 150k-row table.
With this commit, we all partitioned distributed tables with
replication factor > 1. However, we also have many restrictions.
In summary, we disallow all kinds of modifications (including DDLs)
on the partition tables. Instead, the user is allowed to run the
modifications over the parent table.
The necessity for such a restriction have two aspects:
- We need to acquire shard resource locks appropriately
- We need to handle marking partitions INVALID in case
of any failures. Note that, in theory, the parent table
should also become INVALID, which is too aggressive.
We acquire distributed lock on all mx nodes for truncated
tables before actually doing truncate operation.
This is needed for distributed serialization of the truncate
command without causing a deadlock.
Reason for the failure is that PG11 introduced a new relation kind
RELKIND_PARTITIONED_INDEX to be used for partitioned indices.
We expanded our check to cover that case.
This commit uses *_walker instead of *_mutator for performance reasons.
Given that we're only updating a functionId in the tree, the approach
seems fine.
In case a failure happens when a transaction is failed on PREPARE,
we used to hit an assertion for ensuring there is no pending
activity on the connection. However, that's not true after the
changes in #2031. Thus, we've replaced the assertion with a more
generic function call to consume any pending activity, if exists.
In the distributed deadlock detection design, we concluded that prepared transactions
cannot be part of a distributed deadlock. The idea is that (a) when the transaction
is prepared it already acquires all the locks, so cannot be part of a deadlock
(b) even if some other processes blocked on the prepared transaction, prepared transactions
would eventually be committed (or rollbacked) and the system will continue operating.
With the above in mind, we probably had a mistake in terms of memory allocations. For each
backend initialized, we keep a `BackendData` struct. The bug we've introduced is that, we
assumed there would only be `MaxBackend` number of backends. However, `MaxBackends` doesn't
include the prepared transactions and axuliary processes. When you check Postgres' InitProcGlobal`
you'd see that `TotalProcs = MaxBackends + NUM_AUXILIARY_PROCS + max_prepared_xacts;`
This commit aligns with total procs processed with that.
PG11 introduced PROCEDURE concept similar to FUNCTION
Procedure's allow committing/rolling back behavior.
This commmit adds regression tests for procedure calls.
With this commit, we implement two views that are very similar
to pg_stat_activity, but showing queries that are involved in
distributed queries:
- citus_dist_stat_activity: Shows all the distributed queries
- citus_worker_stat_activity: Shows all the queries on the shards
that are initiated by distributed queries.
Both views have the same columns in the outputs. In very basic terms, both of the views
are meant to provide some useful insights about the distributed
transactions within the cluster. As the names reveal, both views are similar to pg_stat_activity.
Also note that these views can be pretty useful on Citus MX clusters.
Note that when the views are queried from the worker nodes, they'd not show the distributed
transactions that are initiated from the coordinator node. The reason is that the worker
nodes do not know the host/port of the coordinator. Thus, it is advisable to query the
views from the coordinator.
If we bucket the columns that the views returns, we'd end up with the following:
- Hostnames and ports:
- query_hostname, query_hostport: The node that the query is running
- master_query_host_name, master_query_host_port: The node in the cluster
initiated the query.
Note that for citus_dist_stat_activity view, the query_hostname-query_hostport
is always the same with master_query_host_name-master_query_host_port. The
distinction is mostly relevant for citus_worker_stat_activity. For example,
on Citus MX, a users starts a transaction on Node-A, which starts worker
transactions on Node-B and Node-C. In that case, the query hostnames would be
Node-B and Node-C whereas the master_query_host_name would Node-A.
- Distributed transaction related things:
This is mostly the process_id, distributed transactionId and distributed transaction
number.
- pg_stat_activity columns:
These two views get all the columns from pg_stat_activity. We're basically joining
pg_stat_activity with get_all_active_transactions on process_id.
This test's output changes depending on which worker is
picked for explain (e.g., worker port in the output changes).
Given that the test is only aiming to ensure that CTEs inside
CTEs work fine in DML queries, it should be fine to get rid of
the EXPLAIN. The output is verified to be correct as well.
We previously implemented OTHER_WORKERS_WITH_METADATA tag. However,
that was wrong. See the related discussion:
https://github.com/citusdata/citus/issues/2320
Instead, we switched using OTHER_WORKER_NODES and make the command
that we're running optional such that even if the node is not a
metadata node, we won't be in trouble.
This commit fixes a bug where a concurrent DROP TABLE deadlocks
with SELECT (or DML) when the SELECT is executed from the workers.
The problem was that Citus used to remove the metadata before
droping the table on the workers. That creates a time window
where the SELECT starts running on some of the nodes and DROP
table on some of the other nodes.
This commit enables support for TRUNCATE on both
distributed table and reference tables.
The basic idea is to acquire lock on the relation by sending
the TRUNCATE command to all metedata worker nodes. We only
skip sending the TRUNCATE command to the node that actually
executus the command to prevent a self-distributed-deadlock.
This commit should be reverted once a new PostgreSQL 11 beta is
available: it's due to a bug in the partitioning code which has been
fixed in REL_11_STABLE but (not yet) a released tag.
Make sure that the coordinator sends the commands when the search
path synchronised with the coordinator's search_path. This is only
important when Citus sends the commands that are directly relayed
to the worker nodes. For example, the deparsed DLL commands or
queries always adds schema qualifications to the queries. So, they
do not require this change.
This commit by default enables hiding shard names on MX workers
by simple replacing `pg_table_is_visible()` calls with
`citus_table_is_visible()` calls on the MX worker nodes. The latter
function filters out tables that are known to be shards.
The main motivation of this change is a better UX. The functionality
can be opted out via a GUC.
We also added two views, namely citus_shards_on_worker and
citus_shard_indexes_on_worker such that users can query
them to see the shards and their corresponding indexes.
We also added debug messages such that the filtered tables can
be interactively seen by setting the level to DEBUG1.
Add ability to understand whether a table is a
known shard on MX workers. Note that this is only useful
and applicable for hiding shards on MX worker nodes given
that we can have metadata only there.
- mitmdump now listens on port 9060
- Add some logging to fluent.py, making issues like this easier to debug in the future
- Fail the tests if something is already running on the port mitmProxy tries to use
- check-failure now works with VPATH builds
This commit adds an extensive failure testing, which covers quite
a bit of things and their combinations:
- 1PC vs 2PC
- Replication factor 1 and Replication factor 2
- Network failures and query cancellations
- Sequential vs Parallel query execution mode
- Lots of detail is in src/test/regress/mitmscripts/README
- Create a new target, make check-failure, which runs tests
- Tells travis how to install everything and run the tests
We can now support more complex count distinct operations by
pulling necessary columns to coordinator and evalutating the
aggreage at coordinator.
It supports broad range of expression with the restriction that
the expression must contain a column.
When a hash distributed table have a foreign key to a reference
table, there are few restrictions we have to apply in order to
prevent distributed deadlocks or reading wrong results.
The necessity to apply the restrictions arise from cascading
nature of foreign keys. When a foreign key on a reference table
cascades to a distributed table, a single operation over a single
connection can acquire locks on multiple shards of the distributed
table. Thus, any parallel operation on that distributed table, in the
same transaction should not open parallel connections to the shards.
Otherwise, we'd either end-up with a self-distributed deadlock or
read wrong results.
As briefly described above, the restrictions that we apply is done
by tracking the distributed/reference relation accesses inside
transaction blocks, and act accordingly when necessary.
The two main rules are as follows:
- Whenever a parallel distributed relation access conflicts
with a consecutive reference relation access, Citus errors
out
- Whenever a reference relation access is followed by a
conflicting parallel relation access, the execution mode
is switched to sequential mode.
There are also some other notes to mention:
- If the user does SET LOCAL citus.multi_shard_modify_mode
TO 'sequential';, all the queries should simply work with
using one connection per worker and sequentially executing
the commands. That's obviously a slower approach than Citus'
usual parallel execution. However, we've at least have a way
to run all commands successfully.
- If an unrelated parallel query executed on any distributed
table, we cannot switch to sequential mode. Because, the essense
of sequential mode is using one connection per worker. However,
in the presence of a parallel connection, the connection manager
picks those connections to execute the commands. That contradicts
with our purpose, thus we error out.
- COPY to a distributed table cannot be executed in sequential mode.
Thus, if we switch to sequential mode and COPY is executed, the
operation fails and there is currently no way of implementing that.
Note that, when the local table is not empty and create_distributed_table
is used, citus uses COPY internally. Thus, in those cases,
create_distributed_table() will also fail.
- There is a GUC called citus.enforce_foreign_key_restrictions
to disable all the checks. We added that GUC since the restrictions
we apply is sometimes a bit more restrictive than its necessary.
The user might want to relax those. Similarly, if you don't have
CASCADEing reference tables, you might consider disabling all the
checks.
-[x] drop constraint
-[x] drop column
-[x] alter column type
-[x] truncate
are sequentialized if there is a foreign constraint from
a distributed table to a reference table on the affected relations
by the above commands.
Make sure that intermediate results use a connection that is
not associated with any placement. That is useful in two ways:
- More complex queries can be executed with CTEs
- Safely use the same connections when there is a foreign key
to reference table from a distributed table, which needs to
use the same connection for modifications since the reference
table might cascade to the distributed table.
This table will be used by Citus Enterprise to populate authentication-
related fields in outbound connections; Citus Community lacks support
for this functionality.
To support more flexible (i.e. not at compile-time) specification of
libpq connection parameters, this change adds a new GUC, node_conninfo,
which must be a space-separated string of key-value pairs suitable for
parsing by libpq's connection establishment methods.
To avoid rebuilding and parsing these values at connection time, this
change also adds a cache in front of the configuration params to permit
immediate use of any previously-calculated parameters.
We're relying on multi_shard_modify_mode GUC for real-time SELECTs.
The name of the GUC is unfortunate, but, adding one more GUC
(or renaming the GUC) would make the UX even worse. Given that this
mode is mostly important for transaction blocks that involve modification
/DDL queries along with real-time SELECTs, we can live with the confusion.
After this commit DDL commands honour `citus.multi_shard_modify_mode`.
We preferred using the code-path that executes single task router
queries (e.g., ExecuteSingleModifyTask()) in order not to invent
a new executor that is only applicable for DDL commands that require
sequential execution.
Errors thrown in the COMMIT handler will cause Postgres to segfault,
there's nothing it can do it abort the transaction by the time that
handler is called!
RemoveIntermediateResultsDirectory is problematic for two reasons:
- It has calls to ereport(ERROR which have been known to trigger
- It makes memory allocations which raise ERRORs when they fail
Once the COMMIT process has begun we don't use the intermediate results,
so it's safe to remove them a little earlier in the process. A failure
here will abort the transaction. That's pretty unnecessary, it's not
that important that we remove the results, but it's still better than a
crash.
Previously we checked if an operator is in pg_catalog, and if it wasn't we prefixed it with namespace in worker queries. This can have a huge impact on performance of physical planner when using custom data types.
This happened regardless of current search_path config, because Citus overrides the search path in get_query_def_extended(). When we do so, the check for existence of the operator in current search path in generate_operator_name() fails for any operators outside pg_catalog. This means that nothing gets cached, and in the following calls we will again recheck the system tables for existence of the operators, which took an additional 40-50ms for some of the usecases we were seeing.
In this change we skip the pg_catalog check, and always prefix the operator with its namespace.
* Change worker_hash_partition_table() such that the
divergence between Citus planner's hashing and
worker_hash_partition_table() becomes the same.
* Rename single partitioning to single range partitioning.
* Add single hash repartitioning. Basically, logical planner
treats single hash and range partitioning almost equally.
Physical planner, on the other hand, treats single hash and
dual hash repartitioning almost equally (except for JoinPruning).
* Add a new GUC to enable this feature
utilityStmt sometimes (such as when it's inside of a plpgsql function)
comes from a cached plan, which is kept in a child of the
CacheMemoryContext. When we naively call copyObject we're copying it into
a statement-local context, which corrupts the cached plan when it's
thrown away.
This commit doesn't change any of the logic at all.
Instead, the goal is to:
* Get rid of any code duplication
* Incremental changes to the optimizer made it slightly hard
to follow the code, improve that and make it easier to
implement new features
* Simplify the code by moving each part of query processing (e.g.,
DISTINCT, LIMIT etc) into its own function
* Make the interaction between each part of the query more
obvious (e.g., How DISTINCT affects LIMIT etc)
In case a failure happens when a transaction is rollbacked,
we used to hit an assertion for ensuring there is no pending
activity on the connection. However, that's not true after the
changes in #2031. Thus, we've replaced the assertion with a more
generic function call to consume any pending activity, if exists.
- changes in ruleutils_11.c is reflected
- vacuum statement api change is handled. We now allow
multi-table vacuum commands.
- some other function header changes are reflected
- api conflicts between PG11 and earlier versions
are handled by adding shims in version_compat.h
- various regression tests are fixed due output and
functionality in PG1
- no change is made to support new features in PG11
they need to be handled by new commit
- Add install.pl to instal .sql files on Windows
- Remove a hack to PGDLLIMPORT some variables
- Add citus_version.o to the Makefile
- Fix pg_regress_multi's PATH generation on Windows
- Output regression.diffs when the tests fail
- Fix permissions in data directory, make sure postgres can play with it
Before this commit, we had code duplication in the
WorkerExtendedOpNode(). The duplication was
noticeable and any change is prone to bugs.
The PR consists of 4 commits. Each commit incrementally
fixes the problem by moving certain parts of the duplicated
code into smaller, better-documented functions.
Before this commit, we had a divergence among
the creation of master/worker extended op nodes.
This commit moves the related parts into a single place
and allows the creation of master/extended op nodes to
share a common data structure.
PostgreSQL might remove some of the subqueries when they do not
contribute to the query result at all. Citus should not try to
access such subqueries during planning.
Without this change we crash on Windows with COPYing into a table with
62 shards, and we ERROR when COPYing into a table with >62 shards:
ERROR: WaitForMutipleObjects() failed: error code 87
Without this change multi_real_time_transaction blocks forever (on
Windows) in the block where it repeatedly calls pg_advisory_lock(15).
This happens because the deadlock detector tries to cancel the backend
but the backend never processes that signal.
This PR adds support for multiple AND expressions in Having
for pushdown planner. We simply make a call to make_ands_explicit
from MultiLogicalPlanOptimize for the having qual in
workerExtendedOpNode.
After this commit large_table_shard_count wont be used to
check whether broadcast join, which is renamed as reference
join, can be applied. Reference join can only be applied over
reference tables.
We recently added partitionin support to Citus MX. We should not execute
DROP table commands from MX workers but at the moment we try to execute
such commands for partitioned tables. This PR fixes that problem by
adding check.
Previously, we prevented creation of partitioned tables on Citus MX.
We decided to not focus on this feature until there is a need. Since
now there are requests for this feature, we are implementing support
for partitioned tables on Citus MX.
After this change all the logic related to shard data fetch logic
will be removed. Planner won't plan any ShardFetchTask anymore.
Shard fetch related steps in real time executor and task-tracker
executor have been removed.
- Force all platforms to use the same collation
- Force all platforms to use the same locale
- Use /dev/null or NUL, depending on platform
- Use /tmp or %TEMP%, dpeending on platform
- don't hardcode path names
- replace system calls for rm/mkdir/rm -rf with perl equivalents
- force utf-8 encoding
- the Windows shell uses different quoting and escape rules
Pushing down limit and order by into workers may produce
wrong output when distinct on() clause has expressions,
aggregates, or window functions.
This checking allows pushing down of limits only if
distinct clause is a superset of group by clause. i.e. it contains all clauses in group by.
This commit checks the connection status right after any IO happens
on the socket.
This is necessary since before this commit we didn't pass any information
to the higher level functions whether we're done with the connection
(e.g., no IO required anymore) or an errors happened during the IO.
This is the first of series of window function work.
We can now support window functions that can be pushed down to workers.
Window function must have distribution column in the partition clause
to be pushed down.
We push down order by to worker query when limit is specified
(with some other additional checks). If the query has an expression
on an aggregate or avg aggregate by itself, and there is an order
by on this particular target we may send wrong order by to worker
query with potential to affect query result.
The fix creates a auxilary target entry in the worker query and
uses that target entry for sorting.
Before this PR, we were trusting on the columns of group by about
guaranteeing the uniqueness of the results. However, this assumption
is correct only if the columns in the group by is subset of columns
in the distinct clause. It can be wrong if we have part of group by
columns and some aggregation columns in the distinct clause. With
this PR, we add distinct plan on top of aggregate plan when necessary.
With #1804 (and related PRs), Citus gained the ability to
plan subqueries that are not safe to pushdown.
There are two high-level requirements for pushing down subqueries:
* Individual subqueries that require a merge step (i.e., GROUP BY
on non-distribution key, or LIMIT in the subquery etc). We've
handled such subqueries via #1876.
* Combination of subqueries that are not joined on distribution keys.
This commit aims to recursively plan some of such subqueries to make
the whole query safe to pushdown.
The main logic behind non colocated subquery joins is that we pick
an anchor range table entry and check for distribution key equality
of any other subqueries in the given query. If for a given subquery,
we cannot find distribution key equality with the anchor rte, we
recursively plan that subquery.
We also used a hacky solution for picking relations as the anchor range
table entries. The hack is that we wrap them into a subquery. This is only
necessary since some of the attribute equivalance checks are based on
queries rather than range table entries.
We used to only support pushdownable set operations inside a
subquery, however, we could easily expand the restriction
checks to cover top level set operations as well.
We use PostgreSQL hooks to accumulate the join restrictions
and PostgreSQL gives us all the join paths it tries while
deciding on the join order. Thus, for queries that have many
joins, this function is likely to remove lots of duplicate join
restrictions. This becomes relevant for Citus on query pushdown
check peformance.
We were allowing count distict queries even if they were
not directly on columns if the query is grouped on
distribution column.
When performing these checks we were skipping subqueries
because they also perform this check in a more concise manner.
We relied on oid SUBQUERY_RELATION_ID (10000) to decide if
a given RTE relation id denotes a subquery, however, we also
use SUBQUERY_PUSHDOWN_RELATION_ID (10001) for some subqueries.
We skip both type of subqueries with this change.
VLAs aren't supported by Visual Studio.
- Remove all existing instances of VLAs.
- Add a flag, -Werror=vla, which makes gcc refuse to compile if we add
VLAs in the future.
- variable length arrays (VLAs) do not work with Visual Studio
- fix an off-by-one error. We incorrectly assumed there would always at
least as many edges as there were nodes.
- refactor: reduce scope of transactionNodeStack by moving it into the
function which uses it.
- refactor: break up the distinct uses of currentStackDepth into
separate variables.
It's against our coding convention to call functions inside parameter
lists; when single-stepping with a debugger it's difficult to determine
what the function returned.
That wouldn't be good enough reason to change this code but while
porting Citus to Windows I ran into this line of code.
assign_distributed_transaction_id was called with a weird timestamp and
I wasn't able to find the problem without first making this change.
* Don't use expressions inside compound statements
* Don't depend on __builtin_constant_p
* Remove reliance on S_ISLNK
* Replace use of __func__: older mcvs doesn't support this builtin
With this fix, we traverse the graph with DFS which was originally
intended. Note that, before the fix, we traverse the graph with BFS
which might lead to killing some unrelated backend that is not
involved in the distributed deadlock.
By sharing the implementation of the function AppendOptionListToString on
three call sites, we would expand an extra OPTIONS keyword in a create index
statement, and omit other bits of the specific syntax here.
This patch introduces an AppendStorageParametersToString() function that is
very similar to AppendOptionListToString() but handles WITH(a="foo",...)
syntax that is used in reloptions (aka Storage Parameters).
Fixes#1747.
PostgreSQL implements support for several relation kinds in a single
statement, such as in the AlterTableStmt case, which supports both tables
and indexes and more (see ATExecSetRelOptions in PostgreSQL source code file
src/backend/commands/tablecmds.c for an example of that).
As a consequence, this patch implements support for setting and resetting
storage parameters on both relation kinds.
The command is now distributed among the shards when the table is
distributed. To that effect, we fill in the DDLJob's targetRelationId with
the OID of the table for which the index is defined, rather than the OID of
the index itself.
Citus sometimes have regressions around non-default schema support, meaning
not public and not in the search_path, per @marcocitus. This patch changes
some regression tests to use a non-default schema in order to cover more
cases.
The implementation was already mostly in place, but the code was protected
by a principled check against the operation. Turns out there's a nasty
concurrency bug though with long identifier names, much as in #1664.
To prevent deadlocks from happening, we could either review the DDL
transaction management in shards and placements, or we can simply reject
names with (NAMEDATALEN - 1) chars or more — that's because of the
PostgreSQL array types being created with a one-char prefix: '_'.
We shouldn't return in middle of a PG_TRY() block because if we do, we won't reset PG_exception_stack, and later when a re-throw tries to jump to the jump-point which was active in this PG_TRY() block, it seg-faults.
We used to return in middle of PG_TRY() block in WaitForConnections() where we checked for cancellations. Whenever cancellations were caught here, Citus crashed. And example was reported by @onderkalaci at #1903.
clause is not supported
This change allows unsupported clauses to go through query pushdown
planner instead of erroring out as we already do for non-outer joins.
We used to error out if the join clause includes filters like
t1.a < t2.a even if other filter like t1.key = t2.key exists.
Recently we lifted that restriction in subquery planning by
not lifting that restriction and focusing on equivalance classes
provided by postgres.
This checkin forwards previously erroring out real-time queries
due to join clauses to subquery planner and let it handle the
join even if the query does not have a subquery.
We are now pushing down queries that do not have any
subqueries in it. Error message looked misleading, changed to a more descriptive one.
We were creating intermediate query result's target
names from subquery target list. Now we also check
if cte re-defines its column name aliases, and create
intermediate result query accordingly.
The macro we were using to detect strtoull isn't set on Windows, and
just in case there are differences use a portable function from PG
instead of calling strtoull directly.
This commit introduces a new GUC to limit the intermediate
result size which we handle when we use read_intermediate_result
function for CTEs and complex subqueries.
With this commit, Citus recursively plans subqueries that
are not safe to pushdown, in other words, requires a merge
step.
The algorithm is simple: Recursively traverse the query from bottom
up (i.e., bottom meaning the leaf queries). On each level, check
whether the query is safe to pushdown (or a single repartition
subquery). If the answer is yes, do not touch that subquery. If the
answer is no, plan the subquery seperately (i.e., create a subPlan
for it) and replace the subquery with a call to
`read_intermediate_results(planId, subPlanId)`. During the the
execution, run the subPlans first, and make them avaliable to the
next query executions.
Some of the queries hat this change allows us:
* Subqueries with LIMIT
* Subqueries with GROUP BY/DISTINCT on non-partition keys
* Subqueries involving re-partition joins, router queries
* Mixed usage of subqueries and CTEs (i.e., use CTEs in
subqueries as well). Nested subqueries as long as we
support the subquery inside the nested subquery.
* Subqueries with local tables (i.e., those subqueries
has the limitation that they have to be leaf subqueries)
* VIEWs on the distributed tables just works (i.e., the
limitations mentioned below still applies to views)
Some of the queries that is still NOT supported:
* Corrolated subqueries that are not safe to pushdown
* Window function on non-partition keys
* Recursively planned subqueries or CTEs on the outer
side of an outer join
* Only recursively planned subqueries and CTEs in the FROM
(i.e., not any distributed tables in the FROM) and subqueries
in WHERE clause
* Subquery joins that are not on the partition columns (i.e., each
subquery is individually joined on partition keys but not the upper
level subquery.)
* Any limitation that logical planner applies such as aggregate
distincts (except for count) when GROUP BY is on non-partition key,
or array_agg with ORDER BY
Postgres provides OS agnosting formatting macros for
formatting 64 bit numbers. Replaced %ld %lu with
INT64_FORMAT and UINT64_FORMAT respectively.
Also found some incorrect usages of formatting
flags and fixed them.
We added the ability to filter out the planner restriction information
for specific parts of the query. This might lead to situations where
the common restriction includes some other relations that we're searching
for. The reason is that while filtering for join restrictions, we add the
restriction as soon as we find the relation.
With this commit we make sure that the common attribute
equivalance class always includes the input relations.
In subquery pushdown, we first ensure that each relation is joined with at least
on another relation on the partition keys. That's fine given that the decision
is binary: pushdown the query at all or not.
With recursive planning, we'd want to check whether any specific part
of the query can be pushded down or not. Thus, we need the ability to
understand which part(s) of the subquery is safe to pushdown. This commit
adds the infrastructure for doing that.
Note that we used to iterate over the RTEs once for performance reasons.
However, keeping an extra copy of original query seems more costly and
hard to maintain/explain.
Subquery pushdown planning is based on relation restriction
equivalnce. This brings us the opportuneatly to allow any
other joins as long as there is an already equi join between
the distributed tables.
We already allow that for joins with reference tables and
this commit allows that for joins among distributed tables.
With this commit, we allow pushing down subqueries with only
reference tables where GROUP BY or DISTINCT clause or Window
functions include only columns from reference tables.
Store pointers to shared hashes in process-local variables. Previously
pointers to shared hashes were put into shared memory. This causes
problems on EXEC_BACKEND because everybody calls execve and receives a
brand new address space; the shared hash will be in a different place
for every backend. (normally we call fork, which gives you a copy of the
address space, so these pointers remain constant)
Autovacuum process cancels itself if any modification starts
on the table in order to avoid blocking your regular Postgres
sessions. That's normal and expected. Thus, any locks held by
autovacuum process cannot involve in a distributed deadlock
since it'll be released if needed.
These locks are held for a very short duration time and cannot
contribute to a deadlock. Speculative locks are used by Postgres
for internal notification mechanism among transactions.
While attaching a partition to a distributed table in schema, we mistakenly
used unqualified name to find partitioned table's oid. This caused problems
while using partitioned tables with schemas. We are fixing this issue in
this PR.
Short-term share/exclusive page-level locks are used for
read/write access. Locks are released immediately after
each index row is fetched or inserted.
Since those locks may not lead to any deadlocks, it's safe
to ignore them in the distributed deadlock detection.
In DistributedTablesSize() we didn't close the relations that had
replication factor > 2. This caused relcache reference leaks, and
warning messages like following in logs:
WARNING: relcache reference leak: relation "researchers" not closed
It's possible to build INSERT SELECT queries which include implicit
casts, currently we attempt to support these by adding explicit casts to
the SELECT query, but this sometimes crashes because we don't update all
nodes with the new types. (SortClauses, for instance)
This commit removes those explicit casts and passes an unmodified SELECT
query to the COPY executor (how we implement INSERT SELECT under the
scenes). In lieu of those cases, COPY has been given some extra logic to
inspect queries, notice that the types don't line up with the table it's
supposed to be inserting into, and "manually" casting every tuple before
sending them to workers.
ShardPlacementList's implementation can return NIL. In previous implementation
we got a segmentation fault in this case. The relation can be dropped after
getting distributed table list but before calling SingleReplicatedTable().
If we don't propagate the errors we are catching in PG_CATCH(), database's
internal state might not be clean. So we do PG_TRY() inside a subtransaction
so we can rollback to it after catching errors.
This patch adds --with-reports-host configure option, which sets the
REPORTS_BASE_URL constant. The default is reports.citusdata.com.
It also enables stats collection in tests.
Curl writes the received response to stdout if we don't specify a response
callback or an output file. This can pollute the PostgreSQL log. In this change
we add a callback function so the response messages aren't added to the log file.
Sends a request to /v1/releases/latest?flavor=$CITUS_EDITION once a day,
which returns a response similar to {"version": "7.1.0", "major": 7,
"minor": 1, "patch": 0}. Then compares it with current Citus version,
and if the latest release is newer, logs a LOG message.
This includes:
(1) Wrap everything inside a StartTransactionCommand()/CommitTransactionCommand().
This is so we can access the database. This also switches to a new memory context
and releases it, so we don't have to do our own memory management.
(2) LockCitusExtension() so the extension cannot be dropped or created concurrently.
(3) Check CitusHasBeenLoaded() && CheckCitusVersion() before doing any work.
(4) Do not PG_TRY() inside a loop.
This commit makes a change in relay_event_utility.c to check if the
Alter Table command adds a constraint using index. If this is the
case, it appends the shard id to the index name.
By this commit, citus minds the replica identity of the table when
we distribute the table. So the shards of the distributed table
have the same replica identity with the local table.
Expands count distinct coverage by allowing more cases. We used to support
count distinct only if we can push down distinct aggregate to worker query
i.e. the count distinct clause was on the partition column of the table,
or there was a grouping on the partition column.
Now we can support
- non-partition columns, with or without grouping on partition column
- partition, and non partition column in the same query
- having clause
- single table subqueries
- insert into select queries
- join queries where count distinct is on partition, or non-partition column
- filters on count distinct clauses (extends existing support)
We first try to push down aggregate to worker query (original case), if we
can't then we modify worker query to return distinct columns to coordinator
node. We do that by adding distinct column targets to group by clauses. Then
we perform count distinct operation on the coordinator node.
This work should reduce the cases where HLL is used as it can address anything
that HLL can. However, if we start having performance issues due to very large
number rows, then we can recommend hll use.
This change introduces the `pg_dist_node_metadata` which has a single jsonb value. When creating
the extension, a random server id is generated and stored in there. Everything in the metadata table
is added as a nested objected to the json payload that is sent to the reports server.
The following scenario can cause an Assert() crash if we don't do this:
- Install Citus v7.0-15
- Restart server & run a query to start maintenanced.
- Install Citus v7.1
- Restart server & run a query. This will tell user to upgrade.
- Type "UPDATE EXTENSION c" & press tab. maintenanced will start and crash
with Assert(CitusHasBeenLoaded() && CheckCitusVersion(WARNING));
This change checks Citus version before calling metadata functions so the
crash doesn't happen.
This will provide the full project name (i.e. Citus/Citus Enterprise),
and the host system, compiler, and architecture word size.
I wanted to limit the number of copied files in 'config', so I added
only config.guess and call it manually, rather than using the macro
AC_CANONICAL_HOST, which requires several other files.
Eclipse apparently doesn't scan build output looking for -D flags, so
having the value actually appear in a header is nicer for those of us
using IDEs.
Previously <curl/curl.h> was included even if compiled --without-libcurl.
This can fail when libcurl headers are not there. This commit guards this
include by checks for HAVE_LIBCURL.
Adds ```citus.enable_statistics_collection``` GUC variable, which ```true``` by default, unless built without libcurl. If statistics collection is enabled, sends basic usage data to Citus servers every 24 hours.
The data that is collected consists of:
- Citus version
- OS name & release
- Hardware Id
- Number of tables, rounded to next power of 2
- Size of data, rounded to next power of 2
- Number of workers
This commit provides the support for window functions in subquery and insert
into select queries. Note that our support for window functions is still limited
because it must have a partition by clause on the distribution key. This commit
makes changes in the files insert_select_planner and multi_logical_planner. The
required tests are also added with files multi_subquery_window_functions.out
and multi_insert_select_window.out.
We should skip if the process blocked on the relation
extension since those locks are hold for a short duration
while the relation is actually extended on the disk and
released as soon as the extension is done. Thus, recording
such waits on our lock graphs could yield detecting wrong
distributed deadlocks.
We sent multiple commands to worker when starting a transaction.
Previously we only checked the result of the first command that
is transaction 'BEGIN' which always succeeds. Any failure on
following commands were not checked.
With this commit, we make sure all command results are checked.
If there is any error we report the first error found.
Basically we just care whether the running version is before or after
PostgreSQL 10, so testing the major version against 9 and printing a
boolean is sufficient.
When a table and it's shards are dropped, and afterwards the same
shard identifiers are reused, e.g. due to a DROP & CREATE EXTENSION,
the old entry in the shard cache and the required entry in the shard
cache might be for different tables.
Force invalidation for both old and new table to fix.
Citus can handle INSERT INTO ... SELECT queries if the query inserts
into local table by reading data from distributed table. The opposite
way is not correct. With this commit we warn the user if the latter
option is used.
When a NULL connection is provided to PQerrorMessage(), the
returned error message is a static text. Modifying that static
text, which doesn't necessarly be in a writeable memory, is
dangreous and might cause a segfault.
With this commit, we relax the restrictions put on the reference
tables with subquery pushdown.
We did three notable improvements:
1) Relax equi-join restrictions
Previously, we always expected that the non-reference tables are
equi joined with reference tables on the partition key of the
non-reference table.
With this commit, we allow any column of non-reference tables
joined using non-equi joins as well.
2) Relax OUTER JOIN restrictions
Previously Citus errored out if any reference table exists at
any point of the outer part of an outer join. For instance,
See the below sketch where (h) denotes a hash distributed relation,
(r) denotes a reference table, (L) denotes LEFT JOIN and
(I) denotes INNER JOIN.
(L)
/ \
(I) h
/ \
r h
Before this commit Citus would error out since a reference table
appears on the left most part of an left join. However, that was
too restrictive so that we only error out if the reference table
is directly below and in the outer part of an outer join.
3) Bug fixes
We've done some minor bugfixes in the existing implementation.
With this PR we add isolation tests for
COPY to reference table vs. other operations
COPY to partitioned table vs. other operations
Multi row INSERTs vs other operations
INSERT/SELECT vs. other operations
UPSERT vs. other operations
DELETE vs. other operations
TRUNCATE vs. other operations
DROP vs. other operations
DDL vs. other operations
other operations consist of basic SQL operations (like SELECT,
INSERT, DELETE, UPSERT, COPY TRUNCATE, CREATE INDEX) as well
as some Citus functionalities (like master_modify_multiple_shards,
master_apply_delete_command, citus_total_relation_size etc.)
If after the distributed deadlock detection decides to cancel
a backend, the backend has been terminated/killed/cancelled
externally, we might be accessing to a NULL pointer. This commit
prevents that case by ignoring the current distributed deadlock.
This is necessary for multi-row INSERTs for the same reasons we use it
in e.g. UPSERTs: if the range table list has more than one entry, then
PostgreSQL's deparse logic requires that vars be prefixed by the name
of their corresponding range table entry. This of course doesn't affect
single-row INSERTs, but since multi-row INSERTs have a VALUE RTE, they
were affected.
The piece of ruleutils which builds range table names wasn't modified
to handle shard extension; instead UPSERT/INSERT INTO ... SELECT added
an alias to the RTE. When present, this alias is favored. Doing the
same in the multi-row INSERT case fixes RETURNING for such commands.
This change adds support for SAVEPOINT, ROLLBACK TO SAVEPOINT, and RELEASE SAVEPOINT.
When transaction connections are not established yet, savepoints are kept in a stack and sent to the worker when the connection is later established. After establishing connections, savepoint commands are sent as they arrive.
This change fixes#1493 .
We should prevent running the deadlock detection if
there is a major version change. Otherwise, the daemon
may access to obsolete metadata catalog tables.
This change fixes a use-after-free bug while renaming obsolete
`pg_worker_list.conf` file, which causes Citus to crash during upgrade
(or even extension creation) if `pg_worker_list.conf` exists.
We added a new field to the transaction id that is set to true only
for the transactions initialized on the coordinator. This is only
useful for MX in order to distinguish the transaction that started
the distributed transaction on the coordinator where we could
have the same transactions' worker queries on the same node.
We added a new GUC citus.log_distributed_deadlock_detection
which is off by default. When set to on, we log some debug messages
related to the distributed deadlock to the server logs.
With this commit, the maintenance deamon starts to check for
distributed deadlocks.
We also introduced a GUC variable (distributed_deadlock_detection_factor)
whose value is multiplied with Postgres' deadlock_timeout. Setting
it to -1 disables the distributed deadlock detection.
We send SIGINT to a backend that is cancelled due to a deadlock. That
approach ends up being a very confusing error message.
With this commit we intercept the error messages and show a more
meaningful error message to the user.
Now that we already have the necessary infrastructure for detecting
distributed deadlocks. Thus, we don't need enable_deadlock_prevention
which is purely intended for preventing some forms of distributed
deadlocks.
This commit adds all the necessary pieces to do the distributed
deadlock detection.
Each distributed transaction is already assigned with distributed
transaction ids introduced with
3369f3486f. The dependency among the
distributed transactions are gathered with
80ea233ec1.
With this commit, we implement a DFS (depth first seach) on the
dependency graph and search for cycles. Finding a cycle reveals
a distributed deadlock.
Once we find the deadlock, we examine the path that the cycle exists
and cancel the youngest distributed transaction.
Note that, we're not yet enabling the deadlock detection by default
with this commit.
This GUC has two settings, 'always' and 'never'. When it's set to
'never' all behavior stays exactly as it was prior to this commit. When
it's set to 'always' only SELECT queries are allowed to run, and only
secondary nodes are used when processing those queries.
Add some helper functions:
- WorkerNodeIsSecondary(), checks the noderole of the worker node
- WorkerNodeIsReadable(), returns whether we're currently allowed to
read from this node
- ActiveReadableNodeList(), some functions (namely, the ones on the
SELECT path) don't require working with Primary Nodes. They should call
this function instead of ActivePrimaryNodeList(), because the latter
will error out in contexts where we're not allowed to write to nodes.
- ActiveReadableNodeCount(), like the above, replaces
ActivePrimaryNodeCount().
- EnsureModificationsCanRun(), error out if we're not currently allowed
to run queries which modify data. (Either we're in read-only mode or
use_secondary_nodes is set)
Some parts of the code were switched over to use readable nodes instead
of primary nodes:
- Deadlock detection
- DistributedTableSize,
- the router, real-time, and task tracker executors
- ShardPlacement resolution
This change declares two new functions:
`master_update_table_statistics` updates the statistics of shards belong
to the given table as well as its colocated tables.
`get_colocated_shard_array` returns the ids of colocated shards of a
given shard.
This is a pretty substantial refactoring of the existing modify path
within the router executor and planner. In particular, we now hunt for
all VALUES range table entries in INSERT statements and group the rows
contained therein by shard identifier. These rows are stashed away for
later in "ModifyRoute" elements. During deparse, the appropriate RTE
is extracted from the Query and its values list is replaced by these
rows before any SQL is generated.
In this way, we can create multiple Tasks, but only one per shard, to
piecemeal execute a multi-row INSERT. The execution of jobs containing
such tasks now exclusively go through the "multi-router executor" which
was previously used for e.g. INSERT INTO ... SELECT.
By piggybacking onto that executor, we participate in ongoing trans-
actions, get rollback-ability, etc. In short order, the only remaining
use of the "single modify" router executor will be for bare single-
row INSERT statements (i.e. those not in a transaction).
This change appropriately handles deferred pruning as well as master-
evaluated functions.
We use the backend shared memory lock for preventing
new backends to be part of a new distributed transaction
or an existing backend to leave a distributed transaction
while we're reading the all backends' data.
The primary goal is to provide consistent view of the
current distributed transactions while doing the
deadlock detection.
For partitioned tables, PostgreSQL opens partition and its partitions
in BeginCopyFrom and it expects its caller to close those relations.
However, we do not have quick access to opened relations and performing
special operations for partitioned tables isn't necessary in coordinator
node. Therefore before calling BeginCopyFrom, we change relkind of those
partitioned tables to RELKIND_RELATION. This prevents PostgreSQL to open
its partitions as well.
In standart_planner, PostgreSQL expands partitioned tables to their
partitions and call our restriction hook for each partition. It also,
for some queries, skips the partitioned table itself completely. This
behaviour makes it difficult to prune shards and decide whether query
is router plannable or not. To prevent this behaviour, we change inh
flag of partitioned tables to false in the query tree. In this case,
PostgreSQL treats those partitioned tables as regular relations and
does not expand them.
This behaviour is inline with our expectations, because we do not want
to treat partitioned tables differently on coordinator. Although we are
not entirely comfortable with modifying query tree, other solutions to
this problem is overly complicated.
With this PR, Citus starts to support all possible ways to create
distributed partitioned tables. These are;
- Distributing already created partitioning hierarchy
- CREATE TABLE ... PARTITION OF a distributed_table
- ALTER TABLE distributed_table ATTACH PARTITION non_distributed_table
- ALTER TABLE distributed_table ATTACH PARTITION distributed_table
We also support DETACHing partitions from partitioned tables and propogating
TRUNCATE and DDL commands to distributed partitioned tables.
This PR also refactors some parts of distributed table creation logic.
- master_activate_node and master_disable_node correctly toggle
isActive, without crashing
- master_add_node rejects duplicate nodes, even if they're in different
clusters
- master_remove_node allows removing nodes in different clusters
This change removes distributed tables' dependency on distribution key columns. We already check that we cannot drop distribution key columns in ErrorIfUnsupportedAlterTableStmt() at multi_utility.c, so we don't need to have distributed table to distribution key column dependency to avoid dropping of distribution key column.
Furthermore, having this dependency causes some warnings in pg_dump --schema-only (See #866), which are not desirable.
This change also adds check to disallow drop of distribution keys when citus.enable_ddl_propagation is set to false. Regression tests are updated accordingly.
We try to run our isolation tests paralles as much as possible. In
some of those isolation tests we used same table name which causes
problem while running them in paralles. This commit changes table
names in those tests to ensure tests can run in parallel.
This commit is preperation for introducing distributed partitioned
table support. We want to clean and refactor some code in distributed
table creation logic so that we can handle partitioned tables in more
robust way.
maxTaskStringSize determines the size of worker query string.
It was originally hard coded to a specific value. This has caused
issues at some users. Since it determines initial shared memory
allocation, we did not want to set it to an arbitrary higher number.
Instead made it configurable.
This commit introduces a new GUC variable max_task_string_size
Changes in this variable requires restart to be in effect.
In this commit, we add ability to convert global wait edges
into adjacency list with the following format:
[transactionId] = [transactionNode->waitsFor {list of waiting transaction nodes}]
This change adds a general purpose infrastructure to log and monitor
process about long running progresses. It uses
`pg_stat_get_progress_info` infrastructure, introduced with PostgreSQL
9.6 and used for tracking `VACUUM` commands.
This patch only handles the creation of a memory space in dynamic shared
memory, putting its info in `pg_stat_get_progress_info`, fetching the
progress monitors on demand and finalizing the progress tracking.
- Never release locks
- AddNodeMetadata takes ShareRowExclusiveLock so it'll conflict with the
trigger which prevents multiple primary nodes.
- ActivateNode and SetNodeState used to take AccessShareLock, but they
modify the table so they should take RowExclusiveLock.
- DeleteNodeRow and InsertNodeRow used to take AccessExclusiveLock but
only need RowExclusiveLock.
- master_add_node enforces that there is only one primary per group
- there's also a trigger on pg_dist_node to prevent multiple primaries
per group
- functions in metadata cache only return primary nodes
- Rename ActiveWorkerNodeList -> ActivePrimaryNodeList
- Rename WorkerGetLive{Node->Group}Count()
- Refactor WorkerGetRandomCandidateNode
- master_remove_node only complains about active shard placements if the
node being removed is a primary.
- master_remove_node only deletes all reference table placements in the
group if the node being removed is the primary.
- Rename {Node->NodeGroup}HasShardPlacements, this reflects the behavior it
already had.
- Rename DeleteAllReferenceTablePlacementsFrom{Node->NodeGroup}. This also
reflects the behavior it already had, but the new signature forces the
caller to pass in a groupId
- Rename {WorkerGetLiveGroup->ActivePrimaryNode}Count
GCC 7 added `-Wimplicit-fallthrough` to warn for not explicitly specified switch/case fall-throughs.
According to https://gcc.gnu.org/gcc-7/changes.html, to suppress that warning we could either use `__attribute__(fallthrough)`, which didn't seem to work for earlier GCC versions, or a `/* fallthrough */` comment just before the following `case`.
Previously Citus code had the fall-through comments inside the brackets, which didn't seem to suppress the warning. Putting a `/* fallthrough */` comment outside the brackets and right before the `case` fixes the problem.
This commit adds distributed transaction id infrastructure in
the scope of distributed deadlock detection.
In general, the distributed transaction id consists of a tuple
in the form of: `(databaseId, initiatorNodeIdentifier, transactionId,
timestamp)`.
Briefly, we add a shared memory block on each node, which holds some
information per backend (i.e., an array `BackendData backends[MaxBackends]`).
Later, on each coordinated transaction, Citus sends
`SELECT assign_distributed_transaction_id()` right after `BEGIN`.
For that backend on the worker, the distributed transaction id is set to
the values assigned via the function call.
The aim of the above is to correlate the transactions on the coordinator
to the transactions on the worker nodes.
Comes with a few changes:
- Change the signature of some functions to accept groupid
- InsertShardPlacementRow
- DeleteShardPlacementRow
- UpdateShardPlacementState
- NodeHasActiveShardPlacements returns true if the group the node is a
part of has any active shard placements
- TupleToShardPlacement now returns ShardPlacements which have NULL
nodeName and nodePort.
- Populate (nodeName, nodePort) when creating ShardPlacements
- Disallow removing a node if it contains any shard placements
- DeleteAllReferenceTablePlacementsFromNode matches based on group. This
doesn't change behavior for now (while there is only one node per
group), but means in the future callers should be careful about
calling it on a secondary node, it'll delete placements on the primary.
- Create concept of a GroupShardPlacement, which represents an actual
tuple in pg_dist_placement and is distinct from a ShardPlacement,
which has been resolved to a specific node. In the future
ShardPlacement should be renamed to NodeShardPlacement.
- Create some triggers which allow existing code to continue to insert
into and update pg_dist_shard_placement as if it still existed.
These functions are holdovers from pg_shard and were created for unit
testing c-level functions (like InsertShardPlacementRow) which our
regression tests already test quite effectively. Removing because it
makes refactoring the signatures of those c-level functions
unnecessarily difficult.
- create_healthy_local_shard_placement_row
- update_shard_placement_row_state
- delete_shard_placement_row
Uncrustify 0.65 appears to have changed some defaults, resulting in
breakages for those of us who have already upgraded; Travis still uses
Uncrustify 0.64, but these changes work with both versions (assuming
appropriately updated config), so this should permit use of either
version for the time being.
Before this change, we used ShareLock to acquire lock on distributed tables while
running VACUUM. This makes VACUUM and INSERT block each other. With this change we
changed lock mode from ShareLock to ShareUpdateExclusiveLock, which does not conflict
with the locks INSERT acquire.
MasterIrreducibleExpressionWalker has a copied code from
function check_functions_in_node() which was available with
PG 9.6+. Now PG 9.5 support is dropped we can remove
duplicate code and directly call check_functions_in_node().
Previously we used ForgetResults() in StartRemoteTransactionAbort() -
that's problematic because there might still be an ongoing statement,
and this causes us to wait for its completion. That e.g. happens when
a statement running on the coordinator is cancelled.
That's important because the currently running statement on a worker
might continue to hold locks and consume resources, even after the
connection is closed. Unfortunately postgres will only notice closed
connections when reading from / writing to the network. That might
only happen much later.
Now that there's no blocking libpq callers left, default to using
non-blocking mode in connection_management.c. This has two
advantages:
1) Blockiness doesn't have to frequently be reset, simplifying code
2) Prevents accidental use of blocking libpq functions, since they'll
frequently return 'need IO'
This commit is intended to be a base for supporting declarative partitioning
on distributed tables. Here we add the following utility functions and their
unit tests:
* Very basic functions including differnentiating partitioned tables and
partitions, listing the partitions
* Generating the PARTITION BY (expr) and adding this to the DDL events
of partitioned tables
* Ability to generate text representations of the ranges for partitions
* Ability to generate the `ALTER TABLE parent_table ATTACH PARTITION
partition_table FOR VALUES value_range`
* Ability to apply add shard ids to the above command using
`worker_apply_inter_shard_ddl_command()`
* Ability to generate `ALTER TABLE parent_table DETACH PARTITION`
Adds support for PostgreSQL 10 by copying in the requisite ruleutils
and updating all API usages to conform with changes in PostgreSQL 10.
Most changes are fairly minor but they are numerous. One particular
obstacle was the change in \d behavior in PostgreSQL 10's psql; I had
to add SQL implementations (views, mostly) to mimic the pre-10 output.
Add a second implementation of INSERT INTO distributed_table SELECT ... that is used if
the query cannot be pushed down. The basic idea is to execute the SELECT query separately
and pass the results into the distributed table using a CopyDestReceiver, which is also
used for COPY and create_distributed_table. When planning the SELECT, we go through
planner hooks again, which means the SELECT can also be a distributed query.
EXPLAIN is supported, but EXPLAIN ANALYZE is not because preventing double execution was
a lot more complicated in this case.
- Use native postgres function for composite key btree functions
- Move explain tests to multi_explain.sql (get rid of .out _0.out files)
- Get rid of input/output files for multi_subquery.sql by moving table creations
- Update some comments
With this commit we start to register InvalidateDistRelationCacheCallback
function as cache invalidation callback function before version checks
because during version checks we use cache to look up relation ids of some
relations like pg_dist_relation or pg_dist_partition_logical_relid_index
and we want to know about cache invalidation before accessing them.
During version update, we indirectly calld CheckInstalledVersion via
ChackCitusVersions. This obviously fails because during version update it is
expected to have version mismatch between installed version and binary version.
Thus, we remove that ChackCitusVersions. We now only call ChackAvailableVersion.
Before this commit, we were erroring out at almost all queries if there is a
version mismatch. With this commit, we started to error out only requested
operation touches distributed tables.
Normally we would need to use distributed cache to understand whether a table
is distributed or not. However, it is not safe to read our metadata tables when
there is a version mismatch, thus it is not safe to create distributed cache.
Therefore for this specific occasion, we directly read from pg_dist_partition
table. However; reading from catalog is costly and we should not use this
method in other places as much as possible.
This commit fixes the problem where we incorrectly try to reach distributed table
cache when the extension is not loaded completely. We tried to reach the cache
because we wanted to get reference table information to activate the node. However
it is actually not necessary to explicitly activate the nodes which come from
master_initialize_node_metadata. Because it only runs during extension creation and
at that time there are no reference tables and all nodes are considered as active.
When we propogate the schema creation command to data nodes we add schema's
owner name too. Before this patch, we did not quote the owner's name which
causes problems with the names containing characters like '-'.
We incorrectly try to use relation cache to find particular schema's owner and
when we cannot find the schema in the relation cache(i.e always), we automatically
used current user as the schema's owner. This means we always created schemas in
the data nodes with current user. With this patch we started to use namespace
cache to find schemas.
With this commit, we start to use custom compiled PostgreSQL builds in
Travis for merge commits. This allows us to run isolation tests and
PostgreSQL's own regression tests along with our regression tests in
Travis.
Since manually compiling PostgreSQL takes more time and we also add new
tests, we only enable running these tests on merge commits.
* Accept invalidation messages before accessing the metadata cache
This commit is crucial to prevent stale metadata reads from the
cache. Without this commit, some of the operations may use stale
metadata which could end up with various bugs such as crashes,
inconsistent/lost data etc.
As an example, consider that a COPY operation is blocked on shard
metadata lock. Another concurrent session updates the metadata and
invalidates the cache. However, since Citus doesn't accept invalidations,
COPY continues with the stale metadata once it acquires the lock.
With this commit, we make sure that invalidation messages are accepted
just before accessing the metadata cache and preventing any operation to
use stale metadata.
* Add isolation tests for placement changes and conccurrent operations
- add node with reference table vs COPY/insert/update/DDL
- repair shard vs COPY/insert/update/DDL
- repair shard vs repair shard
Distributed query planning for subquery pushdown is done on the original
query. This prevents the usage of external parameters on the execution.
To overcome this, we manually replace the parameters on the original
query.
* Support for subqueries in WHERE clause
This commit enables subqueries in WHERE clause to be pushed down
by the subquery pushdown logic.
The support covers:
- Correlated subqueries with IN, NOT IN, EXISTS, NOT EXISTS,
operator expressions such as (>, <, =, ALL, ANY etc.)
- Non-correlated subqueries with (partition_key) IN (SELECT partition_key ..)
(partition_key) =ANY (SELECT partition_key ...)
Note that this commit heavily utilizes the attribute equivalence logic introduced
in the 1cb6a34ba8. In general, this commit mostly
adjusts the logical planner not to error out on the subqueries in WHERE clause.
* Improve error checks for subquery pushdown and INSERT ... SELECT
Since we allow subqueries in WHERE clause with the previous commit,
we should apply the same limitations to those subqueries.
With this commit, we do not iterate on each subquery one by one.
Instead, we extract all the subqueries and apply the checks directly
on those subqueries. The aim of this change is to (i) Simplify the
code (ii) Make it close to the checks on INSERT .. SELECT code base.
* Extend checks for unresolved paramaters to include SubLinks
With the presence of subqueries in where clause (i.e., SubPlans on the
query) the existing way for checking unresolved parameters fail. The
reason is that the parameters for SubPlans are kept on the parent plan not
on the query itself (see primnodes.h for the details).
With this commit, instead of checking SubPlans on the modified plans
we start to use originalQuery, where SubLinks represent the subqueries
in where clause. The unresolved parameters can be found on the SubLinks.
* Apply code-review feedback
* Remove unnecessary copying of shard interval list
This commit removes unnecessary copying of shard interval list. Note
that there are no copyObject function implemented for shard intervals.
- There was a crash when the table a shardid belonged to changed during
a session. Instead of crashing (a failed assert) we now throw an error
- Update the isolation test which was crashing to no longer exercise
that code path
- Add a regression test to check that the error is thrown
* Enabling physical planner for subquery pushdown changes
This commit applies the logic that exists in INSERT .. SELECT
planning to the subquery pushdown changes.
The main algorithm is followed as :
- pick an anchor relation (i.e., target relation)
- per each target shard interval
- add the target shard interval's shard range
as a restriction to the relations (if all relations
joined on the partition keys)
- Check whether the query is router plannable per
target shard interval.
- If router plannable, create a task
* Add union support within the JOINS
This commit adds support for UNION/UNION ALL subqueries that are
in the following form:
.... (Q1 UNION Q2 UNION ...) as union_query JOIN (QN) ...
In other words, we currently do NOT support the queries that are
in the following form where union query is not JOINed with
other relations/subqueries :
.... (Q1 UNION Q2 UNION ...) as union_query ....
* Subquery pushdown planner uses original query
With this commit, we change the input to the logical planner for
subquery pushdown. Before this commit, the planner was relying
on the query tree that is transformed by the postgresql planner.
After this commit, the planner uses the original query. The main
motivation behind this change is the simplify deparsing of
subqueries.
* Enable top level subquery join queries
This work enables
- Top level subquery joins
- Joins between subqueries and relations
- Joins involving more than 2 range table entries
A new regression test file is added to reflect enabled test cases
* Add top level union support
This commit adds support for UNION/UNION ALL subqueries that are
in the following form:
.... (Q1 UNION Q2 UNION ...) as union_query ....
In other words, Citus supports allow top level
unions being wrapped into aggregations queries
and/or simple projection queries that only selects
some fields from the lower level queries.
* Disallow subqueries without a relation in the range table list for subquery pushdown
This commit disallows subqueries without relation in the range table
list. This commit is only applied for subquery pushdown. In other words,
we do not add this limitation for single table re-partition subqueries.
The reasoning behind this limitation is that if we allow pushing down
such queries, the result would include (shardCount * expectedResults)
where in a non distributed world the result would be (expectedResult)
only.
* Disallow subqueries without a relation in the range table list for INSERT .. SELECT
This commit disallows subqueries without relation in the range table
list. This commit is only applied for INSERT.. SELECT queries.
The reasoning behind this limitation is that if we allow pushing down
such queries, the result would include (shardCount * expectedResults)
where in a non distributed world the result would be (expectedResult)
only.
* Change behaviour of subquery pushdown flag (#1315)
This commit changes the behaviour of the citus.subquery_pushdown flag.
Before this commit, the flag is used to enable subquery pushdown logic. But,
with this commit, that behaviour is enabled by default. In other words, the
flag is now useless. We prefer to keep the flag since we don't want to break
the backward compatibility. Also, we may consider using that flag for other
purposes in the next commits.
* Require subquery_pushdown when limit is used in subquery
Using limit in subqueries may cause returning incorrect
results. Therefore we allow limits in subqueries only
if user explicitly set subquery_pushdown flag.
* Evaluate expressions on the LIMIT clause (#1333)
Subquery pushdown uses orignal query, the LIMIT and OFFSET clauses
are not evaluated. However, logical optimizer expects these expressions
are already evaluated by the standard planner. This commit manually
evaluates the functions on the logical planner for subquery pushdown.
* Better format subquery regression tests (#1340)
* Style fix for subquery pushdown regression tests
With this commit we intented a more consistent style for the
regression tests we've added in the
- multi_subquery_union.sql
- multi_subquery_complex_queries.sql
- multi_subquery_behavioral_analytics.sql
* Enable the tests that are temporarily commented
This commit enables some of the regression tests that were commented
out until all the development is done.
* Fix merge conflicts (#1347)
- Update regression tests to meet the changes in the regression
test output.
- Replace Ifs with Asserts given that the check is already done
- Update shard pruning outputs
* Add view regression tests for increased subquery coverage (#1348)
- joins between views and tables
- joins between views
- union/union all queries involving views
- views with limit
- explain queries with view
* Improve btree operators for the subquery tests
This commit adds the missing comprasion for subquery composite key
btree comparator.
We previously dismissed this as unimportant, but it turns out to be
very useful for the upcoming subquery pushdown, where a user might
specify an equality constraint in a subquery, and the subquery
pushdown machinery adds >= and <= restrictions on the shard boundary.
Previously the latter restriction was ignored.
It semms that GEQO optimizations, when it is set to on, create their own memory context
and free it after when it is no longer necessary. In join multi_join_restriction_hook
we allocate our variables in the CurrentMemoryContext, which is GEQO's memory context
if it is active. To prevent deallocation of our variables when GEQO's memory context is
freed, we started to allocate memory fo these variables in separate MemoryContext.
So far citus used postgres' predicate proofing logic for shard
pruning, except for INSERT and COPY which were already optimized for
speed. That turns out to be too slow:
* Shard pruning for SELECTs is currently O(#shards), because
PruneShardList calls predicate_refuted_by() for every
shard. Obviously using an O(N) type algorithm for general pruning
isn't good.
* predicate_refuted_by() is quite expensive on its own right. That's
primarily because it's optimized for doing a single refutation
proof, rather than performing the same proof over and over.
* predicate_refuted_by() does not keep persistent state (see 2.) for
function calls, which means that a lot of syscache lookups will be
performed. That's particularly bad if the partitioning key is a
composite key, because without a persistent FunctionCallInfo
record_cmp() has to repeatedly look-up the type definition of the
composite key. That's quite expensive.
Thus replace this with custom-code that works in two phases:
1) Search restrictions for constraints that can be pruned upon
2) Use those restrictions to search for matching shards in the most
efficient manner available:
a) Binary search / Hash Lookup in case of hash partitioned tables
b) Binary search for equal clauses in case of range or append
tables without overlapping shards.
c) Binary search for inequality clauses, searching for both lower
and upper boundaries, again in case of range or append
tables without overlapping shards.
d) exhaustive search testing each ShardInterval
My measurements suggest that we are considerably, often orders of
magnitude, faster than the previous solution, even if we have to fall
back to exhaustive pruning.
This determines whether it's possible to perform binary search on
sortedShardIntervalArray or not. If e.g. two shards have overlapping
ranges, that'd be prohibitive.
That'll be useful in later commit introducing faster shard pruning.
That's useful when comparing values a hash-partitioned table is
filtered by. The existing shardIntervalCompareFunction is about
comparing hashed values, not unhashed ones.
The added btree opclass function is so we can get a comparator
back. This should be changed much more widely, but is not necessary so
far.
Previously we, unnecessarily, used a the first shard's type
information to to look up the comparison function. But that
information is already available, so use it. That's helpful because
we sometimes want to access the comparator function even if there's no
shards.
With this commit, we started to send explain queries within a savepoint. After
running explain query, we rollback to savepoint. This saves us from side effects
of EXPLAIN ANALYZE on DML queries.
All callers fetch a cache entry and extract/compute arguments for the
eventual FindShardInterval call, so it makes more sense to refactor
into that function itself; this solves the use-after-free bug, too.
Soon shard pruning will be optimized not to generally work linearly
anymore. Thus we can't print the pruned shard intervals as currently
done anymore.
The current printing of shard ids also prevents us from running tests
in parallel, as otherwise shard ids aren't linearly numbered.
Pretty straightforward. Had some concerns about locking, but due to the
fact that all distributed operations use either some level of deparsing
or need to enumerate column names, they all block during any concurrent
column renames (due to the AccessExclusive lock).
In addition, I had some misgivings about permitting renames of the dis-
tribution column, but nothing bad comes from just allowing them.
Finally, I tried to trigger any sort of error using prepared statements
and could not trigger any errors not also exhibited by plain PostgreSQL
tables.
With this change, we set to default value of isactive column to true so that
upgrading users all nodes will be marked as active to not break their environment.
With this change we add an option to add a node without replicating all reference
tables to that node. If a node is added with this option, we mark the node as
inactive and no queries will sent to that node.
We also added two new UDFs;
- master_activate_node(host, port):
- marks node as active and replicates all reference tables to that node
- master_add_inactive_node(host, port):
- only adds node to pg_dist_node
Before this commit, we were erroring out for queries containing parameterized SQL functions
like 'SELECT parameterized_sql_query(value)' as we should, however we were returning wrong
results for queries like 'SELECT * FROM parameterized_sql_query(value)'. With this commit
we started to error out on such queries too.
In this PR, we aim to deduce whether each of the RTE_RELATION
is joined with at least on another RTE_RELATION on their partition keys. If each
RTE_RELATION follows the above rule, we can conclude that all RTE_RELATIONs are
joined on their partition keys.
In order to do that, we invented a new equivalence class namely:
AttributeEquivalenceClass. In very simple words, a AttributeEquivalenceClass is
identified by an unique id and consists of a list of AttributeEquivalenceMembers.
Each AttributeEquivalenceMember is designed to identify attributes uniquely within the
whole query. The necessity of this arise since varno attributes are defined within
a single level of a query. Instead, here we want to identify each RTE_RELATION uniquely
and try to find equality among each RTE_RELATION's partition key.
Whenever we find an equality clause A = B, where both A and B originates from
relation attributes (i.e., not random expressions), we create an
AttributeEquivalenceClass to record this knowledge. If we later find another
equivalence B = C, we create another AttributeEquivalenceClass. Finally, we can
apply transitity rules and generate a new AttributeEquivalenceClass which includes
A, B and C.
Note that equality among the members are identified by the varattno and rteIdentity.
Each equality among RTE_RELATION is saved using an AttributeEquivalenceClass where
each member attribute is identified by a AttributeEquivalenceMember. In the final
step, we try generate a common attribute equivalence class that holds as much as
AttributeEquivalenceMembers whose attributes are a partition keys.
This was getting pretty long and complex in the context of the main
utility hook. Moved out the checks for what should skip Citus process-
ing and what should have version checks performed.
The use of a bare src/ rather than $srcdir caused configure to fail
during VPATH builds. With our additional dependency upon AWK, we need
to call AC_PROG_AWK, otherwise environments may not have $AWK set.
Finally, citus_version.h should be in .gitignore.
With this change, we start to error out if loaded citus binaries does not match
the available major version or installed citus extension version. In this case
we force user to restart the server or run ALTER EXTENSION depending on the
situation
Thought this looked slightly nicer than the default behavior.
Changed preventTransaction to concurrent to be clearer that this code
path presently affects CONCURRENTLY code only.
Coordinator code marks index as invalid as a base, set it as valid in a
transactional layer atop that base, then proceeds with worker commands.
If a worker command has problems, the rollback results in an index with
isvalid = false. If everything succeeds, the user sees a valid index.
Before this commit, in certain cases router planner allowed pushing
down JOINs that are not on the partition keys.
With @anarazel's suggestion, we change the logic to use uninstantiated
parameter. Previously, the planner was traversing on the restriction
information and once it finds the parameter, it was replacing it with
the shard range. With this commit, instead of traversing the restrict
infos, the planner explicitly checks for the equivalence of the relation
partition key with the uninstantiated parameter. If finds an equivalence,
it adds the restrictions. In this way, we have more control over the
queries that are pushed down.
Some tests relied on worker errors though local commands were invalid.
Fixed those by ensuring preconditions were met to have command work
correctly. Otherwise most test changes are related to slight changes
in local/remote error ordering.
When running under Enterprise, some of the GRANT commands and whatnot
are propagated. Guarding that section with a call to disable DDL prop.
fixes everything.
With this commit, we add the range table list of the original query to our
custom plan. Therefore, PostgreSQL can check relations in the original query
for access permissions and error out if the proper access is not granted.
Custom Scan is a node in the planned statement which helps external providers
to abstract data scan not just for foreign data wrappers but also for regular
relations so you can benefit your version of caching or hardware optimizations.
This sounds like only an abstraction on the data scan layer, but we can use it
as an abstraction for our distributed queries. The only thing we need to do is
to find distributable parts of the query, plan for them and replace them with
a Citus Custom Scan. Then, whenever PostgreSQL hits this custom scan node in
its Vulcano style execution, it will call our callback functions which run
distributed plan and provides tuples to the upper node as it scans a regular
relation. This means fewer code changes, fewer bugs and more supported features
for us!
First, in the distributed query planner phase, we create a Custom Scan which
wraps the distributed plan. For real-time and task-tracker executors, we add
this custom plan under the master query plan. For router executor, we directly
pass the custom plan because there is not any master query. Then, we simply let
the PostgreSQL executor run this plan. When it hits the custom scan node, we
call the related executor parts for distributed plan, fill the tuple store in
the custom scan and return results to PostgreSQL executor in Vulcano style,
a tuple per XXX_ExecScan() call.
* Modify planner to utilize Custom Scan node.
* Create different scan methods for different executors.
* Use native PostgreSQL Explain for master part of queries.
Previously we'd segfault in PQisnonblocking() which, contrary to other
libpq calls, doesn't handle a NULL PQconn (because there'd be no
appropriate return value for that).
cr: @jasonmp85
Delete operation is blocked for any table distributed by hash using master_apply_delete_command. Suggested master_modify_multiple_shards command as a hint.
During later work the transaction debug output will change (as it will
in postgres 10), which makes it hard to see actual changes in the
INSERT ... SELECT ... test. Reduce to DEBUG2 after changing a debug
message to that log level.
This change ignores `citus.replication_model` setting and uses the
statement based replication in
- Tables distributed via the old `master_create_distributed_table` function
- Append and range partitioned tables, even if created via
`create_distributed_table` function
This seems like the easiest solution to #1191, without changing the existing
behavior and harming existing users with custom scripts.
This change also prevents RF>1 on streaming replicated tables on `master_create_worker_shards`
Prior to this change, `master_create_worker_shards` command was not checking
the replication model of the target table, thus allowing RF>1 with streaming
replicated tables. With this change, `master_create_worker_shards` errors
out on the case.
PostgreSQL 9.5.6 and 9.6.2 were released today and broke several tests
by adding TABLESPACE pg_default output to some DDL commands. Fixed all
occurrences.
cr: @anarazel
Add a call to RemoteTransactionBeginIfNecessary so that BEGIN is
actually sent to the remote connections. This means that ROLLBACK and
Ctrl-C are respected and don't leave the table in a partial state.
This change fixes the random failures on Travis, which is a bug introduced
with citus/#1124. Before this fix, travis was failing randomly on `check_multi_mx`
test schedule, specifically in the parallel group of `multi_mx_metadata`,
'multi_mx_modifications` and `multi_mx_modifying_xacts` tests. This change fixes this
by serializing these three test cases.
This change allows users to drop sequences on MX workers. Previously, Citus didn't allow dropping
sequences on MX workers because it could cause shards to be dropped if `DROP SEQUENCE ... CASCADE`
is used. We now allow that since allowing sequence creation but not dropping hurts user experience
and also may cause problems with custom Citus solutions.
- Break CheckShardPlacements into multiple functions (The most important
is MarkFailedShardPlacements), so that we can get rid of the global
CoordinatedTransactionUses2PC.
- Call MarkFailedShardPlacements in the router executor, so we mark
shards as invalid and stop using them while inside transaction blocks.
With this change DropShards function started to use new connection API. DropShards
function is used by DROP TABLE, master_drop_all_shards and master_apply_delete_command,
therefore all of these functions now support transactional operations. In DropShards
function, if we cannot reach a node, we mark shard state of related placements as
FILE_TO_DELETE and continue to drop remaining shards; however if any error occurs after
establishing the connection, we ROLLBACK whole operation.
Before this change, when a transaction failed, we update related placements shard states
to FILE_INACTIVE during XACT_EVENT_PRE_COMMIT. However that means if another code block
changed shard state to something else (e.g. FILE_TO_DELETE) before XACT_EVENT_PRE_COMMIT
we overwrite that. To prevent that problem, in case of failure we started to change
shard state, only if its current shard state is FILE_FINALIZED.
This UDF returns a shard placement from cache given shard id and placement id. At the
moment it iterates over all shard placements of given shard by ShardPlacementList and
searches given placement id in that list, which is not a good solution performance-wise.
However, currently, this function will be used only when there is a failed transaction.
If a need arises we can optimize this function in the future.
All router, real-time, task-tracker plannable queries should now have
full prepared statement support (and even use router when possible),
unless they don't go through the custom plan interface (which
basically just affects LANGUAGE SQL (not plpgsql) functions).
This is achieved by forcing postgres' planner to always choose a
custom plan, by assigning very low costs to plans with bound
parameters (i.e. ones were the postgres planner replanned the query
upon EXECUTE with all parameter values provided), instead of the
generic one.
This requires some trickery, because for custom plans to work the
costs for a non-custom plan have to be known, which means we can't
error out when planning the generic plan. Instead we have to return a
"faux" plan, that'd trigger an error message if executed. But due to
the custom plan logic that plan will likely (unless called by an SQL
function, or because we can't support that query for some reason) not
be executed; instead the custom plan will be chosen.
So far router planner had encapsulated different functionality in
MultiRouterPlanCreate. Modifications always go through router, selects
sometimes. Modifications always error out if the query is unsupported,
selects return NULL. Especially the error handling is a problem for
the upcoming extension of prepared statement support.
Split MultiRouterPlanCreate into CreateRouterPlan and
CreateModifyPlan, and change them to not throw errors.
Instead errors are now reported by setting the new
MultiPlan->plannigError.
Callers of router planner functionality now have to throw errors
themselves if desired, but also can skip doing so.
This is a pre-requisite for expanding prepared statement support.
While touching all those lines, improve a number of error messages by
getting them closer to the postgres error message guidelines.
The name CreatePhysicalPlan() hasn't been accurate for a while, and
the split of work between multi_planner() and CreatePhysicalPlan()
doesn't seem perfect. So rename to CreateDistributedPlan() and move a
bit more logic in there.
It can be useful, e.g. in the upcoming prepared statement support, to
be able to return an error from a function that is not raised
immediately, but can later be thrown. That allows e.g. to attempt to
plan a statment using different methods and to create good error
messages in each planner, but to only error out after all planners
have been run.
To enable that create support for deferred error messages that can be
created (supporting errorcode, message, detail, hint) in one function,
and then thrown in different place.
This adds a replication_model GUC which is used as the replication
model for any new distributed table that is not a reference table.
With this change, tables with replication factor 1 are no longer
implicitly MX tables.
The GUC is similarly respected during empty shard creation for e.g.
existing append-partitioned tables. If the model is set to streaming
while replication factor is greater than one, table and shard creation
routines will error until this invalid combination is corrected.
Changing this parameter requires superuser permissions.
If any placements fail it doesn't update shard statistics on those placements.
A minor enabling refactor: Make CoordinatedTransactionUses2PC public (it used to be CoordinatedTransactionUse2PC but that symbol already existed, so renamed it as well)
We changed error message which appears when user tries to execute outer join command and
that command requires repartitioning. Old error message mentioned about 1-to-1 shard
partitioning which may not be clear to user.
This enables proper transactional behaviour for copy and relaxes some
restrictions like combining COPY with single-row modifications. It
also provides the basis for relaxing restrictions further, and for
optionally allowing connection caching.
Instead use pg_plan_query() like the normal explain does, and use that
to explain the query. That's important because it allows to remove
the duplicated planner logic from multi_explain - and that logic is
about to get more complicated.
They make fixing explain for prepared statement harder, and they don't
really fit into EXPLAIN in the first place. Additionally they're
currently not exercised in any tests.
This change adds support for serial columns to be used with MX tables.
Prior to this change, sequences of serial columns were created in all
workers (for being able to create shards) but never used. With MX, we
need to set the sequences so that sequences in each worker create
unique values. This is done by setting the MINVALUE, MAXVALUE and
START values of the sequence.
A small refactor which pulls some code out of `RecoverWorkerTransactions`
and into `remote_commands.c`. This code block currently only occurs in
`RecoverWorkerTransactions` but will be useful to other functions
shortly.
Unfortunately we couldn't call it `ExecuteRemoteCommand`, that name was
already taken.
Because of foreign keys and similar concerns there should only be a
single modifying/DDL connection for a set of colocated placements to a
node. To enforce placement_connection.c now has an additional
hash-table keeping track of the connections to a set of colocated
placements. In addition to enforcing per placement restrictions on
connections, there's now very similar restrictions for sets of
colocated placements.
This commit is intended to improve the error messages while planning
INSERT INTO .. SELECT queries. The main motivation for this change is
that we used to map multiple cases into a single message. With this change,
we added explicit error messages for many cases.
With this change, we start to delete placement of reference tables at given worker node
after master_remove_node UDF call. We remove placement metadata at master node but we do
not drop actual shard from the worker node. There are two reasons for that decision,
first, it is not critical to DROP the shards in the workers because Citus will ignore them
as long as node is removed from cluster and if we add that node back to cluster we will
DROP and recreate all reference tables. Second, if node is unreachable, it becomes
complicated to cover failure cases and have a transaction support.
Enables use views within distributed queries.
User can create and use a view on distributed tables/queries
as he/she would use with regular queries.
After this change router queries will have full support for views,
insert into select queries will support reading from views, not
writing into. Outer joins would have a limited support, and would
error out at certain cases such as when a view is in the inner side
of the outer join.
Although PostgreSQL supports writing into views under certain circumstances.
We disallowed that for distributed views.
In tests related to automatic reference table creation and deletion, there were some
tests whose output may change order thus creating inconsistent test results. With this
change we add ORDER BY clause to related tests to have consistent output.
This change fixes a small bug about quoting of workerrack column in
NodeListInsertCommand:
Previous: `"..., '%s'", workerRack`
Now: `"..., %s", quote_literal_cstr(workerRack)`
So far we've reloaded them frequently. Besides avoiding that cost -
noticeable for some workloads with large shard counts - it makes it
easier to add information to ShardPlacements that help us make
placement_connection.c colocation aware.
Doing so requires adding a mapping from shardId to the cache
entries. For that metadata_cache.c now maintains an additional
hashtable. That hashtable only references shard intervals in the
dist table cache.
Previously the function was getting too large. Thus this splits the
function into separate parts for looking up the cache entry and
building the cache contents.
CloseNodeConnections() is supposed to close connections to a given node.
However, before this commit it lacks to actually call PQFinish() on the
connections. Using CloseConnection() handles closing and all other necessary
actions.
With this change we start to error out on router planner queries where a common table
expression with data-modifying statement is present. We already do not support if
there is a data-modifying statement using result of the CTE, now we also error out
if CTE itself is data-modifying statement.
Remove the router specific transaction and shard management, and
replace it with the new placement connection API. This mostly leaves
behaviour alone, except that it is now, inside a transaction, legal to
select from a shard to which no pre-existing connection exists.
To simplify code the code handling task executions for select and
modify has been split into two - the previous coding was starting to
get confusing due to the amount of only conditionally applicable code.
Modification connections & transactions are now always established in
parallel, not just for reference tables.
Currently there are several places in citus that map placements to
connections and that manage placement health. Centralize this
knowledge. Because of the centralized knowledge about which
connection has previously been used for which shard/placement, this
also provides the basis for relaxing restrictions around combining
various forms of DDL/DML.
Connections for a placement can now be acquired using
GetPlacementConnection(). If the connection is used for DML or DDL the
FOR_DDL/DML flags should be used respectively. If an individual
remote transaction fails (but the transaction on the master succeeds)
and FOR_DDL/DML have been specified, the placement is marked as
invalid, unless that'd mark all placements for a shard as invalid.
With this change, we start to replicate all reference tables to the new node when new node
is added to the cluster with master_add_node command. We also update replication factor
of reference table's colocation group.
Since we will now replicate reference tables each time we add node, we need to ensure
that test space is clean in terms of reference tables before any add node operation.
For this purpose we had to change order of multi_drop_extension test which caused
change of some of the colocation ids.
With this change we introduce new UDF, upgrade_to_reference_table, which can be used to
upgrade existing broadcast tables reference tables. For upgrading, we require that given
table contains only one shard.
Renamed FindShardIntervalIndex() to ShardIndex() and added binary search
capability. It used to assume that hash partition tables are always
uniformly distributed which is not true if upcoming tenant isolation
feature is applied. This commit also reduces code duplication.
Router planner already handles cases when all shards
are pruned out. This is about missing test cases. Notice that
"column is null" and "column = null" have different shard
pruning behavior.
We have one replication of reference table for each node. Therefore all problems with
replication factor > 1 also applies to reference table. As a solution we will not allow
foreign keys on reference tables. It is not possible to define foreign key from, to or
between reference tables.
Previously, we errored out if non-user tries to SELECT query for some metadata tables. It
seems that we already GRANT SELECT access to some metadata tables but not others. With
this change, we GRANT SELECT access to all existing Citus metadata tables.
* Add get_distribution_value_shardid UDF
With this UDF users can now map given distribution value to shard id. We mostly hide
shardids from users to prevent unnecessary complexity but some power users might need
to know about which entry/value is stored in which shard for maintanence purposes.
Signature of this UDF is as follows;
bigint get_distribution_value_shardid(table_name regclass, distribution_value anyelement)
With this commit, we implemented some basic features of reference tables.
To start with, a reference table is
* a distributed table whithout a distribution column defined on it
* the distributed table is single sharded
* and the shard is replicated to all nodes
Reference tables follows the same code-path with a single sharded
tables. Thus, broadcast JOINs are applicable to reference tables.
But, since the table is replicated to all nodes, table fetching is
not required any more.
Reference tables support the uniqueness constraints for any column.
Reference tables can be used in INSERT INTO .. SELECT queries with
the following rules:
* If a reference table is in the SELECT part of the query, it is
safe join with another reference table and/or hash partitioned
tables.
* If a reference table is in the INSERT part of the query, all
other participating tables should be reference tables.
Reference tables follow the regular co-location structure. Since
all reference tables are single sharded and replicated to all nodes,
they are always co-located with each other.
Queries involving only reference tables always follows router planner
and executor.
Reference tables can have composite typed columns and there is no need
to create/define the necessary support functions.
All modification queries, master_* UDFs, EXPLAIN, DDLs, TRUNCATE,
sequences, transactions, COPY, schema support works on reference
tables as expected. Plus, all the pre-requisites associated with
distribution columns are dismissed.
We used to disable router planner and executor
when task executor is set to task-tracker.
This change enables router planning and execution
at all times regardless of task execution mode.
We are introducing a hidden flag enable_router_execution
to enable/disable router execution. Its default value is
true. User may disable router planning by setting it to false.
Adds support for VACUUM and ANALYZE commands which target a specific
distributed table. After grabbing the appropriate locks, this imple-
mentation sends VACUUM commands to each placement (using one connec-
tion per placement). These commands are sent in parallel, so users
with large tables will benefit from sharding. Except for VERBOSE, all
VACUUM and ANALYZE options are supported, including the explicit
column list used by ANALYZE.
As with many of our utility commands, the local command also runs. In
the VACUUM/ANALYZE case, the local command is executed before any re-
mote propagation. Because error handling is managed after local proc-
essing, this can result in a VACUUM completing locally but erroring
out when distributed processing commences: a minor technicality in all
cases, as there isn't really much reason to ever roll back a VACUUM (an
impossibility in any case, as VACUUM cannot run within a transaction).
Remote propagation of targeted VACUUM/ANALYZE is controlled by the
enable_ddl_propagation setting; warnings are emitted if such a command
is attempted when DDL propagation is disabled. Unqualified VACUUM or
ANALYZE is not handled, but a warning message informs the user of this.
Implementation note: this commit adds a "BARE" value to MultiShard-
CommitProtocol. When active, no BEGIN command is ever sent to remote
nodes, useful for commands such as VACUUM/ANALYZE which must not run in
a transaction block. This value is not user-facing and is reset at
transaction end.
This change adds `start_metadata_sync_to_node` UDF which copies the metadata about nodes and MX tables
from master to the specified worker, sets its local group ID and marks its hasmetadata to true to
allow it receive future DDL changes.
One less place managing remote transactions. It also makes it fairly
easy to use 2PC for certain modifications (e.g. reference tables). Just
issue a CoordinatedTransactionUse2PC(). If every placement failure
should cause the whole transaction to abort, additionally mark the
relevant transactions as critical.
Not that this is not with the citus *extension* loaded, just the shared
library. The former runs (by adding --use-existing to the flags) but
has a bunch of trivial test differences.
With this PR, we add foreign key support to ALTER TABLE commands. For now,
we only support foreign constraint creation via ALTER TABLE query, if it
is only subcommand in ALTER TABLE subcommand list.
We also only allow foreign key creation if replication factor is 1.
That way connections can be automatically closed after errors and such,
and the connection management infrastructure gets wider testing. It
also fixes a few issues around connection string building.
This includes basic infrastructure for logging of commands sent to
remote/worker nodes. Note that this has no effect as of yet, since no
callers are converted to the new infrastructure.
Connections are tracked and released by integrating into postgres'
transaction handling. That allows to to use connections without having
to resort to having to disable interrupts or using PG_TRY/CATCH blocks
to avoid leaking connections.
This is intended to eventually replace multi_client_executor.c and
connection_cache.c, and to provide the basis of a centralized
transaction management.
The newly introduced transaction hook should, in the future, be the only
one in citus, to allow for proper ordering between operations. For now
this central handler is responsible for releasing connections and
resetting XactModificationLevel after a transaction.
On some systems a new libpq is available than what we're compiling
against, but until now we used psql in the version we're compiling
against. That' a problem, because (quoting Jason):
With 9.6, libpq's default handling of CONTEXT changed: it is hidden
unless the level is ERROR or higher. We addressed this ourselves using
the SHOW_CONTEXT variable (by setting "always" in pg_regress_multi): in
9.5, this is ignored (and unneeded), in 9.6, it ensures old behavior is
preserved.
For 9.6 we'd already worked around the problem by specifying that
context should always be shown, but < 9.6 psql doesn't know how to do
that.
As there's no csql anymore, which strictly tied us to a specific version
of psql/csql, we can now just use the system's psql if available. We
still fall back to the psql of the installation we're compiling against,
if there's no other psql in PATH.
This commit fixes a bug when the SELECT target list includes a constant
value.
Previous behaviour of target list re-ordering:
* Iterate over the INSERT target list
* If it includes a Var, find the corresponding SELECT entry
and update its resno accordingly
* If it does not include a Var (which we only considered to be
DEFAULTs), generate a new SELECT target entry
* If the processed target entry count in SELECT target list is less
than the original SELECT target list (GROUP BY elements not included in
the SELECT target entry), add them in the SELECT target list and
update the resnos accordingly.
* However, this step was leading to add the CONST SELECT target entries
twice. The reason is that when CONST target list entries appear in the
SELECT target list, the INSERT target list doesn't include a Var. Instead,
it includes CONST as it does for DEFAULTs.
New behaviour of target list re-ordering:
* Iterate over the INSERT target list
* If it includes a Var, find the corresponding SELECT entry
and update its resno accordingly
* If it does not include a Var (which we consider to be
DEFAULTs and CONSTs on the SELECT), generate a new SELECT
target entry
* If any target entry remains on the SELECT target list which are resjunk,
(GROUP BY elements not included in the SELECT target entry), keep them
in the SELECT target list by updating the resnos.
While creating a colocated table, we don't want the source table to be dropped.
However, using a ShareLock blocks DML statements on the source table, and
using AccessShareLock is enough to prevent DROP. Therefore, we just loosened
the lock to AccessShareLock.
This change allows seeing the names of columns of `master_add_node`,
using `SELECT * FROM master_add_node(...)` by specifying output
columns in UDF definition.
Previously, we threw an error when we ran CREATE INDEX IF NOT EXISTS
with an already existing index. This change enables expected behavior by
checking if the statement has IF NOT EXISTS before throwing the error.
We also ensure that we don't execute the command on the workers, if an
index already exists on the master.
At the moment, we do not support foreign constraints if replication factor is greater
than 1. However foreign constraints can be used in cloud with high availability option.
Therefore we do not want to create an impression such that foreign constraints with
high availability is not supported at all. We call users to action with this error
message.
In ErrorIfShardPlacementsNotColocated(), while checking if shards are colocated,
error out if matching shard intervals have different number of shard placements.
Added a new UDF, mark_tables_colocated(), to colocate tables with the same
configuration (shard count, shard replication count and distribution column type).
Fixcitusdata/citus#886
The way postgres' explain hook is designed means that our hook is never
called during EXPLAIN EXECUTE. So, we special-case EXPLAIN EXECUTE by
catching it in the utility hook. We then replace the EXECUTE with the
original query and pass it back to Citus.
Previously, when a repair is requested on a shard, we also repair all co-located shards
of given shard, which may cause repairing already healthy shards. With this change, we
only repair given shard.
This forces prepared statements to be re-planned after changes of the
placement metadata. There's some locking issues remaining, but that's a
a separate task.
Also add regression tests verifying that invalidations take effect on
prepared statements.
This commit adds INSERT INTO ... SELECT feature for distributed tables.
We implement INSERT INTO ... SELECT by pushing down the SELECT to
each shard. To compute that we use the router planner, by adding
an "uninstantiated" constraint that the partition column be equal to a
certain value. standard_planner() distributes that constraint to all
the tables where it knows how to push the restriction safely. An example
is that the tables that are connected via equi joins.
The router planner then iterates over the target table's shards,
for each we replace the "uninstantiated" restriction, with one that
PruneShardList() handles. Do so by replacing the partitioning qual
parameter added in multi_planner() with the current shard's
actual boundary values. Also, add the current shard's boundary values to the
top level subquery to ensure that even if the partitioning qual is
not distributed to all the tables, we never run the queries on the shards
that don't match with the current shard boundaries. Finally, perform the
normal shard pruning to decide on whether to push the query to the
current shard or not.
We do not support certain SQLs on the subquery, which are described/commented
on ErrorIfInsertSelectQueryNotSupported().
We also added some locking on the router executor. When an INSERT/SELECT command
runs on a distributed table with replication factor >1, we need to ensure that
it sees the same result on each placement of a shard. So we added the ability
such that router executor takes exclusive locks on shards from which the SELECT
in an INSERT/SELECT reads in order to prevent concurrent changes. This is not a
very optimal solution, but it's simple and correct. The
citus.all_modifications_commutative can be used to avoid aggressive locking.
An INSERT/SELECT whose filters are known to exclude any ongoing writes can be
marked as commutative. See RequiresConsistentSnapshot() for the details.
We also moved the decison of whether the multiPlan should be executed on
the router executor or not to the planning phase. This allowed us to
integrate multi task router executor tasks to the router executor smoothly.
The necessity for this functionality comes from the fact that ruleutils.c is not supposed to be
used on "rewritten" queries (i.e. ones that have been passed through QueryRewrite()).
Query rewriting is the process in which views and such are expanded,
and, INSERT/UPDATE targetlists are reordered to match the physical order,
defaults etc. For the details of reordeing, see transformInsertRow().
We'd been relying on a single SET search_path command in an earlier
script, but a subsequent script RESET search_path, causing any further
bare functions to be created in the first schema on the search path.
However, starting with an older extension version and executing ALTER
scripts one at a time DOES avoid putting any functions in the public
namespace, so I wrote an upgrade script resilient to that, especially
because PostgreSQL 9.5 will error out if a function is already in the
schema it's being moved to.
Between restart (running the new code) and ALTER EXTENSION citus
UPGRADE there was an inconsistency where we assumed that
pg_dist_partition had the repmodel column set. Now we give it a default
value if the column doesn't exist yet.
With this change, we now push down foreign key constraints created during CREATE TABLE
statements. We also start to send foreign constraints during shard move along with
other DDL statements
create_reference_table() creates a hash distributed table with shard count
equals to 1 and replication factor equals to shard_replication_factor
configuration value.
Adds support for PostgreSQL 9.6 by copying in the requisite ruleutils
file and refactoring the out/readfuncs code to flexibly support the
old-style copy/pasted out/readfuncs (prior to 9.6) or use extensible
node APIs (in 9.6 and higher).
Most version-specific code within this change is only needed to set new
fields in the AggRef nodes we build for aggregations. Version-specific
test output files were added in certain cases, though in most they were
not necessary. Each such file begins by e.g. printing the major version
in order to clarify its purpose.
The comment atop citus_nodes.h details how to add support for new nodes
for when that becomes necessary.
This change adds the pg_dist_local_group metadata table, which indicates
the group id of the current node. It is expected that this table contains
one and only one row, which only contains the group id of the node as an
integer.
With this change, master_copy_shard_placement and master_move_shard_placement functions
start to copy/move given shard along with its co-located shards.
This commit completes having support in Citus by adding having support for
real-time and task-tracker executors. Multiple tests are added to regression
tests to cover new supported queries with having support.
This change adds the required infrastructure about metadata snapshot from MX
codebase into Citus, mainly metadata_sync.c file and master_metadata_snapshot UDF.
Two sets of tests are fixed by this change:
* multi_agg_approximate_distinct
* those in multi_task_tracker_extra_schedule
The first broke when we renamed stage to load in many files and was
never being run because the HyperLogLog extension wasn't easily
available in Debian. Now it's in our repo, so we install it and run
the test. I removed the distinct HLL target in favor of just always
running it and providing an output variant to handle when the extension
is absent. Basically, if PostgreSQL thinks HLL is available, the test
installs it and runs normally, otherwise the absent variant is used.
The second broke when I removed a test variant, erroneously believing
it to be related to an older Citus version. I've added a line in that
test to clarify why the variant is necessary (a practice we should
widely adopt).
So far placements were assigned an Oid, but that was just used to track
insertion order. It also did so incompletely, as it was not preserved
across changes of the shard state. The behaviour around oid wraparound
was also not entirely as intended.
The newly introduced, explicitly assigned, IDs are preserved across
shard-state changes.
The prime goal of this change is not to improve ordering of task
assignment policies, but to make it easier to reference shards. The
newly introduced UpdateShardPlacementState() makes use of that, and so
will the in-progress connection and transaction management changes.
Related to #786
This change adds the `pg_dist_node` table that contains the information
about the workers in the cluster, replacing the previously used
`pg_worker_list.conf` file (or the one specified with `citus.worker_list_file`).
Upon update, `pg_worker_list.conf` file is read and `pg_dist_node` table is
populated with the file's content. After that, `pg_worker_list.conf` file
is renamed to `pg_worker_list.conf.obsolete`
For adding and removing nodes, the change also includes two new UDFs:
`master_add_node` and `master_remove_node`, which require superuser
permissions.
'citus.worker_list_file' guc is kept for update purposes but not used after the
update is finished.
related to a table that might be distributed, allowing any name
that is within regular PostgreSQL length limits to be extended
with a shard ID for use in shards on workers. Handles multi-byte
character boundaries in identifiers when making prefixes for
shard-extended names. Includes tests.
Uses hash_any from PostgreSQL's access/hashfunc.c.
Removes AppendShardIdToStringInfo() as it's used only once
and arguably is best replaced there with a call to AppendShardIdToName().
Adds UDF shard_name(object_name, shard_id) to expose the shard-extended
name logic to other PL/PGSQL, UDFs and scripts.
Bumps version to 6.0-2 to allow for UDF to be created in migration script.
Fixescitusdata/citus#781 and citusdata/citus#179.
hash_create(), called by TaskHashCreate(), doesn't work correctly for a
zero sized hash table. This triggers valgrind errors, and could
potentially cause crashes even without valgring.
This currently happens for Jobs with 0 tasks. These probably should be
optimized away before reaching TaskHashCreate(), but that's a bigger
change.
count_agg_clause *adds* the cost of the aggregates to the state
variable, it doesn't reinitialize it. That is intentional, as it is used
to incrementally add costs in some places.
is now a `::regtype` using the qualified name of the column type,
not the column type OID which may differ between master/worker nodes.
Test coverage of a hash reparitition using a UDT as the join column.
Note that the UDFs `worker_hash_partition_table` and `worker_range_partition_table`
are unchanged, and rightly expect an OID for the column type; but the
planner code building the commands now allows for `::regtype` casting
to do its magic.
Fixescitusdata/citus#111.
Fixescitusdata/citus#714
On `InsertShardRow`, we previously called `CommandCounterIncrement()` before
`CitusInvalidateRelcacheByRelid(relationId);`. This might prevent to skip
invalidation of the distributed table in the next access within the same session.
This commit enables to create different worker and master temporary folders.
This change is important for citus-mx on task-tracker execution. In simple words,
on citus-mx, the worker could actually be reponsible for the master tasks as well.
Prior to this change, both master and worker logic on task-tracker executor was
accessing and using the same files for different purposes which was dangerous on
certain cases (i.e., when task_tracker_delay is low).
Before this change, count on a distributed returned NULL if all shards
were pruned away, because on the master we replace with count(..) call
with a sum(..) call to sum the counts from the shards. However, sum
returns NULL when there are no rows, whereas count is expected to return
0.
I had changed these callbacks to use the same method I chose for the
router executor (for consistency), but as that method is flawed, we now
want to ensure we directly register them from PG_init as well.
Not entirely sure why we went with the shared memory hook approach, but
it causes problems (multiple registration) during crashes. Changing to
a simple direct registration call from PG_init.
An interaction between ReraiseRemoteError and DML transaction support
causes segfaults:
* ReraiseRemoteError calls PurgeConnection, freeing a connection...
* That connection is still in the xactParticipantHash
At transaction end, the memory in the freed connection might happen to
pass the "is this connection OK?" check, causing us to try to send an
ABORT over that connection. By removing it from the transaction hash
before calling ReraiseRemoteError, we avoid this possibility.
UNIQUE or PRIMARY KEY constraints. Also, properly propagate valid
EXCLUDE constraints to worker shard tables.
If an EXCLUDE constraint includes the distribution column,
the operator must be an equality operator.
Tests in regression suite for exclusion constraints that include
the partition column, omit it, and include it but with non-equality
operator. Regression tests also verify that valid exclusion constraints
are propagated to the shard tables. And the tests work in different
timezones now.
Fixescitusdata/citus#748 and citusdata/citus#778.
pg_toast_* oids are constantly changing, and this causes regression tests to
fail time to time. With this commit, we remove all of the pg_toast_* references
from regression test outputs.
Three changes here to get to true multi-statement, multi-relation DDL
transactions (same functionality pre-5.2, with benefits of atomicity):
1. Changed the multi-shard utility hook to always run (consistency
with router executor hook, removes ad-hoc "installed" boolean)
2. Change the global connection list in multi_shard_transaction to
instead be a hash; update related functions to operate on global
hash instead of local hash/global list
3. Remove check within DDL code to prevent subsequent DDL commands;
place unset/reset guard around call to ConnectToNode to permit
connecting to additional nodes after DDL transaction has begun
In addition, code has been added to raise an error if a ROLLBACK TO
SAVEPOINT is attempted (similar to router executor), and comprehensive
tests execute all multi-DDL scenarios (full success, user ROLLBACK, any
actual errors (say, duplicate index), partial failure (duplicate index
on one node but not others), partial COMMIT (one node fails), and 2PC
partial PREPARE (one node fails)). Interleavings with other commands
(DML, \copy) are similarly all covered.
To permit use with ZomboDB (https://github.com/zombodb/zombodb), two
changes were necessary:
1. Permit use of `tableoid` system column in queries
2. Extend relation names appearing in index expressions
The first is accomplished by simply changing the deparse logic to allow
system columns in queries destined for distributed tables. The latter
was slightly more complex, given that DDL extension currently occurs on
workers. But since indexes cannot reference tables other than the one
being indexed, it is safe to look for any relation reference ending in
a '*' character and extend their penultimate segments with a shard id.
This change also adds an error to prevent users from distributing any
relations using the WITH (OIDS) feature, which is unsupported.
The call to hash_create specified HASH_CONTEXT without actually setting
one using the provided HASHCTL. The hashes returned by this function
are used locally, so simply using CurrentMemoryContext is sufficient.
In subquery pushdown, we allow outer joins if the join condition is on the
partition columns. WhereClauseList() used to return all join conditions including
outer joins. However, this has been changed with a commit related to outer join
support on regular queries. With this commit, we refactored ExtractFromExpressionWalker()
to return two lists of qualifiers. The first list is for inner join and filter
clauses and the second list is for outer join clauses. Therefore, we can also
use outer join clauses to check subquery pushdown prerequisites.
Before this change, we do not check whether given table which already contains any data
in master_create_distributed_table command. If that table contains any data, making it
it distributed, makes that data hidden to user. With this change, we now gave error to
user if the table contains data.
Recent changes to DDL and transaction logic resulted in a "regression"
from the viewpoint of users. Previously, DDL commands were allowed in
multi-command transaction blocks, though they were not processed in any
actual transactional manner. We improved the atomicity of our DDL code,
but added a restriction that DDL commands themselves must not occur in
any BEGIN/END transaction block.
To give users back the original functionality (and improved atomicity)
we now keep track of whether a multi-command transaction has modified
data (DML) or schema (DDL). Interleaving the two modification types in
a single transaction is disallowed.
This first step simply permits a single DDL command in such a block,
admittedly an incomplete solution, but one which will permit us to add
full multi-DDL command support in a subsequent commit.
"Staging table" will be the only valid use of 'stage' from now on, we
will now say "load" when talking about data ingestion. If creation of
shards is its own step, we'll just say "shard creation".
Fixes#547
This change removes all references to \stage in the regression tests
and puts \COPY instead. Doing so changed shard counts, min/max
values on some test tables (lineitem, orders, etc.).
I've been seeing warnings on OS X/clang for a while about these lines
and finally got tired of it. The main problem is that PRIu64 expects a
uint64_t but we were passing a uint64 (a PostgreSQL-defined type). In
PostgreSQL 9.5, we now have INT64_MODIFIER, so can build our own zero-
padded unsigned 64-bit int format modifier that expects a PostgreSQL-
provided uint64 type.
This simplifies the code slightly (no more ifdefs) and gets rid of the
warning that's been annoying me since April (my TODO creation time).
A recent change to the image used in Travis causes some problems for
the code we use here to ensure the local replica is first. Since this
code is essentially dead in a post-stage world anyhow, we're OK with
ripping out the tests to placate Travis.
PostgreSQL 9.5.4 stopped calling planner for materialized view create
command when NO DATA option is provided.
This causes our test to behave differently between pre-9.5.4 and 9.5.4.
When an unreferenced prepared statement parameter does not explicitly
have a type assigned, we cannot deserialize it, to send to the remote
side. That commonly happens inside plpgsql functions, where local
variables are passed in as unused prepared statement parameters.
A recent change generates a "dummy" shard placement with its identifier
set to INVALID_SHARD_ID for SELECT queries against distributed tables
with no shards. Normally, no lock is acquired for SELECT statements,
but if all_modifications_commutative is set to true, we will acquire a
shared lock, triggering an assertion failure within LockShardResource
in the above case.
The "dummy" shard placement is actually necessary to ensure such empty
queries have somewhere to execute, and INVALID_SHARD_ID seems the most
appropriate value for the dummy's shard identifier field, so the most
straightforward fix is to just avoid locking invalid shard identifiers.
Text datums can't be directly accessed via the struct equivalence trick
used to access catalogs. That's because, as an optimization, they're
sometimes aligned to 1 byte ("text"'s alignment), and sometimes to 4
bytes. That depends on it being a short
varlena (cf. VARATT_NOT_PAD_BYTE) or not.
In the case at hand here, partkey became longer than 127 characters -
the boundary for short varlenas (cf. VARATT_CAN_MAKE_SHORT()). Thus it
became 4 byte/int aligned. Which lead to the direct struct access
accessing the wrong data.
The fix is simply to never access partkey that way - to enforce that,
hide partkey ehind the usual ifdef.
Fixes: #674
Fixes#679
This change sets the default commit protocol for distributed DDL
commands to '1pc'. If the user issues a distributed DDL command with
this default setting, then once in a session, a NOTICE message is
shown about using '2pc' being extra safe.
This adds support for SERIAL/BIGSERIAL column types. Because we now can
evaluate functions on the master (during execution), adding this is a
matter of ensuring the table creation step works properly.
To accomplish this, I've added some logic to detect sequences owned by
a table (i.e. those related to its columns). Simply creating a sequence
and using it in a default value is insufficient; users who do so must
ensure the sequence is owned by the column using it.
Fortunately, this is exactly what SERIAL and BIGSERIAL do, which is the
use case we're targeting with this feature. While testing this, I found
that worker_apply_shard_ddl_command actually adds shard identifiers to
sequence names, though I found no places that use or test this path. I
removed that code so that sequence names are not mutated and will match
those used by a SERIAL default value expression.
Our use of the new-to-9.5 CREATE SEQUENCE IF NOT EXISTS syntax means we
are dropping support for 9.4 (which is being done regardless, but makes
this change simpler). I've removed 9.4 from the Travis build matrix.
Some edge cases are possible in ALTER SEQUENCE, COPY FROM (on workers),
and CREATE SEQUENCE OWNED BY. I've added errors for each so that users
understand when and why certain operations are prohibited.
We remove schema name parameter from worker_fetch_foreign_file and
worker_fetch_regular_table functions. We now send schema name
concatanated with table name.
Fixes#676
We added old versions (i.e. without schema name) of worker_apply_shard_ddl_command,
worker_fetch_foreign_file and worker_fetch_regular_table back. During function call
of one of these functions, we set schema name as public schema and call the newer
version of the functions.
This change removes AllFinalizedPlacementsAccessible function since,
we open connections to all shard placements before any command is
sent so we immediately error out if a shard placement is not accessible.
This change allows users to interrupt long running DDL commands.
Interrupt requests are handled after each DDL command being propagated
to a shard placement, which means that generally the cancel request will
be processed right after the execution of the DDL is finished in the
current placement.
We can now support richer set of queries in router planner.
This allow us to support CTEs, joins, window function, subqueries
if they are known to be executed at a single worker with a single
task (all tables are filtered down to a single shard and a single
worker contains all table shards referenced in the query).
Fixes : #501
Fixes#132
We hook into ALTER ... SET SCHEMA and warn out if user tries to change schema of a
distributed table.
We also hook into ALTER TABLE ALL IN TABLE SPACE statements and warn out if citus has
been loaded.
Allows the use of modification commands (INSERT/UPDATE/DELETE) within
transaction blocks (delimited by BEGIN and ROLLBACK/COMMIT), so long as
all modifications hit a subset of nodes involved in the first such com-
mand in the transaction. This does not circumvent the requirement that
each individual modification command must still target a single shard.
For instance, after sending BEGIN, a user might INSERT some rows to a
shard replicated on two nodes. Subsequent modifications can hit other
shards, so long as they are on one or both of these nodes.
SAVEPOINTs are supported, though if the user actually attempts to send
a ROLLBACK command that specifies a SAVEPOINT they will receive an
ERROR at the end of the topmost transaction.
Placements are only marked inactive if at least one replica succeeds
in a transaction where others fail. Non-atomic behavior is possible if
the shard targeted by the initial modification within a transaction has
a higher replication factor than another shard within the same block
and a node with the latter shard has a failure during the COMMIT phase.
Other methods of denoting transaction blocks (multi-statement commands
sent all at once and functions written in e.g. PL/pgSQL or other such
languages) are not presently supported; their treatment remains the
same as before.
Fixes#555
Before this change, we were resolving HLL function and type Oid without qualified name.
Now we find the schema name where HLL objects are stored and generate qualified names for
each objects.
Similar fix is also applied for cstore_table_size function call.
Fixes#565Fixes#626
To add schema support to citus, we need to schema-prefix all table names, object names etc.
in the queries sent to worker nodes. However; query deparsing is not available for most of
DDL commands, therefore it is not easy to generate worker query in the master node.
As a solution we are sending schema names along with shard id and query to run to worker
nodes with worker_apply_shard_ddl_command.
To not break \STAGE command we pass public schema as paramater while calling
worker_apply_shard_ddl_command from there. This will not cause problem if user uses \STAGE
in different schema because passes schema name is used only if there is no schema name is
given in the query.
Fixes#215Fixes#267Fixes#502Fixes#556Fixes#557Fixes#560Fixes#568Fixes#623Fixes#624
With this change we schema-prefix table names, operator names and composite types.
This change fixes the unused variable problem in
`ExecuteDistributedDDLCommand` function (multi_utility.c). The
parameter is meant to be used in PreventTransactionChain call.
Fixes#513
This change modifies the DDL Propagation logic so that DDL queries
are propagated via 2-Phase Commit protocol. This way, failures during
the execution of distributed DDL commands will not leave the table in
an intermediate state and the pending prepared transactions can be
commited manually.
DDL commands are not allowed inside other transaction blocks or functions.
DDL commands are performed with 2PC regardless of the value of
`citus.multi_shard_commit_protocol` parameter.
The workflow of the successful case is this:
1. Open individual connections to all shard placements and send `BEGIN`
2. Send `SELECT worker_apply_shard_ddl_command(<shardId>, <DDL Command>)`
to all connections, one by one, in a serial manner.
3. Send `PREPARE TRANSCATION <transaction_id>` to all connections.
4. Sedn `COMMIT` to all connections.
Failure cases:
- If a worker problem occurs before sending of all DDL commands is finished, then
all changes are rolled back.
- If a worker problem occurs after all DDL commands are sent but not after
`PREPARE TRANSACTION` commands are finished, then all changes are rolled back.
However, if a worker node is failed, then the prepared transactions in that worker
should be rolled back manually.
- If a worker problem occurs during `COMMIT PREPARED` statements are being sent,
then the prepared transactions on the failed workers should be commited manually.
- If master fails before the first 'PREPARE TRANSACTION' is sent, then nothing is
changed on workers.
- If master fails during `PREPARE TRANSACTION` commands are being sent, then the
prepared transactions on workers should be rolled back manually.
- If master fails during `COMMIT PREPARED` or `ROLLBACK PREPARED` commands are being
sent, then the remaining prepared transactions on the workers should be handled manually.
This change also helps with #480, since failed DDL changes no longer mark
failed placements as inactive.
Fixes#394
This change adds LIMIT/OFFSET support for non router-plannable
distributed queries.
In cases that we can push the LIMIT down, we add the OFFSET value to
that LIMIT in the worker queries. When a query with LIMIT x OFFSET y is issued,
the query is propagated to the workers as LIMIT (x+y) OFFSET 0, and on the
master table, the original LIMIT and OFFSET values are used. With this change,
we can use OFFSET wherever we can use LIMIT.
- Enables using VOLATILE functions (like nextval()) in INSERT queries
- Enables using STABLE functions (like now()) targetLists and joinTrees
UPDATE and INSERT can now contain non-immutable functions. INSERT can contain any kind of
expression, while UPDATE can contain any STABLE function, so long as a Var is not passed
into the STABLE function, even indirectly. UPDATE TagetEntry's can now also include Vars.
There's an exception, CASE/COALESCE statements may not contain mutable functions.
Functions calls in master_modify_multiple_shards are also evaluated.
Fixes#463
OID of user-defined types may be different in master and worker nodes. This causes errors
while sending data between nodes with binary nodes. Because binary copy format adds OID
of the element if it is in an array. The code adding OID is in PostgreSQL code, therefore
we cannot change it. Instead we decided to use text format if we try to send array of
user-defined type.
It turns out some tests exercised this behavior, but removing it should
have no ill effects. Besides, both copy and INSERT disallow NULLs in a
table's partition column.
Fixes a bug where anti-joins on hash-partitioned distributed tables
would incorrectly prune shards early, result in incorrect results (test
included).
The upcoming RETURNING support would otherwise require too much
duplication. This contains most of the pieces required for RETURNING
support, except removing the planner checks and adjusting regression
test output.
The old targetlist wasn't used so far, but the upcoming RETURNING
support relies on it.
This also allows to get rid of some crufty code in
multi_executor.c:multi_ExecutorStart(), which used the worker query's
targetlist instead of the main statement's (which didn't have one up to
now).
The targetlist contains TargetEntrys containing expressions, not
expressions directly. That didn't matter so far, but with the upcoming
RETURNING support, the targetlist is inspected to build a TupleDesc.
ExecCleanTypeFromTL hits an assert when looking at something that's not
a TargetEntry.
Mark the entry as resjunk, so it's not actually used.
As postgres's generic .l -> .c Makefile rule uses ifdef - which is
evaluated early, not during rule evaluation - we have to override the
rule, in addition to the detection of FLEX in the previous commit.
Fixes: #439
The only way we re-raise an error is if the raiseError flag is true, so
might as well purge connection in that block rather than independently
checking errorLevel.
Fixes#78
With this change, it is possible to append a table in any schema to shard. The function
master_append_table_to_shard now supports schema names.
For CITUS_RTE_RELATION type fragments, reloading shardIntervals from the
database is rather expensive. So store a pointer to the full shard
interval, instead of just the shard id. There's no new memory lifetime
hazards here, because we already passed a pointer to the shardInterval's
->shardId field around.
The plan time for the query in issue #607 goes from 2889 ms to 106 ms.
with this change.
By far the most expensive part of ShardIntervalsOverlap() is computing
the function to use to determine overlap. Luckily we already have that
computed and cached.
The plan time for the query in issue #607 goes from 8764 ms to 2889 ms
with this change.
-Added 2 more schedules for task-tracker and multi-binary
instead of running multi_schedule 3 times
-set task-tracker-delay for each long running schedule
Fixes#550, fixes#545
If table name contains special characters, it needs to be escaped. However in some cases,
we escape table name before appending shardId, which causes syntax error in the queries
sent to worker nodes. With this change we now append shardId before escaping table names.
This checkin removes variant files we needed
due to differences in outputs of pg94 and pg95 runs.
However, variant file for test multi_upsert stays
since this file tests for a feature that does not
exist in pg94, and outputs are drastically different.
now copies all column references in count distinct aggreagete
to worker target list and group by. Master target list is
also updated to reflect changes in attribute order.
Fixes 569
Fixes#496
Previously we do not check whether table is foreign or not while creating empty
shards, and set storage type to 't'(Standard table) or 'c'(Columnar table). Now
if the table is foreign table(but not CStore foreign table) we set storage
type to 'f'(Foreign table). If it is CStore foreign table, we set its storage
type to 'c', i.e. columnar table have priority over foreign table.
Please note that 'c' is only used for CStore tables not for other possible
columnar stores at the moment. Possible improvement could be checking for other
columnar stores, though I am not sure if there is a way to check it for all
other columnar stores.
Since we now short-circuit on certain remote errors, we want to ensure
we preserve the old behavior of not modifying any placement states if
a non-short-circuiting error occurs on all placements.
There's not a ton of documentation about what CONTEXT lines should look
like, but this seems like the most dominant pattern. Similarly, users
should expect lowercase, non-period strings.
Fixes#271
This change sets ShardIds and JobIds for each test case. Before this change,
when a new test that somehow increments Job or Shard IDs is added, then
the tests after the new test should be updated.
ShardID and JobID sequences are set at the beginning of each file with the
following commands:
```
ALTER SEQUENCE pg_catalog.pg_dist_shardid_seq RESTART 290000;
ALTER SEQUENCE pg_catalog.pg_dist_jobid_seq RESTART 290000;
```
ShardIds and JobIds are multiples of 10000. Exceptions are:
- multi_large_shardid: shardid and jobid sequences are set to much larger values
- multi_fdw_large_shardid: same as above
- multi_join_pruning: Causes a race condition with multi_hash_pruning since
they are run in parallel.
Fixes#302
Since our previous syntax did not allow creating hash partitioned tables,
some of the previous tests manually changed partition method to hash to
be able to test it. With this change we remove unnecessary workaround and
create hash distributed tables instead. Also in some tests metadata was
created manually. With this change we also fixed this issue.