With this PR, Citus starts to support all possible ways to create
distributed partitioned tables. These are;
- Distributing already created partitioning hierarchy
- CREATE TABLE ... PARTITION OF a distributed_table
- ALTER TABLE distributed_table ATTACH PARTITION non_distributed_table
- ALTER TABLE distributed_table ATTACH PARTITION distributed_table
We also support DETACHing partitions from partitioned tables and propogating
TRUNCATE and DDL commands to distributed partitioned tables.
This PR also refactors some parts of distributed table creation logic.
- master_activate_node and master_disable_node correctly toggle
isActive, without crashing
- master_add_node rejects duplicate nodes, even if they're in different
clusters
- master_remove_node allows removing nodes in different clusters
This change removes distributed tables' dependency on distribution key columns. We already check that we cannot drop distribution key columns in ErrorIfUnsupportedAlterTableStmt() at multi_utility.c, so we don't need to have distributed table to distribution key column dependency to avoid dropping of distribution key column.
Furthermore, having this dependency causes some warnings in pg_dump --schema-only (See #866), which are not desirable.
This change also adds check to disallow drop of distribution keys when citus.enable_ddl_propagation is set to false. Regression tests are updated accordingly.
We try to run our isolation tests paralles as much as possible. In
some of those isolation tests we used same table name which causes
problem while running them in paralles. This commit changes table
names in those tests to ensure tests can run in parallel.
This commit is preperation for introducing distributed partitioned
table support. We want to clean and refactor some code in distributed
table creation logic so that we can handle partitioned tables in more
robust way.
In this commit, we add ability to convert global wait edges
into adjacency list with the following format:
[transactionId] = [transactionNode->waitsFor {list of waiting transaction nodes}]
This change adds a general purpose infrastructure to log and monitor
process about long running progresses. It uses
`pg_stat_get_progress_info` infrastructure, introduced with PostgreSQL
9.6 and used for tracking `VACUUM` commands.
This patch only handles the creation of a memory space in dynamic shared
memory, putting its info in `pg_stat_get_progress_info`, fetching the
progress monitors on demand and finalizing the progress tracking.
- Never release locks
- AddNodeMetadata takes ShareRowExclusiveLock so it'll conflict with the
trigger which prevents multiple primary nodes.
- ActivateNode and SetNodeState used to take AccessShareLock, but they
modify the table so they should take RowExclusiveLock.
- DeleteNodeRow and InsertNodeRow used to take AccessExclusiveLock but
only need RowExclusiveLock.
- master_add_node enforces that there is only one primary per group
- there's also a trigger on pg_dist_node to prevent multiple primaries
per group
- functions in metadata cache only return primary nodes
- Rename ActiveWorkerNodeList -> ActivePrimaryNodeList
- Rename WorkerGetLive{Node->Group}Count()
- Refactor WorkerGetRandomCandidateNode
- master_remove_node only complains about active shard placements if the
node being removed is a primary.
- master_remove_node only deletes all reference table placements in the
group if the node being removed is the primary.
- Rename {Node->NodeGroup}HasShardPlacements, this reflects the behavior it
already had.
- Rename DeleteAllReferenceTablePlacementsFrom{Node->NodeGroup}. This also
reflects the behavior it already had, but the new signature forces the
caller to pass in a groupId
- Rename {WorkerGetLiveGroup->ActivePrimaryNode}Count
This commit adds distributed transaction id infrastructure in
the scope of distributed deadlock detection.
In general, the distributed transaction id consists of a tuple
in the form of: `(databaseId, initiatorNodeIdentifier, transactionId,
timestamp)`.
Briefly, we add a shared memory block on each node, which holds some
information per backend (i.e., an array `BackendData backends[MaxBackends]`).
Later, on each coordinated transaction, Citus sends
`SELECT assign_distributed_transaction_id()` right after `BEGIN`.
For that backend on the worker, the distributed transaction id is set to
the values assigned via the function call.
The aim of the above is to correlate the transactions on the coordinator
to the transactions on the worker nodes.
Comes with a few changes:
- Change the signature of some functions to accept groupid
- InsertShardPlacementRow
- DeleteShardPlacementRow
- UpdateShardPlacementState
- NodeHasActiveShardPlacements returns true if the group the node is a
part of has any active shard placements
- TupleToShardPlacement now returns ShardPlacements which have NULL
nodeName and nodePort.
- Populate (nodeName, nodePort) when creating ShardPlacements
- Disallow removing a node if it contains any shard placements
- DeleteAllReferenceTablePlacementsFromNode matches based on group. This
doesn't change behavior for now (while there is only one node per
group), but means in the future callers should be careful about
calling it on a secondary node, it'll delete placements on the primary.
- Create concept of a GroupShardPlacement, which represents an actual
tuple in pg_dist_placement and is distinct from a ShardPlacement,
which has been resolved to a specific node. In the future
ShardPlacement should be renamed to NodeShardPlacement.
- Create some triggers which allow existing code to continue to insert
into and update pg_dist_shard_placement as if it still existed.
These functions are holdovers from pg_shard and were created for unit
testing c-level functions (like InsertShardPlacementRow) which our
regression tests already test quite effectively. Removing because it
makes refactoring the signatures of those c-level functions
unnecessarily difficult.
- create_healthy_local_shard_placement_row
- update_shard_placement_row_state
- delete_shard_placement_row
This commit is intended to be a base for supporting declarative partitioning
on distributed tables. Here we add the following utility functions and their
unit tests:
* Very basic functions including differnentiating partitioned tables and
partitions, listing the partitions
* Generating the PARTITION BY (expr) and adding this to the DDL events
of partitioned tables
* Ability to generate text representations of the ranges for partitions
* Ability to generate the `ALTER TABLE parent_table ATTACH PARTITION
partition_table FOR VALUES value_range`
* Ability to apply add shard ids to the above command using
`worker_apply_inter_shard_ddl_command()`
* Ability to generate `ALTER TABLE parent_table DETACH PARTITION`
Adds support for PostgreSQL 10 by copying in the requisite ruleutils
and updating all API usages to conform with changes in PostgreSQL 10.
Most changes are fairly minor but they are numerous. One particular
obstacle was the change in \d behavior in PostgreSQL 10's psql; I had
to add SQL implementations (views, mostly) to mimic the pre-10 output.
Add a second implementation of INSERT INTO distributed_table SELECT ... that is used if
the query cannot be pushed down. The basic idea is to execute the SELECT query separately
and pass the results into the distributed table using a CopyDestReceiver, which is also
used for COPY and create_distributed_table. When planning the SELECT, we go through
planner hooks again, which means the SELECT can also be a distributed query.
EXPLAIN is supported, but EXPLAIN ANALYZE is not because preventing double execution was
a lot more complicated in this case.
- Use native postgres function for composite key btree functions
- Move explain tests to multi_explain.sql (get rid of .out _0.out files)
- Get rid of input/output files for multi_subquery.sql by moving table creations
- Update some comments
With this commit, we start to use custom compiled PostgreSQL builds in
Travis for merge commits. This allows us to run isolation tests and
PostgreSQL's own regression tests along with our regression tests in
Travis.
Since manually compiling PostgreSQL takes more time and we also add new
tests, we only enable running these tests on merge commits.
* Accept invalidation messages before accessing the metadata cache
This commit is crucial to prevent stale metadata reads from the
cache. Without this commit, some of the operations may use stale
metadata which could end up with various bugs such as crashes,
inconsistent/lost data etc.
As an example, consider that a COPY operation is blocked on shard
metadata lock. Another concurrent session updates the metadata and
invalidates the cache. However, since Citus doesn't accept invalidations,
COPY continues with the stale metadata once it acquires the lock.
With this commit, we make sure that invalidation messages are accepted
just before accessing the metadata cache and preventing any operation to
use stale metadata.
* Add isolation tests for placement changes and conccurrent operations
- add node with reference table vs COPY/insert/update/DDL
- repair shard vs COPY/insert/update/DDL
- repair shard vs repair shard
Distributed query planning for subquery pushdown is done on the original
query. This prevents the usage of external parameters on the execution.
To overcome this, we manually replace the parameters on the original
query.
* Support for subqueries in WHERE clause
This commit enables subqueries in WHERE clause to be pushed down
by the subquery pushdown logic.
The support covers:
- Correlated subqueries with IN, NOT IN, EXISTS, NOT EXISTS,
operator expressions such as (>, <, =, ALL, ANY etc.)
- Non-correlated subqueries with (partition_key) IN (SELECT partition_key ..)
(partition_key) =ANY (SELECT partition_key ...)
Note that this commit heavily utilizes the attribute equivalence logic introduced
in the 1cb6a34ba8. In general, this commit mostly
adjusts the logical planner not to error out on the subqueries in WHERE clause.
* Improve error checks for subquery pushdown and INSERT ... SELECT
Since we allow subqueries in WHERE clause with the previous commit,
we should apply the same limitations to those subqueries.
With this commit, we do not iterate on each subquery one by one.
Instead, we extract all the subqueries and apply the checks directly
on those subqueries. The aim of this change is to (i) Simplify the
code (ii) Make it close to the checks on INSERT .. SELECT code base.
* Extend checks for unresolved paramaters to include SubLinks
With the presence of subqueries in where clause (i.e., SubPlans on the
query) the existing way for checking unresolved parameters fail. The
reason is that the parameters for SubPlans are kept on the parent plan not
on the query itself (see primnodes.h for the details).
With this commit, instead of checking SubPlans on the modified plans
we start to use originalQuery, where SubLinks represent the subqueries
in where clause. The unresolved parameters can be found on the SubLinks.
* Apply code-review feedback
* Remove unnecessary copying of shard interval list
This commit removes unnecessary copying of shard interval list. Note
that there are no copyObject function implemented for shard intervals.
- There was a crash when the table a shardid belonged to changed during
a session. Instead of crashing (a failed assert) we now throw an error
- Update the isolation test which was crashing to no longer exercise
that code path
- Add a regression test to check that the error is thrown
* Enabling physical planner for subquery pushdown changes
This commit applies the logic that exists in INSERT .. SELECT
planning to the subquery pushdown changes.
The main algorithm is followed as :
- pick an anchor relation (i.e., target relation)
- per each target shard interval
- add the target shard interval's shard range
as a restriction to the relations (if all relations
joined on the partition keys)
- Check whether the query is router plannable per
target shard interval.
- If router plannable, create a task
* Add union support within the JOINS
This commit adds support for UNION/UNION ALL subqueries that are
in the following form:
.... (Q1 UNION Q2 UNION ...) as union_query JOIN (QN) ...
In other words, we currently do NOT support the queries that are
in the following form where union query is not JOINed with
other relations/subqueries :
.... (Q1 UNION Q2 UNION ...) as union_query ....
* Subquery pushdown planner uses original query
With this commit, we change the input to the logical planner for
subquery pushdown. Before this commit, the planner was relying
on the query tree that is transformed by the postgresql planner.
After this commit, the planner uses the original query. The main
motivation behind this change is the simplify deparsing of
subqueries.
* Enable top level subquery join queries
This work enables
- Top level subquery joins
- Joins between subqueries and relations
- Joins involving more than 2 range table entries
A new regression test file is added to reflect enabled test cases
* Add top level union support
This commit adds support for UNION/UNION ALL subqueries that are
in the following form:
.... (Q1 UNION Q2 UNION ...) as union_query ....
In other words, Citus supports allow top level
unions being wrapped into aggregations queries
and/or simple projection queries that only selects
some fields from the lower level queries.
* Disallow subqueries without a relation in the range table list for subquery pushdown
This commit disallows subqueries without relation in the range table
list. This commit is only applied for subquery pushdown. In other words,
we do not add this limitation for single table re-partition subqueries.
The reasoning behind this limitation is that if we allow pushing down
such queries, the result would include (shardCount * expectedResults)
where in a non distributed world the result would be (expectedResult)
only.
* Disallow subqueries without a relation in the range table list for INSERT .. SELECT
This commit disallows subqueries without relation in the range table
list. This commit is only applied for INSERT.. SELECT queries.
The reasoning behind this limitation is that if we allow pushing down
such queries, the result would include (shardCount * expectedResults)
where in a non distributed world the result would be (expectedResult)
only.
* Change behaviour of subquery pushdown flag (#1315)
This commit changes the behaviour of the citus.subquery_pushdown flag.
Before this commit, the flag is used to enable subquery pushdown logic. But,
with this commit, that behaviour is enabled by default. In other words, the
flag is now useless. We prefer to keep the flag since we don't want to break
the backward compatibility. Also, we may consider using that flag for other
purposes in the next commits.
* Require subquery_pushdown when limit is used in subquery
Using limit in subqueries may cause returning incorrect
results. Therefore we allow limits in subqueries only
if user explicitly set subquery_pushdown flag.
* Evaluate expressions on the LIMIT clause (#1333)
Subquery pushdown uses orignal query, the LIMIT and OFFSET clauses
are not evaluated. However, logical optimizer expects these expressions
are already evaluated by the standard planner. This commit manually
evaluates the functions on the logical planner for subquery pushdown.
* Better format subquery regression tests (#1340)
* Style fix for subquery pushdown regression tests
With this commit we intented a more consistent style for the
regression tests we've added in the
- multi_subquery_union.sql
- multi_subquery_complex_queries.sql
- multi_subquery_behavioral_analytics.sql
* Enable the tests that are temporarily commented
This commit enables some of the regression tests that were commented
out until all the development is done.
* Fix merge conflicts (#1347)
- Update regression tests to meet the changes in the regression
test output.
- Replace Ifs with Asserts given that the check is already done
- Update shard pruning outputs
* Add view regression tests for increased subquery coverage (#1348)
- joins between views and tables
- joins between views
- union/union all queries involving views
- views with limit
- explain queries with view
* Improve btree operators for the subquery tests
This commit adds the missing comprasion for subquery composite key
btree comparator.
So far citus used postgres' predicate proofing logic for shard
pruning, except for INSERT and COPY which were already optimized for
speed. That turns out to be too slow:
* Shard pruning for SELECTs is currently O(#shards), because
PruneShardList calls predicate_refuted_by() for every
shard. Obviously using an O(N) type algorithm for general pruning
isn't good.
* predicate_refuted_by() is quite expensive on its own right. That's
primarily because it's optimized for doing a single refutation
proof, rather than performing the same proof over and over.
* predicate_refuted_by() does not keep persistent state (see 2.) for
function calls, which means that a lot of syscache lookups will be
performed. That's particularly bad if the partitioning key is a
composite key, because without a persistent FunctionCallInfo
record_cmp() has to repeatedly look-up the type definition of the
composite key. That's quite expensive.
Thus replace this with custom-code that works in two phases:
1) Search restrictions for constraints that can be pruned upon
2) Use those restrictions to search for matching shards in the most
efficient manner available:
a) Binary search / Hash Lookup in case of hash partitioned tables
b) Binary search for equal clauses in case of range or append
tables without overlapping shards.
c) Binary search for inequality clauses, searching for both lower
and upper boundaries, again in case of range or append
tables without overlapping shards.
d) exhaustive search testing each ShardInterval
My measurements suggest that we are considerably, often orders of
magnitude, faster than the previous solution, even if we have to fall
back to exhaustive pruning.
This determines whether it's possible to perform binary search on
sortedShardIntervalArray or not. If e.g. two shards have overlapping
ranges, that'd be prohibitive.
That'll be useful in later commit introducing faster shard pruning.
That's useful when comparing values a hash-partitioned table is
filtered by. The existing shardIntervalCompareFunction is about
comparing hashed values, not unhashed ones.
The added btree opclass function is so we can get a comparator
back. This should be changed much more widely, but is not necessary so
far.