Following scenario resulted in distributed deadlock before this commit:
CREATE TABLE partitioning_test(id int, time date) PARTITION BY RANGE (time);
CREATE TABLE partitioning_test_2009 (LIKE partitioning_test);
CREATE TABLE partitioning_test_reference(id int PRIMARY KEY, subid int);
SELECT create_distributed_table('partitioning_test_2009', 'id'),
create_distributed_table('partitioning_test', 'id'),
create_reference_table('partitioning_test_reference');
ALTER TABLE partitioning_test ADD CONSTRAINT partitioning_reference_fkey FOREIGN KEY (id) REFERENCES partitioning_test_reference(id) ON DELETE CASCADE;
ALTER TABLE partitioning_test_2009 ADD CONSTRAINT partitioning_reference_fkey_2009 FOREIGN KEY (id) REFERENCES partitioning_test_reference(id) ON DELETE CASCADE;
ALTER TABLE partitioning_test ATTACH PARTITION partitioning_test_2009 FOR VALUES FROM ('2009-01-01') TO ('2010-01-01');
Since flattening query may flatten outer joins' columns into coalesce expr that is
in the USING part, and that was not expected before this commit, these queries were
erroring out. It is fixed by this commit with considering coalesce expression as well.
Before this commit, round-robin task assignment policy was relying
on the taskId. Thus, even inside a transaction, the tasks were
assigned to different nodes. This was especially problematic
while reading from reference tables within transaction blocks.
Because, we had to expand the distributed transaction to many
nodes that are not necessarily already in the distributed transaction.
In this context, we define "Fast Path Planning for SELECT" as trivial
queries where Citus can skip relying on the standard_planner() and
handle all the planning.
For router planner, standard_planner() is mostly important to generate
the necessary restriction information. Later, the restriction information
generated by the standard_planner is used to decide whether all the shards
that a distributed query touches reside on a single worker node. However,
standard_planner() does a lot of extra things such as cost estimation and
execution path generations which are completely unnecessary in the context
of distributed planning.
There are certain types of queries where Citus could skip relying on
standard_planner() to generate the restriction information. For queries
in the following format, Citus does not need any information that the
standard_planner() generates:
SELECT ... FROM single_table WHERE distribution_key = X; or
DELETE FROM single_table WHERE distribution_key = X; or
UPDATE single_table SET value_1 = value_2 + 1 WHERE distribution_key = X;
Note that the queries might not be as simple as the above such that
GROUP BY, WINDOW FUNCIONS, ORDER BY or HAVING etc. are all acceptable. The
only rule is that the query is on a single distributed (or reference) table
and there is a "distribution_key = X;" in the WHERE clause. With that, we
could use to decide the shard that a distributed query touches reside on
a worker node.
Failure&Cancellation tests for initial start_metadata_sync() calls
to worker and DDL queries that send metadata syncing messages to an MX node
Also adds message type definitions for messages that are exchanged
during metadata syncing
-
We used to error out if there is a reference table
in the query participating a union. This has caused
pushdownable queries to be evaluated in coordinator.
Now we let reference tables inside union queries as long
as there is a distributed table in from clause.
Existing join checks (reference table on the outer part)
sufficient enought that we do not need check the join relation
of reference tables.
Previously we allowed task assignment policy to have affect on router queries
with only intermediate results. However, that is erroneous since the code-path
that assigns placements relies on shardIds and placements, which doesn't exists
for intermediate results.
With this commit, we do not apply task assignment policies when a router query
hits only intermediate results.
We disable bunch of planning options on the workers. This might be
risky if any concurrent test relies on EXPLAIN OUTPUT as well. Still,
we want to keep this test, so we should try to not parallelize this
test with such test.
Before this commit, Citus supported INSERT...SELECT queries with
ON CONFLICT or RETURNING clauses only for pushdownable ones, since
queries supported via coordinator were utilizing COPY infrastructure
of PG to send selected tuples to the target worker nodes.
After this PR, INSERT...SELECT queries with ON CONFLICT or RETURNING
clauses will be performed in two phases via coordinator. In the first
phase selected tuples will be saved to the intermediate table which
is colocated with target table of the INSERT...SELECT query. Note that,
a utility function to save results to the colocated intermediate result
also implemented as a part of this commit. In the second phase, INSERT..
SELECT query is directly run on the worker node using the intermediate
table as the source table.
Description: Support round-robin `task_assignment_policy` for queries to reference tables.
This PR allows users to query multiple placements of shards in a round robin fashion. When `citus.task_assignment_policy` is set to `'round-robin'` the planner will use a round robin scheduling feature when multiple shard placements are available.
The primary use-case is spreading the load of reference table queries to all the nodes in the cluster instead of hammering only the first placement of the reference table. Since reference tables share the same path for selecting the shards with single shard queries that have multiple placements (`citus.shard_replication_factor > 1`) this setting also allows users to spread the query load on these shards.
For modifying queries we do not apply a round-robin strategy. This would be negated by an extra reordering step in the executor for such queries where a `first-replica` strategy is enforced.
In recent postgres builds you cannot set client_min_messages to
values higher then ERROR, if will silently set it to ERROR if so.
During some tests we would set it to fatal to hide random values
(eg. pid's of processes) from the test output. This patch will use
different tactics for hiding these values.
After Fast ALTER TABLE ADD COLUMN with a non-NULL default in PG11, physical heaps might not contain all attributes after a ALTER TABLE ADD COLUMN happens. heap_getattr() returns NULL when the physical tuple doesn't contain an attribute. So we should use heap_deform_tuple() in these cases, which fills in the missing attributes.
Our catalog tables evolve over time, and an upgrade might involve some ALTER TABLE ADD COLUMN commands.
Note that we don't need to worry about postgres catalog tables and we can use heap_getattr() for them, because they only change between major versions.
This also fixes#2453.
I'm pretty sure a lot of this test functionality may be covered in some
of our existing regression tests, but I've included them to ensure we
put all failure-based tests under our new testing method for that kind
of test.
Didn't include lower replication factor, as (for a single-shard mod.),
it's indistinguishable from modifying a reference table. So these all
test modifications which hit a single, replicated shard.
Fairly straightforward; verified that modifications fail atomically if
a worker is down or fails mid-transaction (i.e. all workers need to ack
modifications to reference tables in order to persist changes).
Including several examples from #1926. I couldn't understand why the
recover_prepared_transactions "should be an error", and EXPLAIN has
changed since the original bug (so that it runs EXPLAINs in txns, I
think for EXPLAIN ANALYZE to not have side effects); other than that,
most of the reported bugs now error out rather than crash or return
an empty result set.
VACUUM runs outside of a transaction, so the failure modes for it are
somewhat straightforward, though ANALYZE runs in a 1pc transaction and
multi-table VACUUM can fail between statements (PG 11 and higher).
Tests various failure points during a multi-shard modification within
a transaction with multiple statements. Verifies three cases:
* Reference tables (single shard, many placements)
* Normal table with replication factor two
* Multi-shard table with no replication
In the replication-factor case, we expect shard health to be affected
in some transactions; most others fail the transaction entirely and
all we need verify is that no effects of the transaction are visible.
Had trouble testing the final PREPARE/COMMIT/ROLLBACK phase of the 2pc,
in particular because the error message produced includes the PID of
the backend, which is unpredictable.
Drop schema command fails in mx mode if there
is a partitioned table with active partitions.
This is due to fact that sql drop trigger receives
all the dropped objects including partitions. When
we call drop table on parent partition, it also drops
the partitions on the mx node. This causes the drop
table command on partitions to fail on mx node because
they are already dropped when the partition parent was
dropped.
With this work we did not require the table to exist on
worker_drop_distributed_table.
PG now allows foreign keys on partitioned tables.
Each foreign key constraint on partitioned table
is propagated down to partitions.
We used to create all constraints on shards when we are creating
a new shard, or when just simply moving a shard from one worker
to another. We also used the same logic when creating a copy of
coordinator table in mx node.
With this change we create the constraint on worker node only if
it is not an inherited constraint.
We used to set the execution mode in the truncate trigger. However,
when multiple tables are truncated with a single command, we could
set the execution mode very late. Instead, now set the execution mode
on the utility hook.
By setting the CPU tuple cost so high, we were triggering JIT. Instead,
we should use parallel_tuple_cost.
See: rhaas.blogspot.com/2018/06/using-forceparallelmode-correctly.html
With this commit, we all partitioned distributed tables with
replication factor > 1. However, we also have many restrictions.
In summary, we disallow all kinds of modifications (including DDLs)
on the partition tables. Instead, the user is allowed to run the
modifications over the parent table.
The necessity for such a restriction have two aspects:
- We need to acquire shard resource locks appropriately
- We need to handle marking partitions INVALID in case
of any failures. Note that, in theory, the parent table
should also become INVALID, which is too aggressive.
Reason for the failure is that PG11 introduced a new relation kind
RELKIND_PARTITIONED_INDEX to be used for partitioned indices.
We expanded our check to cover that case.
This commit uses *_walker instead of *_mutator for performance reasons.
Given that we're only updating a functionId in the tree, the approach
seems fine.
PG11 introduced PROCEDURE concept similar to FUNCTION
Procedure's allow committing/rolling back behavior.
This commmit adds regression tests for procedure calls.
With this commit, we implement two views that are very similar
to pg_stat_activity, but showing queries that are involved in
distributed queries:
- citus_dist_stat_activity: Shows all the distributed queries
- citus_worker_stat_activity: Shows all the queries on the shards
that are initiated by distributed queries.
Both views have the same columns in the outputs. In very basic terms, both of the views
are meant to provide some useful insights about the distributed
transactions within the cluster. As the names reveal, both views are similar to pg_stat_activity.
Also note that these views can be pretty useful on Citus MX clusters.
Note that when the views are queried from the worker nodes, they'd not show the distributed
transactions that are initiated from the coordinator node. The reason is that the worker
nodes do not know the host/port of the coordinator. Thus, it is advisable to query the
views from the coordinator.
If we bucket the columns that the views returns, we'd end up with the following:
- Hostnames and ports:
- query_hostname, query_hostport: The node that the query is running
- master_query_host_name, master_query_host_port: The node in the cluster
initiated the query.
Note that for citus_dist_stat_activity view, the query_hostname-query_hostport
is always the same with master_query_host_name-master_query_host_port. The
distinction is mostly relevant for citus_worker_stat_activity. For example,
on Citus MX, a users starts a transaction on Node-A, which starts worker
transactions on Node-B and Node-C. In that case, the query hostnames would be
Node-B and Node-C whereas the master_query_host_name would Node-A.
- Distributed transaction related things:
This is mostly the process_id, distributed transactionId and distributed transaction
number.
- pg_stat_activity columns:
These two views get all the columns from pg_stat_activity. We're basically joining
pg_stat_activity with get_all_active_transactions on process_id.
This test's output changes depending on which worker is
picked for explain (e.g., worker port in the output changes).
Given that the test is only aiming to ensure that CTEs inside
CTEs work fine in DML queries, it should be fine to get rid of
the EXPLAIN. The output is verified to be correct as well.
This commit fixes a bug where a concurrent DROP TABLE deadlocks
with SELECT (or DML) when the SELECT is executed from the workers.
The problem was that Citus used to remove the metadata before
droping the table on the workers. That creates a time window
where the SELECT starts running on some of the nodes and DROP
table on some of the other nodes.
This commit enables support for TRUNCATE on both
distributed table and reference tables.
The basic idea is to acquire lock on the relation by sending
the TRUNCATE command to all metedata worker nodes. We only
skip sending the TRUNCATE command to the node that actually
executus the command to prevent a self-distributed-deadlock.
Make sure that the coordinator sends the commands when the search
path synchronised with the coordinator's search_path. This is only
important when Citus sends the commands that are directly relayed
to the worker nodes. For example, the deparsed DLL commands or
queries always adds schema qualifications to the queries. So, they
do not require this change.
This commit by default enables hiding shard names on MX workers
by simple replacing `pg_table_is_visible()` calls with
`citus_table_is_visible()` calls on the MX worker nodes. The latter
function filters out tables that are known to be shards.
The main motivation of this change is a better UX. The functionality
can be opted out via a GUC.
We also added two views, namely citus_shards_on_worker and
citus_shard_indexes_on_worker such that users can query
them to see the shards and their corresponding indexes.
We also added debug messages such that the filtered tables can
be interactively seen by setting the level to DEBUG1.
- mitmdump now listens on port 9060
- Add some logging to fluent.py, making issues like this easier to debug in the future
- Fail the tests if something is already running on the port mitmProxy tries to use
- check-failure now works with VPATH builds
This commit adds an extensive failure testing, which covers quite
a bit of things and their combinations:
- 1PC vs 2PC
- Replication factor 1 and Replication factor 2
- Network failures and query cancellations
- Sequential vs Parallel query execution mode
- Lots of detail is in src/test/regress/mitmscripts/README
- Create a new target, make check-failure, which runs tests
- Tells travis how to install everything and run the tests
When a hash distributed table have a foreign key to a reference
table, there are few restrictions we have to apply in order to
prevent distributed deadlocks or reading wrong results.
The necessity to apply the restrictions arise from cascading
nature of foreign keys. When a foreign key on a reference table
cascades to a distributed table, a single operation over a single
connection can acquire locks on multiple shards of the distributed
table. Thus, any parallel operation on that distributed table, in the
same transaction should not open parallel connections to the shards.
Otherwise, we'd either end-up with a self-distributed deadlock or
read wrong results.
As briefly described above, the restrictions that we apply is done
by tracking the distributed/reference relation accesses inside
transaction blocks, and act accordingly when necessary.
The two main rules are as follows:
- Whenever a parallel distributed relation access conflicts
with a consecutive reference relation access, Citus errors
out
- Whenever a reference relation access is followed by a
conflicting parallel relation access, the execution mode
is switched to sequential mode.
There are also some other notes to mention:
- If the user does SET LOCAL citus.multi_shard_modify_mode
TO 'sequential';, all the queries should simply work with
using one connection per worker and sequentially executing
the commands. That's obviously a slower approach than Citus'
usual parallel execution. However, we've at least have a way
to run all commands successfully.
- If an unrelated parallel query executed on any distributed
table, we cannot switch to sequential mode. Because, the essense
of sequential mode is using one connection per worker. However,
in the presence of a parallel connection, the connection manager
picks those connections to execute the commands. That contradicts
with our purpose, thus we error out.
- COPY to a distributed table cannot be executed in sequential mode.
Thus, if we switch to sequential mode and COPY is executed, the
operation fails and there is currently no way of implementing that.
Note that, when the local table is not empty and create_distributed_table
is used, citus uses COPY internally. Thus, in those cases,
create_distributed_table() will also fail.
- There is a GUC called citus.enforce_foreign_key_restrictions
to disable all the checks. We added that GUC since the restrictions
we apply is sometimes a bit more restrictive than its necessary.
The user might want to relax those. Similarly, if you don't have
CASCADEing reference tables, you might consider disabling all the
checks.
-[x] drop constraint
-[x] drop column
-[x] alter column type
-[x] truncate
are sequentialized if there is a foreign constraint from
a distributed table to a reference table on the affected relations
by the above commands.
Make sure that intermediate results use a connection that is
not associated with any placement. That is useful in two ways:
- More complex queries can be executed with CTEs
- Safely use the same connections when there is a foreign key
to reference table from a distributed table, which needs to
use the same connection for modifications since the reference
table might cascade to the distributed table.
This table will be used by Citus Enterprise to populate authentication-
related fields in outbound connections; Citus Community lacks support
for this functionality.
We're relying on multi_shard_modify_mode GUC for real-time SELECTs.
The name of the GUC is unfortunate, but, adding one more GUC
(or renaming the GUC) would make the UX even worse. Given that this
mode is mostly important for transaction blocks that involve modification
/DDL queries along with real-time SELECTs, we can live with the confusion.
After this commit DDL commands honour `citus.multi_shard_modify_mode`.
We preferred using the code-path that executes single task router
queries (e.g., ExecuteSingleModifyTask()) in order not to invent
a new executor that is only applicable for DDL commands that require
sequential execution.