PG 11 has change the way that PARAM_EXTERN is processed.
This commit ensures that Citus follows the same pattern.
For details see the related Postgres commit:
6719b238e8
Assign the distributed transaction id before trying to acquire the
executor advisory locks. This is useful to show this backend in citus
lock graphs (e.g., dump_global_wait_edges() and citus_lock_waits).
I'm pretty sure a lot of this test functionality may be covered in some
of our existing regression tests, but I've included them to ensure we
put all failure-based tests under our new testing method for that kind
of test.
Didn't include lower replication factor, as (for a single-shard mod.),
it's indistinguishable from modifying a reference table. So these all
test modifications which hit a single, replicated shard.
We made PG11 builds optional when we had an issue
with mx isolation test that we could not solve back then.
This commit solves the issue with a workaround by running
start_metadata_sync_to_node outside the transaction block.
Both of these are a bit of a shot in the dark. In one case, we noticed
a stack trace where a caller received a null pointer and attempted to
dereference the memory context field (at 0x010). In the other, I saw
that any error thrown from within AdjustParseTree could keep the stack
from being cleaned up (presumably if we push we should always pop).
Both stack traces were collected during times of high memory pressure
and locally reproducing the problem locally or otherwise has been very
tricky (i.e. it hasn't been reproduced reliably at all).
* Keep track of cached entries in case of interruption.
Previously we set DistTableCacheEntry->sortedShardIntervalArray
and DistTableCacheEntry->shardIntervalArrayLength after we entered
all related shard entries into DistShardCacheHash. The drawback was
that if populating DistShardCacheHash was interrupted,
ResetDistTableCacheEntry() didn't see the shard hash entries created,
so was unable to clean them up.
This patch fixes that by setting sortedShardIntervalArray earlier,
and incrementing shardIntervalArrayLength as we enter shards into
the cache.
Fairly straightforward; verified that modifications fail atomically if
a worker is down or fails mid-transaction (i.e. all workers need to ack
modifications to reference tables in order to persist changes).
Including several examples from #1926. I couldn't understand why the
recover_prepared_transactions "should be an error", and EXPLAIN has
changed since the original bug (so that it runs EXPLAINs in txns, I
think for EXPLAIN ANALYZE to not have side effects); other than that,
most of the reported bugs now error out rather than crash or return
an empty result set.
VACUUM runs outside of a transaction, so the failure modes for it are
somewhat straightforward, though ANALYZE runs in a 1pc transaction and
multi-table VACUUM can fail between statements (PG 11 and higher).
Tests various failure points during a multi-shard modification within
a transaction with multiple statements. Verifies three cases:
* Reference tables (single shard, many placements)
* Normal table with replication factor two
* Multi-shard table with no replication
In the replication-factor case, we expect shard health to be affected
in some transactions; most others fail the transaction entirely and
all we need verify is that no effects of the transaction are visible.
Had trouble testing the final PREPARE/COMMIT/ROLLBACK phase of the 2pc,
in particular because the error message produced includes the PID of
the backend, which is unpredictable.
Drop schema command fails in mx mode if there
is a partitioned table with active partitions.
This is due to fact that sql drop trigger receives
all the dropped objects including partitions. When
we call drop table on parent partition, it also drops
the partitions on the mx node. This causes the drop
table command on partitions to fail on mx node because
they are already dropped when the partition parent was
dropped.
With this work we did not require the table to exist on
worker_drop_distributed_table.
PG now allows foreign keys on partitioned tables.
Each foreign key constraint on partitioned table
is propagated down to partitions.
We used to create all constraints on shards when we are creating
a new shard, or when just simply moving a shard from one worker
to another. We also used the same logic when creating a copy of
coordinator table in mx node.
With this change we create the constraint on worker node only if
it is not an inherited constraint.
We used to set the execution mode in the truncate trigger. However,
when multiple tables are truncated with a single command, we could
set the execution mode very late. Instead, now set the execution mode
on the utility hook.
By setting the CPU tuple cost so high, we were triggering JIT. Instead,
we should use parallel_tuple_cost.
See: rhaas.blogspot.com/2018/06/using-forceparallelmode-correctly.html
This reverts commit a2fb5a84f1.
JIT wasn't actually interfering with the operation of Citus, a test was
just written in a way which caused JIT to run for a function on every
row in a 150k-row table.
With this commit, we all partitioned distributed tables with
replication factor > 1. However, we also have many restrictions.
In summary, we disallow all kinds of modifications (including DDLs)
on the partition tables. Instead, the user is allowed to run the
modifications over the parent table.
The necessity for such a restriction have two aspects:
- We need to acquire shard resource locks appropriately
- We need to handle marking partitions INVALID in case
of any failures. Note that, in theory, the parent table
should also become INVALID, which is too aggressive.
We acquire distributed lock on all mx nodes for truncated
tables before actually doing truncate operation.
This is needed for distributed serialization of the truncate
command without causing a deadlock.
Reason for the failure is that PG11 introduced a new relation kind
RELKIND_PARTITIONED_INDEX to be used for partitioned indices.
We expanded our check to cover that case.
This commit uses *_walker instead of *_mutator for performance reasons.
Given that we're only updating a functionId in the tree, the approach
seems fine.
In case a failure happens when a transaction is failed on PREPARE,
we used to hit an assertion for ensuring there is no pending
activity on the connection. However, that's not true after the
changes in #2031. Thus, we've replaced the assertion with a more
generic function call to consume any pending activity, if exists.
In the distributed deadlock detection design, we concluded that prepared transactions
cannot be part of a distributed deadlock. The idea is that (a) when the transaction
is prepared it already acquires all the locks, so cannot be part of a deadlock
(b) even if some other processes blocked on the prepared transaction, prepared transactions
would eventually be committed (or rollbacked) and the system will continue operating.
With the above in mind, we probably had a mistake in terms of memory allocations. For each
backend initialized, we keep a `BackendData` struct. The bug we've introduced is that, we
assumed there would only be `MaxBackend` number of backends. However, `MaxBackends` doesn't
include the prepared transactions and axuliary processes. When you check Postgres' InitProcGlobal`
you'd see that `TotalProcs = MaxBackends + NUM_AUXILIARY_PROCS + max_prepared_xacts;`
This commit aligns with total procs processed with that.
PG11 introduced PROCEDURE concept similar to FUNCTION
Procedure's allow committing/rolling back behavior.
This commmit adds regression tests for procedure calls.
With this commit, we implement two views that are very similar
to pg_stat_activity, but showing queries that are involved in
distributed queries:
- citus_dist_stat_activity: Shows all the distributed queries
- citus_worker_stat_activity: Shows all the queries on the shards
that are initiated by distributed queries.
Both views have the same columns in the outputs. In very basic terms, both of the views
are meant to provide some useful insights about the distributed
transactions within the cluster. As the names reveal, both views are similar to pg_stat_activity.
Also note that these views can be pretty useful on Citus MX clusters.
Note that when the views are queried from the worker nodes, they'd not show the distributed
transactions that are initiated from the coordinator node. The reason is that the worker
nodes do not know the host/port of the coordinator. Thus, it is advisable to query the
views from the coordinator.
If we bucket the columns that the views returns, we'd end up with the following:
- Hostnames and ports:
- query_hostname, query_hostport: The node that the query is running
- master_query_host_name, master_query_host_port: The node in the cluster
initiated the query.
Note that for citus_dist_stat_activity view, the query_hostname-query_hostport
is always the same with master_query_host_name-master_query_host_port. The
distinction is mostly relevant for citus_worker_stat_activity. For example,
on Citus MX, a users starts a transaction on Node-A, which starts worker
transactions on Node-B and Node-C. In that case, the query hostnames would be
Node-B and Node-C whereas the master_query_host_name would Node-A.
- Distributed transaction related things:
This is mostly the process_id, distributed transactionId and distributed transaction
number.
- pg_stat_activity columns:
These two views get all the columns from pg_stat_activity. We're basically joining
pg_stat_activity with get_all_active_transactions on process_id.
This test's output changes depending on which worker is
picked for explain (e.g., worker port in the output changes).
Given that the test is only aiming to ensure that CTEs inside
CTEs work fine in DML queries, it should be fine to get rid of
the EXPLAIN. The output is verified to be correct as well.
We previously implemented OTHER_WORKERS_WITH_METADATA tag. However,
that was wrong. See the related discussion:
https://github.com/citusdata/citus/issues/2320
Instead, we switched using OTHER_WORKER_NODES and make the command
that we're running optional such that even if the node is not a
metadata node, we won't be in trouble.