DESCRIPTION: Fixes deadlock with transaction recovery that is possible
during Citus upgrades.
Fixes#7875.
This commit addresses two interrelated deadlock issues uncovered during Citus
upgrades:
1. Local Deadlock:
- **Problem:**
In `RecoverWorkerTransactions()`, a new connection is created for each worker
node to perform transaction recovery by locking the
`pg_dist_transaction` catalog table until the end of the transaction. When
`RecoverTwoPhaseCommits()` calls this function for each worker node, the order
of acquiring locks on `pg_dist_authinfo` and `pg_dist_transaction` can alternate.
This reversal can lead to a deadlock if any concurrent process requires locks on
these tables.
- **Fix:**
Pre-establish all worker node connections upfront so that
`RecoverWorkerTransactions()` operates with a single, consistent connection.
This ensures that locks on `pg_dist_authinfo` and `pg_dist_transaction` are always
acquired in the correct order, thereby preventing the local deadlock.
2. Distributed Deadlock:
- **Problem:**
After resolving the local deadlock, a distributed deadlock issue emerges. The
maintenance daemon calls `RecoverWorkerTransactions()` on each worker node—
including the local node—which leads to a complex locking sequence:
- A RowExclusiveLock is taken on the `pg_dist_transaction` table in
`RecoverWorkerTransactions()`.
- An update extension then tries to acquire an AccessExclusiveLock on the same
table, getting blocked by the RowExclusiveLock.
- A subsequent query (e.g., a SELECT on `pg_prepared_xacts`) issued using a
separate connection on the local node gets blocked due to locks held during a
call to `BuildCitusTableCacheEntry()`.
- The maintenance daemon waits for this query, resulting in a circular wait and
stalling the entire cluster.
- **Fix:**
Avoid cache lookups for internal PostgreSQL tables by implementing an early bailout
for relation IDs below `FirstNormalObjectId` (system objects). This eliminates
unnecessary calls to `BuildCitusTableCache`, reducing lock contention and mitigating
the distributed deadlock.
Furthermore, this optimization improves performance in fast
connect→query_catalog→disconnect cycles by eliminating redundant
cache creation and lookups.
3. Also reverts the commit that disabled the relevant test cases.
DESCRIPTION: Drops PG14 support
1. Remove "$version_num" != 'xx' from configure file
2. delete all PG_VERSION_NUM = PG_VERSION_XX references in the code
3. Look at pg_version_compat.h file, remove all _compat functions etc
defined specifically for PGXX differences
4. delete all PG_VERSION_NUM >= PG_VERSION_(XX+1), PG_VERSION_NUM <
PG_VERSION_(XX+1) ifs in the codebase
5. delete ruleutils_xx.c file
6. cleanup normalize.sed file from pg14 specific lines
7. delete all alternative output files for that particular PG version,
server_version_ge variable helps here
As of this commit, after recovering the remote transactions, now we release the lock
on pg_dist_transaction while closing it to avoid deadlocks that might occur because
of trying to acquire a lock on pg_dist_authinfo while holding a lock on
pg_dist_transaction. Such a scenario can only cause a deadlock if another transaction
is trying to acquire a strong lock on pg_dist_transaction while holding a lock on
pg_dist_authinfo. As of today, we (implicitly) acquire a strong lock on
pg_dist_transaction only when upgrading Citus to 11.3-1 and this happens when creating
a REPLICA IDENTITY on pg_dist_transaction.
And regardless of the code-path we are in, it should be okay to release the lock there
because all we do after that point is to abort the prepared transactions that are not
part of an in-progress distributed transaction and releasing the lock before doing so
should be just fine.
This also changes the blocking behavior between citus_create_restore_point and the
transaction recovery code-path in the sense that now citus_create_restore_point doesn't
until transaction recovery completes aborting the prepared transactions that are not
part of an in-progress distributed transaction. However, this should be fine because
even before this was possible, e.g., if transaction recovery fails to open a remote
connection to a node.
This PR provides successful compilation against PG17.0.
- Remove ExecFreeExprContext call
Relevant PG commit
d060e921ea5aa47b6265174c32e1128cebdbc3df
d060e921ea
- PG17 uses streaming IO in analyze, fix scan_analyze_next_block function
Relevant PG commit
041b96802efa33d2bc9456f2ad946976b92b5ae1
041b96802e
- Define ObjectClass for PG17+ only since it's removed
Relevant PG commit:
89e5ef7e21812916c9cf9fcf56e45f0f74034656
89e5ef7e21
- Remove ReorderBufferTupleBuf structure.
Relevant PG commit:
08e6344fd6423210b339e92c069bb979ba4e7cd6
08e6344fd6
- Define colliculocale and daticulocale since they have been renamed
Relevant PG commit:
f696c0cd5f299f1b51e214efc55a22a782cc175d
f696c0cd5f
- makeStringConst defined in PG17
Relevant PG commit:
de3600452b61d1bc3967e9e37e86db8956c8f577
de3600452b
- RangeVarCallbackOwnsTable was replaced by RangeVarCallbackMaintainsTable
Relevant PG commit:
ecb0fd33720fab91df1207e85704f382f55e1eb7
ecb0fd3372
- attstattarget is nullable, define pg compatible functions for it
Relevant PG commit:
4f622503d6de975ac87448aea5cea7de4bc140d5
4f622503d6
- stxstattarget is nullable in PG17, write compat functions for it
Relevant PG commit:
012460ee93c304fbc7220e5b55d9d0577fc766ab
012460ee93
- Use ResourceOwner to track WaitEventSet in PG17
Relevant PG commit:
50c67c2019ab9ade8aa8768bfe604cd802fe8591
50c67c2019
- getIdentitySequence now uses Relation instead of relation_id
Relevant PG commit:
509199587df73f06eda898ae13284292f4ae573a
509199587d
- Remove no-op tuplestore_donestoring function
Relevant PG commit:
75680c3d805e2323cd437ac567f0677fdfc7b680
75680c3d80
- MergeAction can have 3 merge kinds (now enum) in PG17, write compat
Relevant PG commit:
0294df2f1f842dfb0eed79007b21016f486a3c6c
0294df2f1f
- EXPLAIN (MEMORY) is added, make changes to ExplainOnePlan
Relevant PG commit:
5de890e3610d5a12cdaea36413d967cf5c544e20
5de890e361
- LIMIT_OPTION_DEFAULT has been removed as it's useless, use LIMIT_OPTION_COUNT
Relevant PG commit:
a6be0600ac3b71dda8277ab0fcbe59ee101ac1ce
a6be0600ac
- write compat for create_foreignscan_path bcs of more arguments in PG17
Relevant PG commit:
9e9931d2bf40e2fea447d779c2e133c2c1256ef3
9e9931d2bf
- pgprocno and lxid have been combined into a struct in PGPROC
Relevant PG commits:
28f3915b73f75bd1b50ba070f56b34241fe53fd1
28f3915b73
ab355e3a88de745607f6dd4c21f0119b5c68f2ad
ab355e3a88
024c521117579a6d356050ad3d78fdc95e44eefa
024c521117
- Simplify CitusNewNode (#7434)
postgres refactored newNode() in PG 17, the main point for doing this is
the original tricks is no longer neccessary for modern compilers[1].
This does the same for Citus.
This should have no backward compatibility issues since it just replaces
palloc0fast with palloc0.
This is good for forward compatibility since palloc0fast no longer
exists in PG 17.
[1]
https://www.postgresql.org/message-id/b51f1fa7-7e6a-4ecc-936d-90a8a1659e7c@iki.fi
(cherry picked from commit 4b295cc)
This is prep work for successful compilation with PG17
PG17added foreach_ptr, foreach_int and foreach_oid macros
Relevant PG commit
14dd0f27d7cd56ffae9ecdbe324965073d01a9ff
14dd0f27d7
We already have these macros, but they are different with the
PG17 ones because our macros take a DECLARED variable, whereas
the PG16 macros declare a locally-scoped loop variable themselves.
Hence I am renaming our macros to foreach_declared_
I am separating this into its own PR since it touches many files. The
main compilation PR is https://github.com/citusdata/citus/pull/7699
DESCRIPTION: Fixes a crash that happens because of unsafe catalog access
when re-assigning the global pid after application_name changes.
When application_name changes, we don't actually need to
try re-assigning the global pid for external client backends because
application_name doesn't affect the global pid for such backends. Plus,
trying to re-assign the global pid for external client backends would
unnecessarily cause performing a catalog access when the cached local
node id is invalidated. However, accessing to the catalog tables is
dangerous in certain situations like when we're not in a transaction
block. And for the other types of backends, i.e., the Citus internal
backends, we need to re-assign the global pid when the application_name
changes because for such backends we simply extract the global pid
inherited from the originating backend from the application_name -that's
specified by originating backend when openning that connection- and this
doesn't require catalog access.
Fixes https://github.com/citusdata/citus/issues/7536.
Note to reviewer:
Before this commit, the following results in an assertion failure when
executed locally and this won't be the case anymore:
```console
make -C src/test/regress/ check-citus-upgrade-local citus-old-version=v10.2.0
```
Note that this doesn't happen on CI as we don't enable assertions there.
---------
Co-authored-by: Jelte Fennema-Nio <jelte.fennema@microsoft.com>
DESCRIPTION: Adds support for distributed `CREATE/DROP DATABASE `
commands from the databases where Citus is not installed
---------
Co-authored-by: Onur Tirtir <onurcantirtir@gmail.com>
Moves the following functions to the Citus internal schema:
citus_internal_local_blocked_processes
citus_internal_global_blocked_processes
citus_internal_mark_node_not_synced
citus_internal_unregister_tenant_schema_globally
citus_internal_update_none_dist_table_metadata
citus_internal_update_placement_metadata
citus_internal_update_relation_colocation
citus_internal_start_replication_origin_tracking
citus_internal_stop_replication_origin_tracking
citus_internal_is_replication_origin_tracking_active
#7405
---------
Co-authored-by: Jelte Fennema-Nio <jelte.fennema@microsoft.com>
This PR makes the connections to other nodes for
`mark_object_distributed` use the same user as
`execute_command_on_remote_nodes_as_user` so they'll use the same
connection.
DESCRIPTION: Adds REASSIGN OWNED BY propagation
This pull request introduces the propagation of the "Reassign owned by"
statement. It accommodates both local and distributed roles for both the
old and new assignments. However, when the old role is a local role, it
undergoes filtering and is not propagated. On the other hand, if the new
role is a local role, the process involves first creating the role on
worker nodes before propagating the "Reassign owned" statement.
DESCRIPTION: Adds support for 2PC from non-Citus main databases
This PR only adds support for `CREATE USER` queries, other queries need
to be added. But it should be simple because this PR creates the
underlying structure.
Citus main database is the database where the Citus extension is
created. A non-main database is all the other databases that are in the
same node with a Citus main database.
When a `CREATE USER` query is run on a non-main database we:
1. Run `start_management_transaction` on the main database. This
function saves the outer transaction's xid (the non-main database
query's transaction id) and marks the current query as main db command.
2. Run `execute_command_on_remote_nodes_as_user("CREATE USER
<username>", <username to run the command>)` on the main database. This
function creates the users in the rest of the cluster by running the
query on the other nodes. The user on the current node is created by the
query on the outer, non-main db, query to make sure consequent commands
in the same transaction can see this user.
3. Run `mark_object_distributed` on the main database. This function
adds the user to `pg_dist_object` in all of the nodes, including the
current one.
This PR also implements transaction recovery for the queries from
non-main databases.
This change adds a script to programatically group all includes in a
specific order. The script was used as a one time invocation to group
and sort all includes throught our formatted code. The grouping is as
follows:
- System includes (eg. `#include<...>`)
- Postgres.h (eg. `#include "postgres.h"`)
- Toplevel imports from postgres, not contained in a directory (eg.
`#include "miscadmin.h"`)
- General postgres includes (eg . `#include "nodes/..."`)
- Toplevel citus includes, not contained in a directory (eg. `#include
"citus_verion.h"`)
- Columnar includes (eg. `#include "columnar/..."`)
- Distributed includes (eg. `#include "distributed/..."`)
Because it is quite hard to understand the difference between toplevel
citus includes and toplevel postgres includes it hardcodes the list of
toplevel citus includes. In the same manner it assumes anything not
prefixed with `columnar/` or `distributed/` as a postgres include.
The sorting/grouping is enforced by CI. Since we do so with our own
script there are not changes required in our uncrustify configuration.
DESCRIPTION: Adds support from issuing role management commands from worker nodes
It's unlikely to get into a distributed deadlock with role commands, we
don't care much about them at the moment.
There were several attempts to reduce the chances of a deadlock but we
didn't any of them merged into main branch yet, see:
#7325#7016#7009
**Problem:**
Previously we always used an outside superuser connection to overcome
permission issues for the current user while propagating dependencies.
That has mainly 2 problems:
1. Visibility issues during dependency propagation, (metadata connection
propagates some objects like a schema, and outside transaction does not
see it and tries to create it again)
2. Security issues (it is preferrable to use current user's connection
instead of extension superuser)
**Solution (high level):**
Now, we try to make a smarter decision on whether should we use an
outside superuser connection or current user's metadata connection. We
prefer using current user's connection if any of the objects, which is
already propagated in the current transaction, is a dependency for a
target object. We do that since we assume if current user has
permissions to create the dependency, then it can most probably
propagate the target as well.
Our assumption is expected to hold most of the times but it can still be
wrong. In those cases, transaction would fail and user should set the
GUC `citus.create_object_propagation` to `deferred` to work around it.
**Solution:**
1. We track all objects propagated in the current transaction (we can
handle subtransactions),
2. We propagate dependencies via the current user's metadata connection
if any dependency is created in the current transaction to address
issues listed above. Otherwise, we still use an outside superuser
connection.
DESCRIPTION: Fixes some object propagation errors seen with transaction
blocks.
Fixes https://github.com/citusdata/citus/issues/6614
---------
Co-authored-by: Nils Dijk <nils@citusdata.com>
This PR provides successful compilation against PG16Beta2. It does some
necessary refactoring to prepare for full support of version 16, in
https://github.com/citusdata/citus/pull/6952 .
Change RelFileNode to RelFileNumber or RelFileLocator
Relevant PG commit
b0a55e43299c4ea2a9a8c757f9c26352407d0ccc
new header for varatt.h
Relevant PG commit:
d952373a987bad331c0e499463159dd142ced1ef
drop support for Abs, use fabs
Relevant PG commit
357cfefb09115292cfb98d504199e6df8201c957
tuplesort PGcommit: d37aa3d35832afde94e100c4d2a9618b3eb76472
Relevant PG commit:
d37aa3d35832afde94e100c4d2a9618b3eb76472
Fix vacuum in columnar
Relevant PG commit:
4ce3afb82ecfbf64d4f6247e725004e1da30f47c
older one:
b6074846cebc33d752f1d9a66e5a9932f21ad177
Add alloc_flags to pg_clean_ascii
Relevant PG commit:
45b1a67a0fcb3f1588df596431871de4c93cb76f
Merge GetNumConfigOptions() into get_guc_variables()
Relevant PG commit:
3057465acfbea2f3dd7a914a1478064022c6eecd
Minor PG refactor PG_FUNCNAME_MACRO __func__
Relevant PG commit
320f92b744b44f961e5d56f5f21de003e8027a7f
Pass NULL context to stringToQualifiedNameList, typeStringToTypeName
The pre-PG16 error behaviour for the following
stringToQualifiedNameList & typeStringToTypeName
was ereport(ERROR, ...)
Now with PG16 we have this context input. We preserve the same behaviour
by passing a NULL context, because of the following:
(copy paste comment from PG16)
If "context" isn't an ErrorSaveContext node, this behaves as
errstart(ERROR, domain), and the errsave() macro ends up acting
exactly like ereport(ERROR, ...).
Relevant PG commit
858e776c84f48841e7e16fba7b690b76e54f3675
Use RangeVarCallbackMaintainsTable instead of RangeVarCallbackOwnsTable
Relevant PG commit:
60684dd834a222fefedd49b19d1f0a6189c1632e
FIX THIS: Not implemented grant-level control of role inheritance
see PG commit
e3ce2de09d814f8770b2e3b3c152b7671bcdb83f
Make Scan node abstract
PG commit:
8c73c11a0d39049de2c1f400d8765a0eb21f5228
Change in Var representations, get_relids_in_jointree
PG commit
2489d76c4906f4461a364ca8ad7e0751ead8aa0d
Deadlock detection changes because SHM_QUEUE is removed
Relevant PG Commit:
d137cb52cb7fd44a3f24f3c750fbf7924a4e9532
TU_UpdateIndexes
Relevant PG commit
19d8e2308bc51ec4ab993ce90077342c915dd116
Use object_ownercheck and object_aclcheck functions
Relevant PG commits:
afbfc02983f86c4d71825efa6befd547fe81a926
c727f511bd7bf3c58063737bcf7a8f331346f253
Rework Permission Info for successful compilation
Relevant PG commits:
postgres/postgres@a61b1f7postgres/postgres@b803b7d
---------
Co-authored-by: onderkalaci <onderkalaci@gmail.com>
This commit is the second and last phase of dropping PG13 support.
It consists of the following:
- Removes all PG_VERSION_13 & PG_VERSION_14 from codepaths
- Removes pg_version_compat entries and columnar_version_compat entries
specific for PG13
- Removes alternative pg13 test outputs
- Removes PG13 normalize lines and fix the test outputs based on that
It is a continuation of 5bf163a27d
With this PR, we allow creating distributed tables with without
specifying a shard key via create_distributed_table(). Here are the
the important details about those tables:
* Specifying `shard_count` is not allowed because it is assumed to be 1.
* We mostly call such tables as "null shard-key" table in code /
comments.
* To avoid doing a breaking layout change in create_distributed_table();
instead of throwing an error, it will inform the user that
`distribution_type`
param is ignored unless it's explicitly set to NULL or 'h'.
* `colocate_with` param allows colocating such null shard-key tables to
each other.
* We define this table type, i.e., NULL_SHARD_KEY_TABLE, as a subclass
of
DISTRIBUTED_TABLE because we mostly want to treat them as distributed
tables in terms of SQL / DDL / operation support.
* Metadata for such tables look like:
- distribution method => DISTRIBUTE_BY_NONE
- replication model => REPLICATION_MODEL_STREAMING
- colocation id => **!=** INVALID_COLOCATION_ID (distinguishes from
Citus local tables)
* We assign colocation groups for such tables to different nodes in a
round-robin fashion based on the modulo of "colocation id".
Note that this PR doesn't care about DDL (except CREATE TABLE) / SQL /
operation (i.e., Citus UDFs) support for such tables but adds a
preliminary
API.
Description:
Implementing CDC changes using Logical Replication to avoid
re-publishing events multiple times by setting up replication origin
session, which will add "DoNotReplicateId" to every WAL entry.
- shard splits
- shard moves
- create distributed table
- undistribute table
- alter distributed tables (for some cases)
- reference table operations
The citus decoder which will be decoding WAL events for CDC clients,
ignores any WAL entry with replication origin that is not zero.
It also maps the shard names to distributed table names.
Now that we will soon add another table type having DISTRIBUTE_BY_NONE
as distribution method and that we want the code to interpret such
tables mostly as distributed tables, let's make the definition of those
other two table types more strict by removing
CITUS_TABLE_WITH_NO_DIST_KEY
macro.
And instead, use HasDistributionKey() check in the places where the
logic applies to all table types that have / don't have a distribution
key. In future PRs, we might want to convert some of those
HasDistributionKey() checks if logic only applies to Citus local /
reference tables, not the others.
And adding HasDistributionKey() also allows us to consider having
DISTRIBUTE_BY_NONE as the distribution method as a "table attribute"
that can apply to distributed tables too, rather something that
determines the table type.
In #6314 I refactored the connection cleanup to be simpler to
understand and use. However, by doing so I introduced a use-after-free
possibility (that valgrind luckily picked up):
In the `ShouldShutdownConnection` path of
`AfterXactHostConnectionHandling`
we free connections without removing the `transactionNode` from the
dlist that it might be part of. Before the refactoring this wasn't a
problem, because the dlist would be completely reset quickly after in
`ResetGlobalVariables` (without reading or writing the dlist entries).
The refactoring changed this by moving the `dlist_delete` call to
`ResetRemoteTransaction`, which in turn was called in the
`!ShouldShutdownConnection` path of `AfterXactHostConnectionHandling`.
Thus this `!ShouldShutdownConnection` path would now delete from the
`dlist`, but the `ShouldShutdownConnection` path would not. Thus to
remove itself the deleting path would sometimes update nodes in the list
that were freed right before.
There's two ways of fixing this:
1. Call `dlist_delete` from **both** of paths.
2. Call `dlist_delete` from **neither** of the paths.
This commit implements the second approach, and #6684 implements the
first. We need to choose which approach we prefer.
To make calling `dlist_delete` from both paths actually work, we also need
to use a slightly different check to determine if we need to call dlist_delete.
Various regression tests showed that there can be cases where the
`transactionState` is something else than `REMOTE_TRANS_NOT_STARTED`
but the connection was not added to the `InProgressTransactions` list
One example of such a case is when running `TransactionStateMachine`
without calling `StartRemoteTransactionBegin` beforehand. In those
cases the connection won't be added to `InProgressTransactions`, but
the `transactionState` is changed to `REMOTE_TRANS_SENT_COMMAND`.
Sidenote: This bug already existed in 11.1, but valgrind didn't catch it
back then. My guess is that this happened because #6314 was merged after
the initial release branch was cut.
Fixes#6638
The original implementation of GPIDs didn't work correctly when using
`pg_dist_poolinfo` together with PgBouncer. The reason is that it
assumed that once a connection was made to a worker, the originating
GPID should stay the same for ever. But when pg_dist_poolinfo is used
this isn't the case, because the same connection on the worker might be
used by different backends of the coordinator.
This fixes that issue by updating the GPID whenever a new application
name is set on a connection. This is the only thing that's needed,
because PgBouncer already sets the application name correctly on the
server connection whenever a client is updated.
In #6412 I made a change to not re-assign the global PID if it was
already set. This inadvertently introduced a regression where `userId`
and `databaseId` would not be set on the backend data when the global
PID was assigned in the authentication hook.
This fixes it by doing two things:
1. Removing `userId` from `BackendData`, since it's not used anywhere
anyway.
2. Move assignment of `databaseId` to dedicated
`SetBackendDataDatabaseId` function, that isn't a no-op when global
pid is already set.
Since #6412 is not released yet this does not need a description.
When debugging issues it's quite useful to see the originating gpid in
the application_name of a query on a worker. This already happens for
most queries, but not for queries created by the rebalancer or by
run_command_on_worker. This adds a gpid to those two application_names
too.
Note, that if the GPID of the new application_names is different than
the current GPID of the backend the backend will continue to keep
the old gpid as its actual GPID. This PR is just meant to make sure
that the application_name is as useful as it can be for users to
look at. Updating of gpids will be done in a follow-up PR, and
adding gpids to all internal connections will make this easier.
(Hopefully) Fixes#5000.
If memory allocation done for `SubXactContext *state` in `PushSubXact()`
fails, then `PopSubXact()` might segfault, for example, when grabbing
the
topmost `SubXactContext` from `activeSubXactContexts` if this is the
first
ever subxact within the current xact, with the following stack trace:
```c
citus.so!list_nth_cell(const List * list, int n) (\opt\pgenv\pgsql-14.3\include\server\nodes\pg_list.h:260)
citus.so!PopSubXact(SubTransactionId subId) (\home\onurctirtir\citus\src\backend\distributed\transaction\transaction_management.c:761)
citus.so!CoordinatedSubTransactionCallback(SubXactEvent event, SubTransactionId subId, SubTransactionId parentSubid, void * arg) (\home\onurctirtir\citus\src\backend\distributed\transaction\transaction_management.c:673)
CallSubXactCallbacks(SubXactEvent event, SubTransactionId mySubid, SubTransactionId parentSubid) (\opt\pgenv\src\postgresql-14.3\src\backend\access\transam\xact.c:3644)
AbortSubTransaction() (\opt\pgenv\src\postgresql-14.3\src\backend\access\transam\xact.c:5058)
AbortCurrentTransaction() (\opt\pgenv\src\postgresql-14.3\src\backend\access\transam\xact.c:3366)
PostgresMain(int argc, char ** argv, const char * dbname, const char * username) (\opt\pgenv\src\postgresql-14.3\src\backend\tcop\postgres.c:4250)
BackendRun(Port * port) (\opt\pgenv\src\postgresql-14.3\src\backend\postmaster\postmaster.c:4530)
BackendStartup(Port * port) (\opt\pgenv\src\postgresql-14.3\src\backend\postmaster\postmaster.c:4252)
ServerLoop() (\opt\pgenv\src\postgresql-14.3\src\backend\postmaster\postmaster.c:1745)
PostmasterMain(int argc, char ** argv) (\opt\pgenv\src\postgresql-14.3\src\backend\postmaster\postmaster.c:1417)
main(int argc, char ** argv) (\opt\pgenv\src\postgresql-14.3\src\backend\main\main.c:209)
```
For this reason, to be more defensive against memory-allocation errors
that could happen at `PushSubXact()`, now we use our pre-allocated
memory
context for the objects created in `PushSubXact()`.
This commit also attempts reducing the memory allocations done under
CommitContext to reduce the chances of consuming all the memory
available
to CommitContext.
Note that it's problematic to encounter with such a memory-allocation
error for other objects created in `PushSubXact()` as well, so above is
an **example** scenario that might result in a segfault.
DESCRIPTION: Fixes a bug that might cause segfaults when handling deeply
nested subtransactions
DESCRIPTION: Fix bug in global PID assignment for rebalancer
sub-connections
In CI our isolation_shard_rebalancer_progress test would sometimes fail
like this:
```diff
+isolationtester: canceling step s1-rebalance-c1-block-writes after 60 seconds
step s1-rebalance-c1-block-writes:
SELECT rebalance_table_shards('colocated1', shard_transfer_mode:='block_writes');
- <waiting ...>
+
+ERROR: canceling statement due to user request
step s7-get-progress:
```
Source:
https://app.circleci.com/pipelines/github/citusdata/citus/27855/workflows/2a7e335a-f3e8-46ed-b6bd-6920d42f7214/jobs/831710
It turned out this was an actual bug in the way our assigning of global
PIDs interacts with the way we connect to ourselves as the shard
rebalancer. The first command the shard rebalancer sends is a SET
ommand to change the application_name to `citus_rebalancer`. If
`StartupCitusBackend` is called after this command is processed, then it
overwrites the global PID that was extracted from the previous
application_name. This makes sure that we don't do that, and continue to
use the original global PID. While it might seem that we only call
`StartupCitusBackend` once for each query backend, this isn't actually
the case. Whenever pg_dist_partition gets ANALYZEd by autovacuum
we indirectly call `StartupCitusBackend` again, because we invalidate
the cache then.
In passing this fixes two other things as well:
1. It sets `distributedCommandOriginator` correctly in
`AssignGlobalPID`, by using IsExternalClientBackend(). This doesn't
matter much anymore, since AssignGlobalPID effectively becomes a
no-op in this PR for any non-external client backends.
2. It passes the application_name to InitializeBackendData in
StartupCitusBackend, instead of INVALID_CITUS_INTERNAL_BACKEND_GPID
(which effectively got casted to NULL). In practice this doesn't
change the behaviour of the call, since the call is a no-op for every
backend except the maintenance daemon. And the behaviour of the call
is the same for NULL as for the application_name of the maintenance
daemon.
In Split, Logical replication logic and ShardCleaner we call
`SendCommandListToWorkerOutsideTransaction` and
`SendOptionalCommandListToWorkerOutsideTransaction` frequently. This
opens new connection for each of those calls, even though we already
have a perfectly good connection lying around.
This PR adds two new APIs
`SendCommandListToWorkerOutsideTransactionWithConnection` and
`SendOptionalCommandListToWorkerOutsideTransactionWithConnection` that
allow sending a list of queries in a transaction over an existing
connection. We also update the callers (Split, ShardCleaner, Logical
Replication) to use these new APIs instead.
Co-authored-by: Nitish Upreti <niupre@microsoft.com>
Co-authored-by: Onder Kalaci <onderkalaci@gmail.com>
DESCRIPTION:
This PR adds support for 'Deferred Drop' and robust 'Shard Cleanup' for Splits.
Common Infrastructure
This PR introduces new common infrastructure so as any operation that wants robust cleanup of resources can register with the cleaner and have the resources cleaned appropriately based on a specified policy. 'Shard Split' is the first consumer using this new infrastructure.
Note : We only support adding 'shards' as resources to be cleaned-up right now but the framework will be extended to support other resources in future.
Deferred Drop for Split
Deferred Drop Support ensures that shards undergoing split are not dropped inline as part of operation but dropped later when no active read queries are running on shard. This helps with :
Avoids any potential deadlock scenarios that can cause long running Split operation to rollback.
Avoids Split operation blocking writes and then getting blocked (due to running queries on the shard) when trying to drop shards.
Deferred drop is the new default behavior going forward.
Shard Cleaner Extension
Shard Cleaner is a background task responsible for deferred drops in case of 'Move' operations.
The cleaner has been extended to ensure robust cleanup of shards (dummy shards and split children) in case of a failure based on the new infrastructure mentioned above. The cleaner also handles deferred drop for 'Splits'.
TESTING:
New test ''citus_split_shard_by_split_points_deferred_drop' to test deferred drop support.
New test 'failure_split_cleanup' to test shard cleanup with failures in different stages.
Update 'isolation_blocking_shard_split and isolation_non_blocking_shard_split' for deferred drop.
Added non-deferred drop version of existing tests : 'citus_split_shard_no_deferred_drop' and 'citus_non_blocking_splits_no_deferred_drop'
Sometimes in CI our isolation_citus_dist_activity test fails randomly
like this:
```diff
step s2-view-dist:
SELECT query, citus_nodename_for_nodeid(citus_nodeid_for_gpid(global_pid)), citus_nodeport_for_nodeid(citus_nodeid_for_gpid(global_pid)), state, wait_event_type, wait_event, usename, datname FROM citus_dist_stat_activity WHERE query NOT ILIKE ALL(VALUES('%pg_prepared_xacts%'), ('%COMMIT%'), ('%BEGIN%'), ('%pg_catalog.pg_isolation_test_session_is_blocked%'), ('%citus_add_node%')) AND backend_type = 'client backend' ORDER BY query DESC;
query |citus_nodename_for_nodeid|citus_nodeport_for_nodeid|state |wait_event_type|wait_event|usename |datname
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------+-------------------------+-------------------+---------------+----------+--------+----------
INSERT INTO test_table VALUES (100, 100);
|localhost | 57636|idle in transaction|Client |ClientRead|postgres|regression
-(1 row)
+
+ SELECT coalesce(to_jsonb(array_agg(csa_from_one_node.*)), '[{}]'::JSONB)
+ FROM (
+ SELECT global_pid, worker_query AS is_worker_query, pg_stat_activity.* FROM
+ pg_stat_activity LEFT JOIN get_all_active_transactions() ON process_id = pid
+ ) AS csa_from_one_node;
+ |localhost | 57636|active | | |postgres|regression
+(2 rows)
step s3-view-worker:
```
Source: https://app.circleci.com/pipelines/github/citusdata/citus/26692/workflows/3406e4b4-b686-4667-bec6-8253ee0809b1/jobs/765119
I intended to fix this with #6263, but the fix turned out to be
insufficient. This PR tries to address the issue by setting
distributedCommandOriginator correctly in more situations. However, even
with this change it's still possible to reproduce the flaky test in CI.
In any case this should fix at least some instances of this issue.
In passing this changes the isolation_citus_dist_activity test to allow
running it multiple times in a row.
Sometimes in CI our isolation_citus_dist_activity test fails randomly
like this:
```diff
step s2-view-dist:
SELECT query, citus_nodename_for_nodeid(citus_nodeid_for_gpid(global_pid)), citus_nodeport_for_nodeid(citus_nodeid_for_gpid(global_pid)), state, wait_event_type, wait_event, usename, datname FROM citus_dist_stat_activity WHERE query NOT ILIKE ALL(VALUES('%pg_prepared_xacts%'), ('%COMMIT%'), ('%BEGIN%'), ('%pg_catalog.pg_isolation_test_session_is_blocked%'), ('%citus_add_node%')) AND backend_type = 'client backend' ORDER BY query DESC;
query |citus_nodename_for_nodeid|citus_nodeport_for_nodeid|state |wait_event_type|wait_event|usename |datname
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------+-------------------------+-------------------+---------------+----------+--------+----------
INSERT INTO test_table VALUES (100, 100);
|localhost | 57636|idle in transaction|Client |ClientRead|postgres|regression
-(1 row)
+
+ SELECT coalesce(to_jsonb(array_agg(csa_from_one_node.*)), '[{}]'::JSONB)
+ FROM (
+ SELECT global_pid, worker_query AS is_worker_query, pg_stat_activity.* FROM
+ pg_stat_activity LEFT JOIN get_all_active_transactions() ON process_id = pid
+ ) AS csa_from_one_node;
+ |localhost | 57636|active | | |postgres|regression
+(2 rows)
step s3-view-worker:
```
Source: https://app.circleci.com/pipelines/github/citusdata/citus/26605/workflows/56d284d2-5bb3-4e64-a0ea-7b9b1626e7cd/jobs/760633
The reason for this is that citus_dist_stat_activity sometimes shows the
query that it uses itself to get the data from pg_stat_activity. This is
actually a bug, because it's a worker query and thus shouldn't show up
there. To try and solve this bug, we remove two small opportunities for a
race condition. These race conditions could happen when the backenddata
was marked as active, but the distributedCommandOriginator was not set
correctly yet/anymore. There was an opportunity for this to happen both
during connection start and shutdown.
DESCRIPTION: Use faster custom copy logic for non-blocking shard moves
Non-blocking shard moves consist of two main phases:
1. Initial data copy
2. Catchup phase
This changes the first of these phases significantly. Previously we used the
copy logic provided by postgres subscriptions. This meant we didn't have
to implement it ourselves, but it came with the downside of little control.
When implementing shard splits we needed more control to even make it
work, so we implemented our own logic for copying data between nodes.
This PR starts using that logic for non-blocking shard moves. Doing so
has four main advantages:
1. It uses COPY in binary format when possible, which is cheaper to encode
and decode. Furthermore it very often results in less data that needs to
be sent over the network.
2. It allows us to create the primary key (or other replica identity) after doing
the initial data copy. This should give some speed up over the total run,
because creating an index is bulk is much faster than incrementally building it.
3. It doesn't require a replication slot per parallel copy. Increasing the maximum
number of replication slots uses resources in postgres, even if they are not used.
So reducing the number of replication slots that shard moves need is nice.
4. Logical replication table_sync workers are slow to start up, so if lots of shards
need to be copied that can make it quite slow. This can happen easily when
combining Postgres partitioning with Citus.