The logical replication catchup part for shard splits and shard moves is
very similar. This abstracts most of that similarity away into a single
function. This also improves the logic for non blocking shard splits a
bit, by using faster foreign key creation. It also parallelizes index creation
which shard moves were already doing, but shard splits did not.
Previously we would create foreign keys to reference table in an extra
fast way at the end of a shard move. This uses that same logic to also
do it for foreign keys between distributed tables.
Fixes#6141
DESCRIPTION:
This PR adds support for 'Deferred Drop' and robust 'Shard Cleanup' for Splits.
Common Infrastructure
This PR introduces new common infrastructure so as any operation that wants robust cleanup of resources can register with the cleaner and have the resources cleaned appropriately based on a specified policy. 'Shard Split' is the first consumer using this new infrastructure.
Note : We only support adding 'shards' as resources to be cleaned-up right now but the framework will be extended to support other resources in future.
Deferred Drop for Split
Deferred Drop Support ensures that shards undergoing split are not dropped inline as part of operation but dropped later when no active read queries are running on shard. This helps with :
Avoids any potential deadlock scenarios that can cause long running Split operation to rollback.
Avoids Split operation blocking writes and then getting blocked (due to running queries on the shard) when trying to drop shards.
Deferred drop is the new default behavior going forward.
Shard Cleaner Extension
Shard Cleaner is a background task responsible for deferred drops in case of 'Move' operations.
The cleaner has been extended to ensure robust cleanup of shards (dummy shards and split children) in case of a failure based on the new infrastructure mentioned above. The cleaner also handles deferred drop for 'Splits'.
TESTING:
New test ''citus_split_shard_by_split_points_deferred_drop' to test deferred drop support.
New test 'failure_split_cleanup' to test shard cleanup with failures in different stages.
Update 'isolation_blocking_shard_split and isolation_non_blocking_shard_split' for deferred drop.
Added non-deferred drop version of existing tests : 'citus_split_shard_no_deferred_drop' and 'citus_non_blocking_splits_no_deferred_drop'
Using binary encoding can save a lot of CPU cycles, both on the sender
and on the receiver. Since the walsender and walreceiver processes are
single threaded, this can matter a lot for the throughput if they are
bottlenecked on CPU.
This feature is only available in PG14, not PG13. It should be safe to
always enable because it's only used for types that support binary
encoding according to the PG docs:
> Even when this option is enabled, only data types that have binary
> send and receive functions will be transferred in binary.
But in case it causes problems, it can still be disabled by setting
`citus.enable_binary_protocol` to `false`.
**Intro**
This adds support to Citus to change the CPU priority values of
backends. This is created with two main usecases in mind:
1. Users might want to run the logical replication part of the shard moves
or shard splits at a higher speed than they would do by themselves.
This might cause some small loss of DB performance for their regular
queries, but this is often worth it. During high load it's very possible
that the logical replication WAL sender is not able to keep up with the
WAL that is generated. This is especially a big problem when the
machine is close to running out of disk when doing a rebalance.
2. Users might have certain long running queries that they don't impact
their regular workload too much.
**Be very careful!!!**
Using CPU priorities to control scheduling can be helpful in some cases
to control which processes are getting more CPU time than others.
However, due to an issue called "[priority inversion][1]" it's possible that
using CPU priorities together with the many locks that are used within
Postgres cause the exact opposite behavior of what you intended. This
is why this PR only allows the PG superuser to change the CPU priority
of its own processes. Currently it's not recommended to set `citus.cpu_priority`
directly. Currently the only recommended interface for users is the setting
called `citus.cpu_priority_for_logical_replication_senders`. This setting
controls CPU priority for a very limited set of processes (the logical
replication senders). So, the dangers of priority inversion are also limited
with when using it for this usecase.
**Background**
Before reading the rest it's important to understand some basic
background regarding process CPU priorities, because they are a bit
counter intuitive. A lower priority value, means that the process will
be scheduled more and whatever it's doing will thus complete faster. The
default priority for processes is 0. Valid values are from -20 to 19
inclusive. On Linux a larger difference between values of two processes
will result in a bigger difference in percentage of scheduling.
**Handling the usecases**
Usecase 1 can be achieved by setting `citus.cpu_priority_for_logical_replication_senders`
to the priority value that you want it to have. It's necessary to set
this both on the workers and the coordinator. Example:
```
citus.cpu_priority_for_logical_replication_senders = -10
```
Usecase 2 can with this PR be achieved by running the following as
superuser. Note that this is only possible as superuser currently
due to the dangers mentioned in the "Be very carefull!!!" section.
And although this is possible it's **NOT** recommended:
```sql
ALTER USER background_job_user SET citus.cpu_priority = 5;
```
**OS configuration**
To actually make these settings work well it's important to run Postgres
with more a more permissive value for the 'nice' resource limit than
Linux will do by default. By default Linux will not allow a process to
set its priority lower than it currently is, even if it was lower when
the process originally started. This capability is necessary to reset
the CPU priority to its original value after a transaction finishes.
Depending on how you run Postgres this needs to be done in one of two
ways:
If you use systemd to start Postgres all you have to do is add a line
like this to the systemd service file:
```conf
LimitNice=+0 # the + is important, otherwise its interpreted incorrectly as 20
```
If that's not the case you'll have to configure `/etc/security/limits.conf`
like so, assuming that you are running Postgres as the `postgres` OS user:
```
postgres soft nice 0
postgres hard nice 0
```
Finally you'd have add the following line to `/etc/pam.d/common-session`
```
session required pam_limits.so
```
These settings would allow to change the priority back after setting it
to a higher value.
However, to actually allow you to set priorities even lower than the
default priority value you would need to change the values in the
config to something lower than 0. So for example:
```conf
LimitNice=-10
```
or
```
postgres soft nice -10
postgres hard nice -10
```
If you use WSL2 you'll likely have to do another thing. You have to
open a new shell, because when PAM is only used during login, and
WSL2 doesn't actually log you in. You can force a login like this:
```
sudo su $USER --shell /bin/bash
```
Source: https://stackoverflow.com/a/68322992/2570866
[1]: https://en.wikipedia.org/wiki/Priority_inversion
When introducing non-blocking shard split functionality it was based
heavily on the non-blocking shard moves. However, differences between
usage was slightly to big to be able to reuse the existing functions
easily. So, most logical replication code was simply copied to dedicated
shard split functions and modified for that purpose.
This PR tries to create a more generic logical replication
infrastructure that can be used by both shard splits and shard moves.
There's probably more code sharing possible in the future, but I believe
this is at least a good start and addresses the lowest hanging fruit.
This also adds a CreateSimpleHash function that makes creating the
most common type of hashmap common.
DESCRIPTION: Use faster custom copy logic for non-blocking shard moves
Non-blocking shard moves consist of two main phases:
1. Initial data copy
2. Catchup phase
This changes the first of these phases significantly. Previously we used the
copy logic provided by postgres subscriptions. This meant we didn't have
to implement it ourselves, but it came with the downside of little control.
When implementing shard splits we needed more control to even make it
work, so we implemented our own logic for copying data between nodes.
This PR starts using that logic for non-blocking shard moves. Doing so
has four main advantages:
1. It uses COPY in binary format when possible, which is cheaper to encode
and decode. Furthermore it very often results in less data that needs to
be sent over the network.
2. It allows us to create the primary key (or other replica identity) after doing
the initial data copy. This should give some speed up over the total run,
because creating an index is bulk is much faster than incrementally building it.
3. It doesn't require a replication slot per parallel copy. Increasing the maximum
number of replication slots uses resources in postgres, even if they are not used.
So reducing the number of replication slots that shard moves need is nice.
4. Logical replication table_sync workers are slow to start up, so if lots of shards
need to be copied that can make it quite slow. This can happen easily when
combining Postgres partitioning with Citus.
* Remove if conditions with PG_VERSION_NUM < 13
* Remove server_above_twelve(&eleven) checks from tests
* Fix tests
* Remove pg12 and pg11 alternative test output files
* Remove pg12 specific normalization rules
* Some more if conditions in the code
* Change RemoteCollationIdExpression and some pg12/pg13 comments
* Remove some more normalization rules
Similar to #5897, one more step for running Citus with PG 15.
This PR at least make Citus run with PG 15. I have not tried running the tests with PG 15.
Shmem changes are based on 4f2400cb3f
Compile breaks are mostly due to #6008
This PR makes all of the features open source that were previously only
available in Citus Enterprise.
Features that this adds:
1. Non blocking shard moves/shard rebalancer
(`citus.logical_replication_timeout`)
2. Propagation of CREATE/DROP/ALTER ROLE statements
3. Propagation of GRANT statements
4. Propagation of CLUSTER statements
5. Propagation of ALTER DATABASE ... OWNER TO ...
6. Optimization for COPY when loading JSON to avoid double parsing of
the JSON object (`citus.skip_jsonb_validation_in_copy`)
7. Support for row level security
8. Support for `pg_dist_authinfo`, which allows storing different
authentication options for different users, e.g. you can store
passwords or certificates here.
9. Support for `pg_dist_poolinfo`, which allows using connection poolers
in between coordinator and workers
10. Tracking distributed query execution times using
citus_stat_statements (`citus.stat_statements_max`,
`citus.stat_statements_purge_interval`,
`citus.stat_statements_track`). This is disabled by default.
11. Blocking tenant_isolation
12. Support for `sslkey` and `sslcert` in `citus.node_conninfo`