DESCRIPTION: Fix reference table lock contention
Dropping and creating reference tables unintentionally blocked on each other due to the use of an ExclusiveLock for both the Drop and conditionally copying existing reference tables to (new) nodes.
The patch does the following:
- Lower lock lever for dropping (reference) tables to `ShareLock` so they don't self conflict
- Treat reference tables and distributed tables equally and acquire the colocation lock when dropping any table that is in a colocation group
- Perform the precondition check for copying reference tables twice, first time with a lower lock that doesn't conflict with anything. Could have been a NoLock, however, in preparation for dropping a colocation group, it is an `AccessShareLock`
During normal operation the first check will always pass and we don't have to escalate that lock. Making it that we won't be blocked on adding and remove reference tables. Only after a node addition the first `create_reference_table` will still need to acquire an `ExclusiveLock` on the colocation group to perform the copy.
This is a refactoring PR that starts using our new hash table creation
helper function. It adds a few more macros for ease of use, because C
doesn't have default arguments. It also adds a macro to check if a
struct contains automatic padding bytes. No struct that is hashed using
tag_hash should have automatic padding bytes, because those bytes are
undefined and thus using them to create a hash will result in undefined
behaviour (usually a random hash).
**Intro**
This adds support to Citus to change the CPU priority values of
backends. This is created with two main usecases in mind:
1. Users might want to run the logical replication part of the shard moves
or shard splits at a higher speed than they would do by themselves.
This might cause some small loss of DB performance for their regular
queries, but this is often worth it. During high load it's very possible
that the logical replication WAL sender is not able to keep up with the
WAL that is generated. This is especially a big problem when the
machine is close to running out of disk when doing a rebalance.
2. Users might have certain long running queries that they don't impact
their regular workload too much.
**Be very careful!!!**
Using CPU priorities to control scheduling can be helpful in some cases
to control which processes are getting more CPU time than others.
However, due to an issue called "[priority inversion][1]" it's possible that
using CPU priorities together with the many locks that are used within
Postgres cause the exact opposite behavior of what you intended. This
is why this PR only allows the PG superuser to change the CPU priority
of its own processes. Currently it's not recommended to set `citus.cpu_priority`
directly. Currently the only recommended interface for users is the setting
called `citus.cpu_priority_for_logical_replication_senders`. This setting
controls CPU priority for a very limited set of processes (the logical
replication senders). So, the dangers of priority inversion are also limited
with when using it for this usecase.
**Background**
Before reading the rest it's important to understand some basic
background regarding process CPU priorities, because they are a bit
counter intuitive. A lower priority value, means that the process will
be scheduled more and whatever it's doing will thus complete faster. The
default priority for processes is 0. Valid values are from -20 to 19
inclusive. On Linux a larger difference between values of two processes
will result in a bigger difference in percentage of scheduling.
**Handling the usecases**
Usecase 1 can be achieved by setting `citus.cpu_priority_for_logical_replication_senders`
to the priority value that you want it to have. It's necessary to set
this both on the workers and the coordinator. Example:
```
citus.cpu_priority_for_logical_replication_senders = -10
```
Usecase 2 can with this PR be achieved by running the following as
superuser. Note that this is only possible as superuser currently
due to the dangers mentioned in the "Be very carefull!!!" section.
And although this is possible it's **NOT** recommended:
```sql
ALTER USER background_job_user SET citus.cpu_priority = 5;
```
**OS configuration**
To actually make these settings work well it's important to run Postgres
with more a more permissive value for the 'nice' resource limit than
Linux will do by default. By default Linux will not allow a process to
set its priority lower than it currently is, even if it was lower when
the process originally started. This capability is necessary to reset
the CPU priority to its original value after a transaction finishes.
Depending on how you run Postgres this needs to be done in one of two
ways:
If you use systemd to start Postgres all you have to do is add a line
like this to the systemd service file:
```conf
LimitNice=+0 # the + is important, otherwise its interpreted incorrectly as 20
```
If that's not the case you'll have to configure `/etc/security/limits.conf`
like so, assuming that you are running Postgres as the `postgres` OS user:
```
postgres soft nice 0
postgres hard nice 0
```
Finally you'd have add the following line to `/etc/pam.d/common-session`
```
session required pam_limits.so
```
These settings would allow to change the priority back after setting it
to a higher value.
However, to actually allow you to set priorities even lower than the
default priority value you would need to change the values in the
config to something lower than 0. So for example:
```conf
LimitNice=-10
```
or
```
postgres soft nice -10
postgres hard nice -10
```
If you use WSL2 you'll likely have to do another thing. You have to
open a new shell, because when PAM is only used during login, and
WSL2 doesn't actually log you in. You can force a login like this:
```
sudo su $USER --shell /bin/bash
```
Source: https://stackoverflow.com/a/68322992/2570866
[1]: https://en.wikipedia.org/wiki/Priority_inversion
The long description of the `citus.distributed_deadlock_detection_factor`
setting was incorrectly stating that 1000 would disable it. Instead -1
is the value that disables distributed deadlock detection.
When introducing non-blocking shard split functionality it was based
heavily on the non-blocking shard moves. However, differences between
usage was slightly to big to be able to reuse the existing functions
easily. So, most logical replication code was simply copied to dedicated
shard split functions and modified for that purpose.
This PR tries to create a more generic logical replication
infrastructure that can be used by both shard splits and shard moves.
There's probably more code sharing possible in the future, but I believe
this is at least a good start and addresses the lowest hanging fruit.
This also adds a CreateSimpleHash function that makes creating the
most common type of hashmap common.
This creates consistent test output for isolation tests that involve
`CREATE INDEX CONCURRENTLY`. `CREATE INDEX CONCURRENTLY` is sometimes
temporarily detected as blocking, even though it will complete without any other
queries needing to be run. This change makes sure that we wait until that happens
without running any other queries in the meantime. This way we always get consistent
output. The way we do that is addressed by using an empty step in the same
session as the `CREATE INDEX CONCURRENLTY` command. Doing so forces
the isolation tester to wait until the command is finished and not continue with
steps from other sessions. This is [the recommended approach by Postgres][1].
There's two separate cases which are addressed in slightly different ways:
1. If `CREATE INDEX CONCURRENTLY` is actually blocked on another session: Add an
empty step right after the commit of blocking session.
e.g. `"s2-ddl-create-index-concurrently" "s1-commit" "s2-empty"`
2. If it's not actually blocked on another session: Add [an asterisk marker][2] to make
it look like it's blocked (because sometimes this happens randomly) and right
after that we add an empty step to trigger waiting.
e.g. `"s2-ddl-create-index-concurrently"(*) "s2-empty" "s1-commit"`
In passing this also enables isolation tests that were disabled due to a
bug that has already been fixed for a while.
Fixes#5993
Related to #5910 and #2966
[1]: 5f0adec253/src/test/isolation/README (L197-L204)
[2]: 5f0adec253/src/test/isolation/README (L174-L179)
Co-authored-by: Hanefi Onaldi <Hanefi.Onaldi@microsoft.com>
Remove .source files
PostgreSQL 15 dropped usage of .source files that are used to generate
.sql and .out files by replacing some placeholders with the actual
values before test runs. Instead, the information is passed from
pg_regress to the .sql and .out files directly via env variables. Those
variables are read via \getenv psql command in relevant test files.
PostgreSQL 15 introduced some changes to pg_regress binary that allowed
this to happen. However this change is not backported to earlier
versions, and thus we come up with a similar mechanism in
pg_regress_multi that works in all supported PG versions.
We also needed to make some changes to `copy` and `\copy` commands.
`\copy` does not support variable interpolation, and we need to store
`\copy` commands in a variable and use that variable in a consecutive
line to let interpolation do its magic.
Relevant PG commits:
- postgres/postgres@33d3eeadb2
adds `\getenv` command to psql.
- postgres/postgres@d1029bb5a2
updates all `.source` files to be supported `sql` or `out` files
without actually renaming them. `pg_regress.c` is patched to set some
environment variables that contain paths to relevant directories, and
the `.source` files use the newly introduced `\getenv` to read those
paths.
- postgres/postgres@dc9c3b0ff2
renames all `.source` files into either `.sql` or `.out` files.
This commit is inspired by a commit
dc9c3b0ff21465fa89d71eecf5e6cc956d647eca from PostgreSQL 15 that shares
the same header.
I also removed some gitignore rules so that I can add some files to git
worktree. We used to ignore the generated files, that are no longer
generated after this commit.
--------------------
Below is the commit message from PostgreSQL 15 commit
dc9c3b0ff21465fa89d71eecf5e6cc956d647eca :
"git mv" all the input/*.source and output/*.source files into
the corresponding sql/ and expected/ directories. Then remove
the pg_regress and Makefile infrastructure associated with
dynamic translation.
Discussion: https://postgr.es/m/1655733.1639871614@sss.pgh.pa.us
This commit is inspired by a commit
d1029bb5a26cb84b116b0dee4dde312291359f2a from PostgreSQL 15 that shares
the same header.
--------------------
Below is the commit message from PostgreSQL 15 commit
d1029bb5a26cb84b116b0dee4dde312291359f2a :
pg_regress has long had provisions for dynamically substituting path
names into regression test scripts and result files, but use of that
feature has always been a serious pain in the neck, mainly because
updating the result files requires tedious manual editing. Let's
get rid of that in favor of passing down the paths in environment
variables.
In addition to being easier to maintain, this way is capable of
dealing with path names that require escaping at runtime, for example
paths containing single-quote marks. (There are other stumbling
blocks in the way of actually building in a path that looks like
that, but removing this one seems like a good thing to do.) The key
coding rule that makes that possible is to concatenate pieces of a
dynamically-variable string using psql's \set command, and then use
the :'variable' notation to quote and escape the string for the next
level of interpretation.
In hopes of making this change more transparent to "git blame",
I've split it into two steps. This commit adds the necessary
pg_regress.c support and changes all the *.source files in-place
so that they no longer require any dynamic translation. The next
commit will just "git mv" them into the regular sql/ and expected/
directories.
Discussion: https://postgr.es/m/1655733.1639871614@sss.pgh.pa.us
PostgreSQL 15 dropped usage of .source files that are used to generate
.sql and .out files by replacing some placeholders with the actual
values before test runs. Instead, the information is passed from
pg_regress to the .sql and .out files directly via env variables. Those
variables are read via \getenv psql command in relevant test files.
PostgreSQL 15 commit d1029bb5a26cb84b116b0dee4dde312291359f2a introduced
some changes to pg_regress binary that allowed this to happen. However
this change is not backported to earlier versions of PG, and thus we
come up with a similar mechanism in pg_regress_multi that works in all
available PG versions.
When using `citus.replicate_reference_tables_on_activate = off`,
reference tables need to be replicated later. This can be done using the
`replicate_reference_tables()` UDF. However, this function only allowed
blocking replication. This changes the function to default to logical
replication instead, and allows choosing any of our existing shard
transfer modes.
DESCRIPTION: Use faster custom copy logic for non-blocking shard moves
Non-blocking shard moves consist of two main phases:
1. Initial data copy
2. Catchup phase
This changes the first of these phases significantly. Previously we used the
copy logic provided by postgres subscriptions. This meant we didn't have
to implement it ourselves, but it came with the downside of little control.
When implementing shard splits we needed more control to even make it
work, so we implemented our own logic for copying data between nodes.
This PR starts using that logic for non-blocking shard moves. Doing so
has four main advantages:
1. It uses COPY in binary format when possible, which is cheaper to encode
and decode. Furthermore it very often results in less data that needs to
be sent over the network.
2. It allows us to create the primary key (or other replica identity) after doing
the initial data copy. This should give some speed up over the total run,
because creating an index is bulk is much faster than incrementally building it.
3. It doesn't require a replication slot per parallel copy. Increasing the maximum
number of replication slots uses resources in postgres, even if they are not used.
So reducing the number of replication slots that shard moves need is nice.
4. Logical replication table_sync workers are slow to start up, so if lots of shards
need to be copied that can make it quite slow. This can happen easily when
combining Postgres partitioning with Citus.