Sometimes this multi_utilities would fail with the following error:
```diff
SET citus.log_remote_commands TO ON;
-- should propagate to all workers because no table is specified
ANALYZE;
NOTICE: issuing BEGIN TRANSACTION ISOLATION LEVEL READ COMMITTED;SELECT assign_distributed_transaction_id(0, 3461, '2022-08-19 01:56:06.35816-07');
DETAIL: on server postgres@localhost:57637 connectionId: 1
NOTICE: issuing BEGIN TRANSACTION ISOLATION LEVEL READ COMMITTED;SELECT assign_distributed_transaction_id(0, 3461, '2022-08-19 01:56:06.35816-07');
DETAIL: on server postgres@localhost:57638 connectionId: 2
NOTICE: issuing SET citus.enable_ddl_propagation TO 'off'
DETAIL: on server postgres@localhost:57637 connectionId: 1
-NOTICE: issuing SET citus.enable_ddl_propagation TO 'off'
-DETAIL: on server postgres@localhost:xxxxx connectionId: xxxxxxx
NOTICE: issuing ANALYZE
DETAIL: on server postgres@localhost:57637 connectionId: 1
+NOTICE: issuing SET citus.enable_ddl_propagation TO 'off'
+DETAIL: on server postgres@localhost:57638 connectionId: 2
NOTICE: issuing ANALYZE
DETAIL: on server postgres@localhost:57638 connectionId: 2
```
This is simply a harmless change in output due to some timing
differences. This PR makes the test output consistent by only logging
the remote ANALYZE commands, not the SET commands.
This fixes our most commonly randomly failing failure test. The failing
diff is as follows:
```diff
SELECT citus.mitmproxy('conn.onQuery(query="fetch_intermediate_results").kill()');
mitmproxy
-----------
(1 row)
INSERT INTO target_table SELECT * FROM source_table;
-ERROR: connection to the remote node localhost:xxxxx failed with the following error: connection not open
+ERROR: could not open file "base/pgsql_job_cache/10_0_40/repartitioned_results_20770193413_from_4213590_to_1.data": No such file or directory
+CONTEXT: while executing command on localhost:9060
+while executing command on localhost:57637
SELECT * FROM target_table ORDER BY a;
```
As far as I can tell this is the cause of a race condition: After killing
fetch_intermediate_results on worker 9060, the previously created data
file gets cleaned up. The fetch_intermediate_results call that's sent
to worker 57637 will be cancelled and rolled back soon because of the
failure on the other connection. But if that fetch_intermediate_results
call is able to connect to 9060 before it is cancelled, it won't find
the file it's looking for there anymore. So while it's not the error we
expect, it does indicate that we succeeded.
To avoid this issue instead of killing the fetch_intermediate_results
call directly, we kill the COPY command that it uses to do the fetch.
This results in stable output as can be seen here, where 227 runs of
failure_insert_select_repartition succeeded:
https://app.circleci.com/pipelines/github/citusdata/citus/26168/workflows/9c64a3b6-f46c-4725-9fb4-8f6a2d00a023/jobs/739389
To be clear this changes the test to affects the opposite
fetch_intermediate_results call. This kills the fetch_intermediate_results
call of worker 57637, instead of killing the fetch_intermediate_results call
on worker 9060.
Example of failing test: https://app.circleci.com/pipelines/github/citusdata/citus/26147/workflows/780e95ea-264a-4c9f-ad2e-cf11449a795e/jobs/738467
We're in the processes of totally changing the shard rebalancer
experience and infrastructure. Soon the shard rebalancer will include
retries, crash recovery and support for running in the background.
These improvements come at a cost though, the way the
get_rebalance_progress UDF currently works is very hard to replicate
with this new structure. This is mostly because the old behaviour
doesn't really make sense anymore with this new infrastructure. A new
and better way to track the progress will be included as part of the new
infrastructure.
This PR is in preparation of the new code rebalancer experience.
It changes the get_rebalance_progress UDF to only display the moves that
are in progress at the moment, not the ones that happened in the past or
that are planned in the future. Another option would have been to
completely remove the current get_rebalance_progress functionality and
point people to the new way of tracking progress. But old blogposts
still reference the old UDF and users might have some automation on top
of it. Showing the progress of the current moves is fairly simple to
achieve, even with the new infrastructure.
So this PR is a kind of compromise: It doesn't have complete feature
parity with the old get_rebalance_progress, but the most common use
cases will still work.
There's also an advantage of the change: You can now see progress of
shard moves that were triggered by calling citus_move_shard_placement
manually. Instead of only being able to see progress of moves that were
initiated using get_rebalance_table_shards.
We used to rely on a seperate session to add the coordinator.
However, that might prevent the existing sessions to get
assigned proper gpids, which causes flaky tests.
This removes a flaky test that I introduced in #3868 after I fixed the
issue described in #3622. This test is sometimes fails randomly in CI.
The way it fails indicates that there might be some bug: A connection
breaks after rolling back to a savepoint.
I tried reproducing this issue locally, but I wasn't able to. I don't
understand what causes the failure.
Things that I tried were:
1. Running the test with:
```sql
SET citus.force_max_query_parallelization = true;
```
2. Running the test with:
```sql
SET citus.max_adaptive_executor_pool_size = 1;
```
3. Running the test in parallel with the same tests that it is run in
parallel with in multi_schedule.
None of these allowed me to reproduce the issue locally.
So I think it's time to give on fixing this test and simply remove the
test. The regression that this test protects against seems very unlikely
to reappear, since in #3868 I also added a big comment about the need
for the newly added `UnclaimConnection` call. So, I think the need for
the test is quite small, and removing it will make our CI less flaky.
In case the cause of the bug ever gets found, I tracked the bug in #6189
Example of a failing CI run:
https://app.circleci.com/pipelines/github/citusdata/citus/26098/workflows/f84741d9-13b1-4ae7-9155-c21ed3466951/jobs/736424
For reference the unexpected diff is this (so both warnings and an error):
```diff
INSERT INTO t SELECT i FROM generate_series(1, 100) i;
+WARNING: connection to the remote node localhost:57638 failed with the following error:
+WARNING:
+CONTEXT: while executing command on localhost:57638
+ERROR: connection to the remote node localhost:57638 failed with the following error:
ROLLBACK;
```
This test is also mentioned as the most failing regression test in #5975
There are 3 different ways that a sequence can be interacting
with tables. (1) and (2) are already supported. This commit adds
support for (3).
(1) column DEFAULT nextval('seq'):
The dependency is roughly like below,
and ExpandCitusSupportedTypes() is responsible
for finding the depending sequences.
schema <--- table <--- column <---- default value
^ |
|------------------ sequence <--------|
(2) serial columns: Bigserial/small serial etc:
The dependency is roughly like below,
and ExpandCitusSupportedTypes() is responsible
for finding the depending sequences.
schema <--- table <--- column <---- default value
^ |
| |
sequence <--------|
(3) Sequence OWNED BY table.column: Added support for
this type of resolution in this commit.
The dependency is almost like the following, and
ExpandCitusSupportedTypes() is NOT responsible for finding
the dependency.
schema <--- table <--- column
^
|
sequence
Object type ids have changed in PG15 because of at least two added
objects in the list: OBJECT_PARAMETER_ACL, OBJECT_PUBLICATION_NAMESPACE
To avoid different output between pg versions, let's use the object
name in the error, and put the object id in the error detail.
Relevant PG commits:
a0ffa885e478f5eeacc4e250e35ce25a4740c487
5a2832465fd8984d089e8c44c094e6900d987fcd
DESCRIPTION: Fix reference table lock contention
Dropping and creating reference tables unintentionally blocked on each other due to the use of an ExclusiveLock for both the Drop and conditionally copying existing reference tables to (new) nodes.
The patch does the following:
- Lower lock lever for dropping (reference) tables to `ShareLock` so they don't self conflict
- Treat reference tables and distributed tables equally and acquire the colocation lock when dropping any table that is in a colocation group
- Perform the precondition check for copying reference tables twice, first time with a lower lock that doesn't conflict with anything. Could have been a NoLock, however, in preparation for dropping a colocation group, it is an `AccessShareLock`
During normal operation the first check will always pass and we don't have to escalate that lock. Making it that we won't be blocked on adding and remove reference tables. Only after a node addition the first `create_reference_table` will still need to acquire an `ExclusiveLock` on the colocation group to perform the copy.
This is a refactoring PR that starts using our new hash table creation
helper function. It adds a few more macros for ease of use, because C
doesn't have default arguments. It also adds a macro to check if a
struct contains automatic padding bytes. No struct that is hashed using
tag_hash should have automatic padding bytes, because those bytes are
undefined and thus using them to create a hash will result in undefined
behaviour (usually a random hash).
**Intro**
This adds support to Citus to change the CPU priority values of
backends. This is created with two main usecases in mind:
1. Users might want to run the logical replication part of the shard moves
or shard splits at a higher speed than they would do by themselves.
This might cause some small loss of DB performance for their regular
queries, but this is often worth it. During high load it's very possible
that the logical replication WAL sender is not able to keep up with the
WAL that is generated. This is especially a big problem when the
machine is close to running out of disk when doing a rebalance.
2. Users might have certain long running queries that they don't impact
their regular workload too much.
**Be very careful!!!**
Using CPU priorities to control scheduling can be helpful in some cases
to control which processes are getting more CPU time than others.
However, due to an issue called "[priority inversion][1]" it's possible that
using CPU priorities together with the many locks that are used within
Postgres cause the exact opposite behavior of what you intended. This
is why this PR only allows the PG superuser to change the CPU priority
of its own processes. Currently it's not recommended to set `citus.cpu_priority`
directly. Currently the only recommended interface for users is the setting
called `citus.cpu_priority_for_logical_replication_senders`. This setting
controls CPU priority for a very limited set of processes (the logical
replication senders). So, the dangers of priority inversion are also limited
with when using it for this usecase.
**Background**
Before reading the rest it's important to understand some basic
background regarding process CPU priorities, because they are a bit
counter intuitive. A lower priority value, means that the process will
be scheduled more and whatever it's doing will thus complete faster. The
default priority for processes is 0. Valid values are from -20 to 19
inclusive. On Linux a larger difference between values of two processes
will result in a bigger difference in percentage of scheduling.
**Handling the usecases**
Usecase 1 can be achieved by setting `citus.cpu_priority_for_logical_replication_senders`
to the priority value that you want it to have. It's necessary to set
this both on the workers and the coordinator. Example:
```
citus.cpu_priority_for_logical_replication_senders = -10
```
Usecase 2 can with this PR be achieved by running the following as
superuser. Note that this is only possible as superuser currently
due to the dangers mentioned in the "Be very carefull!!!" section.
And although this is possible it's **NOT** recommended:
```sql
ALTER USER background_job_user SET citus.cpu_priority = 5;
```
**OS configuration**
To actually make these settings work well it's important to run Postgres
with more a more permissive value for the 'nice' resource limit than
Linux will do by default. By default Linux will not allow a process to
set its priority lower than it currently is, even if it was lower when
the process originally started. This capability is necessary to reset
the CPU priority to its original value after a transaction finishes.
Depending on how you run Postgres this needs to be done in one of two
ways:
If you use systemd to start Postgres all you have to do is add a line
like this to the systemd service file:
```conf
LimitNice=+0 # the + is important, otherwise its interpreted incorrectly as 20
```
If that's not the case you'll have to configure `/etc/security/limits.conf`
like so, assuming that you are running Postgres as the `postgres` OS user:
```
postgres soft nice 0
postgres hard nice 0
```
Finally you'd have add the following line to `/etc/pam.d/common-session`
```
session required pam_limits.so
```
These settings would allow to change the priority back after setting it
to a higher value.
However, to actually allow you to set priorities even lower than the
default priority value you would need to change the values in the
config to something lower than 0. So for example:
```conf
LimitNice=-10
```
or
```
postgres soft nice -10
postgres hard nice -10
```
If you use WSL2 you'll likely have to do another thing. You have to
open a new shell, because when PAM is only used during login, and
WSL2 doesn't actually log you in. You can force a login like this:
```
sudo su $USER --shell /bin/bash
```
Source: https://stackoverflow.com/a/68322992/2570866
[1]: https://en.wikipedia.org/wiki/Priority_inversion
The long description of the `citus.distributed_deadlock_detection_factor`
setting was incorrectly stating that 1000 would disable it. Instead -1
is the value that disables distributed deadlock detection.
When introducing non-blocking shard split functionality it was based
heavily on the non-blocking shard moves. However, differences between
usage was slightly to big to be able to reuse the existing functions
easily. So, most logical replication code was simply copied to dedicated
shard split functions and modified for that purpose.
This PR tries to create a more generic logical replication
infrastructure that can be used by both shard splits and shard moves.
There's probably more code sharing possible in the future, but I believe
this is at least a good start and addresses the lowest hanging fruit.
This also adds a CreateSimpleHash function that makes creating the
most common type of hashmap common.
This creates consistent test output for isolation tests that involve
`CREATE INDEX CONCURRENTLY`. `CREATE INDEX CONCURRENTLY` is sometimes
temporarily detected as blocking, even though it will complete without any other
queries needing to be run. This change makes sure that we wait until that happens
without running any other queries in the meantime. This way we always get consistent
output. The way we do that is addressed by using an empty step in the same
session as the `CREATE INDEX CONCURRENLTY` command. Doing so forces
the isolation tester to wait until the command is finished and not continue with
steps from other sessions. This is [the recommended approach by Postgres][1].
There's two separate cases which are addressed in slightly different ways:
1. If `CREATE INDEX CONCURRENTLY` is actually blocked on another session: Add an
empty step right after the commit of blocking session.
e.g. `"s2-ddl-create-index-concurrently" "s1-commit" "s2-empty"`
2. If it's not actually blocked on another session: Add [an asterisk marker][2] to make
it look like it's blocked (because sometimes this happens randomly) and right
after that we add an empty step to trigger waiting.
e.g. `"s2-ddl-create-index-concurrently"(*) "s2-empty" "s1-commit"`
In passing this also enables isolation tests that were disabled due to a
bug that has already been fixed for a while.
Fixes#5993
Related to #5910 and #2966
[1]: 5f0adec253/src/test/isolation/README (L197-L204)
[2]: 5f0adec253/src/test/isolation/README (L174-L179)
Co-authored-by: Hanefi Onaldi <Hanefi.Onaldi@microsoft.com>
Remove .source files
PostgreSQL 15 dropped usage of .source files that are used to generate
.sql and .out files by replacing some placeholders with the actual
values before test runs. Instead, the information is passed from
pg_regress to the .sql and .out files directly via env variables. Those
variables are read via \getenv psql command in relevant test files.
PostgreSQL 15 introduced some changes to pg_regress binary that allowed
this to happen. However this change is not backported to earlier
versions, and thus we come up with a similar mechanism in
pg_regress_multi that works in all supported PG versions.
We also needed to make some changes to `copy` and `\copy` commands.
`\copy` does not support variable interpolation, and we need to store
`\copy` commands in a variable and use that variable in a consecutive
line to let interpolation do its magic.
Relevant PG commits:
- postgres/postgres@33d3eeadb2
adds `\getenv` command to psql.
- postgres/postgres@d1029bb5a2
updates all `.source` files to be supported `sql` or `out` files
without actually renaming them. `pg_regress.c` is patched to set some
environment variables that contain paths to relevant directories, and
the `.source` files use the newly introduced `\getenv` to read those
paths.
- postgres/postgres@dc9c3b0ff2
renames all `.source` files into either `.sql` or `.out` files.
This commit is inspired by a commit
dc9c3b0ff21465fa89d71eecf5e6cc956d647eca from PostgreSQL 15 that shares
the same header.
I also removed some gitignore rules so that I can add some files to git
worktree. We used to ignore the generated files, that are no longer
generated after this commit.
--------------------
Below is the commit message from PostgreSQL 15 commit
dc9c3b0ff21465fa89d71eecf5e6cc956d647eca :
"git mv" all the input/*.source and output/*.source files into
the corresponding sql/ and expected/ directories. Then remove
the pg_regress and Makefile infrastructure associated with
dynamic translation.
Discussion: https://postgr.es/m/1655733.1639871614@sss.pgh.pa.us