The failure_create_distributed_table_non_empty test would sometimes fail
like this:
```diff
-- in the first test, cancel the first connection we sent from the coordinator
SELECT citus.mitmproxy('conn.cancel(' || pg_backend_pid() || ')');
- mitmproxy
----------------------------------------------------------------------
-
-(1 row)
-
+ERROR: canceling statement due to user request
+CONTEXT: COPY mitmproxy_result, line 1: ""
+SQL statement "COPY mitmproxy_result FROM '/home/circleci/project/src/test/regress/tmp_check/mitmproxy.fifo'"
+PL/pgSQL function citus.mitmproxy(text) line 11 at EXECUTE
SELECT create_distributed_table('test_table', 'id');
```
Because the cancel command had no filter it would actually sometimes
cancel the mitmproxy cancel command itself. This PR addresses that by
filtering on CREATE TABLE, which is one of the command that
create_distributed_table will send to the workers.
Example of failing test: https://app.circleci.com/pipelines/github/citusdata/citus/26252/workflows/1b7e5464-cca4-4ec1-99b3-48ddf25c29fa/jobs/742829
Sometimes in CI the columnar_memory test was using slightly more memory
than expected.
```diff
SELECT CASE WHEN 1.0 * TopMemoryContext / :top_post BETWEEN 0.98 AND 1.02 THEN 1 ELSE 1.0 * TopMemoryContext / :top_post END AS top_growth
FROM columnar_test_helpers.columnar_store_memory_stats();
--[ RECORD 1 ]-
-top_growth | 1
+-[ RECORD 1 ]------------------
+top_growth | 1.0206132116232119
-- before this change, max mem usage while executing inserts was 28MB and
```
This PR changes the expectation to be slightly higher, such that this
random increase in memory usage doesn't cause a flaky test.
Failing test: https://app.circleci.com/pipelines/github/citusdata/citus/26256/workflows/c0870f66-3346-4f8d-a1d3-36dfd7c98289/jobs/743028
In the logical_replication test we test that the cleanup logic at the
start of a shard move works as expected. To do so we create a
subscription and publication slot manually. This changes the test to
make that subscription actually connect to the database that the
publication is in.
Useful for #5987#6085
By running isolation tests in parallel we're just asking for flaky
tasks. The first test might temporarily block one of the commands in the
second test, which we then detect as waiting like this:
```diff
step s2-vacuum-analyze:
VACUUM ANALYZE test_insert_vacuum;
-
+ <waiting ...>
step s1-commit:
COMMIT;
+step s2-vacuum-analyze: <... completed>
```
Debugging flaky tests is also much harder when they are run in parallel.
This PR starts running all our isolation tests sequentially.
The reason for opening this PR was me seeing this failing test:
https://app.circleci.com/pipelines/github/citusdata/citus/26194/workflows/ff57e2cf-8ac4-40fe-bc0c-74a7f8fecb53/jobs/740454
As well as having fixed a similar issue recently in #6122
* Adjust some isolation test for the recent PG commits
In 3f32395612,
Postgres starts any isolation session with `set application_name`.
However, one of the tests we had expected that it is exactly the first
command in the session. The test tries to show that even if a gpid
has not been assigned, we can show it in the citus_lock_waits graph.
Now that, it is literally not possible to have such test as gpid
would be assigned after `set application_name` command. Still,
it is good to have a test where a command is blocked on the parser
Sometimes the columnar_memory test fails in CI with the following error:
```diff
SELECT 1.0 * TopMemoryContext / :top_post BETWEEN 0.98 AND 1.02 AS top_growth_ok
FROM columnar_test_helpers.columnar_store_memory_stats();
-[ RECORD 1 ]-+--
-top_growth_ok | t
+top_growth_ok | f
-- before this change, max mem usage while executing inserts was 28MB and
```
This is almost certainly a harmless failure that simply requires bumping
the margin a little bit. However, it's impossible to say with the
current output. I was unable to reproduce this on-demand on my local
machine or even in CI. So this changes the test to include the actual
value difference in the size of TopMemoryContext when it's outside the
expected range. Then next time it fails we at least have some
information about why.
Example of failing test: https://app.circleci.com/pipelines/github/citusdata/citus/25966/workflows/d472a57b-419a-4f33-b8bc-2e174a98d4d6/jobs/730576
As shown in #6196 the output of s1-view-locks is sometimes not as
expected. However, because it's output is very minimal it's hard to
understand the reason for that. This adds some more columns and
aggregates less, so we can more easily see what locks are unexpectedly
held or released.
In passing this also fixes the following flaky part of this test by excluding
locks taken by the maintenance daemon. After running it with this more
detailed output for s1-view-locks it became obvious that that was the
problem here.
```diff
diff -dU10 -w /home/jelte/work/citus/src/test/regress/expected/isolation_ref2ref_foreign_keys.out /home/jelte/work/citus/src/test/regress/results/isolation_ref2ref_foreign_keys.out
--- /home/jelte/work/citus/src/test/regress/expected/isolation_ref2ref_foreign_keys.out.modified 2022-08-18 15:42:08.689525233 +0200
+++ /home/jelte/work/citus/src/test/regress/results/isolation_ref2ref_foreign_keys.out.modified 2022-08-18 15:42:08.729525233 +0200
@@ -288,21 +288,22 @@
step s1-view-locks:
SELECT mode, count(*)
FROM pg_locks
WHERE locktype='advisory'
GROUP BY mode
ORDER BY 1, 2;
mode |count
------------------------+-----
-(0 rows)
+ShareUpdateExclusiveLock| 1
+(1 row)
starting permutation: s2-begin s2-insert-table-3 s1-view-locks s2-rollback s1-view-locks
step s2-begin:
BEGIN;
step s2-insert-table-3:
INSERT INTO ref_table_3 VALUES (7, 5);
step s1-view-locks:
```
In CI sometimes failure_setup will fail with the following error:
```diff
SELECT master_add_node('localhost', :worker_2_proxy_port); -- an mitmproxy which forwards to the second worker
- master_add_node
----------------------------------------------------------------------
- 2
-(1 row)
-
+ERROR: connection to the remote node localhost:9060 failed with the following error: could not connect to server: Connection refused
+ Is the server running on host "localhost" (127.0.0.1) and accepting
+ TCP/IP connections on port 9060?
+could not connect to server: Connection refused
+ Is the server running on host "localhost" (127.0.0.1) and accepting
+ TCP/IP connections on port 9060?
+could not connect to server: Cannot assign requested address
+ Is the server running on host "localhost" (::1) and accepting
+ TCP/IP connections on port 9060?
diff -dU10 -w /home/circleci/project/src/test/regress/expected/failure_online_move_shard_placement.out /home/circleci/project/src/test/regress/results/failure_online_move_shard_placement.out
```
This then breaks all the tests run after it as well, because we're
missing one worker node.
Locally I was able to reproduce this error by sleeping for 10 seconds in
the forked process sleep before actually starting mitmproxy. So I'm
expecting what's happening in CI is that due to limited resources,
mitmproxy is not up yet when we try to add its port as a workernode.
This PR fixes this by waiting until mitmproxy is listening on its socket
before actually starting to run our tests. This fixed it locally for me
when I made the forked process sleep for 10 seconds before starting
mitmproxy.
In passing it also improves the detection and errors that we already
had for the case where something was already listening on the
mitmproxy port.
Because both @gledis69 and me were changing things in our CI images
at the same time this also includes a bump of the style checker tools.
Closes#6200
This removes some warnings that are present when building on Ubuntu 22.04.
It removes warnings on PG13 + OpenSSL 3.0. OpenSSL 3.0 has marked some
functions that we use as deprecated, but we want to continue support OpenSSL
1.0.1 for the time being too. This indicates that to OpenSSL 3.0, so it doesn't
show warnings.
Sometimes this multi_utilities would fail with the following error:
```diff
SET citus.log_remote_commands TO ON;
-- should propagate to all workers because no table is specified
ANALYZE;
NOTICE: issuing BEGIN TRANSACTION ISOLATION LEVEL READ COMMITTED;SELECT assign_distributed_transaction_id(0, 3461, '2022-08-19 01:56:06.35816-07');
DETAIL: on server postgres@localhost:57637 connectionId: 1
NOTICE: issuing BEGIN TRANSACTION ISOLATION LEVEL READ COMMITTED;SELECT assign_distributed_transaction_id(0, 3461, '2022-08-19 01:56:06.35816-07');
DETAIL: on server postgres@localhost:57638 connectionId: 2
NOTICE: issuing SET citus.enable_ddl_propagation TO 'off'
DETAIL: on server postgres@localhost:57637 connectionId: 1
-NOTICE: issuing SET citus.enable_ddl_propagation TO 'off'
-DETAIL: on server postgres@localhost:xxxxx connectionId: xxxxxxx
NOTICE: issuing ANALYZE
DETAIL: on server postgres@localhost:57637 connectionId: 1
+NOTICE: issuing SET citus.enable_ddl_propagation TO 'off'
+DETAIL: on server postgres@localhost:57638 connectionId: 2
NOTICE: issuing ANALYZE
DETAIL: on server postgres@localhost:57638 connectionId: 2
```
This is simply a harmless change in output due to some timing
differences. This PR makes the test output consistent by only logging
the remote ANALYZE commands, not the SET commands.
This fixes our most commonly randomly failing failure test. The failing
diff is as follows:
```diff
SELECT citus.mitmproxy('conn.onQuery(query="fetch_intermediate_results").kill()');
mitmproxy
-----------
(1 row)
INSERT INTO target_table SELECT * FROM source_table;
-ERROR: connection to the remote node localhost:xxxxx failed with the following error: connection not open
+ERROR: could not open file "base/pgsql_job_cache/10_0_40/repartitioned_results_20770193413_from_4213590_to_1.data": No such file or directory
+CONTEXT: while executing command on localhost:9060
+while executing command on localhost:57637
SELECT * FROM target_table ORDER BY a;
```
As far as I can tell this is the cause of a race condition: After killing
fetch_intermediate_results on worker 9060, the previously created data
file gets cleaned up. The fetch_intermediate_results call that's sent
to worker 57637 will be cancelled and rolled back soon because of the
failure on the other connection. But if that fetch_intermediate_results
call is able to connect to 9060 before it is cancelled, it won't find
the file it's looking for there anymore. So while it's not the error we
expect, it does indicate that we succeeded.
To avoid this issue instead of killing the fetch_intermediate_results
call directly, we kill the COPY command that it uses to do the fetch.
This results in stable output as can be seen here, where 227 runs of
failure_insert_select_repartition succeeded:
https://app.circleci.com/pipelines/github/citusdata/citus/26168/workflows/9c64a3b6-f46c-4725-9fb4-8f6a2d00a023/jobs/739389
To be clear this changes the test to affects the opposite
fetch_intermediate_results call. This kills the fetch_intermediate_results
call of worker 57637, instead of killing the fetch_intermediate_results call
on worker 9060.
Example of failing test: https://app.circleci.com/pipelines/github/citusdata/citus/26147/workflows/780e95ea-264a-4c9f-ad2e-cf11449a795e/jobs/738467
We're in the processes of totally changing the shard rebalancer
experience and infrastructure. Soon the shard rebalancer will include
retries, crash recovery and support for running in the background.
These improvements come at a cost though, the way the
get_rebalance_progress UDF currently works is very hard to replicate
with this new structure. This is mostly because the old behaviour
doesn't really make sense anymore with this new infrastructure. A new
and better way to track the progress will be included as part of the new
infrastructure.
This PR is in preparation of the new code rebalancer experience.
It changes the get_rebalance_progress UDF to only display the moves that
are in progress at the moment, not the ones that happened in the past or
that are planned in the future. Another option would have been to
completely remove the current get_rebalance_progress functionality and
point people to the new way of tracking progress. But old blogposts
still reference the old UDF and users might have some automation on top
of it. Showing the progress of the current moves is fairly simple to
achieve, even with the new infrastructure.
So this PR is a kind of compromise: It doesn't have complete feature
parity with the old get_rebalance_progress, but the most common use
cases will still work.
There's also an advantage of the change: You can now see progress of
shard moves that were triggered by calling citus_move_shard_placement
manually. Instead of only being able to see progress of moves that were
initiated using get_rebalance_table_shards.
We used to rely on a seperate session to add the coordinator.
However, that might prevent the existing sessions to get
assigned proper gpids, which causes flaky tests.
This removes a flaky test that I introduced in #3868 after I fixed the
issue described in #3622. This test is sometimes fails randomly in CI.
The way it fails indicates that there might be some bug: A connection
breaks after rolling back to a savepoint.
I tried reproducing this issue locally, but I wasn't able to. I don't
understand what causes the failure.
Things that I tried were:
1. Running the test with:
```sql
SET citus.force_max_query_parallelization = true;
```
2. Running the test with:
```sql
SET citus.max_adaptive_executor_pool_size = 1;
```
3. Running the test in parallel with the same tests that it is run in
parallel with in multi_schedule.
None of these allowed me to reproduce the issue locally.
So I think it's time to give on fixing this test and simply remove the
test. The regression that this test protects against seems very unlikely
to reappear, since in #3868 I also added a big comment about the need
for the newly added `UnclaimConnection` call. So, I think the need for
the test is quite small, and removing it will make our CI less flaky.
In case the cause of the bug ever gets found, I tracked the bug in #6189
Example of a failing CI run:
https://app.circleci.com/pipelines/github/citusdata/citus/26098/workflows/f84741d9-13b1-4ae7-9155-c21ed3466951/jobs/736424
For reference the unexpected diff is this (so both warnings and an error):
```diff
INSERT INTO t SELECT i FROM generate_series(1, 100) i;
+WARNING: connection to the remote node localhost:57638 failed with the following error:
+WARNING:
+CONTEXT: while executing command on localhost:57638
+ERROR: connection to the remote node localhost:57638 failed with the following error:
ROLLBACK;
```
This test is also mentioned as the most failing regression test in #5975
There are 3 different ways that a sequence can be interacting
with tables. (1) and (2) are already supported. This commit adds
support for (3).
(1) column DEFAULT nextval('seq'):
The dependency is roughly like below,
and ExpandCitusSupportedTypes() is responsible
for finding the depending sequences.
schema <--- table <--- column <---- default value
^ |
|------------------ sequence <--------|
(2) serial columns: Bigserial/small serial etc:
The dependency is roughly like below,
and ExpandCitusSupportedTypes() is responsible
for finding the depending sequences.
schema <--- table <--- column <---- default value
^ |
| |
sequence <--------|
(3) Sequence OWNED BY table.column: Added support for
this type of resolution in this commit.
The dependency is almost like the following, and
ExpandCitusSupportedTypes() is NOT responsible for finding
the dependency.
schema <--- table <--- column
^
|
sequence
Object type ids have changed in PG15 because of at least two added
objects in the list: OBJECT_PARAMETER_ACL, OBJECT_PUBLICATION_NAMESPACE
To avoid different output between pg versions, let's use the object
name in the error, and put the object id in the error detail.
Relevant PG commits:
a0ffa885e478f5eeacc4e250e35ce25a4740c487
5a2832465fd8984d089e8c44c094e6900d987fcd
DESCRIPTION: Fix reference table lock contention
Dropping and creating reference tables unintentionally blocked on each other due to the use of an ExclusiveLock for both the Drop and conditionally copying existing reference tables to (new) nodes.
The patch does the following:
- Lower lock lever for dropping (reference) tables to `ShareLock` so they don't self conflict
- Treat reference tables and distributed tables equally and acquire the colocation lock when dropping any table that is in a colocation group
- Perform the precondition check for copying reference tables twice, first time with a lower lock that doesn't conflict with anything. Could have been a NoLock, however, in preparation for dropping a colocation group, it is an `AccessShareLock`
During normal operation the first check will always pass and we don't have to escalate that lock. Making it that we won't be blocked on adding and remove reference tables. Only after a node addition the first `create_reference_table` will still need to acquire an `ExclusiveLock` on the colocation group to perform the copy.
This is a refactoring PR that starts using our new hash table creation
helper function. It adds a few more macros for ease of use, because C
doesn't have default arguments. It also adds a macro to check if a
struct contains automatic padding bytes. No struct that is hashed using
tag_hash should have automatic padding bytes, because those bytes are
undefined and thus using them to create a hash will result in undefined
behaviour (usually a random hash).
**Intro**
This adds support to Citus to change the CPU priority values of
backends. This is created with two main usecases in mind:
1. Users might want to run the logical replication part of the shard moves
or shard splits at a higher speed than they would do by themselves.
This might cause some small loss of DB performance for their regular
queries, but this is often worth it. During high load it's very possible
that the logical replication WAL sender is not able to keep up with the
WAL that is generated. This is especially a big problem when the
machine is close to running out of disk when doing a rebalance.
2. Users might have certain long running queries that they don't impact
their regular workload too much.
**Be very careful!!!**
Using CPU priorities to control scheduling can be helpful in some cases
to control which processes are getting more CPU time than others.
However, due to an issue called "[priority inversion][1]" it's possible that
using CPU priorities together with the many locks that are used within
Postgres cause the exact opposite behavior of what you intended. This
is why this PR only allows the PG superuser to change the CPU priority
of its own processes. Currently it's not recommended to set `citus.cpu_priority`
directly. Currently the only recommended interface for users is the setting
called `citus.cpu_priority_for_logical_replication_senders`. This setting
controls CPU priority for a very limited set of processes (the logical
replication senders). So, the dangers of priority inversion are also limited
with when using it for this usecase.
**Background**
Before reading the rest it's important to understand some basic
background regarding process CPU priorities, because they are a bit
counter intuitive. A lower priority value, means that the process will
be scheduled more and whatever it's doing will thus complete faster. The
default priority for processes is 0. Valid values are from -20 to 19
inclusive. On Linux a larger difference between values of two processes
will result in a bigger difference in percentage of scheduling.
**Handling the usecases**
Usecase 1 can be achieved by setting `citus.cpu_priority_for_logical_replication_senders`
to the priority value that you want it to have. It's necessary to set
this both on the workers and the coordinator. Example:
```
citus.cpu_priority_for_logical_replication_senders = -10
```
Usecase 2 can with this PR be achieved by running the following as
superuser. Note that this is only possible as superuser currently
due to the dangers mentioned in the "Be very carefull!!!" section.
And although this is possible it's **NOT** recommended:
```sql
ALTER USER background_job_user SET citus.cpu_priority = 5;
```
**OS configuration**
To actually make these settings work well it's important to run Postgres
with more a more permissive value for the 'nice' resource limit than
Linux will do by default. By default Linux will not allow a process to
set its priority lower than it currently is, even if it was lower when
the process originally started. This capability is necessary to reset
the CPU priority to its original value after a transaction finishes.
Depending on how you run Postgres this needs to be done in one of two
ways:
If you use systemd to start Postgres all you have to do is add a line
like this to the systemd service file:
```conf
LimitNice=+0 # the + is important, otherwise its interpreted incorrectly as 20
```
If that's not the case you'll have to configure `/etc/security/limits.conf`
like so, assuming that you are running Postgres as the `postgres` OS user:
```
postgres soft nice 0
postgres hard nice 0
```
Finally you'd have add the following line to `/etc/pam.d/common-session`
```
session required pam_limits.so
```
These settings would allow to change the priority back after setting it
to a higher value.
However, to actually allow you to set priorities even lower than the
default priority value you would need to change the values in the
config to something lower than 0. So for example:
```conf
LimitNice=-10
```
or
```
postgres soft nice -10
postgres hard nice -10
```
If you use WSL2 you'll likely have to do another thing. You have to
open a new shell, because when PAM is only used during login, and
WSL2 doesn't actually log you in. You can force a login like this:
```
sudo su $USER --shell /bin/bash
```
Source: https://stackoverflow.com/a/68322992/2570866
[1]: https://en.wikipedia.org/wiki/Priority_inversion
The long description of the `citus.distributed_deadlock_detection_factor`
setting was incorrectly stating that 1000 would disable it. Instead -1
is the value that disables distributed deadlock detection.
When introducing non-blocking shard split functionality it was based
heavily on the non-blocking shard moves. However, differences between
usage was slightly to big to be able to reuse the existing functions
easily. So, most logical replication code was simply copied to dedicated
shard split functions and modified for that purpose.
This PR tries to create a more generic logical replication
infrastructure that can be used by both shard splits and shard moves.
There's probably more code sharing possible in the future, but I believe
this is at least a good start and addresses the lowest hanging fruit.
This also adds a CreateSimpleHash function that makes creating the
most common type of hashmap common.
This creates consistent test output for isolation tests that involve
`CREATE INDEX CONCURRENTLY`. `CREATE INDEX CONCURRENTLY` is sometimes
temporarily detected as blocking, even though it will complete without any other
queries needing to be run. This change makes sure that we wait until that happens
without running any other queries in the meantime. This way we always get consistent
output. The way we do that is addressed by using an empty step in the same
session as the `CREATE INDEX CONCURRENLTY` command. Doing so forces
the isolation tester to wait until the command is finished and not continue with
steps from other sessions. This is [the recommended approach by Postgres][1].
There's two separate cases which are addressed in slightly different ways:
1. If `CREATE INDEX CONCURRENTLY` is actually blocked on another session: Add an
empty step right after the commit of blocking session.
e.g. `"s2-ddl-create-index-concurrently" "s1-commit" "s2-empty"`
2. If it's not actually blocked on another session: Add [an asterisk marker][2] to make
it look like it's blocked (because sometimes this happens randomly) and right
after that we add an empty step to trigger waiting.
e.g. `"s2-ddl-create-index-concurrently"(*) "s2-empty" "s1-commit"`
In passing this also enables isolation tests that were disabled due to a
bug that has already been fixed for a while.
Fixes#5993
Related to #5910 and #2966
[1]: 5f0adec253/src/test/isolation/README (L197-L204)
[2]: 5f0adec253/src/test/isolation/README (L174-L179)
Co-authored-by: Hanefi Onaldi <Hanefi.Onaldi@microsoft.com>
This commit is inspired by a commit
dc9c3b0ff21465fa89d71eecf5e6cc956d647eca from PostgreSQL 15 that shares
the same header.
I also removed some gitignore rules so that I can add some files to git
worktree. We used to ignore the generated files, that are no longer
generated after this commit.
--------------------
Below is the commit message from PostgreSQL 15 commit
dc9c3b0ff21465fa89d71eecf5e6cc956d647eca :
"git mv" all the input/*.source and output/*.source files into
the corresponding sql/ and expected/ directories. Then remove
the pg_regress and Makefile infrastructure associated with
dynamic translation.
Discussion: https://postgr.es/m/1655733.1639871614@sss.pgh.pa.us
This commit is inspired by a commit
d1029bb5a26cb84b116b0dee4dde312291359f2a from PostgreSQL 15 that shares
the same header.
--------------------
Below is the commit message from PostgreSQL 15 commit
d1029bb5a26cb84b116b0dee4dde312291359f2a :
pg_regress has long had provisions for dynamically substituting path
names into regression test scripts and result files, but use of that
feature has always been a serious pain in the neck, mainly because
updating the result files requires tedious manual editing. Let's
get rid of that in favor of passing down the paths in environment
variables.
In addition to being easier to maintain, this way is capable of
dealing with path names that require escaping at runtime, for example
paths containing single-quote marks. (There are other stumbling
blocks in the way of actually building in a path that looks like
that, but removing this one seems like a good thing to do.) The key
coding rule that makes that possible is to concatenate pieces of a
dynamically-variable string using psql's \set command, and then use
the :'variable' notation to quote and escape the string for the next
level of interpretation.
In hopes of making this change more transparent to "git blame",
I've split it into two steps. This commit adds the necessary
pg_regress.c support and changes all the *.source files in-place
so that they no longer require any dynamic translation. The next
commit will just "git mv" them into the regular sql/ and expected/
directories.
Discussion: https://postgr.es/m/1655733.1639871614@sss.pgh.pa.us
PostgreSQL 15 dropped usage of .source files that are used to generate
.sql and .out files by replacing some placeholders with the actual
values before test runs. Instead, the information is passed from
pg_regress to the .sql and .out files directly via env variables. Those
variables are read via \getenv psql command in relevant test files.
PostgreSQL 15 commit d1029bb5a26cb84b116b0dee4dde312291359f2a introduced
some changes to pg_regress binary that allowed this to happen. However
this change is not backported to earlier versions of PG, and thus we
come up with a similar mechanism in pg_regress_multi that works in all
available PG versions.
When using `citus.replicate_reference_tables_on_activate = off`,
reference tables need to be replicated later. This can be done using the
`replicate_reference_tables()` UDF. However, this function only allowed
blocking replication. This changes the function to default to logical
replication instead, and allows choosing any of our existing shard
transfer modes.
DESCRIPTION: Use faster custom copy logic for non-blocking shard moves
Non-blocking shard moves consist of two main phases:
1. Initial data copy
2. Catchup phase
This changes the first of these phases significantly. Previously we used the
copy logic provided by postgres subscriptions. This meant we didn't have
to implement it ourselves, but it came with the downside of little control.
When implementing shard splits we needed more control to even make it
work, so we implemented our own logic for copying data between nodes.
This PR starts using that logic for non-blocking shard moves. Doing so
has four main advantages:
1. It uses COPY in binary format when possible, which is cheaper to encode
and decode. Furthermore it very often results in less data that needs to
be sent over the network.
2. It allows us to create the primary key (or other replica identity) after doing
the initial data copy. This should give some speed up over the total run,
because creating an index is bulk is much faster than incrementally building it.
3. It doesn't require a replication slot per parallel copy. Increasing the maximum
number of replication slots uses resources in postgres, even if they are not used.
So reducing the number of replication slots that shard moves need is nice.
4. Logical replication table_sync workers are slow to start up, so if lots of shards
need to be copied that can make it quite slow. This can happen easily when
combining Postgres partitioning with Citus.
master_drain_node in distributed_triggers.sql test file takes too
long to execute. It is directly dependent on the shard count.
Hence I reduced shard count from 32 to 4 (default in tests),
since this doesn't affect the validity of the tests.
This change reduces the setup time of our minimal schedules in two ways:
1. Don't run `multi_cluster_managament`, but instead run a much smaller
sql file with almost the same results. `multi_cluster_management`
adds and removes lots of nodes and tests all kinds of failure
scenarios. This is not needed for the minimal schedules. The only
reason we were using it there was to get a working cluster of the
layout that the tests expected. The new `minimal_cluster_management`
test achieves this with much less work, going from ~2s to ~0.5s.
2. Parallelize a bit more of the helper tests.
We are reducing the log level here to avoid alternative test output
in PG15 because of the change in the display of SQL-standard
function's arguments in INSERT/SELECT in PG15.
The log level changes can be reverted when we drop support for PG14
Relevant PG commit:
a8d8445a7b2f80f6d0bfe97b19f90bd2cbef8759
The new shard copy code that was created for shard splits has some
advantages over the old shard copy code. The old code was using
worker_append_table_to_shard, which wrote to disk twice. And it also
didn't use binary copy when that was possible. Both of these issues
were fixed in the new copy code. This PR starts using this new copy
logic also for shard moves, not just for shard splits.
On my local machine I created a single shard table like this.
```sql
set citus.shard_count = 1;
create table t(id bigint, a bigint);
select create_distributed_table('t', 'id');
INSERT into t(id, a) SELECT i, i from generate_series(1, 100000000) i;
```
I then turned `fsync` off to make sure I wasn't bottlenecked by disk.
Finally I moved this shard between nodes with `citus_move_shard_placement`
with `block_writes`.
Before this PR a move took ~127s, after this PR it took only ~38s. So for this
small test this resulted in spending ~70% less time.
And I also tried the same test for a table that contained large strings:
```sql
set citus.shard_count = 1;
create table t(id bigint, a bigint, content text);
select create_distributed_table('t', 'id');
INSERT into t(id, a, content) SELECT i, i, 'aunethautnehoautnheaotnuhetnohueoutnehotnuhetncouhaeohuaeochgrhgd.athbetndairgexdbuhaobulrhdbaetoausnetohuracehousncaoehuesousnaceohuenacouhancoexdaseohusnaetobuetnoduhasneouhaceohusnaoetcuhmsnaetohuacoeuhebtokteaoshetouhsanetouhaoug.lcuahesonuthaseauhcoerhuaoecuh.lg;rcydabsnetabuesabhenth' from generate_series(1, 20000000) i;
```
While testing 5670dffd33, I realized
that we have a missing RecordNonDistTableAccessesForTask() for
local utility commands.
Although we don't have to record the relation access for local
only cases, we really want to keep the behaviour for scale-out
be the same with single node on all aspects. We wouldn't want
any single node complex transaction to work on single machine,
but not on multi node cluster. Hence, we apply the same restrictions.
For example, on a distributed cluster, the following errors, and
after this commit this errors locally as well
```SQL
CREATE TABLE ref(a int primary key);
INSERT INTO ref VALUES (1);
CREATE TABLE dist(a int REFERENCES ref(a));
SELECT create_reference_table('ref');
SELECT create_distributed_table('dist', 'a');
BEGIN;
SELECT * FROM dist;
TRUNCATE ref CASCADE;
ERROR: cannot execute DDL on table "ref" because there was a parallel SELECT access to distributed table "dist" in the same transaction
HINT: Try re-running the transaction with "SET LOCAL citus.multi_shard_modify_mode TO 'sequential';"
COMMIT;
```
We also add the comprehensive test suite and run the same locally.
Code snippet in Makefile was blocking Citus build when USE_PGXS flag was set. This was included for port to FSPG but is not needed for Citus engine and can be safely removed.
Reported bug #5803 shows that we are currently not sending the IN clause to our planner for columnar. This PR fixes it by checking for ScalarArrayOpExpr in ExtractPushdownClause so that we do not skip it. Also added a test case for this new addition.
It turns out that create_distributed_table
and citus_move/copy_shard_placement does not
work well concurrently.
To fix that, we need to acquire a lock, which
sounds like a good use of colocation lock.
However, the current usage of colocation lock is
limited to higher level UDFs like rebalance_table_shards
etc. Those usage of lock is still useful, but
we cannot acquire the same lock on citus_move_shard_placement
etc. because the coordinator connects to itself to acquire
the lock. Hence, the high level UDF blocks itself.
To fix that, we use one more colocation lock, with the placements
are the main objects to consider.
Before this commit, we required multiple copies of the
same stringInfo if we needed to append/prepend data to
the stringInfo. Now, we optionally get prefix/postfix.
For large string operations, this can save up to %10
memory.
Previously, CreateFixPartitionShardIndexNames() created all
the relevant query strings for all the shards, and executed
the large query string. And, in terms of the memory consumption,
this huge command (and its ExprContext generated while running
the command) is the main bottleneck/
With this change, we are reducing the total amount of memory
usage to almost 1/shard_count.
On my local machine, a distributed partitioned table with 120 partitions,
each 32 shards, the total memory consumption reduced from ~3GB
to ~0.1GB. And, the total execution time increased from ~28 seconds
to ~30 seconds. This seems like a good trade-off.
We used to only check whether the PID is valid
or not. However, Postgres does not necessarily
set the PID of the backend to 0 when it exists.
Instead, we need to be able to check it from procArray.
IsBackendPid() is what pg_stat_activity also relies
on for a similar purpose.
Historically we have been testing with the 'latest' version of libpq
when the CI images were build. This has the downside that rebuilding the
images often break our tests due to different errors returned from
libpq.
With this change we will actually test with a stable version of libpq
that is based on the postgres minor version that we test against.
This will make it easier to maintain postgres images over time, as well
as running _all_ tests locally, where we change libpq in sync with the
postgres server version.
use RecurseObjectDependencies api to find if an object is citus depended
make vanilla tests runnable to see if citus_depended function is working correctly
citus_locks combines the pg_locks views from all nodes and adds
global_pid, nodeid, and relation_name. The columns of citus_locks don't
change based on the Postgres version, however the pg_locks's columns do.
Postgres 14 added one more column to pg_locks (waitstart timestamptz).
citus_locks has the most expansive column set, including the newly added
column. If citus_locks is queried in a Postgres version where pg_locks
doesn't have some columns, the values for those columns in citus_locks
will be NULL
DESCRIPTION:
This PR extends support for Partitioned and Columnar tables in blocking 'citus_split_shard_by_split_points' workflow.
Columnar Support : No special handling required. Just removing checks that fails split for columnar table and adding test coverage.
Partitioned Table Support :
Skip copying of parent table as they are empty, The partitions contain data and are treated as co-located shards that will be copied separately.
Attach partitions to parent on destination after inserting new shard metadata and before creating foreign key constraints.
MISC:
Fix Bug #4949 where Blocking shard moves fails if there is a foreign key between partitioned distributed tables (from child to parent).
TEST:
Added new test 'citus_split_shards_columnar_partitioned' for splitting 'partitioned' and 'columnar + partitioned' table.
Added new test 'shard_move_constraints_blocking' to add coverage for shard move bug fix.
Updated test 'citus_split_shard_by_split_points_negative' to allow columnar and partitioned table.
* Remove if conditions with PG_VERSION_NUM < 13
* Remove server_above_twelve(&eleven) checks from tests
* Fix tests
* Remove pg12 and pg11 alternative test output files
* Remove pg12 specific normalization rules
* Some more if conditions in the code
* Change RemoteCollationIdExpression and some pg12/pg13 comments
* Remove some more normalization rules
When building packages on ubuntu jammy, we started to see some warnings.
autoreconf: warning: autoconf input should be named 'configure.ac', not
'configure.in'
* Blocking split setup
* Add missing type
* Missing API from Metadata Sync
* Shard Split e2e code
* Worker Split Copy DestReceiver skeleton
* Basic destreceiver code
* worker_split_copy UDF
* UDF calling
* Split points are text
* Isolate Tenant and Split Shard Unification
* Fixing executor and misc
* Reindent code
* Fixing UDF definitions
* Hello World Local Copy works
* Remote copy hello world works
* Local and Remote binary test
* Fixing text local copy and adding tests
* Hello World shard split works
* Negative tests
* Blocking Split workflow works
* Refactor
* Bug fix
* Reindent
* Cleaning up and adding comments
* Basic test for shard split workflow
* ReIndent
* Circle CI integration
* Removing include causing circle-ci build failure
* Remove SplitCopyDestReceiver and use PartitionedResultDestReceiver
* Add support for citus.enable_binary_protocol
* Reindent
* Fix build break
* Update Test
* Cleanup on catch
* Addressing open comments
* Update downgrade script and quote schema/table in COPY statement
* Fix metadata sync issue. Update regression test
* Isolation test and bug fix
* Add Isolation test, fix foreign constraint deadlock issue
* Misc code review comments
* Test name needing to be quoted
* Refactor code from review comments
* Explaining shardGroupSplitIntervalListList
* Fix upgrade & downgrade
* Fix broken test
* Test fix Round 2
* Fixing bug and modifying test appropriately
* Fully qualify copy udf name. Run Reindent
* Address PR comments
* Fix null handling when creating AuxiliaryStructures
* Ensure local copy is triggered in tests
* Limit max shards that can be created with split
* Test failure fix
* Remove split_mode and use shard_transfer_mode instead'
* Fix test failure
* Fix test failure
* Fixing permission issue when splitting non-superuser owned tables
* Fix test expected output
* Remove extra space
* Fix test
* attempt to fix test
* Addressing Marco's PR comment
* Only clean shards created by workflow
* Remove from merge
* Update test
Similar to #5897, one more step for running Citus with PG 15.
This PR at least make Citus run with PG 15. I have not tried running the tests with PG 15.
Shmem changes are based on 4f2400cb3f
Compile breaks are mostly due to #6008
This is a continuation of a refactor (with commit sha
2b7cf0c097) that aimed to use Citus helper
UDFs by default in iso tests.
PostgreSQL isolation test infrastructure uses some UDFs to detect
whether concurrent sessions block each other. Citus implements
alternatives to that UDF so that we are able to detect and report
distributed transactions that get blocked on the worker nodes as well.
We needed to explicitly replace PG helper functions with Citus
implementations in each isolation file. Now we replace them by default.
* Support upgrade and downgrade and separate columnar as citus_columnar extension
Co-authored-by: Yanwen Jin <yanwjin@microsoft.com>
Co-authored-by: Jeff Davis <jeff@j-davis.com>
Use Citus helper UDFs by default in iso tests
PostgreSQL isolation test infrastructure uses some UDFs to detect
whether concurrent sessions block each other. Citus implements
alternatives to that UDF so that we are able to detect and report
distributed transactions that get blocked on the worker nodes as well.
We needed to explicitly replace PG helper functions with Citus
implementations in each isolation file. Now we replace them by default.
* Added more regression tests for more vacuum options,
* Fixed deadlock for unqualified vacuum when there is only 1 worker,
* Supported lock_skipped for vacuum.
This PR makes all of the features open source that were previously only
available in Citus Enterprise.
Features that this adds:
1. Non blocking shard moves/shard rebalancer
(`citus.logical_replication_timeout`)
2. Propagation of CREATE/DROP/ALTER ROLE statements
3. Propagation of GRANT statements
4. Propagation of CLUSTER statements
5. Propagation of ALTER DATABASE ... OWNER TO ...
6. Optimization for COPY when loading JSON to avoid double parsing of
the JSON object (`citus.skip_jsonb_validation_in_copy`)
7. Support for row level security
8. Support for `pg_dist_authinfo`, which allows storing different
authentication options for different users, e.g. you can store
passwords or certificates here.
9. Support for `pg_dist_poolinfo`, which allows using connection poolers
in between coordinator and workers
10. Tracking distributed query execution times using
citus_stat_statements (`citus.stat_statements_max`,
`citus.stat_statements_purge_interval`,
`citus.stat_statements_track`). This is disabled by default.
11. Blocking tenant_isolation
12. Support for `sslkey` and `sslcert` in `citus.node_conninfo`
We already have tests relying on citus_finalize_upgrade_to_citus11().
Now, adjust those to rely on citus_finish_citus_upgrade() and
always call citus_finish_citus_upgrade().
The error comes due to the datum jsonb in pg_dist_metadata_node.metadata being 0 in some scenarios. This is likely due to not copying the data when receiving a datum from a tuple and pg deciding to deallocate that memory when the table that the tuple was from is closed.
Also fix another place in the code that might have been susceptible to this issue.
I tested on both multi-vg and multi-1-vg and the test were successful.
altering the distributed table.
To be able to alter view's owner without enforcing sequential mode.
Alter view process functions have been udpated to use metadata
connection.
Do not obtain AccessShareLock before acquiring the distributed locks.
Acquiring an AccessShareLock ensures that the relations which we are trying to get a distributed lock on will not be dropped in the time between when the LOCK command is issued and the LOCK commands are send to the worker. However, this also leads to distributed deadlocks in such scenarios:
```sql
-- for dist lock acquiring order coor, w1, w2
-- on w2
LOCK t1 IN ACCESS EXLUSIVE MODE;
-- acquire AccessShareLock locally on t1 to ensure it is not dropped while we get ready to distribute the lock
-- concurrently on w1
LOCK t1 IN ACCESS EXLUSIVE MODE;
-- acquire AccessShareLock locally on t1 to ensure it is not dropped while we get ready to distribute the lock
-- acquire dist lock on coor, w1, gets blocked on local AccessShareLock on w2
-- on w2 continuation of the execution above
-- starts to acquire dist locks and gets blocked on the coor by the lock acquired by w1
-- distributed deadlock
```
We opt for avoiding such deadlocks with the cost of the possibility of running into errors when the relations on which we are trying to acquire locks on get dropped.
It is often useful to be able to sync the metadata in parallel
across nodes.
Also citus_finalize_upgrade_to_citus11() uses
start_metadata_sync_to_primary_nodes() after this commit.
Note that this commit does not parallelize all pieces of node
activation or metadata syncing. Instead, it tries to parallelize
potenially large parts of metadata, which is the objects and
distributed tables (in general Citus tables).
In the future, it would be nice to sync the reference tables
in parallel across nodes.
Create ~720 distributed tables / ~23450 shards
```SQL
-- declaratively partitioned table
CREATE TABLE github_events_looooooooooooooong_name (
event_id bigint,
event_type text,
event_public boolean,
repo_id bigint,
payload jsonb,
repo jsonb,
actor jsonb,
org jsonb,
created_at timestamp
) PARTITION BY RANGE (created_at);
SELECT create_time_partitions(
table_name := 'github_events_looooooooooooooong_name',
partition_interval := '1 day',
end_at := now() + '24 months'
);
CREATE INDEX ON github_events_looooooooooooooong_name USING btree (event_id, event_type, event_public, repo_id);
SELECT create_distributed_table('github_events_looooooooooooooong_name', 'repo_id');
SET client_min_messages TO ERROR;
```
across 1 node: almost same as expected
```SQL
SELECT start_metadata_sync_to_primary_nodes();
Time: 15664.418 ms (00:15.664)
select start_metadata_sync_to_node(nodename,nodeport) from pg_dist_node;
Time: 14284.069 ms (00:14.284)
```
across 7 nodes: ~3.5x improvement
```SQL
SELECT start_metadata_sync_to_primary_nodes();
┌──────────────────────────────────────┐
│ start_metadata_sync_to_primary_nodes │
├──────────────────────────────────────┤
│ t │
└──────────────────────────────────────┘
(1 row)
Time: 25711.192 ms (00:25.711)
-- across 7 nodes
select start_metadata_sync_to_node(nodename,nodeport) from pg_dist_node;
Time: 82126.075 ms (01:22.126)
```
Move internal storage details to a separate schema with no public
access to limit the possibility for information leakage.
Create views with public access that show storage details for those
columnar tables where the user has ownership privileges. Include
mapping between relation ID and storage ID for easier interpretation.
We remove `<waiting ...>` and `<... completed>` outputs for some CREATE
INDEX CONCURRENTLY commands since they can cause flakiness in some scenarios.
Postgres calls WaitForOlderSnapshots() and this can cause CREATE INDEX
CONCURRENTLY commands for shards to get blocked by each other for brief
periods of time. The extra waits can pop-up, or they can get completed
at different lines in the output files. To remedy that, we rename those
indexes to be captured by the new normalization rule.
* Bug fix for bug #5876. Memset MetadataCacheSystem every time there is an abort
* Created an ObjectAccessHook that saves the transactionlevel of when citus was created and will clear metadatacache if that transaction level is rolled back. Added additional tests to make sure metadatacache is cleared
Columnar: support relation options with ALTER TABLE.
Use ALTER TABLE ... SET/RESET to specify relation options rather than
alter_columnar_table_set() and alter_columnar_table_reset().
Not only is this more ergonomic, but it also allows better integration
because it can be treated like DDL on a regular table. For instance,
citus can use its own ProcessUtility_hook to distribute the new
settings to the shards.
DESCRIPTION: Columnar: support relation options with ALTER TABLE.
With Citus MX enabled, when a reference table is modified, it does
some operations on the first worker node(e.g., acquire locks).
If node metadata is locked (via add node or create restore point),
the changes to the reference tables should be blocked.
In the past (pre-11), we allowed removing worker nodes
that had active placements for replicated distributed
table, without even checking if there are any other
replicas of the same placement.
However, with #5469, we prevent disabling nodes via a hard
error when there is the last active placement of shard, as we
do for reference tables. Note that otherwise, we'd allow
users to lose data.
As of today, the NOTICE is completely irrelevant.
First worker node has a special meaning for modifications on the replicated tables
It is used to acquire a remote lock, such that the modifications are serialized.
With this commit, we make sure that we do not let any distributed query to see a
different 'first worker node' while first worker node is disabled.
Note that, maybe implicitly mentioned above, when first worker node is disabled,
the first worker node changes, that's why we have to handle the situation.
Before this commit, we had:
```SQL
SELECT citus_disable_node(nodename, nodeport, force boolean DEFAULT false)
```
Where, we allow forcing to disable first worker node with
`force:=true`. However, it entails the risk for losing
data / diverging placement data etc.
With `force` flag, we control disabling the first worker node,
and with `async` flag we control whether the changes are done
via bg worker or immediately.
```SQL
SELECT citus_disable_node(nodename, nodeport, force boolean DEFAULT false, sync boolean DEFAULT false)
```
Where we can achieve all the following:
| Mode | Data loss possibility | Can run in 2PC | Handle multiple node failures | Immediately effective |
| --- |--- |--- |--- |--- |
| force:false, sync: false | false | true | true | false |
| force:false, sync: true | false | false | false | true |
| force:true, sync: false | true | true | true | false |
| force:true, sync: true | false | false | false | true |
There are two problems in this area. First, when there are expressions
on the index name, we should call `transformIndexExpression()` before
generating the index name. That is what Postgres does.
Second, because of 40c24bfef9
PG 13 and PG 14 generates different names for indexes with function calls even for local PG tables.
Assume we have:
```SQL
create table t(id int);
select create_distributed_table('t', 'id');
create index ON t (my_very_boring_function(id));
```
On PG 13, the name of the index is `t_expr_idx`
```SQL
\d t
Table "public.t"
┌────────┬─────────┬───────────┬──────────┬─────────┐
│ Column │ Type │ Collation │ Nullable │ Default │
├────────┼─────────┼───────────┼──────────┼─────────┤
│ id │ integer │ │ │ │
└────────┴─────────┴───────────┴──────────┴─────────┘
Indexes:
"t_expr_idx" btree (my_very_boring_function(id::bigint))
```
On PG 14, the name of the index is `t_my_very_boring_function_idx`
```SQL
\d t
Table "public.t"
┌────────┬─────────┬───────────┬──────────┬─────────┐
│ Column │ Type │ Collation │ Nullable │ Default │
├────────┼─────────┼───────────┼──────────┼─────────┤
│ id │ integer │ │ │ │
└────────┴─────────┴───────────┴──────────┴─────────┘
Indexes:
"t_my_very_boring_function_idx" btree (my_very_boring_function(id::bigint))
```
The second issue is not very critical. The important part is that
we adjust regression tests to drop all the indexes, which ensures
the index names are sane on any version.
Over time we have added significantly improved the support for objects to be propagated by Citus as to make scaling out the database more seamless. It became evident that there was a lot of code duplication that got into the codebase to implement the propagation.
This PR tries to reduce the amount of repeated code that is at most only slightly different. To make things worse, most of the differences were actually oversights instead of correct.
This Patch introduces 3 reusable sets of pre/post processing steps for respectively
- create
- alter
- drop
With the use of the common functionality we should have more coherent behaviour between different supported object by Citus.
Some steps either omit the Pre or Post processing step if they would not make sense to include.
All tests pass, only 1 test needed changing, foreign servers, as the dropping of foreign servers didn't implement support for dropping multiple foreign servers at once. Given the common approach correctly supports dropping of multiple objects, either distributed or not, the test that assumed it wouldn't work was now obsolete.
We have a mechanism which ensures that newly distributed
objects are recorded during `alter extension citus update`.
However, the logic was lacking "view"s. With this commit, we make
sure that existing views are also marked as distributed during
upgrade.
We are nearing the 100 objects being propagated in `master_copy_shard_placement` and with the extra supported objects this gets pushed over a 100 objects.
When a 100 objects are reached for propagation a notice will be shown to the user, informing them it might take a while to finish the operation.
During testing this is not important to see. Since the message contains the exact number of objects to be propagated the tests becomes very unstable when merging community into enterprsie.
This change makes that the test output stays stable.
Adds support for propagation ALTER VIEW commands to
- Change owner of view
- SET/RESET option
- Rename view and view's column name
- Change schema of the view
Since PG also supports targeting views with ALTER TABLE
commands, related code also added to direct such ALTER TABLE
commands to ALTER VIEW commands while sending them to workers.
Breaking down #5899 into smaller PR-s
This particular PR changes the way TRUNCATE acquires distributed locks on the relations it is truncating to use the LOCK command instead of lock_relation_if_exists. This has the benefit of using pg's recursive locking logic it implements for the LOCK command instead of us having to resolve relation dependencies and lock them explicitly. While this does not directly affect truncate, it will allow us to generalize this locking logic to then log different relations where the pg recursive locking will become useful (e.g. locking views).
This implementation is a bit more complex that it needs to be due to pg not supporting locking foreign tables. We can however, still lock foreign tables with lock_relation_if_exists. So for a command:
TRUNCATE dist_table_1, dist_table_2, foreign_table_1, foreign_table_2, dist_table_3;
We generate and send the following command to all the workers in metadata:
```sql
SEL citus.enable_ddl_propagation TO FALSE;
LOCK dist_table_1, dist_table_2 IN ACCESS EXCLUSIVE MODE;
SELECT lock_relation_if_exists('foreign_table_1', 'ACCESS EXCLUSIVE');
SELECT lock_relation_if_exists('foreign_table_2', 'ACCESS EXCLUSIVE');
LOCK dist_table_3 IN ACCESS EXCLUSIVE MODE;
SEL citus.enable_ddl_propagation TO TRUE;
```
Note that we need to alternate between the lock command and lock_table_if_exists in order to preserve the TRUNCATE order of relations.
When pg supports locking foreign tables, we will be able to massive simplify this logic and send a single LOCK command.
Adds support for propagating create/drop view commands and views to
worker node while scaling out the cluster. Since views are dropped while
converting the table type, metadata connection will be used while
propagating view commands to not switch to sequential mode.
First, it is not needed. Second, in the past we had issues regarding
this: https://github.com/citusdata/citus/pull/4344
When I create 10k tables, ~120K shards, this saves
40Mb of memory during ALTER EXTENSION citus UPDATE.
Before the change: MetadataCacheMemoryContext: 41943040 ~ 40MB
After the change: MetadataCacheMemoryContext: 8192