The failure_create_distributed_table_non_empty test would sometimes fail
like this:
```diff
-- in the first test, cancel the first connection we sent from the coordinator
SELECT citus.mitmproxy('conn.cancel(' || pg_backend_pid() || ')');
- mitmproxy
----------------------------------------------------------------------
-
-(1 row)
-
+ERROR: canceling statement due to user request
+CONTEXT: COPY mitmproxy_result, line 1: ""
+SQL statement "COPY mitmproxy_result FROM '/home/circleci/project/src/test/regress/tmp_check/mitmproxy.fifo'"
+PL/pgSQL function citus.mitmproxy(text) line 11 at EXECUTE
SELECT create_distributed_table('test_table', 'id');
```
Because the cancel command had no filter it would actually sometimes
cancel the mitmproxy cancel command itself. This PR addresses that by
filtering on CREATE TABLE, which is one of the command that
create_distributed_table will send to the workers.
Example of failing test: https://app.circleci.com/pipelines/github/citusdata/citus/26252/workflows/1b7e5464-cca4-4ec1-99b3-48ddf25c29fa/jobs/742829
Sometimes in CI the columnar_memory test was using slightly more memory
than expected.
```diff
SELECT CASE WHEN 1.0 * TopMemoryContext / :top_post BETWEEN 0.98 AND 1.02 THEN 1 ELSE 1.0 * TopMemoryContext / :top_post END AS top_growth
FROM columnar_test_helpers.columnar_store_memory_stats();
--[ RECORD 1 ]-
-top_growth | 1
+-[ RECORD 1 ]------------------
+top_growth | 1.0206132116232119
-- before this change, max mem usage while executing inserts was 28MB and
```
This PR changes the expectation to be slightly higher, such that this
random increase in memory usage doesn't cause a flaky test.
Failing test: https://app.circleci.com/pipelines/github/citusdata/citus/26256/workflows/c0870f66-3346-4f8d-a1d3-36dfd7c98289/jobs/743028
In the logical_replication test we test that the cleanup logic at the
start of a shard move works as expected. To do so we create a
subscription and publication slot manually. This changes the test to
make that subscription actually connect to the database that the
publication is in.
Useful for #5987#6085
By running isolation tests in parallel we're just asking for flaky
tasks. The first test might temporarily block one of the commands in the
second test, which we then detect as waiting like this:
```diff
step s2-vacuum-analyze:
VACUUM ANALYZE test_insert_vacuum;
-
+ <waiting ...>
step s1-commit:
COMMIT;
+step s2-vacuum-analyze: <... completed>
```
Debugging flaky tests is also much harder when they are run in parallel.
This PR starts running all our isolation tests sequentially.
The reason for opening this PR was me seeing this failing test:
https://app.circleci.com/pipelines/github/citusdata/citus/26194/workflows/ff57e2cf-8ac4-40fe-bc0c-74a7f8fecb53/jobs/740454
As well as having fixed a similar issue recently in #6122
* Adjust some isolation test for the recent PG commits
In 3f32395612,
Postgres starts any isolation session with `set application_name`.
However, one of the tests we had expected that it is exactly the first
command in the session. The test tries to show that even if a gpid
has not been assigned, we can show it in the citus_lock_waits graph.
Now that, it is literally not possible to have such test as gpid
would be assigned after `set application_name` command. Still,
it is good to have a test where a command is blocked on the parser
Sometimes the columnar_memory test fails in CI with the following error:
```diff
SELECT 1.0 * TopMemoryContext / :top_post BETWEEN 0.98 AND 1.02 AS top_growth_ok
FROM columnar_test_helpers.columnar_store_memory_stats();
-[ RECORD 1 ]-+--
-top_growth_ok | t
+top_growth_ok | f
-- before this change, max mem usage while executing inserts was 28MB and
```
This is almost certainly a harmless failure that simply requires bumping
the margin a little bit. However, it's impossible to say with the
current output. I was unable to reproduce this on-demand on my local
machine or even in CI. So this changes the test to include the actual
value difference in the size of TopMemoryContext when it's outside the
expected range. Then next time it fails we at least have some
information about why.
Example of failing test: https://app.circleci.com/pipelines/github/citusdata/citus/25966/workflows/d472a57b-419a-4f33-b8bc-2e174a98d4d6/jobs/730576
As shown in #6196 the output of s1-view-locks is sometimes not as
expected. However, because it's output is very minimal it's hard to
understand the reason for that. This adds some more columns and
aggregates less, so we can more easily see what locks are unexpectedly
held or released.
In passing this also fixes the following flaky part of this test by excluding
locks taken by the maintenance daemon. After running it with this more
detailed output for s1-view-locks it became obvious that that was the
problem here.
```diff
diff -dU10 -w /home/jelte/work/citus/src/test/regress/expected/isolation_ref2ref_foreign_keys.out /home/jelte/work/citus/src/test/regress/results/isolation_ref2ref_foreign_keys.out
--- /home/jelte/work/citus/src/test/regress/expected/isolation_ref2ref_foreign_keys.out.modified 2022-08-18 15:42:08.689525233 +0200
+++ /home/jelte/work/citus/src/test/regress/results/isolation_ref2ref_foreign_keys.out.modified 2022-08-18 15:42:08.729525233 +0200
@@ -288,21 +288,22 @@
step s1-view-locks:
SELECT mode, count(*)
FROM pg_locks
WHERE locktype='advisory'
GROUP BY mode
ORDER BY 1, 2;
mode |count
------------------------+-----
-(0 rows)
+ShareUpdateExclusiveLock| 1
+(1 row)
starting permutation: s2-begin s2-insert-table-3 s1-view-locks s2-rollback s1-view-locks
step s2-begin:
BEGIN;
step s2-insert-table-3:
INSERT INTO ref_table_3 VALUES (7, 5);
step s1-view-locks:
```
In CI sometimes failure_setup will fail with the following error:
```diff
SELECT master_add_node('localhost', :worker_2_proxy_port); -- an mitmproxy which forwards to the second worker
- master_add_node
----------------------------------------------------------------------
- 2
-(1 row)
-
+ERROR: connection to the remote node localhost:9060 failed with the following error: could not connect to server: Connection refused
+ Is the server running on host "localhost" (127.0.0.1) and accepting
+ TCP/IP connections on port 9060?
+could not connect to server: Connection refused
+ Is the server running on host "localhost" (127.0.0.1) and accepting
+ TCP/IP connections on port 9060?
+could not connect to server: Cannot assign requested address
+ Is the server running on host "localhost" (::1) and accepting
+ TCP/IP connections on port 9060?
diff -dU10 -w /home/circleci/project/src/test/regress/expected/failure_online_move_shard_placement.out /home/circleci/project/src/test/regress/results/failure_online_move_shard_placement.out
```
This then breaks all the tests run after it as well, because we're
missing one worker node.
Locally I was able to reproduce this error by sleeping for 10 seconds in
the forked process sleep before actually starting mitmproxy. So I'm
expecting what's happening in CI is that due to limited resources,
mitmproxy is not up yet when we try to add its port as a workernode.
This PR fixes this by waiting until mitmproxy is listening on its socket
before actually starting to run our tests. This fixed it locally for me
when I made the forked process sleep for 10 seconds before starting
mitmproxy.
In passing it also improves the detection and errors that we already
had for the case where something was already listening on the
mitmproxy port.
Because both @gledis69 and me were changing things in our CI images
at the same time this also includes a bump of the style checker tools.
Closes#6200
This removes some warnings that are present when building on Ubuntu 22.04.
It removes warnings on PG13 + OpenSSL 3.0. OpenSSL 3.0 has marked some
functions that we use as deprecated, but we want to continue support OpenSSL
1.0.1 for the time being too. This indicates that to OpenSSL 3.0, so it doesn't
show warnings.
Sometimes this multi_utilities would fail with the following error:
```diff
SET citus.log_remote_commands TO ON;
-- should propagate to all workers because no table is specified
ANALYZE;
NOTICE: issuing BEGIN TRANSACTION ISOLATION LEVEL READ COMMITTED;SELECT assign_distributed_transaction_id(0, 3461, '2022-08-19 01:56:06.35816-07');
DETAIL: on server postgres@localhost:57637 connectionId: 1
NOTICE: issuing BEGIN TRANSACTION ISOLATION LEVEL READ COMMITTED;SELECT assign_distributed_transaction_id(0, 3461, '2022-08-19 01:56:06.35816-07');
DETAIL: on server postgres@localhost:57638 connectionId: 2
NOTICE: issuing SET citus.enable_ddl_propagation TO 'off'
DETAIL: on server postgres@localhost:57637 connectionId: 1
-NOTICE: issuing SET citus.enable_ddl_propagation TO 'off'
-DETAIL: on server postgres@localhost:xxxxx connectionId: xxxxxxx
NOTICE: issuing ANALYZE
DETAIL: on server postgres@localhost:57637 connectionId: 1
+NOTICE: issuing SET citus.enable_ddl_propagation TO 'off'
+DETAIL: on server postgres@localhost:57638 connectionId: 2
NOTICE: issuing ANALYZE
DETAIL: on server postgres@localhost:57638 connectionId: 2
```
This is simply a harmless change in output due to some timing
differences. This PR makes the test output consistent by only logging
the remote ANALYZE commands, not the SET commands.
This fixes our most commonly randomly failing failure test. The failing
diff is as follows:
```diff
SELECT citus.mitmproxy('conn.onQuery(query="fetch_intermediate_results").kill()');
mitmproxy
-----------
(1 row)
INSERT INTO target_table SELECT * FROM source_table;
-ERROR: connection to the remote node localhost:xxxxx failed with the following error: connection not open
+ERROR: could not open file "base/pgsql_job_cache/10_0_40/repartitioned_results_20770193413_from_4213590_to_1.data": No such file or directory
+CONTEXT: while executing command on localhost:9060
+while executing command on localhost:57637
SELECT * FROM target_table ORDER BY a;
```
As far as I can tell this is the cause of a race condition: After killing
fetch_intermediate_results on worker 9060, the previously created data
file gets cleaned up. The fetch_intermediate_results call that's sent
to worker 57637 will be cancelled and rolled back soon because of the
failure on the other connection. But if that fetch_intermediate_results
call is able to connect to 9060 before it is cancelled, it won't find
the file it's looking for there anymore. So while it's not the error we
expect, it does indicate that we succeeded.
To avoid this issue instead of killing the fetch_intermediate_results
call directly, we kill the COPY command that it uses to do the fetch.
This results in stable output as can be seen here, where 227 runs of
failure_insert_select_repartition succeeded:
https://app.circleci.com/pipelines/github/citusdata/citus/26168/workflows/9c64a3b6-f46c-4725-9fb4-8f6a2d00a023/jobs/739389
To be clear this changes the test to affects the opposite
fetch_intermediate_results call. This kills the fetch_intermediate_results
call of worker 57637, instead of killing the fetch_intermediate_results call
on worker 9060.
Example of failing test: https://app.circleci.com/pipelines/github/citusdata/citus/26147/workflows/780e95ea-264a-4c9f-ad2e-cf11449a795e/jobs/738467