Our isolation_distributed_deadlock_detection test would fail randomly in
CI in three different ways.
The first type of failure looked like this:
```diff
check_distributed_deadlocks
---------------------------
t
(1 row)
-step s1-update-5: <... completed>
step s5-update-1: <... completed>
ERROR: canceling the transaction since it was involved in a distributed deadlock
+step s1-update-5: <... completed>
step s1-commit:
```
Source: https://app.circleci.com/pipelines/github/citusdata/citus/26399/workflows/d213ee85-397a-467a-9ffb-39e4f44e6688/jobs/749533
This random change in output was harmless and happened because when the
deadlock detector cancelled a query, two queries would continue: The one
that was cancelled would throw an error (and thus complete), and the one
that was unblocked would now complete.
It was random which of the two the isolation tester would first detect
as completed. To resolve this PR starts using the ["marker" feature][1],
this allows us to make sure one of the steps won't be marked as
completed until the other one completed first.
The second random failure was very similar:
```diff
check_distributed_deadlocks
---------------------------
t
(1 row)
-step s2-update-2: <... completed>
-step s3-update-3: <... completed>
-ERROR: canceling the transaction since it was involved in a distributed deadlock
step s6-commit:
COMMIT;
step s5-update-6: <... completed>
+step s2-update-2: <... completed>
+step s3-update-3: <... completed>
+ERROR: canceling the transaction since it was involved in a distributed deadlock
step s5-commit:
```
Again a harmless difference in test output. In this case it's possible
that the deadlock detector would not detect the unblocked processes
right away, and would thus continue with to the next step. This step was
a commit on a session that was not blocked, and which thus could
complete without issues.
To solve this I changed the order of the commits at the end of the
permutation, to always have the first session that would commit be the
session that would be unblocked the last. This ensures that no commit
will ever be executed before completing all the queries.
The third issue was different and looked like this:
```diff
step s4-update-5: <... completed>
step s4-commit:
COMMIT;
+step s1-update-4: <... completed>
+isolationtester: canceling step s3-update-4 after 5 seconds
step s3-update-4: <... completed>
+ERROR: canceling statement due to user request
+step s2-update-2: <... completed>
step s3-commit:
COMMIT;
-step s2-update-2: <... completed>
-step s1-update-4: <... completed>
step s1-commit:
```
Source: https://app.circleci.com/pipelines/github/citusdata/citus/26411/workflows/9089beec-4f0f-4027-b4ce-0e84889afc06/jobs/750143
The reason for this failure is not entirely clear to me, but I was able
to remove the flakyness without impacting the goal of the test. What was
happening was that both `s1` and `s3` were waiting for `s4` to commit
and release it's lock on the row 4. For some reason it wasn't
deterministic which of the two sessions would be granted the lock after
it was released by row 4. The test expected `s3` to be granted the lock,
but sometimes it would be granted to `s1` instead. Which would in turn
cause `s3` to still be blocked.
To solve this I simply removed `s1` completely from this test. It wasn't
actually part of the cycle that the deadlock detector should detect and
was an unrelated appendage:
```mermaid
graph TD;
s2-->s3;
s3-->s4;
s1-->s4;
s4-->s5;
s5-->s6;
s6-->s5;
```
By removing `s1` completely there was no contention for the lock and
`s3` could always acquire it.
[1]: a73d6c87f2/src/test/isolation/README (L163-L188)
In our testing infra structure, even though we use pinned versions of postgres, the auxiliary libraries might pull in newer versions. This is for example the case for libpq, which will now use the libpq libraries from 14beta3.
The changes in this PR are a lot due to the libpq changes.
We also have changed the citus version that is used as a base for the citus upgrades, from 10.0 to 10.1 . This caused columnar to enforce some extra limits on the settings, which conflicted with our upgrade tests.
The changes in failure tests are due to the libpq changes.
There are also a lot of changes on isolation tests outputs, hence we
updated all of them.
Co-authored-by: Nils Dijk <nils@citusdata.com>
It seems that one of the deadlock detection tests fails way too often in
our CI. The difference is only ordering. Currently it seems that it is a
good idea to disable this test for the sake of development.
With this fix, we traverse the graph with DFS which was originally
intended. Note that, before the fix, we traverse the graph with BFS
which might lead to killing some unrelated backend that is not
involved in the distributed deadlock.