Fix flakyness in isolation_distributed_deadlock_detection (#6240)

Our isolation_distributed_deadlock_detection test would fail randomly in
CI in three different ways.

The first type of failure looked like this:

```diff
 check_distributed_deadlocks
 ---------------------------
 t
 (1 row)

-step s1-update-5: <... completed>
 step s5-update-1: <... completed>
 ERROR:  canceling the transaction since it was involved in a distributed deadlock
+step s1-update-5: <... completed>
 step s1-commit:
```
Source: https://app.circleci.com/pipelines/github/citusdata/citus/26399/workflows/d213ee85-397a-467a-9ffb-39e4f44e6688/jobs/749533

This random change in output was harmless and happened because when the
deadlock detector cancelled a query, two queries would continue: The one
that was cancelled would throw an error (and thus complete), and the one
that was unblocked would now complete.

It was random which of the two the isolation tester would first detect
as completed. To resolve this PR starts using the ["marker" feature][1],
this allows us to make sure one of the steps won't be marked as
completed until the other one completed first.

The second random failure was very similar:
```diff
 check_distributed_deadlocks
 ---------------------------
 t
 (1 row)

-step s2-update-2: <... completed>
-step s3-update-3: <... completed>
-ERROR:  canceling the transaction since it was involved in a distributed deadlock
 step s6-commit:
   COMMIT;

 step s5-update-6: <... completed>
+step s2-update-2: <... completed>
+step s3-update-3: <... completed>
+ERROR:  canceling the transaction since it was involved in a distributed deadlock
 step s5-commit:
```

Again a harmless difference in test output. In this case it's possible
that the deadlock detector would not detect the unblocked processes
right away, and would thus continue with to the next step. This step was
a commit on a session that was not blocked, and which thus could
complete without issues.

To solve this I changed the order of the commits at the end of the
permutation, to always have the first session that would commit be the
session that would be unblocked the last. This ensures that no commit
will ever be executed before completing all the queries.

The third issue was different and looked like this:
```diff
 step s4-update-5: <... completed>
 step s4-commit:
   COMMIT;

+step s1-update-4: <... completed>
+isolationtester: canceling step s3-update-4 after 5 seconds
 step s3-update-4: <... completed>
+ERROR:  canceling statement due to user request
+step s2-update-2: <... completed>
 step s3-commit:
   COMMIT;

-step s2-update-2: <... completed>
-step s1-update-4: <... completed>
 step s1-commit:
```
Source: https://app.circleci.com/pipelines/github/citusdata/citus/26411/workflows/9089beec-4f0f-4027-b4ce-0e84889afc06/jobs/750143

The reason for this failure is not entirely clear to me, but I was able
to remove the flakyness without impacting the goal of the test. What was
happening was that both `s1` and `s3` were waiting for `s4` to commit
and release it's lock on the row 4. For some reason it wasn't
deterministic which of the two sessions would be granted the lock after
it was released by row 4. The test expected `s3` to be granted the lock,
but sometimes it would be granted to `s1` instead. Which would in turn
cause `s3` to still be blocked.

To solve this I simply removed `s1` completely from this test. It wasn't
actually part of the cycle that the deadlock detector should detect and
was an unrelated appendage:

```mermaid
  graph TD;
      s2-->s3;
      s3-->s4;
      s1-->s4;
      s4-->s5;
      s5-->s6;
      s6-->s5;
```

By removing `s1` completely there was no contention for the lock and
`s3` could always acquire it.

[1]: a73d6c87f2/src/test/isolation/README (L163-L188)

(cherry picked from commit 9749622399)
pull/6363/head
Jelte Fennema 2022-08-26 11:03:40 +02:00
parent 45838a0139
commit bf63788b98
2 changed files with 43 additions and 53 deletions

View File

@ -141,7 +141,7 @@ step s2-commit:
COMMIT;
starting permutation: s1-begin s2-begin s1-update-1 s2-update-2 s1-update-2 deadlock-checker-call s2-upsert-select-all deadlock-checker-call s1-commit s2-commit
starting permutation: s1-begin s2-begin s1-update-1 s2-update-2 s1-update-2 deadlock-checker-call s2-upsert-select-all deadlock-checker-call s2-commit s1-commit
step s1-begin:
BEGIN;
@ -179,14 +179,14 @@ t
step s1-update-2: <... completed>
step s2-upsert-select-all: <... completed>
ERROR: canceling the transaction since it was involved in a distributed deadlock
step s1-commit:
COMMIT;
step s2-commit:
COMMIT;
step s1-commit:
COMMIT;
starting permutation: s1-begin s2-begin s1-update-1 s2-update-2 s1-update-2 deadlock-checker-call s2-ddl deadlock-checker-call s1-commit s2-commit
starting permutation: s1-begin s2-begin s1-update-1 s2-update-2 s1-update-2 deadlock-checker-call s2-ddl deadlock-checker-call s2-commit s1-commit
step s1-begin:
BEGIN;
@ -224,10 +224,10 @@ t
step s1-update-2: <... completed>
step s2-ddl: <... completed>
ERROR: canceling the transaction since it was involved in a distributed deadlock
step s1-commit:
step s2-commit:
COMMIT;
step s2-commit:
step s1-commit:
COMMIT;
@ -268,7 +268,7 @@ step s2-commit:
COMMIT;
starting permutation: s1-begin s2-begin s2-insert-ref-10 s1-insert-ref-11 s1-insert-ref-10 s2-insert-ref-11 deadlock-checker-call s1-commit s2-commit
starting permutation: s1-begin s2-begin s2-insert-ref-10 s1-insert-ref-11 s1-insert-ref-10 s2-insert-ref-11 deadlock-checker-call s2-commit s1-commit
step s1-begin:
BEGIN;
@ -298,10 +298,10 @@ t
step s1-insert-ref-10: <... completed>
ERROR: canceling the transaction since it was involved in a distributed deadlock
step s2-insert-ref-11: <... completed>
step s1-commit:
step s2-commit:
COMMIT;
step s2-commit:
step s1-commit:
COMMIT;
@ -408,7 +408,7 @@ step s1-commit:
COMMIT;
starting permutation: s1-begin s2-begin s3-begin s2-update-1 s1-update-1 s2-update-2 s3-update-3 s3-update-2 deadlock-checker-call s2-update-3 deadlock-checker-call s3-commit s2-commit s1-commit
starting permutation: s1-begin s2-begin s3-begin s2-update-1 s1-update-1 s2-update-2 s3-update-3 s3-update-2 deadlock-checker-call s2-update-3 deadlock-checker-call s2-commit s3-commit s1-commit
step s1-begin:
BEGIN;
@ -455,18 +455,18 @@ t
step s3-update-2: <... completed>
ERROR: canceling the transaction since it was involved in a distributed deadlock
step s2-update-3: <... completed>
step s3-commit:
COMMIT;
step s2-commit:
COMMIT;
step s1-update-1: <... completed>
step s3-commit:
COMMIT;
step s1-commit:
COMMIT;
starting permutation: s1-begin s2-begin s3-begin s4-begin s1-update-1 s2-update-2 s3-update-3 s3-update-2 deadlock-checker-call s4-update-4 s2-update-3 deadlock-checker-call s3-commit s2-commit s1-commit s4-commit
starting permutation: s1-begin s2-begin s3-begin s4-begin s1-update-1 s2-update-2 s3-update-3 s3-update-2 deadlock-checker-call s4-update-4 s2-update-3 deadlock-checker-call s2-commit s3-commit s1-commit s4-commit
step s1-begin:
BEGIN;
@ -516,10 +516,10 @@ t
step s3-update-2: <... completed>
ERROR: canceling the transaction since it was involved in a distributed deadlock
step s2-update-3: <... completed>
step s3-commit:
step s2-commit:
COMMIT;
step s2-commit:
step s3-commit:
COMMIT;
step s1-commit:
@ -665,7 +665,7 @@ step s3-commit:
COMMIT;
starting permutation: s1-begin s2-begin s3-begin s4-begin s5-begin s6-begin s1-update-1 s5-update-5 s3-update-2 s2-update-3 s4-update-4 s3-update-4 deadlock-checker-call s6-update-6 s4-update-6 s1-update-5 s5-update-1 deadlock-checker-call s1-commit s5-commit s6-commit s4-commit s3-commit s2-commit
starting permutation: s1-begin s2-begin s3-begin s4-begin s5-begin s6-begin s1-update-1 s5-update-5 s3-update-2 s2-update-3 s4-update-4 s3-update-4 deadlock-checker-call s6-update-6 s4-update-6 s1-update-5 s5-update-1 deadlock-checker-call s5-commit s1-commit s6-commit s4-commit s3-commit s2-commit
step s1-begin:
BEGIN;
@ -733,10 +733,10 @@ t
step s1-update-5: <... completed>
step s5-update-1: <... completed>
ERROR: canceling the transaction since it was involved in a distributed deadlock
step s1-commit:
step s5-commit:
COMMIT;
step s5-commit:
step s1-commit:
COMMIT;
step s6-commit:
@ -754,7 +754,7 @@ step s2-commit:
COMMIT;
starting permutation: s1-begin s2-begin s3-begin s4-begin s5-begin s6-begin s6-update-6 s5-update-5 s5-update-6 s4-update-4 s1-update-4 s4-update-5 deadlock-checker-call s2-update-3 s3-update-2 s2-update-2 s3-update-3 deadlock-checker-call s6-commit s5-commit s4-commit s1-commit s3-commit s2-commit
starting permutation: s1-begin s2-begin s3-begin s4-begin s5-begin s6-begin s6-update-6 s5-update-5 s5-update-6 s4-update-4 s1-update-4 s4-update-5 deadlock-checker-call s2-update-3 s3-update-2 s2-update-2 s3-update-3 deadlock-checker-call s3-commit s6-commit s5-commit s4-commit s1-commit s2-commit
step s1-begin:
BEGIN;
@ -822,6 +822,9 @@ t
step s2-update-2: <... completed>
step s3-update-3: <... completed>
ERROR: canceling the transaction since it was involved in a distributed deadlock
step s3-commit:
COMMIT;
step s6-commit:
COMMIT;
@ -837,17 +840,11 @@ step s1-update-4: <... completed>
step s1-commit:
COMMIT;
step s3-commit:
COMMIT;
step s2-commit:
COMMIT;
starting permutation: s1-begin s2-begin s3-begin s4-begin s5-begin s6-begin s5-update-5 s3-update-2 s2-update-2 s4-update-4 s3-update-4 s4-update-5 s1-update-4 deadlock-checker-call s6-update-6 s5-update-6 s6-update-5 deadlock-checker-call s5-commit s6-commit s4-commit s3-commit s1-commit s2-commit
step s1-begin:
BEGIN;
starting permutation: s2-begin s3-begin s4-begin s5-begin s6-begin s5-update-5 s3-update-2 s2-update-2 s4-update-4 s3-update-4 s4-update-5 deadlock-checker-call s6-update-6 s5-update-6 s6-update-5 deadlock-checker-call s6-commit s5-commit s4-commit s3-commit s2-commit
step s2-begin:
BEGIN;
@ -881,9 +878,6 @@ step s3-update-4:
step s4-update-5:
UPDATE deadlock_detection_test SET some_val = 4 WHERE user_id = 5;
<waiting ...>
step s1-update-4:
UPDATE deadlock_detection_test SET some_val = 1 WHERE user_id = 4;
<waiting ...>
step deadlock-checker-call:
SELECT check_distributed_deadlocks();
@ -912,13 +906,13 @@ t
step s5-update-6: <... completed>
step s6-update-5: <... completed>
ERROR: canceling the transaction since it was involved in a distributed deadlock
step s6-commit:
COMMIT;
step s5-commit:
COMMIT;
step s4-update-5: <... completed>
step s6-commit:
COMMIT;
step s4-commit:
COMMIT;
@ -927,10 +921,6 @@ step s3-commit:
COMMIT;
step s2-update-2: <... completed>
step s1-update-4: <... completed>
step s1-commit:
COMMIT;
step s2-commit:
COMMIT;

View File

@ -316,52 +316,52 @@ step "deadlock-checker-call"
}
// simplest case, loop with two nodes
permutation "s1-begin" "s2-begin" "s1-update-1" "s2-update-2" "s2-update-1" "deadlock-checker-call" "s1-update-2" "deadlock-checker-call" "s1-commit" "s2-commit"
permutation "s1-begin" "s2-begin" "s1-update-1" "s2-update-2" "s2-update-1" "deadlock-checker-call" "s1-update-2"("s2-update-1") "deadlock-checker-call" "s1-commit" "s2-commit"
// simplest case with replication factor 2
permutation "s1-begin" "s2-begin" "s1-update-1-rep-2" "s2-update-2-rep-2" "s2-update-1-rep-2" "deadlock-checker-call" "s1-update-2-rep-2" "deadlock-checker-call" "s1-commit" "s2-commit"
permutation "s1-begin" "s2-begin" "s1-update-1-rep-2" "s2-update-2-rep-2" "s2-update-1-rep-2" "deadlock-checker-call" "s1-update-2-rep-2"("s2-update-1-rep-2") "deadlock-checker-call" "s1-commit" "s2-commit"
// simplest case with 2pc enabled
permutation "s1-begin" "s2-begin" "s1-set-2pc" "s2-set-2pc" "s1-update-1" "s2-update-2" "s2-update-1" "deadlock-checker-call" "s1-update-2" "deadlock-checker-call" "s1-commit" "s2-commit"
// simplest case with multi-shard query is cancelled
permutation "s1-begin" "s2-begin" "s1-update-1" "s2-update-2" "s1-update-2" "deadlock-checker-call" "s2-upsert-select-all" "deadlock-checker-call" "s1-commit" "s2-commit"
permutation "s1-begin" "s2-begin" "s1-update-1" "s2-update-2" "s1-update-2" "deadlock-checker-call" "s2-upsert-select-all"("s1-update-2") "deadlock-checker-call" "s2-commit" "s1-commit"
// simplest case with DDL is cancelled
permutation "s1-begin" "s2-begin" "s1-update-1" "s2-update-2" "s1-update-2" "deadlock-checker-call" "s2-ddl" "deadlock-checker-call" "s1-commit" "s2-commit"
permutation "s1-begin" "s2-begin" "s1-update-1" "s2-update-2" "s1-update-2" "deadlock-checker-call" "s2-ddl"("s1-update-2") "deadlock-checker-call" "s2-commit" "s1-commit"
// daedlock with local table
permutation "s1-begin" "s2-begin" "s1-insert-dist-10" "s2-insert-local-10" "s2-insert-dist-10" "s1-insert-local-10" "deadlock-checker-call" "s1-commit" "s2-commit"
permutation "s1-begin" "s2-begin" "s1-insert-dist-10" "s2-insert-local-10" "s2-insert-dist-10" "s1-insert-local-10"("s2-insert-dist-10") "deadlock-checker-call" "s1-commit" "s2-commit"
// daedlock with reference tables only
permutation "s1-begin" "s2-begin" "s2-insert-ref-10" "s1-insert-ref-11" "s1-insert-ref-10" "s2-insert-ref-11" "deadlock-checker-call" "s1-commit" "s2-commit"
permutation "s1-begin" "s2-begin" "s2-insert-ref-10" "s1-insert-ref-11" "s1-insert-ref-10" "s2-insert-ref-11"("s1-insert-ref-10") "deadlock-checker-call" "s2-commit" "s1-commit"
// deadlock with referecen + distributed tables
permutation "s1-begin" "s2-begin" "s2-insert-ref-10" "s1-update-1" "deadlock-checker-call" "s2-update-1" "s1-insert-ref-10" "deadlock-checker-call" "s1-commit" "s2-commit"
permutation "s1-begin" "s2-begin" "s2-insert-ref-10" "s1-update-1" "deadlock-checker-call" "s2-update-1" "s1-insert-ref-10"("s2-update-1") "deadlock-checker-call" "s1-commit" "s2-commit"
// slightly more complex case, loop with three nodes
permutation "s1-begin" "s2-begin" "s3-begin" "s1-update-1" "s2-update-2" "s3-update-3" "deadlock-checker-call" "s1-update-2" "s2-update-3" "s3-update-1" "deadlock-checker-call" "s3-commit" "s2-commit" "s1-commit"
permutation "s1-begin" "s2-begin" "s3-begin" "s1-update-1" "s2-update-2" "s3-update-3" "deadlock-checker-call" "s1-update-2" "s2-update-3" "s3-update-1"("s2-update-3") "deadlock-checker-call" "s3-commit" "s2-commit" "s1-commit"
// similar to the above (i.e., 3 nodes), but the cycle starts from the second node
permutation "s1-begin" "s2-begin" "s3-begin" "s2-update-1" "s1-update-1" "s2-update-2" "s3-update-3" "s3-update-2" "deadlock-checker-call" "s2-update-3" "deadlock-checker-call" "s3-commit" "s2-commit" "s1-commit"
permutation "s1-begin" "s2-begin" "s3-begin" "s2-update-1" "s1-update-1" "s2-update-2" "s3-update-3" "s3-update-2" "deadlock-checker-call" "s2-update-3"("s3-update-2") "deadlock-checker-call" "s2-commit" "s3-commit" "s1-commit"
// not connected graph
permutation "s1-begin" "s2-begin" "s3-begin" "s4-begin" "s1-update-1" "s2-update-2" "s3-update-3" "s3-update-2" "deadlock-checker-call" "s4-update-4" "s2-update-3" "deadlock-checker-call" "s3-commit" "s2-commit" "s1-commit" "s4-commit"
permutation "s1-begin" "s2-begin" "s3-begin" "s4-begin" "s1-update-1" "s2-update-2" "s3-update-3" "s3-update-2" "deadlock-checker-call" "s4-update-4" "s2-update-3"("s3-update-2") "deadlock-checker-call" "s2-commit" "s3-commit" "s1-commit" "s4-commit"
// still a not connected graph, but each smaller graph contains dependencies, one of which is a distributed deadlock
permutation "s1-begin" "s2-begin" "s3-begin" "s4-begin" "s4-update-1" "s1-update-1" "deadlock-checker-call" "s2-update-2" "s3-update-3" "s2-update-3" "s3-update-2" "deadlock-checker-call" "s3-commit" "s2-commit" "s4-commit" "s1-commit"
permutation "s1-begin" "s2-begin" "s3-begin" "s4-begin" "s4-update-1" "s1-update-1" "deadlock-checker-call" "s2-update-2" "s3-update-3" "s2-update-3" "s3-update-2"("s2-update-3") "deadlock-checker-call" "s3-commit" "s2-commit" "s4-commit" "s1-commit"
// multiple deadlocks on a not connected graph
permutation "s1-begin" "s2-begin" "s3-begin" "s4-begin" "s1-update-1" "s4-update-4" "s2-update-2" "s3-update-3" "s3-update-2" "s4-update-1" "s1-update-4" "deadlock-checker-call" "s1-commit" "s4-commit" "s2-update-3" "deadlock-checker-call" "s2-commit" "s3-commit"
permutation "s1-begin" "s2-begin" "s3-begin" "s4-begin" "s1-update-1" "s4-update-4" "s2-update-2" "s3-update-3" "s3-update-2" "s4-update-1" "s1-update-4"("s4-update-1") "deadlock-checker-call" "s1-commit" "s4-commit" "s2-update-3"("s3-update-2") "deadlock-checker-call" "s2-commit" "s3-commit"
// a larger graph where the first node is in the distributed deadlock
permutation "s1-begin" "s2-begin" "s3-begin" "s4-begin" "s5-begin" "s6-begin" "s1-update-1" "s5-update-5" "s3-update-2" "s2-update-3" "s4-update-4" "s3-update-4" "deadlock-checker-call" "s6-update-6" "s4-update-6" "s1-update-5" "s5-update-1" "deadlock-checker-call" "s1-commit" "s5-commit" "s6-commit" "s4-commit" "s3-commit" "s2-commit"
permutation "s1-begin" "s2-begin" "s3-begin" "s4-begin" "s5-begin" "s6-begin" "s1-update-1" "s5-update-5" "s3-update-2" "s2-update-3" "s4-update-4" "s3-update-4" "deadlock-checker-call" "s6-update-6" "s4-update-6" "s1-update-5" "s5-update-1"("s1-update-5") "deadlock-checker-call" "s5-commit" "s1-commit" "s6-commit" "s4-commit" "s3-commit" "s2-commit"
// a larger graph where the deadlock starts from a middle node
permutation "s1-begin" "s2-begin" "s3-begin" "s4-begin" "s5-begin" "s6-begin" "s6-update-6" "s5-update-5" "s5-update-6" "s4-update-4" "s1-update-4" "s4-update-5" "deadlock-checker-call" "s2-update-3" "s3-update-2" "s2-update-2" "s3-update-3" "deadlock-checker-call" "s6-commit" "s5-commit" "s4-commit" "s1-commit" "s3-commit" "s2-commit"
permutation "s1-begin" "s2-begin" "s3-begin" "s4-begin" "s5-begin" "s6-begin" "s6-update-6" "s5-update-5" "s5-update-6" "s4-update-4" "s1-update-4" "s4-update-5" "deadlock-checker-call" "s2-update-3" "s3-update-2" "s2-update-2" "s3-update-3"("s2-update-2") "deadlock-checker-call" "s3-commit" "s6-commit" "s5-commit" "s4-commit" "s1-commit" "s2-commit"
// a larger graph where the deadlock starts from the last node
permutation "s1-begin" "s2-begin" "s3-begin" "s4-begin" "s5-begin" "s6-begin" "s5-update-5" "s3-update-2" "s2-update-2" "s4-update-4" "s3-update-4" "s4-update-5" "s1-update-4" "deadlock-checker-call" "s6-update-6" "s5-update-6" "s6-update-5" "deadlock-checker-call" "s5-commit" "s6-commit" "s4-commit" "s3-commit" "s1-commit" "s2-commit"
permutation "s2-begin" "s3-begin" "s4-begin" "s5-begin" "s6-begin" "s5-update-5" "s3-update-2" "s2-update-2" "s4-update-4" "s3-update-4" "s4-update-5" "deadlock-checker-call" "s6-update-6" "s5-update-6" "s6-update-5"("s5-update-6") "deadlock-checker-call" "s6-commit" "s5-commit" "s4-commit" "s3-commit" "s2-commit"
// a backend is blocked on multiple backends
// note that session 5 is not strictly necessary to simulate the deadlock