Add docs on how to fix flaky tests (#6438)

I fixed a lot of flaky tests recently and I found some patterns in the type of issues and type of fixes. This adds a document that lists these types of issues and explains how to fix them.
2022-10-18 15:52:01 +02:00 · 2022-10-18 15:52:01 +02:00 · f756db39c4
parent e87eda6496
commit f756db39c4
3 changed files with 332 additions and 0 deletions
--- a/CONTRIBUTING.md
+++ b/CONTRIBUTING.md
@ -218,3 +218,7 @@ style `#include` statements like this:
 Any other SQL you can put directly in the main sql file, e.g.
 `src/backend/distributed/sql/citus--8.3-1--9.0-1.sql`.
 ### Running tests
 See [`src/test/regress/README.md`](https://github.com/citusdata/citus/blob/master/src/test/regress/README.md)
--- a/src/test/regress/README.md
+++ b/src/test/regress/README.md
@ -99,3 +99,8 @@ To automatically setup a citus cluster in tests we use our
 `src/test/regress/pg_regress_multi.pl` script. This sets up a citus cluster and
 then starts the standard postgres test tooling. You almost never have to change
 this file.
 ## Randomly failing tests
 In CI sometimes a test fails randomly, we call these tests "flaky". To fix these
 flaky tests see [`src/test/regress/flaky_tests.md`](https://github.com/citusdata/citus/blob/master/src/test/regress/mitmscripts/README.md)
--- a/src/test/regress/flaky_tests.md
+++ b/src/test/regress/flaky_tests.md
@ -0,0 +1,323 @@
 # How to fix flaky tests
 Flaky tests happen when for some reason our tests return non-deterministic
 results.
 There are three different causes of flaky tests:
 1. Tests that don't make sure output is consistent, i.e. a bug in our tests
 2. Bugs in our testing infrastructure
 3. Bugs in Citus itself
 All of these impact the happiness and productivity of our developers, because we
 have to rerun tests to make them pass. But apart from developer happiness and
 productivity, 3 also impacts our users, and by ignoring flaky tests we can miss
 problems that our users could run into. This reduces the effectiveness of our
 tests.
 ## Reproducing a flaky test
 Before trying to fix the flakyness, it's important that you can reproduce the
 flaky test. Often it only reproduces in CI, so we have a CI job that can help
 you reproduce flakyness consistently by running the same test a lot of times.
 You can configure CI to run this job by setting the `flaky_test` and if
 necessary the possibly the `flaky_test_make` parameters.
 ```diff
   flaky_test:
     type: string
 -    default: ''
 +    default: 'isolation_shard_rebalancer_progress'
   flaky_test_make:
     type: string
 -    default: check-minimal
 +    default: check-isolation-base
 ```
 Once you get this job to consistently fail in CI, you can continue with the next
 steps to make it instead consistently pass. If the failure doesn't reproduce
 with this CI job, it's almost certainly caused by running it concurrently with
 other tests. See the "Don't run test in parallel with others" section below on
 how to fix that.
 ## Easy fixes
 The following types of issues all fall within the category 1: bugs in our tests.
 ### Expected records but different order
 **Issue**: A query returns the right result, but they are in a different order
 than expected by the output.
 **Fix**: Add an extra column to the ORDER BY clause of the query to make the
 output consistent
 **Example**
 ```diff
  8970008 | colocated_dist_table                   | -2147483648   | 2147483647    | localhost |    57637
  8970009 | colocated_partitioned_table            | -2147483648   | 2147483647    | localhost |    57637
  8970010 | colocated_partitioned_table_2020_01_01 | -2147483648   | 2147483647    | localhost |    57637
 - 8970011 | reference_table                        |               |               | localhost |    57637
  8970011 | reference_table                        |               |               | localhost |    57638
 + 8970011 | reference_table                        |               |               | localhost |    57637
 (13 rows)
 ```
 **Example fix**:
 ```diff
 -ORDER BY logicalrelid, shardminvalue::BIGINT;
 +ORDER BY logicalrelid, shardminvalue::BIGINT, nodeport;
 ```
 ### Expected logs but different order
 **Issue**: The logs in the regress output are displayed in a different order
 than what the output file shows
 **Fix**: It's simple: don't log these things during the test. There are two common
 ways of achieving this:
 1. If you don't care about the logs for this query at all, then you can change
   the log `VERBOSITY` or lower `client_min_messages`.
 2. If these are logs of uninteresting commands created by
   `citus.log_remote_commands`, but you care about some of the other remote
   commands being as expected, then you can use `citus.grep_remote_commands` to
   only display the commands that you care about.
 **Example of issue 1**:
 ```diff
 select alter_table_set_access_method('ref','heap');
 NOTICE:  creating a new table for alter_table_set_access_method.ref
 NOTICE:  moving the data of alter_table_set_access_method.ref
 NOTICE:  dropping the old alter_table_set_access_method.ref
 NOTICE:  drop cascades to 2 other objects
 -DETAIL:  drop cascades to materialized view m_ref
 -drop cascades to view v_ref
 +DETAIL:  drop cascades to view v_ref
 +drop cascades to materialized view m_ref
 CONTEXT:  SQL statement "DROP TABLE alter_table_set_access_method.ref CASCADE"
 NOTICE:  renaming the new table to alter_table_set_access_method.ref
  alter_table_set_access_method
 -------------------------------
 (1 row)
 ```
 **Example fix of issue 1**
 ```diff
 +\set VERBOSITY terse
 ```
 **Example of issue 2**
 ```diff
 SET citus.log_remote_commands TO ON;
 -- should propagate to all workers because no table is specified
 ANALYZE;
 NOTICE:  issuing BEGIN TRANSACTION ISOLATION LEVEL READ COMMITTED;SELECT assign_distributed_transaction_id(0, 3461, '2022-08-19 01:56:06.35816-07');
 DETAIL:  on server postgres@localhost:57637 connectionId: 1
 NOTICE:  issuing BEGIN TRANSACTION ISOLATION LEVEL READ COMMITTED;SELECT assign_distributed_transaction_id(0, 3461, '2022-08-19 01:56:06.35816-07');
 DETAIL:  on server postgres@localhost:57638 connectionId: 2
 NOTICE:  issuing SET citus.enable_ddl_propagation TO 'off'
 DETAIL:  on server postgres@localhost:57637 connectionId: 1
 -NOTICE:  issuing SET citus.enable_ddl_propagation TO 'off'
 -DETAIL:  on server postgres@localhost:xxxxx connectionId: xxxxxxx
 NOTICE:  issuing ANALYZE
 DETAIL:  on server postgres@localhost:57637 connectionId: 1
 +NOTICE:  issuing SET citus.enable_ddl_propagation TO 'off'
 +DETAIL:  on server postgres@localhost:57638 connectionId: 2
 NOTICE:  issuing ANALYZE
 DETAIL:  on server postgres@localhost:57638 connectionId: 2
 ```
 **Example fix of issue 2**
 ```diff
 SET citus.log_remote_commands TO ON;
 +SET citus.grep_remote_commands = '%ANALYZE%';
 ```
 ### Isolation test completes in different order
 **Issue**: There's no defined order in which the steps in two different sessions
 complete, because they don't block each other. This can happen when two sessions
 were both blocked by a third session, but when the third session releases the
 lock the first two can both continue.
 **Fix**: Use the isolation test ["marker" feature][marker-feature] to make sure
 one step can only complete after another has completed.
 [marker-feature]: https://github.com/postgres/postgres/blob/c68a1839902daeb42cf1ebc89edfdd91c00e5091/src/test/isolation/README#L163-L188
 **Example**
 ```diff
 -step s1-shard-move-c1-block-writes: <... completed>
 +step s4-shard-move-sep-block-writes: <... completed>
 citus_move_shard_placement
 --------------------------
 (1 row)
 -step s4-shard-move-sep-block-writes: <... completed>
 +step s1-shard-move-c1-block-writes: <... completed>
 citus_move_shard_placement
 --------------------------
 ```
 **Example fix**
 ```diff
 +permutation ... "s1-shard-move-c1-block-writes" "s4-shard-move-sep-block-writes" ...
 +permutation ... "s1-shard-move-c1-block-writes" "s4-shard-move-sep-block-writes"("s1-shard-move-c1-block-writes") ...
 ```
 ### Disk size numbers are not exactly like expected
 **Issue**: In some tests we show the disk size of a table, but due to various
 postgres background processes such as vacuuming these sizes can change slightly.
 **Fix**: Expect a certain range of disk sizes instead of a specific one.
 **Example**
 ```diff
 VACUUM (INDEX_CLEANUP ON, PARALLEL 1) local_vacuum_table;
 SELECT pg_size_pretty( pg_total_relation_size('local_vacuum_table') );
  pg_size_pretty
 ----------------
 - 21 MB
 + 22 MB
 (1 row)
 ```
 **Example fix**
 ```diff
 -SELECT pg_size_pretty( pg_total_relation_size('local_vacuum_table') );
 - pg_size_pretty
 +SELECT CASE WHEN s BETWEEN 20000000 AND 25000000 THEN 22500000 ELSE s END
 +FROM pg_total_relation_size('local_vacuum_table') s ;
 +    s
 ---------------------------------------------------------------------
 - 21 MB
 + 22500000
 ```
 ## Isolation test flakyness
 If the flaky test is an isolation test, first read the Postgres docs on dealing
 with [race conditions in isolation tests][pg-isolation-docs]. A common example
 was already listed above, but the Postgres docs list some other types too and
 explain how to make their output consistent.
 [pg-isolation-docs]: https://github.com/postgres/postgres/blob/c68a1839902daeb42cf1ebc89edfdd91c00e5091/src/test/isolation/README#L152
 ## Ruling out common sources of randomness as the cause
 If it's none of the above, then probably the reason why the test is flaky is not
 immediately obvious. There are a few things that can introduce randomness into
 our test suite. To keep your sanity while investigating, it's good to rule these
 out as the cause (or even better determine that they are the cause).
 ### Don't run test in parallel with others
 Check in the schedule if the test is run in parallel with others. If it is,
 remove it from there and check if it's still flaky.
 **Example**
 ```diff
 test: multi_partitioning_utils replicated_partitioned_table
 -test: multi_partitioning partitioning_issue_3970
 +test: multi_partitioning
 +test: partitioning_issue_3970
 test: drop_partitioned_table
 ```
 ### Use a fixed number of connections
 The adaptive executor of Citus sometimes opens extra connections to do stuff in
 parallel to speed up multi-shard queries. This happens especially in CI, because
 CI machines are sometimes slow. There are two ways to get a consistent number of
 connections:
 1. Use `citus.max_adaptive_executor_pool_size` to limit the connections
 2. Use `citus.force_max_query_parallelization` to always open the maximum number
   of connections.
 **Example**
 ```diff
 ALTER TABLE dist_partitioned_table ADD CONSTRAINT constraint1 UNIQUE (dist_col, partition_col);
 +ERROR:  canceling the transaction since it was involved in a distributed deadlock
 ```
 **Example of fix 1**
 ```diff
 +SET citus.max_adaptive_executor_pool_size TO 1;
 ALTER TABLE dist_partitioned_table ADD CONSTRAINT constraint1 UNIQUE (dist_col, partition_col);
 +RESET citus.max_adaptive_executor_pool_size;
 ```
 **Example of fix 2**
 ```diff
 +SET citus.force_max_query_parallelization TO 1;
 ALTER TABLE dist_partitioned_table ADD CONSTRAINT constraint1 UNIQUE (dist_col, partition_col);
 +RESET citus.force_max_query_parallelization;
 ```
 IMPORTANT: If this helps, this could very well indicate a bug. Check with
 senior/principal engineers if it's expected that it helps in this case.
 ## What to do if this all doesn't work?
 If none of the advice above worked, the first thing to try is read the failing
 test in detail and try to understand how it works. Often, with a bit of thinking
 you can figure out why it's failing in the way that it's failing. If you cannot
 figure it out yourself, it's good to ask senior/principal engineers, maybe they
 can think of the reason. Or maybe they're certain that it's an actual bug.
 ### What to do when you cannot fix or find the bug?
 If it turns out to be an actual bug in Citus, but fixing the bug (or finding its
 cause) is hard, making the test output consistent is already an improvement over
 the status quo. Be sure to create an issue though for the bug. Even if you're
 not entirely sure what's causing it you can still create an issue describing how
 to reproduce the flakiness.
 ## What to do if output can never be consistent?
 There are still a few ways to make our test suite less flaky, even if you
 figured out that the output that Postgres gives can never be made consistent.
 ### Normalizing random output
 If for some reason you cannot make consistent output then our
 [`normalize.sed`][normalize] might come to the rescue. This allows us to
 normalize certain lines to one specific output.
 **Example**
 ```diff
 -CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s.
 +CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.02 s.
 ```
 **Fix by changing inconsistent parts of line**
 ```sed
 # ignore timing statistics for VACUUM VERBOSE
 s/CPU: user: .*s, system: .*s, elapsed: .*s/CPU: user: X.XX s, system: X.XX s, elapsed: X.XX s/
 ```
 **Fix by completely removing line**
 ```sed
 # ignore timing statistics for VACUUM VERBOSE
 /CPU: user: .*s, system: .*s, elapsed: .*s/d
 ```
 [normalize]:
 https://github.com/citusdata/citus/blob/main/src/test/regress/bin/normalize.sed
 ### Removing the flaky test
 Sometimes removing the test is the only way to make our test suite less flaky.
 Of course this is a last resort, but sometimes it's what we want. If running the
 test does more bad than good, removing will be a net positive.