From f756db39c46a8accedf5f5d34b714b867220c42b Mon Sep 17 00:00:00 2001 From: Jelte Fennema Date: Tue, 18 Oct 2022 15:52:01 +0200 Subject: [PATCH] Add docs on how to fix flaky tests (#6438) I fixed a lot of flaky tests recently and I found some patterns in the type of issues and type of fixes. This adds a document that lists these types of issues and explains how to fix them. --- CONTRIBUTING.md | 4 + src/test/regress/README.md | 5 + src/test/regress/flaky_tests.md | 323 ++++++++++++++++++++++++++++++++ 3 files changed, 332 insertions(+) create mode 100644 src/test/regress/flaky_tests.md diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md index 6c54e9d6c..cff39bf51 100644 --- a/CONTRIBUTING.md +++ b/CONTRIBUTING.md @@ -218,3 +218,7 @@ style `#include` statements like this: Any other SQL you can put directly in the main sql file, e.g. `src/backend/distributed/sql/citus--8.3-1--9.0-1.sql`. + +### Running tests + +See [`src/test/regress/README.md`](https://github.com/citusdata/citus/blob/master/src/test/regress/README.md) diff --git a/src/test/regress/README.md b/src/test/regress/README.md index 5a68df2c0..25ee88ecd 100644 --- a/src/test/regress/README.md +++ b/src/test/regress/README.md @@ -99,3 +99,8 @@ To automatically setup a citus cluster in tests we use our `src/test/regress/pg_regress_multi.pl` script. This sets up a citus cluster and then starts the standard postgres test tooling. You almost never have to change this file. + +## Randomly failing tests + +In CI sometimes a test fails randomly, we call these tests "flaky". To fix these +flaky tests see [`src/test/regress/flaky_tests.md`](https://github.com/citusdata/citus/blob/master/src/test/regress/mitmscripts/README.md) diff --git a/src/test/regress/flaky_tests.md b/src/test/regress/flaky_tests.md new file mode 100644 index 000000000..2c065371b --- /dev/null +++ b/src/test/regress/flaky_tests.md @@ -0,0 +1,323 @@ +# How to fix flaky tests + +Flaky tests happen when for some reason our tests return non-deterministic +results. + +There are three different causes of flaky tests: +1. Tests that don't make sure output is consistent, i.e. a bug in our tests +2. Bugs in our testing infrastructure +3. Bugs in Citus itself + +All of these impact the happiness and productivity of our developers, because we +have to rerun tests to make them pass. But apart from developer happiness and +productivity, 3 also impacts our users, and by ignoring flaky tests we can miss +problems that our users could run into. This reduces the effectiveness of our +tests. + + +## Reproducing a flaky test + +Before trying to fix the flakyness, it's important that you can reproduce the +flaky test. Often it only reproduces in CI, so we have a CI job that can help +you reproduce flakyness consistently by running the same test a lot of times. +You can configure CI to run this job by setting the `flaky_test` and if +necessary the possibly the `flaky_test_make` parameters. + +```diff + flaky_test: + type: string +- default: '' ++ default: 'isolation_shard_rebalancer_progress' + flaky_test_make: + type: string +- default: check-minimal ++ default: check-isolation-base +``` + +Once you get this job to consistently fail in CI, you can continue with the next +steps to make it instead consistently pass. If the failure doesn't reproduce +with this CI job, it's almost certainly caused by running it concurrently with +other tests. See the "Don't run test in parallel with others" section below on +how to fix that. + + +## Easy fixes + +The following types of issues all fall within the category 1: bugs in our tests. + +### Expected records but different order + +**Issue**: A query returns the right result, but they are in a different order +than expected by the output. + +**Fix**: Add an extra column to the ORDER BY clause of the query to make the +output consistent + +**Example** +```diff + 8970008 | colocated_dist_table | -2147483648 | 2147483647 | localhost | 57637 + 8970009 | colocated_partitioned_table | -2147483648 | 2147483647 | localhost | 57637 + 8970010 | colocated_partitioned_table_2020_01_01 | -2147483648 | 2147483647 | localhost | 57637 +- 8970011 | reference_table | | | localhost | 57637 + 8970011 | reference_table | | | localhost | 57638 ++ 8970011 | reference_table | | | localhost | 57637 + (13 rows) +``` + +**Example fix**: + +```diff +-ORDER BY logicalrelid, shardminvalue::BIGINT; ++ORDER BY logicalrelid, shardminvalue::BIGINT, nodeport; +``` + +### Expected logs but different order + +**Issue**: The logs in the regress output are displayed in a different order +than what the output file shows + +**Fix**: It's simple: don't log these things during the test. There are two common +ways of achieving this: +1. If you don't care about the logs for this query at all, then you can change + the log `VERBOSITY` or lower `client_min_messages`. +2. If these are logs of uninteresting commands created by + `citus.log_remote_commands`, but you care about some of the other remote + commands being as expected, then you can use `citus.grep_remote_commands` to + only display the commands that you care about. + +**Example of issue 1**: +```diff +select alter_table_set_access_method('ref','heap'); + NOTICE: creating a new table for alter_table_set_access_method.ref + NOTICE: moving the data of alter_table_set_access_method.ref + NOTICE: dropping the old alter_table_set_access_method.ref + NOTICE: drop cascades to 2 other objects +-DETAIL: drop cascades to materialized view m_ref +-drop cascades to view v_ref ++DETAIL: drop cascades to view v_ref ++drop cascades to materialized view m_ref + CONTEXT: SQL statement "DROP TABLE alter_table_set_access_method.ref CASCADE" + NOTICE: renaming the new table to alter_table_set_access_method.ref + alter_table_set_access_method + ------------------------------- + + (1 row) +``` + +**Example fix of issue 1** +```diff ++\set VERBOSITY terse +``` + +**Example of issue 2** +```diff +SET citus.log_remote_commands TO ON; + -- should propagate to all workers because no table is specified + ANALYZE; + NOTICE: issuing BEGIN TRANSACTION ISOLATION LEVEL READ COMMITTED;SELECT assign_distributed_transaction_id(0, 3461, '2022-08-19 01:56:06.35816-07'); + DETAIL: on server postgres@localhost:57637 connectionId: 1 + NOTICE: issuing BEGIN TRANSACTION ISOLATION LEVEL READ COMMITTED;SELECT assign_distributed_transaction_id(0, 3461, '2022-08-19 01:56:06.35816-07'); + DETAIL: on server postgres@localhost:57638 connectionId: 2 + NOTICE: issuing SET citus.enable_ddl_propagation TO 'off' + DETAIL: on server postgres@localhost:57637 connectionId: 1 +-NOTICE: issuing SET citus.enable_ddl_propagation TO 'off' +-DETAIL: on server postgres@localhost:xxxxx connectionId: xxxxxxx + NOTICE: issuing ANALYZE + DETAIL: on server postgres@localhost:57637 connectionId: 1 ++NOTICE: issuing SET citus.enable_ddl_propagation TO 'off' ++DETAIL: on server postgres@localhost:57638 connectionId: 2 + NOTICE: issuing ANALYZE + DETAIL: on server postgres@localhost:57638 connectionId: 2 +``` + +**Example fix of issue 2** +```diff + SET citus.log_remote_commands TO ON; ++SET citus.grep_remote_commands = '%ANALYZE%'; +``` + +### Isolation test completes in different order + +**Issue**: There's no defined order in which the steps in two different sessions +complete, because they don't block each other. This can happen when two sessions +were both blocked by a third session, but when the third session releases the +lock the first two can both continue. + +**Fix**: Use the isolation test ["marker" feature][marker-feature] to make sure +one step can only complete after another has completed. + +[marker-feature]: https://github.com/postgres/postgres/blob/c68a1839902daeb42cf1ebc89edfdd91c00e5091/src/test/isolation/README#L163-L188 + + +**Example** +```diff +-step s1-shard-move-c1-block-writes: <... completed> ++step s4-shard-move-sep-block-writes: <... completed> + citus_move_shard_placement + -------------------------- + + (1 row) + +-step s4-shard-move-sep-block-writes: <... completed> ++step s1-shard-move-c1-block-writes: <... completed> + citus_move_shard_placement + -------------------------- +``` + +**Example fix** +```diff ++permutation ... "s1-shard-move-c1-block-writes" "s4-shard-move-sep-block-writes" ... ++permutation ... "s1-shard-move-c1-block-writes" "s4-shard-move-sep-block-writes"("s1-shard-move-c1-block-writes") ... +``` + +### Disk size numbers are not exactly like expected + +**Issue**: In some tests we show the disk size of a table, but due to various +postgres background processes such as vacuuming these sizes can change slightly. + +**Fix**: Expect a certain range of disk sizes instead of a specific one. + +**Example** +```diff + VACUUM (INDEX_CLEANUP ON, PARALLEL 1) local_vacuum_table; + SELECT pg_size_pretty( pg_total_relation_size('local_vacuum_table') ); + pg_size_pretty + ---------------- +- 21 MB ++ 22 MB + (1 row) +``` + +**Example fix** +```diff +-SELECT pg_size_pretty( pg_total_relation_size('local_vacuum_table') ); +- pg_size_pretty ++SELECT CASE WHEN s BETWEEN 20000000 AND 25000000 THEN 22500000 ELSE s END ++FROM pg_total_relation_size('local_vacuum_table') s ; ++ s + --------------------------------------------------------------------- +- 21 MB ++ 22500000 +``` + + +## Isolation test flakyness + +If the flaky test is an isolation test, first read the Postgres docs on dealing +with [race conditions in isolation tests][pg-isolation-docs]. A common example +was already listed above, but the Postgres docs list some other types too and +explain how to make their output consistent. + +[pg-isolation-docs]: https://github.com/postgres/postgres/blob/c68a1839902daeb42cf1ebc89edfdd91c00e5091/src/test/isolation/README#L152 + + +## Ruling out common sources of randomness as the cause + +If it's none of the above, then probably the reason why the test is flaky is not +immediately obvious. There are a few things that can introduce randomness into +our test suite. To keep your sanity while investigating, it's good to rule these +out as the cause (or even better determine that they are the cause). + +### Don't run test in parallel with others + +Check in the schedule if the test is run in parallel with others. If it is, +remove it from there and check if it's still flaky. + +**Example** +```diff + test: multi_partitioning_utils replicated_partitioned_table +-test: multi_partitioning partitioning_issue_3970 ++test: multi_partitioning ++test: partitioning_issue_3970 + test: drop_partitioned_table +``` + +### Use a fixed number of connections + +The adaptive executor of Citus sometimes opens extra connections to do stuff in +parallel to speed up multi-shard queries. This happens especially in CI, because +CI machines are sometimes slow. There are two ways to get a consistent number of +connections: + +1. Use `citus.max_adaptive_executor_pool_size` to limit the connections +2. Use `citus.force_max_query_parallelization` to always open the maximum number + of connections. + +**Example** +```diff + ALTER TABLE dist_partitioned_table ADD CONSTRAINT constraint1 UNIQUE (dist_col, partition_col); ++ERROR: canceling the transaction since it was involved in a distributed deadlock +``` + +**Example of fix 1** +```diff ++SET citus.max_adaptive_executor_pool_size TO 1; + ALTER TABLE dist_partitioned_table ADD CONSTRAINT constraint1 UNIQUE (dist_col, partition_col); ++RESET citus.max_adaptive_executor_pool_size; +``` + +**Example of fix 2** +```diff ++SET citus.force_max_query_parallelization TO 1; + ALTER TABLE dist_partitioned_table ADD CONSTRAINT constraint1 UNIQUE (dist_col, partition_col); ++RESET citus.force_max_query_parallelization; +``` + +IMPORTANT: If this helps, this could very well indicate a bug. Check with +senior/principal engineers if it's expected that it helps in this case. + + +## What to do if this all doesn't work? + +If none of the advice above worked, the first thing to try is read the failing +test in detail and try to understand how it works. Often, with a bit of thinking +you can figure out why it's failing in the way that it's failing. If you cannot +figure it out yourself, it's good to ask senior/principal engineers, maybe they +can think of the reason. Or maybe they're certain that it's an actual bug. + +### What to do when you cannot fix or find the bug? + +If it turns out to be an actual bug in Citus, but fixing the bug (or finding its +cause) is hard, making the test output consistent is already an improvement over +the status quo. Be sure to create an issue though for the bug. Even if you're +not entirely sure what's causing it you can still create an issue describing how +to reproduce the flakiness. + + +## What to do if output can never be consistent? + +There are still a few ways to make our test suite less flaky, even if you +figured out that the output that Postgres gives can never be made consistent. + +### Normalizing random output + +If for some reason you cannot make consistent output then our +[`normalize.sed`][normalize] might come to the rescue. This allows us to +normalize certain lines to one specific output. + +**Example** +```diff +-CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s. ++CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.02 s. +``` + +**Fix by changing inconsistent parts of line** +```sed +# ignore timing statistics for VACUUM VERBOSE +s/CPU: user: .*s, system: .*s, elapsed: .*s/CPU: user: X.XX s, system: X.XX s, elapsed: X.XX s/ +``` + +**Fix by completely removing line** +```sed +# ignore timing statistics for VACUUM VERBOSE +/CPU: user: .*s, system: .*s, elapsed: .*s/d +``` + +[normalize]: +https://github.com/citusdata/citus/blob/main/src/test/regress/bin/normalize.sed + +### Removing the flaky test + +Sometimes removing the test is the only way to make our test suite less flaky. +Of course this is a last resort, but sometimes it's what we want. If running the +test does more bad than good, removing will be a net positive.