mirror of https://github.com/citusdata/citus.git
Add docs on how to fix flaky tests (#6438)
I fixed a lot of flaky tests recently and I found some patterns in the type of issues and type of fixes. This adds a document that lists these types of issues and explains how to fix them.pull/6442/head
parent
e87eda6496
commit
f756db39c4
|
@ -218,3 +218,7 @@ style `#include` statements like this:
|
|||
|
||||
Any other SQL you can put directly in the main sql file, e.g.
|
||||
`src/backend/distributed/sql/citus--8.3-1--9.0-1.sql`.
|
||||
|
||||
### Running tests
|
||||
|
||||
See [`src/test/regress/README.md`](https://github.com/citusdata/citus/blob/master/src/test/regress/README.md)
|
||||
|
|
|
@ -99,3 +99,8 @@ To automatically setup a citus cluster in tests we use our
|
|||
`src/test/regress/pg_regress_multi.pl` script. This sets up a citus cluster and
|
||||
then starts the standard postgres test tooling. You almost never have to change
|
||||
this file.
|
||||
|
||||
## Randomly failing tests
|
||||
|
||||
In CI sometimes a test fails randomly, we call these tests "flaky". To fix these
|
||||
flaky tests see [`src/test/regress/flaky_tests.md`](https://github.com/citusdata/citus/blob/master/src/test/regress/mitmscripts/README.md)
|
||||
|
|
|
@ -0,0 +1,323 @@
|
|||
# How to fix flaky tests
|
||||
|
||||
Flaky tests happen when for some reason our tests return non-deterministic
|
||||
results.
|
||||
|
||||
There are three different causes of flaky tests:
|
||||
1. Tests that don't make sure output is consistent, i.e. a bug in our tests
|
||||
2. Bugs in our testing infrastructure
|
||||
3. Bugs in Citus itself
|
||||
|
||||
All of these impact the happiness and productivity of our developers, because we
|
||||
have to rerun tests to make them pass. But apart from developer happiness and
|
||||
productivity, 3 also impacts our users, and by ignoring flaky tests we can miss
|
||||
problems that our users could run into. This reduces the effectiveness of our
|
||||
tests.
|
||||
|
||||
|
||||
## Reproducing a flaky test
|
||||
|
||||
Before trying to fix the flakyness, it's important that you can reproduce the
|
||||
flaky test. Often it only reproduces in CI, so we have a CI job that can help
|
||||
you reproduce flakyness consistently by running the same test a lot of times.
|
||||
You can configure CI to run this job by setting the `flaky_test` and if
|
||||
necessary the possibly the `flaky_test_make` parameters.
|
||||
|
||||
```diff
|
||||
flaky_test:
|
||||
type: string
|
||||
- default: ''
|
||||
+ default: 'isolation_shard_rebalancer_progress'
|
||||
flaky_test_make:
|
||||
type: string
|
||||
- default: check-minimal
|
||||
+ default: check-isolation-base
|
||||
```
|
||||
|
||||
Once you get this job to consistently fail in CI, you can continue with the next
|
||||
steps to make it instead consistently pass. If the failure doesn't reproduce
|
||||
with this CI job, it's almost certainly caused by running it concurrently with
|
||||
other tests. See the "Don't run test in parallel with others" section below on
|
||||
how to fix that.
|
||||
|
||||
|
||||
## Easy fixes
|
||||
|
||||
The following types of issues all fall within the category 1: bugs in our tests.
|
||||
|
||||
### Expected records but different order
|
||||
|
||||
**Issue**: A query returns the right result, but they are in a different order
|
||||
than expected by the output.
|
||||
|
||||
**Fix**: Add an extra column to the ORDER BY clause of the query to make the
|
||||
output consistent
|
||||
|
||||
**Example**
|
||||
```diff
|
||||
8970008 | colocated_dist_table | -2147483648 | 2147483647 | localhost | 57637
|
||||
8970009 | colocated_partitioned_table | -2147483648 | 2147483647 | localhost | 57637
|
||||
8970010 | colocated_partitioned_table_2020_01_01 | -2147483648 | 2147483647 | localhost | 57637
|
||||
- 8970011 | reference_table | | | localhost | 57637
|
||||
8970011 | reference_table | | | localhost | 57638
|
||||
+ 8970011 | reference_table | | | localhost | 57637
|
||||
(13 rows)
|
||||
```
|
||||
|
||||
**Example fix**:
|
||||
|
||||
```diff
|
||||
-ORDER BY logicalrelid, shardminvalue::BIGINT;
|
||||
+ORDER BY logicalrelid, shardminvalue::BIGINT, nodeport;
|
||||
```
|
||||
|
||||
### Expected logs but different order
|
||||
|
||||
**Issue**: The logs in the regress output are displayed in a different order
|
||||
than what the output file shows
|
||||
|
||||
**Fix**: It's simple: don't log these things during the test. There are two common
|
||||
ways of achieving this:
|
||||
1. If you don't care about the logs for this query at all, then you can change
|
||||
the log `VERBOSITY` or lower `client_min_messages`.
|
||||
2. If these are logs of uninteresting commands created by
|
||||
`citus.log_remote_commands`, but you care about some of the other remote
|
||||
commands being as expected, then you can use `citus.grep_remote_commands` to
|
||||
only display the commands that you care about.
|
||||
|
||||
**Example of issue 1**:
|
||||
```diff
|
||||
select alter_table_set_access_method('ref','heap');
|
||||
NOTICE: creating a new table for alter_table_set_access_method.ref
|
||||
NOTICE: moving the data of alter_table_set_access_method.ref
|
||||
NOTICE: dropping the old alter_table_set_access_method.ref
|
||||
NOTICE: drop cascades to 2 other objects
|
||||
-DETAIL: drop cascades to materialized view m_ref
|
||||
-drop cascades to view v_ref
|
||||
+DETAIL: drop cascades to view v_ref
|
||||
+drop cascades to materialized view m_ref
|
||||
CONTEXT: SQL statement "DROP TABLE alter_table_set_access_method.ref CASCADE"
|
||||
NOTICE: renaming the new table to alter_table_set_access_method.ref
|
||||
alter_table_set_access_method
|
||||
-------------------------------
|
||||
|
||||
(1 row)
|
||||
```
|
||||
|
||||
**Example fix of issue 1**
|
||||
```diff
|
||||
+\set VERBOSITY terse
|
||||
```
|
||||
|
||||
**Example of issue 2**
|
||||
```diff
|
||||
SET citus.log_remote_commands TO ON;
|
||||
-- should propagate to all workers because no table is specified
|
||||
ANALYZE;
|
||||
NOTICE: issuing BEGIN TRANSACTION ISOLATION LEVEL READ COMMITTED;SELECT assign_distributed_transaction_id(0, 3461, '2022-08-19 01:56:06.35816-07');
|
||||
DETAIL: on server postgres@localhost:57637 connectionId: 1
|
||||
NOTICE: issuing BEGIN TRANSACTION ISOLATION LEVEL READ COMMITTED;SELECT assign_distributed_transaction_id(0, 3461, '2022-08-19 01:56:06.35816-07');
|
||||
DETAIL: on server postgres@localhost:57638 connectionId: 2
|
||||
NOTICE: issuing SET citus.enable_ddl_propagation TO 'off'
|
||||
DETAIL: on server postgres@localhost:57637 connectionId: 1
|
||||
-NOTICE: issuing SET citus.enable_ddl_propagation TO 'off'
|
||||
-DETAIL: on server postgres@localhost:xxxxx connectionId: xxxxxxx
|
||||
NOTICE: issuing ANALYZE
|
||||
DETAIL: on server postgres@localhost:57637 connectionId: 1
|
||||
+NOTICE: issuing SET citus.enable_ddl_propagation TO 'off'
|
||||
+DETAIL: on server postgres@localhost:57638 connectionId: 2
|
||||
NOTICE: issuing ANALYZE
|
||||
DETAIL: on server postgres@localhost:57638 connectionId: 2
|
||||
```
|
||||
|
||||
**Example fix of issue 2**
|
||||
```diff
|
||||
SET citus.log_remote_commands TO ON;
|
||||
+SET citus.grep_remote_commands = '%ANALYZE%';
|
||||
```
|
||||
|
||||
### Isolation test completes in different order
|
||||
|
||||
**Issue**: There's no defined order in which the steps in two different sessions
|
||||
complete, because they don't block each other. This can happen when two sessions
|
||||
were both blocked by a third session, but when the third session releases the
|
||||
lock the first two can both continue.
|
||||
|
||||
**Fix**: Use the isolation test ["marker" feature][marker-feature] to make sure
|
||||
one step can only complete after another has completed.
|
||||
|
||||
[marker-feature]: https://github.com/postgres/postgres/blob/c68a1839902daeb42cf1ebc89edfdd91c00e5091/src/test/isolation/README#L163-L188
|
||||
|
||||
|
||||
**Example**
|
||||
```diff
|
||||
-step s1-shard-move-c1-block-writes: <... completed>
|
||||
+step s4-shard-move-sep-block-writes: <... completed>
|
||||
citus_move_shard_placement
|
||||
--------------------------
|
||||
|
||||
(1 row)
|
||||
|
||||
-step s4-shard-move-sep-block-writes: <... completed>
|
||||
+step s1-shard-move-c1-block-writes: <... completed>
|
||||
citus_move_shard_placement
|
||||
--------------------------
|
||||
```
|
||||
|
||||
**Example fix**
|
||||
```diff
|
||||
+permutation ... "s1-shard-move-c1-block-writes" "s4-shard-move-sep-block-writes" ...
|
||||
+permutation ... "s1-shard-move-c1-block-writes" "s4-shard-move-sep-block-writes"("s1-shard-move-c1-block-writes") ...
|
||||
```
|
||||
|
||||
### Disk size numbers are not exactly like expected
|
||||
|
||||
**Issue**: In some tests we show the disk size of a table, but due to various
|
||||
postgres background processes such as vacuuming these sizes can change slightly.
|
||||
|
||||
**Fix**: Expect a certain range of disk sizes instead of a specific one.
|
||||
|
||||
**Example**
|
||||
```diff
|
||||
VACUUM (INDEX_CLEANUP ON, PARALLEL 1) local_vacuum_table;
|
||||
SELECT pg_size_pretty( pg_total_relation_size('local_vacuum_table') );
|
||||
pg_size_pretty
|
||||
----------------
|
||||
- 21 MB
|
||||
+ 22 MB
|
||||
(1 row)
|
||||
```
|
||||
|
||||
**Example fix**
|
||||
```diff
|
||||
-SELECT pg_size_pretty( pg_total_relation_size('local_vacuum_table') );
|
||||
- pg_size_pretty
|
||||
+SELECT CASE WHEN s BETWEEN 20000000 AND 25000000 THEN 22500000 ELSE s END
|
||||
+FROM pg_total_relation_size('local_vacuum_table') s ;
|
||||
+ s
|
||||
---------------------------------------------------------------------
|
||||
- 21 MB
|
||||
+ 22500000
|
||||
```
|
||||
|
||||
|
||||
## Isolation test flakyness
|
||||
|
||||
If the flaky test is an isolation test, first read the Postgres docs on dealing
|
||||
with [race conditions in isolation tests][pg-isolation-docs]. A common example
|
||||
was already listed above, but the Postgres docs list some other types too and
|
||||
explain how to make their output consistent.
|
||||
|
||||
[pg-isolation-docs]: https://github.com/postgres/postgres/blob/c68a1839902daeb42cf1ebc89edfdd91c00e5091/src/test/isolation/README#L152
|
||||
|
||||
|
||||
## Ruling out common sources of randomness as the cause
|
||||
|
||||
If it's none of the above, then probably the reason why the test is flaky is not
|
||||
immediately obvious. There are a few things that can introduce randomness into
|
||||
our test suite. To keep your sanity while investigating, it's good to rule these
|
||||
out as the cause (or even better determine that they are the cause).
|
||||
|
||||
### Don't run test in parallel with others
|
||||
|
||||
Check in the schedule if the test is run in parallel with others. If it is,
|
||||
remove it from there and check if it's still flaky.
|
||||
|
||||
**Example**
|
||||
```diff
|
||||
test: multi_partitioning_utils replicated_partitioned_table
|
||||
-test: multi_partitioning partitioning_issue_3970
|
||||
+test: multi_partitioning
|
||||
+test: partitioning_issue_3970
|
||||
test: drop_partitioned_table
|
||||
```
|
||||
|
||||
### Use a fixed number of connections
|
||||
|
||||
The adaptive executor of Citus sometimes opens extra connections to do stuff in
|
||||
parallel to speed up multi-shard queries. This happens especially in CI, because
|
||||
CI machines are sometimes slow. There are two ways to get a consistent number of
|
||||
connections:
|
||||
|
||||
1. Use `citus.max_adaptive_executor_pool_size` to limit the connections
|
||||
2. Use `citus.force_max_query_parallelization` to always open the maximum number
|
||||
of connections.
|
||||
|
||||
**Example**
|
||||
```diff
|
||||
ALTER TABLE dist_partitioned_table ADD CONSTRAINT constraint1 UNIQUE (dist_col, partition_col);
|
||||
+ERROR: canceling the transaction since it was involved in a distributed deadlock
|
||||
```
|
||||
|
||||
**Example of fix 1**
|
||||
```diff
|
||||
+SET citus.max_adaptive_executor_pool_size TO 1;
|
||||
ALTER TABLE dist_partitioned_table ADD CONSTRAINT constraint1 UNIQUE (dist_col, partition_col);
|
||||
+RESET citus.max_adaptive_executor_pool_size;
|
||||
```
|
||||
|
||||
**Example of fix 2**
|
||||
```diff
|
||||
+SET citus.force_max_query_parallelization TO 1;
|
||||
ALTER TABLE dist_partitioned_table ADD CONSTRAINT constraint1 UNIQUE (dist_col, partition_col);
|
||||
+RESET citus.force_max_query_parallelization;
|
||||
```
|
||||
|
||||
IMPORTANT: If this helps, this could very well indicate a bug. Check with
|
||||
senior/principal engineers if it's expected that it helps in this case.
|
||||
|
||||
|
||||
## What to do if this all doesn't work?
|
||||
|
||||
If none of the advice above worked, the first thing to try is read the failing
|
||||
test in detail and try to understand how it works. Often, with a bit of thinking
|
||||
you can figure out why it's failing in the way that it's failing. If you cannot
|
||||
figure it out yourself, it's good to ask senior/principal engineers, maybe they
|
||||
can think of the reason. Or maybe they're certain that it's an actual bug.
|
||||
|
||||
### What to do when you cannot fix or find the bug?
|
||||
|
||||
If it turns out to be an actual bug in Citus, but fixing the bug (or finding its
|
||||
cause) is hard, making the test output consistent is already an improvement over
|
||||
the status quo. Be sure to create an issue though for the bug. Even if you're
|
||||
not entirely sure what's causing it you can still create an issue describing how
|
||||
to reproduce the flakiness.
|
||||
|
||||
|
||||
## What to do if output can never be consistent?
|
||||
|
||||
There are still a few ways to make our test suite less flaky, even if you
|
||||
figured out that the output that Postgres gives can never be made consistent.
|
||||
|
||||
### Normalizing random output
|
||||
|
||||
If for some reason you cannot make consistent output then our
|
||||
[`normalize.sed`][normalize] might come to the rescue. This allows us to
|
||||
normalize certain lines to one specific output.
|
||||
|
||||
**Example**
|
||||
```diff
|
||||
-CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s.
|
||||
+CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.02 s.
|
||||
```
|
||||
|
||||
**Fix by changing inconsistent parts of line**
|
||||
```sed
|
||||
# ignore timing statistics for VACUUM VERBOSE
|
||||
s/CPU: user: .*s, system: .*s, elapsed: .*s/CPU: user: X.XX s, system: X.XX s, elapsed: X.XX s/
|
||||
```
|
||||
|
||||
**Fix by completely removing line**
|
||||
```sed
|
||||
# ignore timing statistics for VACUUM VERBOSE
|
||||
/CPU: user: .*s, system: .*s, elapsed: .*s/d
|
||||
```
|
||||
|
||||
[normalize]:
|
||||
https://github.com/citusdata/citus/blob/main/src/test/regress/bin/normalize.sed
|
||||
|
||||
### Removing the flaky test
|
||||
|
||||
Sometimes removing the test is the only way to make our test suite less flaky.
|
||||
Of course this is a last resort, but sometimes it's what we want. If running the
|
||||
test does more bad than good, removing will be a net positive.
|
Loading…
Reference in New Issue