mirror of https://github.com/citusdata/citus.git
Add docs on how to fix flaky tests (#6438)
I fixed a lot of flaky tests recently and I found some patterns in the type of issues and type of fixes. This adds a document that lists these types of issues and explains how to fix them.pull/6442/head
parent
e87eda6496
commit
f756db39c4
|
@ -218,3 +218,7 @@ style `#include` statements like this:
|
||||||
|
|
||||||
Any other SQL you can put directly in the main sql file, e.g.
|
Any other SQL you can put directly in the main sql file, e.g.
|
||||||
`src/backend/distributed/sql/citus--8.3-1--9.0-1.sql`.
|
`src/backend/distributed/sql/citus--8.3-1--9.0-1.sql`.
|
||||||
|
|
||||||
|
### Running tests
|
||||||
|
|
||||||
|
See [`src/test/regress/README.md`](https://github.com/citusdata/citus/blob/master/src/test/regress/README.md)
|
||||||
|
|
|
@ -99,3 +99,8 @@ To automatically setup a citus cluster in tests we use our
|
||||||
`src/test/regress/pg_regress_multi.pl` script. This sets up a citus cluster and
|
`src/test/regress/pg_regress_multi.pl` script. This sets up a citus cluster and
|
||||||
then starts the standard postgres test tooling. You almost never have to change
|
then starts the standard postgres test tooling. You almost never have to change
|
||||||
this file.
|
this file.
|
||||||
|
|
||||||
|
## Randomly failing tests
|
||||||
|
|
||||||
|
In CI sometimes a test fails randomly, we call these tests "flaky". To fix these
|
||||||
|
flaky tests see [`src/test/regress/flaky_tests.md`](https://github.com/citusdata/citus/blob/master/src/test/regress/mitmscripts/README.md)
|
||||||
|
|
|
@ -0,0 +1,323 @@
|
||||||
|
# How to fix flaky tests
|
||||||
|
|
||||||
|
Flaky tests happen when for some reason our tests return non-deterministic
|
||||||
|
results.
|
||||||
|
|
||||||
|
There are three different causes of flaky tests:
|
||||||
|
1. Tests that don't make sure output is consistent, i.e. a bug in our tests
|
||||||
|
2. Bugs in our testing infrastructure
|
||||||
|
3. Bugs in Citus itself
|
||||||
|
|
||||||
|
All of these impact the happiness and productivity of our developers, because we
|
||||||
|
have to rerun tests to make them pass. But apart from developer happiness and
|
||||||
|
productivity, 3 also impacts our users, and by ignoring flaky tests we can miss
|
||||||
|
problems that our users could run into. This reduces the effectiveness of our
|
||||||
|
tests.
|
||||||
|
|
||||||
|
|
||||||
|
## Reproducing a flaky test
|
||||||
|
|
||||||
|
Before trying to fix the flakyness, it's important that you can reproduce the
|
||||||
|
flaky test. Often it only reproduces in CI, so we have a CI job that can help
|
||||||
|
you reproduce flakyness consistently by running the same test a lot of times.
|
||||||
|
You can configure CI to run this job by setting the `flaky_test` and if
|
||||||
|
necessary the possibly the `flaky_test_make` parameters.
|
||||||
|
|
||||||
|
```diff
|
||||||
|
flaky_test:
|
||||||
|
type: string
|
||||||
|
- default: ''
|
||||||
|
+ default: 'isolation_shard_rebalancer_progress'
|
||||||
|
flaky_test_make:
|
||||||
|
type: string
|
||||||
|
- default: check-minimal
|
||||||
|
+ default: check-isolation-base
|
||||||
|
```
|
||||||
|
|
||||||
|
Once you get this job to consistently fail in CI, you can continue with the next
|
||||||
|
steps to make it instead consistently pass. If the failure doesn't reproduce
|
||||||
|
with this CI job, it's almost certainly caused by running it concurrently with
|
||||||
|
other tests. See the "Don't run test in parallel with others" section below on
|
||||||
|
how to fix that.
|
||||||
|
|
||||||
|
|
||||||
|
## Easy fixes
|
||||||
|
|
||||||
|
The following types of issues all fall within the category 1: bugs in our tests.
|
||||||
|
|
||||||
|
### Expected records but different order
|
||||||
|
|
||||||
|
**Issue**: A query returns the right result, but they are in a different order
|
||||||
|
than expected by the output.
|
||||||
|
|
||||||
|
**Fix**: Add an extra column to the ORDER BY clause of the query to make the
|
||||||
|
output consistent
|
||||||
|
|
||||||
|
**Example**
|
||||||
|
```diff
|
||||||
|
8970008 | colocated_dist_table | -2147483648 | 2147483647 | localhost | 57637
|
||||||
|
8970009 | colocated_partitioned_table | -2147483648 | 2147483647 | localhost | 57637
|
||||||
|
8970010 | colocated_partitioned_table_2020_01_01 | -2147483648 | 2147483647 | localhost | 57637
|
||||||
|
- 8970011 | reference_table | | | localhost | 57637
|
||||||
|
8970011 | reference_table | | | localhost | 57638
|
||||||
|
+ 8970011 | reference_table | | | localhost | 57637
|
||||||
|
(13 rows)
|
||||||
|
```
|
||||||
|
|
||||||
|
**Example fix**:
|
||||||
|
|
||||||
|
```diff
|
||||||
|
-ORDER BY logicalrelid, shardminvalue::BIGINT;
|
||||||
|
+ORDER BY logicalrelid, shardminvalue::BIGINT, nodeport;
|
||||||
|
```
|
||||||
|
|
||||||
|
### Expected logs but different order
|
||||||
|
|
||||||
|
**Issue**: The logs in the regress output are displayed in a different order
|
||||||
|
than what the output file shows
|
||||||
|
|
||||||
|
**Fix**: It's simple: don't log these things during the test. There are two common
|
||||||
|
ways of achieving this:
|
||||||
|
1. If you don't care about the logs for this query at all, then you can change
|
||||||
|
the log `VERBOSITY` or lower `client_min_messages`.
|
||||||
|
2. If these are logs of uninteresting commands created by
|
||||||
|
`citus.log_remote_commands`, but you care about some of the other remote
|
||||||
|
commands being as expected, then you can use `citus.grep_remote_commands` to
|
||||||
|
only display the commands that you care about.
|
||||||
|
|
||||||
|
**Example of issue 1**:
|
||||||
|
```diff
|
||||||
|
select alter_table_set_access_method('ref','heap');
|
||||||
|
NOTICE: creating a new table for alter_table_set_access_method.ref
|
||||||
|
NOTICE: moving the data of alter_table_set_access_method.ref
|
||||||
|
NOTICE: dropping the old alter_table_set_access_method.ref
|
||||||
|
NOTICE: drop cascades to 2 other objects
|
||||||
|
-DETAIL: drop cascades to materialized view m_ref
|
||||||
|
-drop cascades to view v_ref
|
||||||
|
+DETAIL: drop cascades to view v_ref
|
||||||
|
+drop cascades to materialized view m_ref
|
||||||
|
CONTEXT: SQL statement "DROP TABLE alter_table_set_access_method.ref CASCADE"
|
||||||
|
NOTICE: renaming the new table to alter_table_set_access_method.ref
|
||||||
|
alter_table_set_access_method
|
||||||
|
-------------------------------
|
||||||
|
|
||||||
|
(1 row)
|
||||||
|
```
|
||||||
|
|
||||||
|
**Example fix of issue 1**
|
||||||
|
```diff
|
||||||
|
+\set VERBOSITY terse
|
||||||
|
```
|
||||||
|
|
||||||
|
**Example of issue 2**
|
||||||
|
```diff
|
||||||
|
SET citus.log_remote_commands TO ON;
|
||||||
|
-- should propagate to all workers because no table is specified
|
||||||
|
ANALYZE;
|
||||||
|
NOTICE: issuing BEGIN TRANSACTION ISOLATION LEVEL READ COMMITTED;SELECT assign_distributed_transaction_id(0, 3461, '2022-08-19 01:56:06.35816-07');
|
||||||
|
DETAIL: on server postgres@localhost:57637 connectionId: 1
|
||||||
|
NOTICE: issuing BEGIN TRANSACTION ISOLATION LEVEL READ COMMITTED;SELECT assign_distributed_transaction_id(0, 3461, '2022-08-19 01:56:06.35816-07');
|
||||||
|
DETAIL: on server postgres@localhost:57638 connectionId: 2
|
||||||
|
NOTICE: issuing SET citus.enable_ddl_propagation TO 'off'
|
||||||
|
DETAIL: on server postgres@localhost:57637 connectionId: 1
|
||||||
|
-NOTICE: issuing SET citus.enable_ddl_propagation TO 'off'
|
||||||
|
-DETAIL: on server postgres@localhost:xxxxx connectionId: xxxxxxx
|
||||||
|
NOTICE: issuing ANALYZE
|
||||||
|
DETAIL: on server postgres@localhost:57637 connectionId: 1
|
||||||
|
+NOTICE: issuing SET citus.enable_ddl_propagation TO 'off'
|
||||||
|
+DETAIL: on server postgres@localhost:57638 connectionId: 2
|
||||||
|
NOTICE: issuing ANALYZE
|
||||||
|
DETAIL: on server postgres@localhost:57638 connectionId: 2
|
||||||
|
```
|
||||||
|
|
||||||
|
**Example fix of issue 2**
|
||||||
|
```diff
|
||||||
|
SET citus.log_remote_commands TO ON;
|
||||||
|
+SET citus.grep_remote_commands = '%ANALYZE%';
|
||||||
|
```
|
||||||
|
|
||||||
|
### Isolation test completes in different order
|
||||||
|
|
||||||
|
**Issue**: There's no defined order in which the steps in two different sessions
|
||||||
|
complete, because they don't block each other. This can happen when two sessions
|
||||||
|
were both blocked by a third session, but when the third session releases the
|
||||||
|
lock the first two can both continue.
|
||||||
|
|
||||||
|
**Fix**: Use the isolation test ["marker" feature][marker-feature] to make sure
|
||||||
|
one step can only complete after another has completed.
|
||||||
|
|
||||||
|
[marker-feature]: https://github.com/postgres/postgres/blob/c68a1839902daeb42cf1ebc89edfdd91c00e5091/src/test/isolation/README#L163-L188
|
||||||
|
|
||||||
|
|
||||||
|
**Example**
|
||||||
|
```diff
|
||||||
|
-step s1-shard-move-c1-block-writes: <... completed>
|
||||||
|
+step s4-shard-move-sep-block-writes: <... completed>
|
||||||
|
citus_move_shard_placement
|
||||||
|
--------------------------
|
||||||
|
|
||||||
|
(1 row)
|
||||||
|
|
||||||
|
-step s4-shard-move-sep-block-writes: <... completed>
|
||||||
|
+step s1-shard-move-c1-block-writes: <... completed>
|
||||||
|
citus_move_shard_placement
|
||||||
|
--------------------------
|
||||||
|
```
|
||||||
|
|
||||||
|
**Example fix**
|
||||||
|
```diff
|
||||||
|
+permutation ... "s1-shard-move-c1-block-writes" "s4-shard-move-sep-block-writes" ...
|
||||||
|
+permutation ... "s1-shard-move-c1-block-writes" "s4-shard-move-sep-block-writes"("s1-shard-move-c1-block-writes") ...
|
||||||
|
```
|
||||||
|
|
||||||
|
### Disk size numbers are not exactly like expected
|
||||||
|
|
||||||
|
**Issue**: In some tests we show the disk size of a table, but due to various
|
||||||
|
postgres background processes such as vacuuming these sizes can change slightly.
|
||||||
|
|
||||||
|
**Fix**: Expect a certain range of disk sizes instead of a specific one.
|
||||||
|
|
||||||
|
**Example**
|
||||||
|
```diff
|
||||||
|
VACUUM (INDEX_CLEANUP ON, PARALLEL 1) local_vacuum_table;
|
||||||
|
SELECT pg_size_pretty( pg_total_relation_size('local_vacuum_table') );
|
||||||
|
pg_size_pretty
|
||||||
|
----------------
|
||||||
|
- 21 MB
|
||||||
|
+ 22 MB
|
||||||
|
(1 row)
|
||||||
|
```
|
||||||
|
|
||||||
|
**Example fix**
|
||||||
|
```diff
|
||||||
|
-SELECT pg_size_pretty( pg_total_relation_size('local_vacuum_table') );
|
||||||
|
- pg_size_pretty
|
||||||
|
+SELECT CASE WHEN s BETWEEN 20000000 AND 25000000 THEN 22500000 ELSE s END
|
||||||
|
+FROM pg_total_relation_size('local_vacuum_table') s ;
|
||||||
|
+ s
|
||||||
|
---------------------------------------------------------------------
|
||||||
|
- 21 MB
|
||||||
|
+ 22500000
|
||||||
|
```
|
||||||
|
|
||||||
|
|
||||||
|
## Isolation test flakyness
|
||||||
|
|
||||||
|
If the flaky test is an isolation test, first read the Postgres docs on dealing
|
||||||
|
with [race conditions in isolation tests][pg-isolation-docs]. A common example
|
||||||
|
was already listed above, but the Postgres docs list some other types too and
|
||||||
|
explain how to make their output consistent.
|
||||||
|
|
||||||
|
[pg-isolation-docs]: https://github.com/postgres/postgres/blob/c68a1839902daeb42cf1ebc89edfdd91c00e5091/src/test/isolation/README#L152
|
||||||
|
|
||||||
|
|
||||||
|
## Ruling out common sources of randomness as the cause
|
||||||
|
|
||||||
|
If it's none of the above, then probably the reason why the test is flaky is not
|
||||||
|
immediately obvious. There are a few things that can introduce randomness into
|
||||||
|
our test suite. To keep your sanity while investigating, it's good to rule these
|
||||||
|
out as the cause (or even better determine that they are the cause).
|
||||||
|
|
||||||
|
### Don't run test in parallel with others
|
||||||
|
|
||||||
|
Check in the schedule if the test is run in parallel with others. If it is,
|
||||||
|
remove it from there and check if it's still flaky.
|
||||||
|
|
||||||
|
**Example**
|
||||||
|
```diff
|
||||||
|
test: multi_partitioning_utils replicated_partitioned_table
|
||||||
|
-test: multi_partitioning partitioning_issue_3970
|
||||||
|
+test: multi_partitioning
|
||||||
|
+test: partitioning_issue_3970
|
||||||
|
test: drop_partitioned_table
|
||||||
|
```
|
||||||
|
|
||||||
|
### Use a fixed number of connections
|
||||||
|
|
||||||
|
The adaptive executor of Citus sometimes opens extra connections to do stuff in
|
||||||
|
parallel to speed up multi-shard queries. This happens especially in CI, because
|
||||||
|
CI machines are sometimes slow. There are two ways to get a consistent number of
|
||||||
|
connections:
|
||||||
|
|
||||||
|
1. Use `citus.max_adaptive_executor_pool_size` to limit the connections
|
||||||
|
2. Use `citus.force_max_query_parallelization` to always open the maximum number
|
||||||
|
of connections.
|
||||||
|
|
||||||
|
**Example**
|
||||||
|
```diff
|
||||||
|
ALTER TABLE dist_partitioned_table ADD CONSTRAINT constraint1 UNIQUE (dist_col, partition_col);
|
||||||
|
+ERROR: canceling the transaction since it was involved in a distributed deadlock
|
||||||
|
```
|
||||||
|
|
||||||
|
**Example of fix 1**
|
||||||
|
```diff
|
||||||
|
+SET citus.max_adaptive_executor_pool_size TO 1;
|
||||||
|
ALTER TABLE dist_partitioned_table ADD CONSTRAINT constraint1 UNIQUE (dist_col, partition_col);
|
||||||
|
+RESET citus.max_adaptive_executor_pool_size;
|
||||||
|
```
|
||||||
|
|
||||||
|
**Example of fix 2**
|
||||||
|
```diff
|
||||||
|
+SET citus.force_max_query_parallelization TO 1;
|
||||||
|
ALTER TABLE dist_partitioned_table ADD CONSTRAINT constraint1 UNIQUE (dist_col, partition_col);
|
||||||
|
+RESET citus.force_max_query_parallelization;
|
||||||
|
```
|
||||||
|
|
||||||
|
IMPORTANT: If this helps, this could very well indicate a bug. Check with
|
||||||
|
senior/principal engineers if it's expected that it helps in this case.
|
||||||
|
|
||||||
|
|
||||||
|
## What to do if this all doesn't work?
|
||||||
|
|
||||||
|
If none of the advice above worked, the first thing to try is read the failing
|
||||||
|
test in detail and try to understand how it works. Often, with a bit of thinking
|
||||||
|
you can figure out why it's failing in the way that it's failing. If you cannot
|
||||||
|
figure it out yourself, it's good to ask senior/principal engineers, maybe they
|
||||||
|
can think of the reason. Or maybe they're certain that it's an actual bug.
|
||||||
|
|
||||||
|
### What to do when you cannot fix or find the bug?
|
||||||
|
|
||||||
|
If it turns out to be an actual bug in Citus, but fixing the bug (or finding its
|
||||||
|
cause) is hard, making the test output consistent is already an improvement over
|
||||||
|
the status quo. Be sure to create an issue though for the bug. Even if you're
|
||||||
|
not entirely sure what's causing it you can still create an issue describing how
|
||||||
|
to reproduce the flakiness.
|
||||||
|
|
||||||
|
|
||||||
|
## What to do if output can never be consistent?
|
||||||
|
|
||||||
|
There are still a few ways to make our test suite less flaky, even if you
|
||||||
|
figured out that the output that Postgres gives can never be made consistent.
|
||||||
|
|
||||||
|
### Normalizing random output
|
||||||
|
|
||||||
|
If for some reason you cannot make consistent output then our
|
||||||
|
[`normalize.sed`][normalize] might come to the rescue. This allows us to
|
||||||
|
normalize certain lines to one specific output.
|
||||||
|
|
||||||
|
**Example**
|
||||||
|
```diff
|
||||||
|
-CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s.
|
||||||
|
+CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.02 s.
|
||||||
|
```
|
||||||
|
|
||||||
|
**Fix by changing inconsistent parts of line**
|
||||||
|
```sed
|
||||||
|
# ignore timing statistics for VACUUM VERBOSE
|
||||||
|
s/CPU: user: .*s, system: .*s, elapsed: .*s/CPU: user: X.XX s, system: X.XX s, elapsed: X.XX s/
|
||||||
|
```
|
||||||
|
|
||||||
|
**Fix by completely removing line**
|
||||||
|
```sed
|
||||||
|
# ignore timing statistics for VACUUM VERBOSE
|
||||||
|
/CPU: user: .*s, system: .*s, elapsed: .*s/d
|
||||||
|
```
|
||||||
|
|
||||||
|
[normalize]:
|
||||||
|
https://github.com/citusdata/citus/blob/main/src/test/regress/bin/normalize.sed
|
||||||
|
|
||||||
|
### Removing the flaky test
|
||||||
|
|
||||||
|
Sometimes removing the test is the only way to make our test suite less flaky.
|
||||||
|
Of course this is a last resort, but sometimes it's what we want. If running the
|
||||||
|
test does more bad than good, removing will be a net positive.
|
Loading…
Reference in New Issue