citus

Fix flaky failure_distributed_results (#7307)

Sometimes in CI we run into this failure:

```diff
   SELECT resultId, nodeport, rowcount, targetShardId, targetShardIndex
   FROM partition_task_list_results('test', $$ SELECT * FROM source_table $$, 'target_table')
           NATURAL JOIN pg_dist_node;
-WARNING:  connection to the remote node localhost:xxxxx failed with the following error: connection not open
+ERROR:  connection to the remote node localhost:9060 failed with the following error: connection not open
 SELECT * FROM distributed_result_info ORDER BY resultId;
-       resultid        | nodeport | rowcount | targetshardid | targetshardindex
----------------------------------------------------------------------
- test_from_100800_to_0 |     9060 |       22 |        100805 |                0
- test_from_100801_to_0 |    57637 |        2 |        100805 |                0
- test_from_100801_to_1 |    57637 |       15 |        100806 |                1
- test_from_100802_to_1 |    57637 |       10 |        100806 |                1
- test_from_100802_to_2 |    57637 |        5 |        100807 |                2
- test_from_100803_to_2 |    57637 |       18 |        100807 |                2
- test_from_100803_to_3 |    57637 |        4 |        100808 |                3
- test_from_100804_to_3 |     9060 |       24 |        100808 |                3
-(8 rows)
-
+ERROR:  current transaction is aborted, commands ignored until end of transaction block
 -- fetch from worker 2 should fail
 SAVEPOINT s1;
+ERROR:  current transaction is aborted, commands ignored until end of transaction block
 SELECT fetch_intermediate_results('{test_from_100802_to_1,test_from_100802_to_2}'::text[], 'localhost', :worker_2_port) > 0 AS fetched;
-ERROR:  could not open file "base/pgsql_job_cache/xx_x_xxx/test_from_100802_to_1.data": No such file or directory
-CONTEXT:  while executing command on localhost:xxxxx
+ERROR:  current transaction is aborted, commands ignored until end of transaction block
 ROLLBACK TO SAVEPOINT s1;
+ERROR:  savepoint "s1" does not exist
 -- fetch from worker 1 should succeed
 SELECT fetch_intermediate_results('{test_from_100802_to_1,test_from_100802_to_2}'::text[], 'localhost', :worker_1_port) > 0 AS fetched;
- fetched
----------------------------------------------------------------------
- t
-(1 row)
-
+ERROR:  current transaction is aborted, commands ignored until end of transaction block
 -- make sure the results read are same as the previous transaction block
 SELECT count(*), sum(x) FROM
   read_intermediate_results('{test_from_100802_to_1,test_from_100802_to_2}'::text[],'binary') AS res (x int);
- count | sum
----------------------------------------------------------------------
-    15 | 863
-(1 row)
-
+ERROR:  current transaction is aborted, commands ignored until end of transaction block
 ROLLBACk;
```

As outlined in the #7306 I created, the reason for this is related to
only having a single connection open to the node. Finding and fixing the
full cause is not trivial, so instead this PR starts working around
this bug by forcing maximum parallelism. Preferably we'd want
this workaround not to be necessary, but that requires
spending time to fix this. For now having a less flaky CI is
good enough.

pull/7321/head^2

Jelte Fennema-Nio

2023-11-02 13:31:56 +01:00

committed by

GitHub

parent b47c8b3fb0

commit f171ec98fc

No known key found for this signature in database

GPG Key ID: 4AEE18F83AFDEB23

2 changed files with 4 additions and 0 deletions

2

src/test/regress/expected/failure_distributed_results.out

View File

 @ -14,6 +14,8 @@ SELECT citus.mitmproxy('conn.allow()');
 (1 row)
 SET citus.next_shard_id TO 100800;
 -- Needed because of issue #7306
 SET citus.force_max_query_parallelization TO true;
 -- always try the 1st replica before the 2nd replica.
 SET citus.task_assignment_policy TO 'first-replica';
 --

									
										2

src/test/regress/sql/failure_distributed_results.sql

										View File
									
				@ -15,6 +15,8 @@ SET client_min_messages TO WARNING;

				SELECT citus.mitmproxy('conn.allow()');

				SET citus.next_shard_id TO 100800;

				-- Needed because of issue #7306

				SET citus.force_max_query_parallelization TO true;

				-- always try the 1st replica before the 2nd replica.

				SET citus.task_assignment_policy TO 'first-replica';

Fix flaky failure_distributed_results (#7307)

2 src/test/regress/expected/failure_distributed_results.out Unescape Escape View File

2 src/test/regress/sql/failure_distributed_results.sql Unescape Escape View File

2

src/test/regress/expected/failure_distributed_results.out

View File

2

src/test/regress/sql/failure_distributed_results.sql

View File