From 77dd49fcf8285cb4e8aad691f341a148c201400b Mon Sep 17 00:00:00 2001 From: Jelte Fennema Date: Fri, 26 Aug 2022 12:49:45 +0200 Subject: [PATCH] Fix flakyness in failure_online_move_shard_placement (#6250) Sometimes in CI failure_online_move_shard_placement fails with the following error: ```diff SELECT citus.mitmproxy('conn.onQuery(query="^ALTER SUBSCRIPTION .* ENABLE").cancel(' || :pid || ')'); mitmproxy ----------- (1 row) SELECT master_move_shard_placement(101, 'localhost', :worker_1_port, 'localhost', :worker_2_proxy_port); -ERROR: canceling statement due to user request +ERROR: tuple concurrently updated +CONTEXT: while executing command on localhost:9060 -- failure on polling subscription state ``` Source: https://app.circleci.com/pipelines/github/citusdata/citus/26441/workflows/dd6e3475-6121-47b3-aea3-4ac92be114f4/jobs/751476/steps This error is not completely harmless, because based on the logs it mean that our cleanup logic failed, which in turn means that replication slots are left around: ``` 2022-08-24 16:01:29.247 UTC [1219] ERROR: XX000: tuple concurrently updated 2022-08-24 16:01:29.247 UTC [1219] LOCATION: simple_heap_update, heapam.c:4179 2022-08-24 16:01:29.247 UTC [1219] STATEMENT: ALTER SUBSCRIPTION citus_shard_move_subscription_10 DISABLE ``` However, we have other mechanisms to clean up any leftovers in case of a failed cleanup. So it's not that big of a problem. The reason we run into this error is arguably because of a Postgres bug, so I created a patch for Postgres that fixes this. While we wait for this (or a similar) patch to be merged, this PR disables the flaky test. There's still a test that tests in case of a connection "kill" instead of a "cancel", so I don't think we lose very important coverage by disabling this test. When trying to reproduce this I only hit this issue in the cancel case, so I don't think there's a need to disable the kill case for now. --- .../expected/failure_online_move_shard_placement.out | 12 ++++-------- .../sql/failure_online_move_shard_placement.sql | 6 ++++-- 2 files changed, 8 insertions(+), 10 deletions(-) diff --git a/src/test/regress/expected/failure_online_move_shard_placement.out b/src/test/regress/expected/failure_online_move_shard_placement.out index 777a6e59e..0ea70a2eb 100644 --- a/src/test/regress/expected/failure_online_move_shard_placement.out +++ b/src/test/regress/expected/failure_online_move_shard_placement.out @@ -110,14 +110,10 @@ SELECT master_move_shard_placement(101, 'localhost', :worker_1_port, 'localhost' ERROR: connection not open CONTEXT: while executing command on localhost:xxxxx -- failure when enabling the subscriptions -SELECT citus.mitmproxy('conn.onQuery(query="^ALTER SUBSCRIPTION .* ENABLE").cancel(' || :pid || ')'); - mitmproxy ---------------------------------------------------------------------- - -(1 row) - -SELECT master_move_shard_placement(101, 'localhost', :worker_1_port, 'localhost', :worker_2_proxy_port); -ERROR: canceling statement due to user request +-- This test can be enabled again once this postgres bug is fixed: +-- https://www.postgresql.org/message-id/flat/HE1PR8303MB0075BF78AF1BE904050DA16BF7729%40HE1PR8303MB0075.EURPRD83.prod.outlook.com +-- SELECT citus.mitmproxy('conn.onQuery(query="^ALTER SUBSCRIPTION .* ENABLE").cancel(' || :pid || ')'); +-- SELECT master_move_shard_placement(101, 'localhost', :worker_1_port, 'localhost', :worker_2_proxy_port); -- failure on polling subscription state SELECT citus.mitmproxy('conn.onQuery(query="^SELECT count\(\*\) FROM pg_subscription_rel").kill()'); mitmproxy diff --git a/src/test/regress/sql/failure_online_move_shard_placement.sql b/src/test/regress/sql/failure_online_move_shard_placement.sql index eb8d68e20..282a2895c 100644 --- a/src/test/regress/sql/failure_online_move_shard_placement.sql +++ b/src/test/regress/sql/failure_online_move_shard_placement.sql @@ -60,8 +60,10 @@ SELECT citus.mitmproxy('conn.onQuery(query="^ALTER SUBSCRIPTION .* ENABLE").kill SELECT master_move_shard_placement(101, 'localhost', :worker_1_port, 'localhost', :worker_2_proxy_port); -- failure when enabling the subscriptions -SELECT citus.mitmproxy('conn.onQuery(query="^ALTER SUBSCRIPTION .* ENABLE").cancel(' || :pid || ')'); -SELECT master_move_shard_placement(101, 'localhost', :worker_1_port, 'localhost', :worker_2_proxy_port); +-- This test can be enabled again once this postgres bug is fixed: +-- https://www.postgresql.org/message-id/flat/HE1PR8303MB0075BF78AF1BE904050DA16BF7729%40HE1PR8303MB0075.EURPRD83.prod.outlook.com +-- SELECT citus.mitmproxy('conn.onQuery(query="^ALTER SUBSCRIPTION .* ENABLE").cancel(' || :pid || ')'); +-- SELECT master_move_shard_placement(101, 'localhost', :worker_1_port, 'localhost', :worker_2_proxy_port); -- failure on polling subscription state SELECT citus.mitmproxy('conn.onQuery(query="^SELECT count\(\*\) FROM pg_subscription_rel").kill()');