Fix flakyness in failure_online_move_shard_placement (#6250)

Sometimes in CI failure_online_move_shard_placement fails with the
following error:
```diff
 SELECT citus.mitmproxy('conn.onQuery(query="^ALTER SUBSCRIPTION .* ENABLE").cancel(' || :pid || ')');
  mitmproxy
 -----------

 (1 row)

 SELECT master_move_shard_placement(101, 'localhost', :worker_1_port, 'localhost', :worker_2_proxy_port);
-ERROR:  canceling statement due to user request
+ERROR:  tuple concurrently updated
+CONTEXT:  while executing command on localhost:9060
 -- failure on polling subscription state
```
Source: https://app.circleci.com/pipelines/github/citusdata/citus/26441/workflows/dd6e3475-6121-47b3-aea3-4ac92be114f4/jobs/751476/steps

This error is not completely harmless, because based on the logs it mean
that our cleanup logic failed, which in turn means that replication
slots are left around:
```
2022-08-24 16:01:29.247 UTC [1219] ERROR:  XX000: tuple concurrently updated
2022-08-24 16:01:29.247 UTC [1219] LOCATION:  simple_heap_update, heapam.c:4179
2022-08-24 16:01:29.247 UTC [1219] STATEMENT:  ALTER SUBSCRIPTION citus_shard_move_subscription_10 DISABLE
```

However, we have other mechanisms to clean up any leftovers in case of a
failed cleanup. So it's not that big of a problem.

The reason we run into this error is arguably because of a Postgres bug,
so I created a patch for Postgres that fixes this.

While we wait for this (or a similar) patch to be merged, this PR
disables the flaky test. There's still a test that tests in case of a
connection "kill" instead of a "cancel", so I don't think we lose very
important coverage by disabling this test. When trying to reproduce this
I only hit this issue in the cancel case, so I don't think there's a
need to disable the kill case for now.
pull/6248/head
Jelte Fennema 2022-08-26 12:49:45 +02:00 committed by GitHub
parent 2a0c0b3ba6
commit 77dd49fcf8
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
2 changed files with 8 additions and 10 deletions

View File

@ -110,14 +110,10 @@ SELECT master_move_shard_placement(101, 'localhost', :worker_1_port, 'localhost'
ERROR: connection not open
CONTEXT: while executing command on localhost:xxxxx
-- failure when enabling the subscriptions
SELECT citus.mitmproxy('conn.onQuery(query="^ALTER SUBSCRIPTION .* ENABLE").cancel(' || :pid || ')');
mitmproxy
---------------------------------------------------------------------
(1 row)
SELECT master_move_shard_placement(101, 'localhost', :worker_1_port, 'localhost', :worker_2_proxy_port);
ERROR: canceling statement due to user request
-- This test can be enabled again once this postgres bug is fixed:
-- https://www.postgresql.org/message-id/flat/HE1PR8303MB0075BF78AF1BE904050DA16BF7729%40HE1PR8303MB0075.EURPRD83.prod.outlook.com
-- SELECT citus.mitmproxy('conn.onQuery(query="^ALTER SUBSCRIPTION .* ENABLE").cancel(' || :pid || ')');
-- SELECT master_move_shard_placement(101, 'localhost', :worker_1_port, 'localhost', :worker_2_proxy_port);
-- failure on polling subscription state
SELECT citus.mitmproxy('conn.onQuery(query="^SELECT count\(\*\) FROM pg_subscription_rel").kill()');
mitmproxy

View File

@ -60,8 +60,10 @@ SELECT citus.mitmproxy('conn.onQuery(query="^ALTER SUBSCRIPTION .* ENABLE").kill
SELECT master_move_shard_placement(101, 'localhost', :worker_1_port, 'localhost', :worker_2_proxy_port);
-- failure when enabling the subscriptions
SELECT citus.mitmproxy('conn.onQuery(query="^ALTER SUBSCRIPTION .* ENABLE").cancel(' || :pid || ')');
SELECT master_move_shard_placement(101, 'localhost', :worker_1_port, 'localhost', :worker_2_proxy_port);
-- This test can be enabled again once this postgres bug is fixed:
-- https://www.postgresql.org/message-id/flat/HE1PR8303MB0075BF78AF1BE904050DA16BF7729%40HE1PR8303MB0075.EURPRD83.prod.outlook.com
-- SELECT citus.mitmproxy('conn.onQuery(query="^ALTER SUBSCRIPTION .* ENABLE").cancel(' || :pid || ')');
-- SELECT master_move_shard_placement(101, 'localhost', :worker_1_port, 'localhost', :worker_2_proxy_port);
-- failure on polling subscription state
SELECT citus.mitmproxy('conn.onQuery(query="^SELECT count\(\*\) FROM pg_subscription_rel").kill()');