Fix race condition in citus_set_coordinator_host when adding multiple coordinator nodes concurrently (#7682)

When multiple sessions concurrently attempt to add the same coordinator
node using `citus_set_coordinator_host`, there is a potential race
condition. Both sessions may pass the initial metadata check
(`isCoordinatorInMetadata`), but only one will succeed in adding the
node. The other session will fail with an assertion error
(`Assert(!nodeAlreadyExists)`), causing the server to crash. Even though
the `AddNodeMetadata` function takes an exclusive lock, it appears that
the lock is not preventing the race condition before the initial
metadata check.

- **Issue**: The current logic allows concurrent sessions to pass the
check for existing coordinators, leading to an attempt to insert
duplicate nodes, which triggers the assertion failure.

- **Impact**: This race condition leads to crashes during operations
that involve concurrent coordinator additions, as seen in
https://github.com/citusdata/citus/issues/7646.

**Test Plan:**

- Isolation Test Limitation: An isolation test was added to simulate
concurrent additions of the same coordinator node, but due to the
behavior of PostgreSQL locking mechanisms, the test does not trigger the
edge case. The lock applied within the function serializes the
operations, preventing the race condition from occurring in the
isolation test environment.
While the edge case is difficult to reproduce in an isolation test, the
fix addresses the core issue by ensuring concurrency control through
proper locking.

- Existing Tests: All existing tests related to node metadata and
coordinator management have been run to ensure that no regressions were
introduced.

**After the Fix:**

- Concurrent attempts to add the same coordinator node will be
serialized. One session will succeed in adding the node, while the
others will skip the operation without crashing the server.

Co-authored-by: Mehmet YILMAZ <mehmet.yilmaz@microsoft.com>
(cherry picked from commit 4775715691)
pull/7841/head
Mehmet YILMAZ 2024-09-09 17:09:56 +03:00 committed by naisila
parent a115b3043a
commit 063cff908e
1 changed files with 3 additions and 0 deletions

View File

@ -217,6 +217,9 @@ citus_set_coordinator_host(PG_FUNCTION_ARGS)
EnsureTransactionalMetadataSyncMode();
}
/* prevent concurrent modification */
LockRelationOid(DistNodeRelationId(), RowExclusiveLock);
bool isCoordinatorInMetadata = false;
WorkerNode *coordinatorNode = PrimaryNodeForGroup(COORDINATOR_GROUP_ID,
&isCoordinatorInMetadata);