Fix a segfault caused by use after free in ConnectionsPlacementHash (#5170)

DESCRIPTION: Fix a segfault caused by use after free in ConnectionsPlacementHash

Fix a segfault caused by retaining data in any of the hashmaps making up the Placement Connection Management.

We have seen production systems segfault due to random data referenced from ConnectionPlacementHash.
On investigation we found that the backends segfaulting on this had OOM errors closely prior to the segfault.
It has shown there are at least 15 places where an allocation can OOM that would cause ConnectionPlacementHash to retain pointers to memory from contexts that are subsequently freed. This would reproduce the segfault we have observed in production.

Conditions for these allocations are:
 - allocated after first call to `AssociatePlacementWithShard`: https://github.com/citusdata/citus/blob/v10.0.3/src/backend/distributed/connection/placement_connection.c#L880-L881
 - allocated before `StartNodeUserDatabaseConnection`: https://github.com/citusdata/citus/blob/v10.0.3/src/backend/distributed/connection/connection_management.c#L291

At least 15 points of memory allocation (which could fail) are between the callsites of both in a primary key lookup on a reference table - where we have seen an OOM cause a segfault moments later.

Instead of leaving any references in ConnectionPlacementHash, ConnectionShardHash and ColocatedPlacementsHash that could retain any pointers that are freed due to the TopTransactionContext being reset we clear all these hashes irregardless of the state of CurrentCoordinatedTransactionState.

Downside is that on any transaction abort we will now iterate through 4 hashmaps and clear their contents. Given that they are either already empty, which should cause a quick iteration, or non-empty, causing segfaults in subsequent executions, this overhead seems reasonable.

A better solution would be to move the creation of these hashmaps so they would live in the TopTransactionContext themself, assuming their contents would never outlive a transaction. This needs more investigation and is an involved refactor Hence fixing this quickly here.
pull/5195/head
Nils Dijk 2021-08-17 17:42:35 +02:00 committed by GitHub
parent 4f213f293e
commit dfc950ce1e
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
1 changed files with 11 additions and 6 deletions

View File

@ -328,12 +328,17 @@ CoordinatedTransactionCallback(XactEvent event, void *arg)
CoordinatedRemoteTransactionsAbort();
}
/* close connections etc. */
if (CurrentCoordinatedTransactionState != COORD_TRANS_NONE)
{
ResetPlacementConnectionManagement();
AfterXactConnectionHandling(false);
}
/*
* Close connections etc. Contrary to a successful transaction we reset the
* placement connection management irregardless of state of the statemachine
* as recorded in CurrentCoordinatedTransactionState.
* The hashmaps recording the connection management live a memory context
* higher compared to most of the data referenced in the hashmap. This causes
* use after free errors when the contents are retained due to an error caused
* before the CurrentCoordinatedTransactionState changed.
*/
ResetPlacementConnectionManagement();
AfterXactConnectionHandling(false);
ResetGlobalVariables();