Mitigate segfault in connection statemachine (#4551)

As described in the comment, we have observed crashes in production due to a segfault caused by the dereference of a NULL pointer in our connection statemachine. As a mitigation, preventing system crashes, we provide an error with a small explanation of the issue. Unfortunately the case is not reliably reproduced yet, hence the inability to add tests. DESCRIPTION: Prevent segfaults when SAVEPOINT handling cannot recover from connection failures (cherry picked from commit d127516dc8)
2021-01-25 15:55:04 +01:00 · 2021-01-25 15:55:04 +01:00 · 2efeed412a
parent 49ce36fe8b
commit 2efeed412a
1 changed files with 19 additions and 0 deletions
--- a/src/backend/distributed/executor/adaptive_executor.c
+++ b/src/backend/distributed/executor/adaptive_executor.c
@ -3297,6 +3297,25 @@ TransactionStateMachine(WorkerSession *session)
 			case REMOTE_TRANS_SENT_COMMAND:
 			{
 				TaskPlacementExecution *placementExecution = session->currentTask;
+				if (placementExecution == NULL)
+				{
+					/*
+					 * We have seen accounts in production where the placementExecution
+					 * could inadvertently be not set. Investigation documented on
+					 * https://github.com/citusdata/citus-enterprise/issues/493
+					 * (due to sensitive data in the initial report it is not discussed
+					 * in our community repository)
+					 *
+					 * Currently we don't have a reliable way of reproducing this issue.
+					 * Erroring here seems to be a more desirable approach compared to a
+					 * SEGFAULT on the dereference of placementExecution, with a possible
+					 * crash recovery as a result.
+					 */
+					ereport(ERROR, (errmsg(
+										"unable to recover from inconsistent state in "
+										"the connection state machine on coordinator")));
+				}
+
 				ShardCommandExecution *shardCommandExecution =
 					placementExecution->shardCommandExecution;
 				Task *task = shardCommandExecution->task;