mirror of https://github.com/citusdata/citus.git
5028 lines
150 KiB
C
5028 lines
150 KiB
C
/*-------------------------------------------------------------------------
|
|
*
|
|
* adaptive_executor.c
|
|
*
|
|
* The adaptive executor executes a list of tasks (queries on shards) over
|
|
* a connection pool per worker node. The results of the queries, if any,
|
|
* are written to a tuple store.
|
|
*
|
|
* The concepts in the executor are modelled in a set of structs:
|
|
*
|
|
* - DistributedExecution:
|
|
* Execution of a Task list over a set of WorkerPools.
|
|
* - WorkerPool
|
|
* Pool of WorkerSessions for the same worker which opportunistically
|
|
* executes "unassigned" tasks from a queue.
|
|
* - WorkerSession:
|
|
* Connection to a worker that is used to execute "assigned" tasks
|
|
* from a queue and may execute unasssigned tasks from the WorkerPool.
|
|
* - ShardCommandExecution:
|
|
* Execution of a Task across a list of placements.
|
|
* - TaskPlacementExecution:
|
|
* Execution of a Task on a specific placement.
|
|
* Used in the WorkerPool and WorkerSession queues.
|
|
*
|
|
* Every connection pool (WorkerPool) and every connection (WorkerSession)
|
|
* have a queue of tasks that are ready to execute (readyTaskQueue) and a
|
|
* queue/set of pending tasks that may become ready later in the execution
|
|
* (pendingTaskQueue). The tasks are wrapped in a ShardCommandExecution,
|
|
* which keeps track of the state of execution and is referenced from a
|
|
* TaskPlacementExecution, which is the data structure that is actually
|
|
* added to the queues and describes the state of the execution of a task
|
|
* on a particular worker node.
|
|
*
|
|
* When the task list is part of a bigger distributed transaction, the
|
|
* shards that are accessed or modified by the task may have already been
|
|
* accessed earlier in the transaction. We need to make sure we use the
|
|
* same connection since it may hold relevant locks or have uncommitted
|
|
* writes. In that case we "assign" the task to a connection by adding
|
|
* it to the task queue of specific connection (in
|
|
* AssignTasksToConnectionsOrWorkerPool). Otherwise we consider the task
|
|
* unassigned and add it to the task queue of a worker pool, which means
|
|
* that it can be executed over any connection in the pool.
|
|
*
|
|
* A task may be executed on multiple placements in case of a reference
|
|
* table or a replicated distributed table. Depending on the type of
|
|
* task, it may not be ready to be executed on a worker node immediately.
|
|
* For instance, INSERTs on a reference table are executed serially across
|
|
* placements to avoid deadlocks when concurrent INSERTs take conflicting
|
|
* locks. At the beginning, only the "first" placement is ready to execute
|
|
* and therefore added to the readyTaskQueue in the pool or connection.
|
|
* The remaining placements are added to the pendingTaskQueue. Once
|
|
* execution on the first placement is done the second placement moves
|
|
* from pendingTaskQueue to readyTaskQueue. The same approach is used to
|
|
* fail over read-only tasks to another placement.
|
|
*
|
|
* Once all the tasks are added to a queue, the main loop in
|
|
* RunDistributedExecution repeatedly does the following:
|
|
*
|
|
* For each pool:
|
|
* - ManageWorkPool evaluates whether to open additional connections
|
|
* based on the number unassigned tasks that are ready to execute
|
|
* and the targetPoolSize of the execution.
|
|
*
|
|
* Poll all connections:
|
|
* - We use a WaitEventSet that contains all (non-failed) connections
|
|
* and is rebuilt whenever the set of active connections or any of
|
|
* their wait flags change.
|
|
*
|
|
* We almost always check for WL_SOCKET_READABLE because a session
|
|
* can emit notices at any time during execution, but it will only
|
|
* wake up WaitEventSetWait when there are actual bytes to read.
|
|
*
|
|
* We check for WL_SOCKET_WRITEABLE just after sending bytes in case
|
|
* there is not enough space in the TCP buffer. Since a socket is
|
|
* almost always writable we also use WL_SOCKET_WRITEABLE as a
|
|
* mechanism to wake up WaitEventSetWait for non-I/O events, e.g.
|
|
* when a task moves from pending to ready.
|
|
*
|
|
* For each connection that is ready:
|
|
* - ConnectionStateMachine handles connection establishment and failure
|
|
* as well as command execution via TransactionStateMachine.
|
|
*
|
|
* When a connection is ready to execute a new task, it first checks its
|
|
* own readyTaskQueue and otherwise takes a task from the worker pool's
|
|
* readyTaskQueue (on a first-come-first-serve basis).
|
|
*
|
|
* In cases where the tasks finish quickly (e.g. <1ms), a single
|
|
* connection will often be sufficient to finish all tasks. It is
|
|
* therefore not necessary that all connections are established
|
|
* successfully or open a transaction (which may be blocked by an
|
|
* intermediate pgbouncer in transaction pooling mode). It is therefore
|
|
* essential that we take a task from the queue only after opening a
|
|
* transaction block.
|
|
*
|
|
* When a command on a worker finishes or the connection is lost, we call
|
|
* PlacementExecutionDone, which then updates the state of the task
|
|
* based on whether we need to run it on other placements. When a
|
|
* connection fails or all connections to a worker fail, we also call
|
|
* PlacementExecutionDone for all queued tasks to try the next placement
|
|
* and, if necessary, mark shard placements as inactive. If a task fails
|
|
* to execute on all placements, the execution fails and the distributed
|
|
* transaction rolls back.
|
|
*
|
|
* For multi-row INSERTs, tasks are executed sequentially by
|
|
* SequentialRunDistributedExecution instead of in parallel, which allows
|
|
* a high degree of concurrency without high risk of deadlocks.
|
|
* Conversely, multi-row UPDATE/DELETE/DDL commands take aggressive locks
|
|
* which forbids concurrency, but allows parallelism without high risk
|
|
* of deadlocks. Note that this is unrelated to SEQUENTIAL_CONNECTION,
|
|
* which indicates that we should use at most one connection per node, but
|
|
* can run tasks in parallel across nodes. This is used when there are
|
|
* writes to a reference table that has foreign keys from a distributed
|
|
* table.
|
|
*
|
|
* Execution finishes when all tasks are done, the query errors out, or
|
|
* the user cancels the query.
|
|
*
|
|
*-------------------------------------------------------------------------
|
|
*/
|
|
|
|
#include "postgres.h"
|
|
#include "funcapi.h"
|
|
#include "libpq-fe.h"
|
|
#include "miscadmin.h"
|
|
#include "pgstat.h"
|
|
|
|
#include <sys/stat.h>
|
|
#include <unistd.h>
|
|
|
|
#include "access/transam.h"
|
|
#include "access/xact.h"
|
|
#include "access/htup_details.h"
|
|
#include "catalog/pg_type.h"
|
|
#include "commands/dbcommands.h"
|
|
#include "commands/schemacmds.h"
|
|
#include "distributed/adaptive_executor.h"
|
|
#include "distributed/cancel_utils.h"
|
|
#include "distributed/citus_custom_scan.h"
|
|
#include "distributed/citus_safe_lib.h"
|
|
#include "distributed/connection_management.h"
|
|
#include "distributed/commands/multi_copy.h"
|
|
#include "distributed/deparse_shard_query.h"
|
|
#include "distributed/shared_connection_stats.h"
|
|
#include "distributed/distributed_execution_locks.h"
|
|
#include "distributed/listutils.h"
|
|
#include "distributed/local_executor.h"
|
|
#include "distributed/multi_client_executor.h"
|
|
#include "distributed/multi_executor.h"
|
|
#include "distributed/multi_explain.h"
|
|
#include "distributed/multi_partitioning_utils.h"
|
|
#include "distributed/multi_physical_planner.h"
|
|
#include "distributed/multi_server_executor.h"
|
|
#include "distributed/placement_access.h"
|
|
#include "distributed/placement_connection.h"
|
|
#include "distributed/relation_access_tracking.h"
|
|
#include "distributed/remote_commands.h"
|
|
#include "distributed/repartition_join_execution.h"
|
|
#include "distributed/resource_lock.h"
|
|
#include "distributed/shared_connection_stats.h"
|
|
#include "distributed/subplan_execution.h"
|
|
#include "distributed/transaction_management.h"
|
|
#include "distributed/tuple_destination.h"
|
|
#include "distributed/version_compat.h"
|
|
#include "distributed/worker_protocol.h"
|
|
#include "lib/ilist.h"
|
|
#include "portability/instr_time.h"
|
|
#include "storage/fd.h"
|
|
#include "storage/latch.h"
|
|
#include "utils/builtins.h"
|
|
#include "utils/int8.h"
|
|
#include "utils/lsyscache.h"
|
|
#include "utils/memutils.h"
|
|
#include "utils/syscache.h"
|
|
#include "utils/timestamp.h"
|
|
|
|
#define SLOW_START_DISABLED 0
|
|
#define WAIT_EVENT_SET_INDEX_NOT_INITIALIZED -1
|
|
#define WAIT_EVENT_SET_INDEX_FAILED -2
|
|
|
|
|
|
/*
|
|
* DistributedExecution represents the execution of a distributed query
|
|
* plan.
|
|
*/
|
|
typedef struct DistributedExecution
|
|
{
|
|
/* the corresponding distributed plan's modLevel */
|
|
RowModifyLevel modLevel;
|
|
|
|
/*
|
|
* remoteAndLocalTaskList contains all the tasks required to finish the
|
|
* execution. remoteTaskList contains all the tasks required to
|
|
* finish the remote execution. localTaskList contains all the
|
|
* local tasks required to finish the local execution.
|
|
*
|
|
* remoteAndLocalTaskList is the union of remoteTaskList and localTaskList.
|
|
*/
|
|
List *remoteAndLocalTaskList;
|
|
List *remoteTaskList;
|
|
List *localTaskList;
|
|
|
|
/*
|
|
* If a task specific destination is not provided for a task, then use
|
|
* defaultTupleDest.
|
|
*/
|
|
TupleDestination *defaultTupleDest;
|
|
|
|
/* Parameters for parameterized plans. Can be NULL. */
|
|
ParamListInfo paramListInfo;
|
|
|
|
/* list of workers involved in the execution */
|
|
List *workerList;
|
|
|
|
/* list of all connections used for distributed execution */
|
|
List *sessionList;
|
|
|
|
/*
|
|
* Flag to indiciate that the set of connections we are interested
|
|
* in has changed and waitEventSet needs to be rebuilt.
|
|
*/
|
|
bool rebuildWaitEventSet;
|
|
|
|
/*
|
|
* Flag to indiciate that the set of wait events we are interested
|
|
* in might have changed and waitEventSet needs to be updated.
|
|
*
|
|
* Note that we set this flag whenever we assign a value to waitFlags,
|
|
* but we don't check that the waitFlags is actually different from the
|
|
* previous value. So we might have some false positives for this flag,
|
|
* which is OK, because in this case ModifyWaitEvent() is noop.
|
|
*/
|
|
bool waitFlagsChanged;
|
|
|
|
/*
|
|
* WaitEventSet used for waiting for I/O events.
|
|
*
|
|
* This could also be local to RunDistributedExecution(), but in that case
|
|
* we had to mark it as "volatile" to avoid PG_TRY()/PG_CATCH() issues, and
|
|
* cast it to non-volatile when doing WaitEventSetFree(). We thought that
|
|
* would make code a bit harder to read than making this non-local, so we
|
|
* move it here. See comments for PG_TRY() in postgres/src/include/elog.h
|
|
* and "man 3 siglongjmp" for more context.
|
|
*/
|
|
WaitEventSet *waitEventSet;
|
|
|
|
/*
|
|
* The number of connections we aim to open per worker.
|
|
*
|
|
* If there are no more tasks to assigned, the actual number may be lower.
|
|
* If there are already more connections, the actual number may be higher.
|
|
*/
|
|
int targetPoolSize;
|
|
|
|
/* total number of tasks to execute */
|
|
int totalTaskCount;
|
|
|
|
/* number of tasks that still need to be executed */
|
|
int unfinishedTaskCount;
|
|
|
|
/*
|
|
* Flag to indicate whether throwing errors on cancellation is
|
|
* allowed.
|
|
*/
|
|
bool raiseInterrupts;
|
|
|
|
/* transactional properties of the current execution */
|
|
TransactionProperties *transactionProperties;
|
|
|
|
/* indicates whether distributed execution has failed */
|
|
bool failed;
|
|
|
|
/*
|
|
* For SELECT commands or INSERT/UPDATE/DELETE commands with RETURNING,
|
|
* the total number of rows received from the workers. For
|
|
* INSERT/UPDATE/DELETE commands without RETURNING, the total number of
|
|
* tuples modified.
|
|
*
|
|
* Note that for replicated tables (e.g., reference tables), we only consider
|
|
* a single replica's rows that are processed.
|
|
*/
|
|
uint64 rowsProcessed;
|
|
|
|
/*
|
|
* The following fields are used while receiving results from remote nodes.
|
|
* We store this information here to avoid re-allocating it every time.
|
|
*
|
|
* columnArray field is reset/calculated per row, so might be useless for
|
|
* other contexts. The benefit of keeping it here is to avoid allocating
|
|
* the array over and over again.
|
|
*/
|
|
uint32 allocatedColumnCount;
|
|
void **columnArray;
|
|
StringInfoData *stringInfoDataArray;
|
|
|
|
/*
|
|
* jobIdList contains all jobs in the job tree, this is used to
|
|
* do cleanup for repartition queries.
|
|
*/
|
|
List *jobIdList;
|
|
} DistributedExecution;
|
|
|
|
|
|
/*
|
|
* WorkerPoolFailureState indicates the current state of the
|
|
* pool.
|
|
*/
|
|
typedef enum WorkerPoolFailureState
|
|
{
|
|
/* safe to continue execution*/
|
|
WORKER_POOL_NOT_FAILED,
|
|
|
|
/* if a pool fails, the execution fails */
|
|
WORKER_POOL_FAILED,
|
|
|
|
/*
|
|
* The remote execution over the pool failed, but we failed over
|
|
* to the local execution and still finish the execution.
|
|
*/
|
|
WORKER_POOL_FAILED_OVER_TO_LOCAL
|
|
} WorkerPoolFailureState;
|
|
|
|
/*
|
|
* WorkerPool represents a pool of sessions on the same worker.
|
|
*
|
|
* A WorkerPool has two queues containing the TaskPlacementExecutions that need
|
|
* to be executed on the worker.
|
|
*
|
|
* TaskPlacementExecutions that are ready to execute are in readyTaskQueue.
|
|
* TaskPlacementExecutions that may need to be executed once execution on
|
|
* another worker finishes or fails are in pendingTaskQueue.
|
|
*
|
|
* In TransactionStateMachine, the sessions opportunistically take
|
|
* TaskPlacementExecutions from the readyQueue when they are ready and have no
|
|
* assigned tasks.
|
|
*
|
|
* We track connection timeouts per WorkerPool. When the first connection is
|
|
* established we set the poolStartTime and if no connection can be established
|
|
* before NodeConnectionTime, the WorkerPool fails. There is some specialised
|
|
* logic in case citus.force_max_query_parallelization is enabled because we
|
|
* may fail to establish a connection per placement after already establishing
|
|
* some connections earlier in the execution.
|
|
*
|
|
* A WorkerPool fails if all connection attempts failed or all connections
|
|
* are lost. In that case, all TaskPlacementExecutions in the queues are
|
|
* marked as failed in PlacementExecutionDone, which typically causes the
|
|
* task and therefore the distributed execution to fail. In case of a
|
|
* replicated table or a SELECT on a reference table, the remaining placements
|
|
* will be tried by moving them from a pendingTaskQueue to a readyTaskQueue.
|
|
*/
|
|
typedef struct WorkerPool
|
|
{
|
|
/* distributed execution in which the worker participates */
|
|
DistributedExecution *distributedExecution;
|
|
|
|
/* worker node on which we have a pool of sessions */
|
|
char *nodeName;
|
|
int nodePort;
|
|
|
|
/* all sessions on the worker that are part of the current execution */
|
|
List *sessionList;
|
|
|
|
/* number of connections that were established */
|
|
int activeConnectionCount;
|
|
|
|
/*
|
|
* Keep track of how many connections are ready for execution, in
|
|
* order to (efficiently) know whether more connections to the worker
|
|
* are needed.
|
|
*/
|
|
int idleConnectionCount;
|
|
|
|
/* number of connections that did not send a command */
|
|
int unusedConnectionCount;
|
|
|
|
/* number of failed connections */
|
|
int failedConnectionCount;
|
|
|
|
/*
|
|
* Placement executions destined for worker node, but not assigned to any
|
|
* connection and not yet ready to start (depends on other placement
|
|
* executions).
|
|
*/
|
|
dlist_head pendingTaskQueue;
|
|
|
|
/*
|
|
* Placement executions destined for worker node, but not assigned to any
|
|
* connection and not ready to start.
|
|
*/
|
|
dlist_head readyTaskQueue;
|
|
int readyTaskCount;
|
|
|
|
/*
|
|
* We keep this for enforcing the connection timeouts. In our definition, a pool
|
|
* starts when the first connection establishment starts.
|
|
*/
|
|
instr_time poolStartTime;
|
|
|
|
/* indicates whether to check for the connection timeout */
|
|
bool checkForPoolTimeout;
|
|
|
|
/* last time we opened a connection */
|
|
instr_time lastConnectionOpenTime;
|
|
|
|
/* maximum number of connections we are allowed to open at once */
|
|
uint32 maxNewConnectionsPerCycle;
|
|
|
|
/*
|
|
* Set to true if the pool is to local node. We use this value to
|
|
* avoid re-calculating often.
|
|
*/
|
|
bool poolToLocalNode;
|
|
|
|
/*
|
|
* This is only set in WorkerPoolFailed() function. Once a pool fails, we do not
|
|
* use it anymore.
|
|
*/
|
|
WorkerPoolFailureState failureState;
|
|
} WorkerPool;
|
|
|
|
struct TaskPlacementExecution;
|
|
|
|
/*
|
|
* WorkerSession represents a session on a worker that can execute tasks
|
|
* (sequentially) and is part of a WorkerPool.
|
|
*
|
|
* Each WorkerSession has two queues containing TaskPlacementExecutions that
|
|
* need to be executed within this particular session because the session
|
|
* accessed the same or co-located placements earlier in the transaction.
|
|
*
|
|
* TaskPlacementExecutions that are ready to execute are in readyTaskQueue.
|
|
* TaskPlacementExecutions that may need to be executed once execution on
|
|
* another worker finishes or fails are in pendingTaskQueue.
|
|
*/
|
|
typedef struct WorkerSession
|
|
{
|
|
/* only useful for debugging */
|
|
uint64 sessionId;
|
|
|
|
/* worker pool of which this session is part */
|
|
WorkerPool *workerPool;
|
|
|
|
/* connection over which the session is established */
|
|
MultiConnection *connection;
|
|
|
|
/* tasks that need to be executed on this connection, but are not ready to start */
|
|
dlist_head pendingTaskQueue;
|
|
|
|
/* tasks that need to be executed on this connection and are ready to start */
|
|
dlist_head readyTaskQueue;
|
|
|
|
/* task the worker should work on or NULL */
|
|
struct TaskPlacementExecution *currentTask;
|
|
|
|
/*
|
|
* The number of commands sent to the worker over the session. Excludes
|
|
* distributed transaction related commands such as BEGIN/COMMIT etc.
|
|
*/
|
|
uint64 commandsSent;
|
|
|
|
/* index in the wait event set */
|
|
int waitEventSetIndex;
|
|
|
|
/* events reported by the latest call to WaitEventSetWait */
|
|
int latestUnconsumedWaitEvents;
|
|
} WorkerSession;
|
|
|
|
|
|
struct TaskPlacementExecution;
|
|
|
|
/* GUC, determining whether Citus opens 1 connection per task */
|
|
bool ForceMaxQueryParallelization = false;
|
|
int MaxAdaptiveExecutorPoolSize = 16;
|
|
bool EnableBinaryProtocol = false;
|
|
|
|
/* GUC, number of ms to wait between opening connections to the same worker */
|
|
int ExecutorSlowStartInterval = 10;
|
|
|
|
|
|
/*
|
|
* TaskExecutionState indicates whether or not a command on a shard
|
|
* has finished, or whether it has failed.
|
|
*/
|
|
typedef enum TaskExecutionState
|
|
{
|
|
TASK_EXECUTION_NOT_FINISHED,
|
|
TASK_EXECUTION_FINISHED,
|
|
TASK_EXECUTION_FAILED,
|
|
TASK_EXECUTION_FAILOVER_TO_LOCAL_EXECUTION
|
|
} TaskExecutionState;
|
|
|
|
/*
|
|
* PlacementExecutionOrder indicates whether a command should be executed
|
|
* on any replica, on all replicas sequentially (in order), or on all
|
|
* replicas in parallel.
|
|
*/
|
|
typedef enum PlacementExecutionOrder
|
|
{
|
|
EXECUTION_ORDER_ANY,
|
|
EXECUTION_ORDER_SEQUENTIAL,
|
|
EXECUTION_ORDER_PARALLEL,
|
|
} PlacementExecutionOrder;
|
|
|
|
|
|
/*
|
|
* ShardCommandExecution represents an execution of a command on a shard
|
|
* that may (need to) run across multiple placements.
|
|
*/
|
|
typedef struct ShardCommandExecution
|
|
{
|
|
/* description of the task */
|
|
Task *task;
|
|
|
|
/* cached AttInMetadata for task */
|
|
AttInMetadata **attributeInputMetadata;
|
|
|
|
/* indicates whether the attributeInputMetadata has binary or text
|
|
* encoding/decoding functions */
|
|
bool binaryResults;
|
|
|
|
/* order in which the command should be replicated on replicas */
|
|
PlacementExecutionOrder executionOrder;
|
|
|
|
/* executions of the command on the placements of the shard */
|
|
struct TaskPlacementExecution **placementExecutions;
|
|
int placementExecutionCount;
|
|
|
|
/*
|
|
* RETURNING results from other shard placements can be ignored
|
|
* after we got results from the first placements.
|
|
*/
|
|
bool gotResults;
|
|
|
|
TaskExecutionState executionState;
|
|
} ShardCommandExecution;
|
|
|
|
/*
|
|
* TaskPlacementExecutionState indicates whether a command is running
|
|
* on a shard placement, or finished or failed.
|
|
*/
|
|
typedef enum TaskPlacementExecutionState
|
|
{
|
|
PLACEMENT_EXECUTION_NOT_READY,
|
|
PLACEMENT_EXECUTION_READY,
|
|
PLACEMENT_EXECUTION_RUNNING,
|
|
PLACEMENT_EXECUTION_FINISHED,
|
|
PLACEMENT_EXECUTION_FAILOVER_TO_LOCAL_EXECUTION,
|
|
PLACEMENT_EXECUTION_FAILED
|
|
} TaskPlacementExecutionState;
|
|
|
|
/*
|
|
* TaskPlacementExecution represents the an execution of a command
|
|
* on a shard placement.
|
|
*/
|
|
typedef struct TaskPlacementExecution
|
|
{
|
|
/* shard command execution of which this placement execution is part */
|
|
ShardCommandExecution *shardCommandExecution;
|
|
|
|
/* shard placement on which this command runs */
|
|
ShardPlacement *shardPlacement;
|
|
|
|
/* state of the execution of the command on the placement */
|
|
TaskPlacementExecutionState executionState;
|
|
|
|
/*
|
|
* Task query can contain multiple queries. queryIndex tracks results of
|
|
* which query we are waiting for.
|
|
*/
|
|
uint32 queryIndex;
|
|
|
|
/* worker pool on which the placement needs to be executed */
|
|
WorkerPool *workerPool;
|
|
|
|
/* the session the placement execution is assigned to or NULL */
|
|
WorkerSession *assignedSession;
|
|
|
|
/* membership in assigned task queue of a particular session */
|
|
dlist_node sessionPendingQueueNode;
|
|
|
|
/* membership in ready-to-start assigned task queue of a particular session */
|
|
dlist_node sessionReadyQueueNode;
|
|
|
|
/* membership in assigned task queue of worker */
|
|
dlist_node workerPendingQueueNode;
|
|
|
|
/* membership in ready-to-start task queue of worker */
|
|
dlist_node workerReadyQueueNode;
|
|
|
|
/* index in array of placement executions in a ShardCommandExecution */
|
|
int placementExecutionIndex;
|
|
} TaskPlacementExecution;
|
|
|
|
|
|
/* local functions */
|
|
static DistributedExecution * CreateDistributedExecution(RowModifyLevel modLevel,
|
|
List *taskList,
|
|
ParamListInfo paramListInfo,
|
|
int targetPoolSize,
|
|
TupleDestination *
|
|
defaultTupleDest,
|
|
TransactionProperties *
|
|
xactProperties,
|
|
List *jobIdList,
|
|
bool localExecutionSupported);
|
|
static TransactionProperties DecideTransactionPropertiesForTaskList(RowModifyLevel
|
|
modLevel,
|
|
List *taskList,
|
|
bool
|
|
exludeFromTransaction);
|
|
static void StartDistributedExecution(DistributedExecution *execution);
|
|
static void RunLocalExecution(CitusScanState *scanState, DistributedExecution *execution);
|
|
static void RunDistributedExecution(DistributedExecution *execution);
|
|
static void SequentialRunDistributedExecution(DistributedExecution *execution);
|
|
static void FinishDistributedExecution(DistributedExecution *execution);
|
|
static void CleanUpSessions(DistributedExecution *execution);
|
|
|
|
static void LockPartitionsForDistributedPlan(DistributedPlan *distributedPlan);
|
|
static void AcquireExecutorShardLocksForExecution(DistributedExecution *execution);
|
|
static bool DistributedExecutionModifiesDatabase(DistributedExecution *execution);
|
|
static bool IsMultiShardModification(RowModifyLevel modLevel, List *taskList);
|
|
static bool TaskListModifiesDatabase(RowModifyLevel modLevel, List *taskList);
|
|
static bool DistributedExecutionRequiresRollback(List *taskList);
|
|
static bool TaskListRequires2PC(List *taskList);
|
|
static bool SelectForUpdateOnReferenceTable(List *taskList);
|
|
static void AssignTasksToConnectionsOrWorkerPool(DistributedExecution *execution);
|
|
static void UnclaimAllSessionConnections(List *sessionList);
|
|
static PlacementExecutionOrder ExecutionOrderForTask(RowModifyLevel modLevel, Task *task);
|
|
static WorkerPool * FindOrCreateWorkerPool(DistributedExecution *execution,
|
|
char *nodeName, int nodePort);
|
|
static WorkerSession * FindOrCreateWorkerSession(WorkerPool *workerPool,
|
|
MultiConnection *connection);
|
|
static void ManageWorkerPool(WorkerPool *workerPool);
|
|
static bool ShouldWaitForSlowStart(WorkerPool *workerPool);
|
|
static int CalculateNewConnectionCount(WorkerPool *workerPool);
|
|
static void OpenNewConnections(WorkerPool *workerPool, int newConnectionCount,
|
|
TransactionProperties *transactionProperties);
|
|
static void CheckConnectionTimeout(WorkerPool *workerPool);
|
|
static void MarkEstablishingSessionsTimedOut(WorkerPool *workerPool);
|
|
static int UsableConnectionCount(WorkerPool *workerPool);
|
|
static long NextEventTimeout(DistributedExecution *execution);
|
|
static WaitEventSet * BuildWaitEventSet(List *sessionList);
|
|
static void RebuildWaitEventSetFlags(WaitEventSet *waitEventSet, List *sessionList);
|
|
static int CitusAddWaitEventSetToSet(WaitEventSet *set, uint32 events, pgsocket fd,
|
|
Latch *latch, void *user_data);
|
|
static bool CitusModifyWaitEvent(WaitEventSet *set, int pos, uint32 events,
|
|
Latch *latch);
|
|
static TaskPlacementExecution * PopPlacementExecution(WorkerSession *session);
|
|
static TaskPlacementExecution * PopAssignedPlacementExecution(WorkerSession *session);
|
|
static TaskPlacementExecution * PopUnassignedPlacementExecution(WorkerPool *workerPool);
|
|
static bool StartPlacementExecutionOnSession(TaskPlacementExecution *placementExecution,
|
|
WorkerSession *session);
|
|
static bool SendNextQuery(TaskPlacementExecution *placementExecution,
|
|
WorkerSession *session);
|
|
static void ConnectionStateMachine(WorkerSession *session);
|
|
static void HandleMultiConnectionSuccess(WorkerSession *session);
|
|
static bool HasAnyConnectionFailure(WorkerPool *workerPool);
|
|
static void Activate2PCIfModifyingTransactionExpandsToNewNode(WorkerSession *session);
|
|
static bool TransactionModifiedDistributedTable(DistributedExecution *execution);
|
|
static void TransactionStateMachine(WorkerSession *session);
|
|
static void UpdateConnectionWaitFlags(WorkerSession *session, int waitFlags);
|
|
static bool CheckConnectionReady(WorkerSession *session);
|
|
static bool ReceiveResults(WorkerSession *session, bool storeRows);
|
|
static void WorkerSessionFailed(WorkerSession *session);
|
|
static void WorkerPoolFailed(WorkerPool *workerPool);
|
|
static void PlacementExecutionDone(TaskPlacementExecution *placementExecution,
|
|
bool succeeded);
|
|
static void ScheduleNextPlacementExecution(TaskPlacementExecution *placementExecution,
|
|
bool succeeded);
|
|
static bool CanFailoverPlacementExecutionToLocalExecution(TaskPlacementExecution *
|
|
placementExecution);
|
|
static bool ShouldMarkPlacementsInvalidOnFailure(DistributedExecution *execution);
|
|
static void PlacementExecutionReady(TaskPlacementExecution *placementExecution);
|
|
static TaskExecutionState TaskExecutionStateMachine(ShardCommandExecution *
|
|
shardCommandExecution);
|
|
static bool HasDependentJobs(Job *mainJob);
|
|
static void ExtractParametersForRemoteExecution(ParamListInfo paramListInfo,
|
|
Oid **parameterTypes,
|
|
const char ***parameterValues);
|
|
static int GetEventSetSize(List *sessionList);
|
|
static bool ProcessSessionsWithFailedWaitEventSetOperations(
|
|
DistributedExecution *execution);
|
|
static int RebuildWaitEventSet(DistributedExecution *execution);
|
|
static void ProcessWaitEvents(DistributedExecution *execution, WaitEvent *events, int
|
|
eventCount, bool *cancellationReceived);
|
|
static long MillisecondsBetweenTimestamps(instr_time startTime, instr_time endTime);
|
|
static HeapTuple BuildTupleFromBytes(AttInMetadata *attinmeta, fmStringInfo *values);
|
|
static AttInMetadata * TupleDescGetAttBinaryInMetadata(TupleDesc tupdesc);
|
|
static int WorkerPoolCompare(const void *lhsKey, const void *rhsKey);
|
|
static void SetAttributeInputMetadata(DistributedExecution *execution,
|
|
ShardCommandExecution *shardCommandExecution);
|
|
|
|
/*
|
|
* AdaptiveExecutorPreExecutorRun gets called right before postgres starts its executor
|
|
* run. Given that the result of our subplans would be evaluated before the first call to
|
|
* the exec function of our custom scan we make sure our subplans have executed before.
|
|
*/
|
|
void
|
|
AdaptiveExecutorPreExecutorRun(CitusScanState *scanState)
|
|
{
|
|
if (scanState->finishedPreScan)
|
|
{
|
|
/*
|
|
* Cursors (and hence RETURN QUERY syntax in pl/pgsql functions)
|
|
* may trigger AdaptiveExecutorPreExecutorRun() on every fetch
|
|
* operation. Though, we should only execute PreScan once.
|
|
*/
|
|
return;
|
|
}
|
|
|
|
DistributedPlan *distributedPlan = scanState->distributedPlan;
|
|
|
|
/*
|
|
* PostgreSQL takes locks on all partitions in the executor. It's not entirely
|
|
* clear why this is necessary (instead of locking the parent during DDL), but
|
|
* we do the same for consistency.
|
|
*/
|
|
LockPartitionsForDistributedPlan(distributedPlan);
|
|
|
|
ExecuteSubPlans(distributedPlan);
|
|
|
|
scanState->finishedPreScan = true;
|
|
}
|
|
|
|
|
|
/*
|
|
* AdaptiveExecutor is called via CitusExecScan on the
|
|
* first call of CitusExecScan. The function fills the tupleStore
|
|
* of the input scanScate.
|
|
*/
|
|
TupleTableSlot *
|
|
AdaptiveExecutor(CitusScanState *scanState)
|
|
{
|
|
TupleTableSlot *resultSlot = NULL;
|
|
|
|
DistributedPlan *distributedPlan = scanState->distributedPlan;
|
|
EState *executorState = ScanStateGetExecutorState(scanState);
|
|
ParamListInfo paramListInfo = executorState->es_param_list_info;
|
|
bool randomAccess = true;
|
|
bool interTransactions = false;
|
|
int targetPoolSize = MaxAdaptiveExecutorPoolSize;
|
|
List *jobIdList = NIL;
|
|
|
|
Job *job = distributedPlan->workerJob;
|
|
List *taskList = job->taskList;
|
|
|
|
/* we should only call this once before the scan finished */
|
|
Assert(!scanState->finishedRemoteScan);
|
|
|
|
/* Reset Task fields that are only valid for a single execution */
|
|
ResetExplainAnalyzeData(taskList);
|
|
|
|
scanState->tuplestorestate =
|
|
tuplestore_begin_heap(randomAccess, interTransactions, work_mem);
|
|
|
|
TupleDesc tupleDescriptor = ScanStateGetTupleDescriptor(scanState);
|
|
TupleDestination *defaultTupleDest =
|
|
CreateTupleStoreTupleDest(scanState->tuplestorestate, tupleDescriptor);
|
|
|
|
if (RequestedForExplainAnalyze(scanState))
|
|
{
|
|
/*
|
|
* We use multiple queries per task in EXPLAIN ANALYZE which need to
|
|
* be part of the same transaction.
|
|
*/
|
|
UseCoordinatedTransaction();
|
|
taskList = ExplainAnalyzeTaskList(taskList, defaultTupleDest, tupleDescriptor,
|
|
paramListInfo);
|
|
}
|
|
|
|
bool hasDependentJobs = HasDependentJobs(job);
|
|
if (hasDependentJobs)
|
|
{
|
|
jobIdList = ExecuteDependentTasks(taskList, job);
|
|
}
|
|
|
|
if (MultiShardConnectionType == SEQUENTIAL_CONNECTION)
|
|
{
|
|
/* defer decision after ExecuteSubPlans() */
|
|
targetPoolSize = 1;
|
|
}
|
|
|
|
TransactionProperties xactProperties = DecideTransactionPropertiesForTaskList(
|
|
distributedPlan->modLevel, taskList,
|
|
hasDependentJobs);
|
|
|
|
bool localExecutionSupported = true;
|
|
DistributedExecution *execution = CreateDistributedExecution(
|
|
distributedPlan->modLevel,
|
|
taskList,
|
|
paramListInfo,
|
|
targetPoolSize,
|
|
defaultTupleDest,
|
|
&xactProperties,
|
|
jobIdList,
|
|
localExecutionSupported);
|
|
|
|
/*
|
|
* Make sure that we acquire the appropriate locks even if the local tasks
|
|
* are going to be executed with local execution.
|
|
*/
|
|
StartDistributedExecution(execution);
|
|
|
|
if (ShouldRunTasksSequentially(execution->remoteTaskList))
|
|
{
|
|
SequentialRunDistributedExecution(execution);
|
|
}
|
|
else
|
|
{
|
|
RunDistributedExecution(execution);
|
|
}
|
|
|
|
/* execute tasks local to the node (if any) */
|
|
if (list_length(execution->localTaskList) > 0)
|
|
{
|
|
/* now execute the local tasks */
|
|
RunLocalExecution(scanState, execution);
|
|
}
|
|
|
|
CmdType commandType = job->jobQuery->commandType;
|
|
if (commandType != CMD_SELECT)
|
|
{
|
|
executorState->es_processed = execution->rowsProcessed;
|
|
}
|
|
|
|
FinishDistributedExecution(execution);
|
|
|
|
if (hasDependentJobs)
|
|
{
|
|
DoRepartitionCleanup(jobIdList);
|
|
}
|
|
|
|
if (SortReturning && distributedPlan->expectResults && commandType != CMD_SELECT)
|
|
{
|
|
SortTupleStore(scanState);
|
|
}
|
|
|
|
return resultSlot;
|
|
}
|
|
|
|
|
|
/*
|
|
* HasDependentJobs returns true if there is any dependent job
|
|
* for the mainjob(top level) job.
|
|
*/
|
|
static bool
|
|
HasDependentJobs(Job *mainJob)
|
|
{
|
|
return list_length(mainJob->dependentJobList) > 0;
|
|
}
|
|
|
|
|
|
/*
|
|
* RunLocalExecution runs the localTaskList in the execution, fills the tuplestore
|
|
* and sets the es_processed if necessary.
|
|
*
|
|
* It also sorts the tuplestore if there are no remote tasks remaining.
|
|
*/
|
|
static void
|
|
RunLocalExecution(CitusScanState *scanState, DistributedExecution *execution)
|
|
{
|
|
EState *estate = ScanStateGetExecutorState(scanState);
|
|
bool isUtilityCommand = false;
|
|
uint64 rowsProcessed = ExecuteLocalTaskListExtended(execution->localTaskList,
|
|
estate->es_param_list_info,
|
|
scanState->distributedPlan,
|
|
execution->defaultTupleDest,
|
|
isUtilityCommand);
|
|
|
|
execution->rowsProcessed += rowsProcessed;
|
|
}
|
|
|
|
|
|
/*
|
|
* ExecuteUtilityTaskList is a wrapper around executing task
|
|
* list for utility commands.
|
|
*/
|
|
uint64
|
|
ExecuteUtilityTaskList(List *utilityTaskList, bool localExecutionSupported)
|
|
{
|
|
RowModifyLevel modLevel = ROW_MODIFY_NONE;
|
|
ExecutionParams *executionParams = CreateBasicExecutionParams(
|
|
modLevel, utilityTaskList, MaxAdaptiveExecutorPoolSize, localExecutionSupported
|
|
);
|
|
executionParams->xactProperties =
|
|
DecideTransactionPropertiesForTaskList(modLevel, utilityTaskList, false);
|
|
executionParams->isUtilityCommand = true;
|
|
|
|
return ExecuteTaskListExtended(executionParams);
|
|
}
|
|
|
|
|
|
/*
|
|
* ExecuteUtilityTaskListExtended is a wrapper around executing task
|
|
* list for utility commands.
|
|
*/
|
|
uint64
|
|
ExecuteUtilityTaskListExtended(List *utilityTaskList, int poolSize,
|
|
bool localExecutionSupported)
|
|
{
|
|
RowModifyLevel modLevel = ROW_MODIFY_NONE;
|
|
ExecutionParams *executionParams = CreateBasicExecutionParams(
|
|
modLevel, utilityTaskList, poolSize, localExecutionSupported
|
|
);
|
|
|
|
bool excludeFromXact = false;
|
|
executionParams->xactProperties =
|
|
DecideTransactionPropertiesForTaskList(modLevel, utilityTaskList,
|
|
excludeFromXact);
|
|
executionParams->isUtilityCommand = true;
|
|
|
|
return ExecuteTaskListExtended(executionParams);
|
|
}
|
|
|
|
|
|
/*
|
|
* ExecuteTaskListOutsideTransaction is a proxy to ExecuteTaskListExtended
|
|
* with defaults for some of the arguments.
|
|
*/
|
|
uint64
|
|
ExecuteTaskListOutsideTransaction(RowModifyLevel modLevel, List *taskList,
|
|
int targetPoolSize, List *jobIdList)
|
|
{
|
|
/*
|
|
* As we are going to run the tasks outside transaction, we shouldn't use local execution.
|
|
* However, there is some problem when using local execution related to
|
|
* repartition joins, when we solve that problem, we can execute the tasks
|
|
* coming to this path with local execution. See PR:3711
|
|
*/
|
|
bool localExecutionSupported = false;
|
|
ExecutionParams *executionParams = CreateBasicExecutionParams(
|
|
modLevel, taskList, targetPoolSize, localExecutionSupported
|
|
);
|
|
|
|
executionParams->xactProperties = DecideTransactionPropertiesForTaskList(
|
|
modLevel, taskList, true);
|
|
return ExecuteTaskListExtended(executionParams);
|
|
}
|
|
|
|
|
|
/*
|
|
* ExecuteTaskListIntoTupleStore is a proxy to ExecuteTaskListExtended() with defaults
|
|
* for some of the arguments.
|
|
*/
|
|
uint64
|
|
ExecuteTaskListIntoTupleDest(RowModifyLevel modLevel, List *taskList,
|
|
TupleDestination *tupleDest,
|
|
bool expectResults)
|
|
{
|
|
int targetPoolSize = MaxAdaptiveExecutorPoolSize;
|
|
bool localExecutionSupported = true;
|
|
ExecutionParams *executionParams = CreateBasicExecutionParams(
|
|
modLevel, taskList, targetPoolSize, localExecutionSupported
|
|
);
|
|
|
|
executionParams->xactProperties = DecideTransactionPropertiesForTaskList(
|
|
modLevel, taskList, false);
|
|
executionParams->expectResults = expectResults;
|
|
executionParams->tupleDestination = tupleDest;
|
|
|
|
return ExecuteTaskListExtended(executionParams);
|
|
}
|
|
|
|
|
|
/*
|
|
* ExecuteTaskListExtended sets up the execution for given task list and
|
|
* runs it.
|
|
*/
|
|
uint64
|
|
ExecuteTaskListExtended(ExecutionParams *executionParams)
|
|
{
|
|
ParamListInfo paramListInfo = NULL;
|
|
uint64 locallyProcessedRows = 0;
|
|
|
|
TupleDestination *defaultTupleDest = executionParams->tupleDestination;
|
|
|
|
if (MultiShardConnectionType == SEQUENTIAL_CONNECTION)
|
|
{
|
|
executionParams->targetPoolSize = 1;
|
|
}
|
|
|
|
DistributedExecution *execution =
|
|
CreateDistributedExecution(
|
|
executionParams->modLevel, executionParams->taskList,
|
|
paramListInfo, executionParams->targetPoolSize,
|
|
defaultTupleDest, &executionParams->xactProperties,
|
|
executionParams->jobIdList, executionParams->localExecutionSupported);
|
|
|
|
/*
|
|
* If current transaction accessed local placements and task list includes
|
|
* tasks that should be executed locally (accessing any of the local placements),
|
|
* then we should error out as it would cause inconsistencies across the
|
|
* remote connection and local execution.
|
|
*/
|
|
List *remoteTaskList = execution->remoteTaskList;
|
|
if (GetCurrentLocalExecutionStatus() == LOCAL_EXECUTION_REQUIRED &&
|
|
AnyTaskAccessesLocalNode(remoteTaskList))
|
|
{
|
|
ErrorIfTransactionAccessedPlacementsLocally();
|
|
}
|
|
|
|
/* run the remote execution */
|
|
StartDistributedExecution(execution);
|
|
RunDistributedExecution(execution);
|
|
FinishDistributedExecution(execution);
|
|
|
|
/* now, switch back to the local execution */
|
|
if (executionParams->isUtilityCommand)
|
|
{
|
|
locallyProcessedRows += ExecuteLocalUtilityTaskList(execution->localTaskList);
|
|
}
|
|
else
|
|
{
|
|
locallyProcessedRows += ExecuteLocalTaskList(execution->localTaskList,
|
|
defaultTupleDest);
|
|
}
|
|
|
|
return execution->rowsProcessed + locallyProcessedRows;
|
|
}
|
|
|
|
|
|
/*
|
|
* CreateBasicExecutionParams creates basic execution parameters with some common
|
|
* fields.
|
|
*/
|
|
ExecutionParams *
|
|
CreateBasicExecutionParams(RowModifyLevel modLevel,
|
|
List *taskList,
|
|
int targetPoolSize,
|
|
bool localExecutionSupported)
|
|
{
|
|
ExecutionParams *executionParams = palloc0(sizeof(ExecutionParams));
|
|
executionParams->modLevel = modLevel;
|
|
executionParams->taskList = taskList;
|
|
executionParams->targetPoolSize = targetPoolSize;
|
|
executionParams->localExecutionSupported = localExecutionSupported;
|
|
|
|
executionParams->tupleDestination = CreateTupleDestNone();
|
|
executionParams->expectResults = false;
|
|
executionParams->isUtilityCommand = false;
|
|
executionParams->jobIdList = NIL;
|
|
|
|
return executionParams;
|
|
}
|
|
|
|
|
|
/*
|
|
* CreateDistributedExecution creates a distributed execution data structure for
|
|
* a distributed plan.
|
|
*/
|
|
static DistributedExecution *
|
|
CreateDistributedExecution(RowModifyLevel modLevel, List *taskList,
|
|
ParamListInfo paramListInfo,
|
|
int targetPoolSize, TupleDestination *defaultTupleDest,
|
|
TransactionProperties *xactProperties,
|
|
List *jobIdList, bool localExecutionSupported)
|
|
{
|
|
DistributedExecution *execution =
|
|
(DistributedExecution *) palloc0(sizeof(DistributedExecution));
|
|
|
|
execution->modLevel = modLevel;
|
|
execution->remoteAndLocalTaskList = taskList;
|
|
execution->transactionProperties = xactProperties;
|
|
|
|
/* we are going to calculate this values below */
|
|
execution->localTaskList = NIL;
|
|
execution->remoteTaskList = NIL;
|
|
|
|
execution->paramListInfo = paramListInfo;
|
|
execution->workerList = NIL;
|
|
execution->sessionList = NIL;
|
|
execution->targetPoolSize = targetPoolSize;
|
|
execution->defaultTupleDest = defaultTupleDest;
|
|
|
|
execution->rowsProcessed = 0;
|
|
|
|
execution->raiseInterrupts = true;
|
|
|
|
execution->rebuildWaitEventSet = false;
|
|
execution->waitFlagsChanged = false;
|
|
|
|
execution->jobIdList = jobIdList;
|
|
|
|
/*
|
|
* Since task can have multiple queries, we are not sure how many columns we should
|
|
* allocate for. We start with 16, and reallocate when we need more.
|
|
*/
|
|
execution->allocatedColumnCount = 16;
|
|
execution->columnArray = palloc0(execution->allocatedColumnCount * sizeof(void *));
|
|
if (EnableBinaryProtocol)
|
|
{
|
|
/*
|
|
* Initialize enough StringInfos for each column. These StringInfos
|
|
* (and thus the backing buffers) will be reused for each row.
|
|
* We will reference these StringInfos in the columnArray if the value
|
|
* is not NULL.
|
|
*
|
|
* NOTE: StringInfos are always grown in the memory context in which
|
|
* they were initially created. So appending in any memory context will
|
|
* result in bufferes that are still valid after removing that memory
|
|
* context.
|
|
*/
|
|
execution->stringInfoDataArray = palloc0(
|
|
execution->allocatedColumnCount *
|
|
sizeof(StringInfoData));
|
|
for (int i = 0; i < execution->allocatedColumnCount; i++)
|
|
{
|
|
initStringInfo(&execution->stringInfoDataArray[i]);
|
|
}
|
|
}
|
|
|
|
if (localExecutionSupported && ShouldExecuteTasksLocally(taskList))
|
|
{
|
|
bool readOnlyPlan = !TaskListModifiesDatabase(modLevel, taskList);
|
|
ExtractLocalAndRemoteTasks(readOnlyPlan, taskList, &execution->localTaskList,
|
|
&execution->remoteTaskList);
|
|
}
|
|
else
|
|
{
|
|
/*
|
|
* Get a shallow copy of the list as we rely on remoteAndLocalTaskList
|
|
* across the execution.
|
|
*/
|
|
execution->remoteTaskList = list_copy(execution->remoteAndLocalTaskList);
|
|
}
|
|
|
|
execution->totalTaskCount = list_length(execution->remoteTaskList);
|
|
execution->unfinishedTaskCount = list_length(execution->remoteTaskList);
|
|
|
|
return execution;
|
|
}
|
|
|
|
|
|
/*
|
|
* DecideTransactionPropertiesForTaskList decides whether to use remote transaction
|
|
* blocks, whether to use 2PC for the given task list, and whether to error on any
|
|
* failure.
|
|
*
|
|
* Since these decisions have specific dependencies on each other (e.g. 2PC implies
|
|
* errorOnAnyFailure, but not the other way around) we keep them in the same place.
|
|
*/
|
|
static TransactionProperties
|
|
DecideTransactionPropertiesForTaskList(RowModifyLevel modLevel, List *taskList, bool
|
|
exludeFromTransaction)
|
|
{
|
|
TransactionProperties xactProperties;
|
|
|
|
/* ensure uninitialized padding doesn't escape the function */
|
|
memset_struct_0(xactProperties);
|
|
xactProperties.errorOnAnyFailure = false;
|
|
xactProperties.useRemoteTransactionBlocks = TRANSACTION_BLOCKS_ALLOWED;
|
|
xactProperties.requires2PC = false;
|
|
|
|
if (taskList == NIL)
|
|
{
|
|
/* nothing to do, return defaults */
|
|
return xactProperties;
|
|
}
|
|
|
|
if (exludeFromTransaction)
|
|
{
|
|
xactProperties.useRemoteTransactionBlocks = TRANSACTION_BLOCKS_DISALLOWED;
|
|
return xactProperties;
|
|
}
|
|
|
|
if (MultiShardCommitProtocol == COMMIT_PROTOCOL_BARE)
|
|
{
|
|
/*
|
|
* We prefer to error on any failures for CREATE INDEX
|
|
* CONCURRENTLY or VACUUM//VACUUM ANALYZE (e.g., COMMIT_PROTOCOL_BARE).
|
|
*/
|
|
xactProperties.errorOnAnyFailure = true;
|
|
xactProperties.useRemoteTransactionBlocks = TRANSACTION_BLOCKS_DISALLOWED;
|
|
return xactProperties;
|
|
}
|
|
|
|
if (DistributedExecutionRequiresRollback(taskList))
|
|
{
|
|
/* transaction blocks are required if the task list needs to roll back */
|
|
xactProperties.useRemoteTransactionBlocks = TRANSACTION_BLOCKS_REQUIRED;
|
|
|
|
if (TaskListRequires2PC(taskList))
|
|
{
|
|
/*
|
|
* Although using two phase commit protocol is an independent decision than
|
|
* failing on any error, we prefer to couple them. Our motivation is that
|
|
* the failures are rare, and we prefer to avoid marking placements invalid
|
|
* in case of failures.
|
|
*/
|
|
xactProperties.errorOnAnyFailure = true;
|
|
xactProperties.requires2PC = true;
|
|
}
|
|
else if (MultiShardCommitProtocol != COMMIT_PROTOCOL_2PC &&
|
|
IsMultiShardModification(modLevel, taskList))
|
|
{
|
|
/*
|
|
* Even if we're not using 2PC, we prefer to error out
|
|
* on any failures during multi shard modifications/DDLs.
|
|
*/
|
|
xactProperties.errorOnAnyFailure = true;
|
|
}
|
|
}
|
|
else if (InCoordinatedTransaction())
|
|
{
|
|
/*
|
|
* If we are already in a coordinated transaction then transaction blocks
|
|
* are required even if they are not strictly required for the current
|
|
* execution.
|
|
*/
|
|
xactProperties.useRemoteTransactionBlocks = TRANSACTION_BLOCKS_REQUIRED;
|
|
}
|
|
|
|
return xactProperties;
|
|
}
|
|
|
|
|
|
/*
|
|
* StartDistributedExecution sets up the coordinated transaction and 2PC for
|
|
* the execution whenever necessary. It also keeps track of parallel relation
|
|
* accesses to enforce restrictions that arise due to foreign keys to reference
|
|
* tables.
|
|
*/
|
|
void
|
|
StartDistributedExecution(DistributedExecution *execution)
|
|
{
|
|
TransactionProperties *xactProperties = execution->transactionProperties;
|
|
|
|
if (xactProperties->useRemoteTransactionBlocks == TRANSACTION_BLOCKS_REQUIRED)
|
|
{
|
|
UseCoordinatedTransaction();
|
|
}
|
|
|
|
if (xactProperties->requires2PC)
|
|
{
|
|
CoordinatedTransactionShouldUse2PC();
|
|
}
|
|
|
|
/*
|
|
* Prevent unsafe concurrent modifications of replicated shards by taking
|
|
* locks.
|
|
*
|
|
* When modifying a reference tables in MX mode, we take the lock via RPC
|
|
* to the first worker in a transaction block, which activates a coordinated
|
|
* transaction. We need to do this before determining whether the execution
|
|
* should use transaction blocks (see below).
|
|
*/
|
|
AcquireExecutorShardLocksForExecution(execution);
|
|
|
|
/*
|
|
* We should not record parallel access if the target pool size is less than 2.
|
|
* The reason is that we define parallel access as at least two connections
|
|
* accessing established to worker node.
|
|
*
|
|
* It is not ideal to have this check here, it'd have been better if we simply passed
|
|
* DistributedExecution directly to the RecordParallelAccess*() function. However,
|
|
* since we have two other executors that rely on the function, we had to only pass
|
|
* the tasklist to have a common API.
|
|
*/
|
|
if (execution->targetPoolSize > 1)
|
|
{
|
|
/*
|
|
* Record the access for both the local and remote tasks. The main goal
|
|
* is to make sure that Citus behaves consistently even if the local
|
|
* shards are moved away.
|
|
*/
|
|
RecordParallelRelationAccessForTaskList(execution->remoteAndLocalTaskList);
|
|
}
|
|
}
|
|
|
|
|
|
/*
|
|
* DistributedExecutionModifiesDatabase returns true if the execution modifies the data
|
|
* or the schema.
|
|
*/
|
|
static bool
|
|
DistributedExecutionModifiesDatabase(DistributedExecution *execution)
|
|
{
|
|
return TaskListModifiesDatabase(execution->modLevel,
|
|
execution->remoteAndLocalTaskList);
|
|
}
|
|
|
|
|
|
/*
|
|
* DistributedPlanModifiesDatabase returns true if the plan modifies the data
|
|
* or the schema.
|
|
*/
|
|
bool
|
|
DistributedPlanModifiesDatabase(DistributedPlan *plan)
|
|
{
|
|
return TaskListModifiesDatabase(plan->modLevel, plan->workerJob->taskList);
|
|
}
|
|
|
|
|
|
/*
|
|
* IsMultiShardModification returns true if the task list is a modification
|
|
* across shards.
|
|
*/
|
|
static bool
|
|
IsMultiShardModification(RowModifyLevel modLevel, List *taskList)
|
|
{
|
|
return list_length(taskList) > 1 && TaskListModifiesDatabase(modLevel, taskList);
|
|
}
|
|
|
|
|
|
/*
|
|
* TaskListModifiesDatabase is a helper function for DistributedExecutionModifiesDatabase and
|
|
* DistributedPlanModifiesDatabase.
|
|
*/
|
|
static bool
|
|
TaskListModifiesDatabase(RowModifyLevel modLevel, List *taskList)
|
|
{
|
|
if (modLevel > ROW_MODIFY_READONLY)
|
|
{
|
|
return true;
|
|
}
|
|
|
|
/*
|
|
* If we cannot decide by only checking the row modify level,
|
|
* we should look closer to the tasks.
|
|
*/
|
|
if (list_length(taskList) < 1)
|
|
{
|
|
/* is this ever possible? */
|
|
return false;
|
|
}
|
|
|
|
Task *firstTask = (Task *) linitial(taskList);
|
|
|
|
return !ReadOnlyTask(firstTask->taskType);
|
|
}
|
|
|
|
|
|
/*
|
|
* DistributedExecutionRequiresRollback returns true if the distributed
|
|
* execution should start a CoordinatedTransaction. In other words, if the
|
|
* function returns true, the execution sends BEGIN; to every connection
|
|
* involved in the distributed execution.
|
|
*/
|
|
static bool
|
|
DistributedExecutionRequiresRollback(List *taskList)
|
|
{
|
|
int taskCount = list_length(taskList);
|
|
|
|
if (MultiShardCommitProtocol == COMMIT_PROTOCOL_BARE)
|
|
{
|
|
return false;
|
|
}
|
|
|
|
if (taskCount == 0)
|
|
{
|
|
return false;
|
|
}
|
|
|
|
Task *task = (Task *) linitial(taskList);
|
|
|
|
bool selectForUpdate = task->relationRowLockList != NIL;
|
|
if (selectForUpdate)
|
|
{
|
|
/*
|
|
* Do not check SelectOpensTransactionBlock, always open transaction block
|
|
* if SELECT FOR UPDATE is executed inside a distributed transaction.
|
|
*/
|
|
return IsTransactionBlock();
|
|
}
|
|
|
|
if (ReadOnlyTask(task->taskType))
|
|
{
|
|
return SelectOpensTransactionBlock &&
|
|
IsTransactionBlock();
|
|
}
|
|
|
|
if (IsMultiStatementTransaction())
|
|
{
|
|
return true;
|
|
}
|
|
|
|
if (list_length(taskList) > 1)
|
|
{
|
|
return true;
|
|
}
|
|
|
|
if (list_length(task->taskPlacementList) > 1)
|
|
{
|
|
if (SingleShardCommitProtocol == COMMIT_PROTOCOL_2PC)
|
|
{
|
|
/*
|
|
* Adaptive executor opts to error out on queries if a placement is unhealthy,
|
|
* not marking the placement itself unhealthy in the process.
|
|
* Use 2PC to rollback placements before the unhealthy replica failed.
|
|
*/
|
|
return true;
|
|
}
|
|
|
|
/*
|
|
* Some tasks don't set replicationModel thus we only
|
|
* rely on the anchorShardId, not replicationModel.
|
|
*
|
|
* TODO: Do we ever need replicationModel in the Task structure?
|
|
* Can't we always rely on anchorShardId?
|
|
*/
|
|
if (task->anchorShardId != INVALID_SHARD_ID && ReferenceTableShardId(
|
|
task->anchorShardId))
|
|
{
|
|
return true;
|
|
}
|
|
|
|
/*
|
|
* Single DML/DDL tasks with replicated tables (non-reference)
|
|
* should not require BEGIN/COMMIT/ROLLBACK.
|
|
*/
|
|
return false;
|
|
}
|
|
|
|
return false;
|
|
}
|
|
|
|
|
|
/*
|
|
* TaskListRequires2PC determines whether the given task list requires 2PC
|
|
* because the tasks provided operates on a reference table or there are multiple
|
|
* tasks and the commit protocol is 2PC.
|
|
*
|
|
* Note that we currently do not generate tasks lists that involves multiple different
|
|
* tables, thus we only check the first task in the list for reference tables.
|
|
*/
|
|
static bool
|
|
TaskListRequires2PC(List *taskList)
|
|
{
|
|
if (taskList == NIL)
|
|
{
|
|
return false;
|
|
}
|
|
|
|
Task *task = (Task *) linitial(taskList);
|
|
if (task->replicationModel == REPLICATION_MODEL_2PC)
|
|
{
|
|
return true;
|
|
}
|
|
|
|
/*
|
|
* Some tasks don't set replicationModel thus we rely on
|
|
* the anchorShardId as well replicationModel.
|
|
*
|
|
* TODO: Do we ever need replicationModel in the Task structure?
|
|
* Can't we always rely on anchorShardId?
|
|
*/
|
|
uint64 anchorShardId = task->anchorShardId;
|
|
if (anchorShardId != INVALID_SHARD_ID && ReferenceTableShardId(anchorShardId))
|
|
{
|
|
return true;
|
|
}
|
|
|
|
bool multipleTasks = list_length(taskList) > 1;
|
|
if (!ReadOnlyTask(task->taskType) &&
|
|
multipleTasks && MultiShardCommitProtocol == COMMIT_PROTOCOL_2PC)
|
|
{
|
|
return true;
|
|
}
|
|
|
|
if (task->taskType == DDL_TASK)
|
|
{
|
|
if (MultiShardCommitProtocol == COMMIT_PROTOCOL_2PC)
|
|
{
|
|
return true;
|
|
}
|
|
}
|
|
|
|
return false;
|
|
}
|
|
|
|
|
|
/*
|
|
* ReadOnlyTask returns true if the input task does a read-only operation
|
|
* on the database.
|
|
*/
|
|
bool
|
|
ReadOnlyTask(TaskType taskType)
|
|
{
|
|
switch (taskType)
|
|
{
|
|
case READ_TASK:
|
|
case MAP_OUTPUT_FETCH_TASK:
|
|
case MAP_TASK:
|
|
case MERGE_TASK:
|
|
{
|
|
return true;
|
|
}
|
|
|
|
default:
|
|
{
|
|
return false;
|
|
}
|
|
}
|
|
}
|
|
|
|
|
|
/*
|
|
* SelectForUpdateOnReferenceTable returns true if the input task
|
|
* contains a FOR UPDATE clause that locks any reference tables.
|
|
*/
|
|
static bool
|
|
SelectForUpdateOnReferenceTable(List *taskList)
|
|
{
|
|
if (list_length(taskList) != 1)
|
|
{
|
|
/* we currently do not support SELECT FOR UPDATE on multi task queries */
|
|
return false;
|
|
}
|
|
|
|
Task *task = (Task *) linitial(taskList);
|
|
RelationRowLock *relationRowLock = NULL;
|
|
foreach_ptr(relationRowLock, task->relationRowLockList)
|
|
{
|
|
Oid relationId = relationRowLock->relationId;
|
|
|
|
if (IsCitusTableType(relationId, REFERENCE_TABLE))
|
|
{
|
|
return true;
|
|
}
|
|
}
|
|
|
|
return false;
|
|
}
|
|
|
|
|
|
/*
|
|
* LockPartitionsForDistributedPlan ensures commands take locks on all partitions
|
|
* of a distributed table that appears in the query. We do this primarily out of
|
|
* consistency with PostgreSQL locking.
|
|
*/
|
|
static void
|
|
LockPartitionsForDistributedPlan(DistributedPlan *distributedPlan)
|
|
{
|
|
if (DistributedPlanModifiesDatabase(distributedPlan))
|
|
{
|
|
Oid targetRelationId = distributedPlan->targetRelationId;
|
|
|
|
LockPartitionsInRelationList(list_make1_oid(targetRelationId), RowExclusiveLock);
|
|
}
|
|
|
|
/*
|
|
* Lock partitions of tables that appear in a SELECT or subquery. In the
|
|
* DML case this also includes the target relation, but since we already
|
|
* have a stronger lock this doesn't do any harm.
|
|
*/
|
|
LockPartitionsInRelationList(distributedPlan->relationIdList, AccessShareLock);
|
|
}
|
|
|
|
|
|
/*
|
|
* AcquireExecutorShardLocksForExecution acquires advisory lock on shard IDs
|
|
* to prevent unsafe concurrent modifications of shards.
|
|
*
|
|
* We prevent concurrent modifications of shards in two cases:
|
|
* 1. Any non-commutative writes to a replicated table
|
|
* 2. Multi-shard writes that are executed in parallel
|
|
*
|
|
* The first case ensures we do not apply updates in different orders on
|
|
* different replicas (e.g. of a reference table), which could lead the
|
|
* replicas to diverge.
|
|
*
|
|
* The second case prevents deadlocks due to out-of-order execution.
|
|
*
|
|
* We do not take executor shard locks for utility commands such as
|
|
* TRUNCATE because the table locks already prevent concurrent access.
|
|
*/
|
|
static void
|
|
AcquireExecutorShardLocksForExecution(DistributedExecution *execution)
|
|
{
|
|
RowModifyLevel modLevel = execution->modLevel;
|
|
|
|
/* acquire the locks for both the remote and local tasks */
|
|
List *taskList = execution->remoteAndLocalTaskList;
|
|
|
|
if (modLevel <= ROW_MODIFY_READONLY &&
|
|
!SelectForUpdateOnReferenceTable(taskList))
|
|
{
|
|
/*
|
|
* Executor locks only apply to DML commands and SELECT FOR UPDATE queries
|
|
* touching reference tables.
|
|
*/
|
|
return;
|
|
}
|
|
|
|
/*
|
|
* When executing in sequential mode or only executing a single task, we
|
|
* do not need multi-shard locks.
|
|
*/
|
|
if (list_length(taskList) == 1 || ShouldRunTasksSequentially(taskList))
|
|
{
|
|
Task *task = NULL;
|
|
foreach_ptr(task, taskList)
|
|
{
|
|
AcquireExecutorShardLocks(task, modLevel);
|
|
}
|
|
}
|
|
else if (list_length(taskList) > 1)
|
|
{
|
|
AcquireExecutorMultiShardLocks(taskList);
|
|
}
|
|
}
|
|
|
|
|
|
/*
|
|
* FinishDistributedExecution cleans up resources associated with a
|
|
* distributed execution.
|
|
*/
|
|
static void
|
|
FinishDistributedExecution(DistributedExecution *execution)
|
|
{
|
|
if (DistributedExecutionModifiesDatabase(execution))
|
|
{
|
|
/* prevent copying shards in same transaction */
|
|
XactModificationLevel = XACT_MODIFICATION_DATA;
|
|
}
|
|
}
|
|
|
|
|
|
/*
|
|
* CleanUpSessions does any clean-up necessary for the session used
|
|
* during the execution. We only reach the function after successfully
|
|
* completing all the tasks and we expect no tasks are still in progress.
|
|
*/
|
|
static void
|
|
CleanUpSessions(DistributedExecution *execution)
|
|
{
|
|
List *sessionList = execution->sessionList;
|
|
|
|
/* we get to this function only after successful executions */
|
|
Assert(!execution->failed && execution->unfinishedTaskCount == 0);
|
|
|
|
/* always trigger wait event set in the first round */
|
|
WorkerSession *session = NULL;
|
|
foreach_ptr(session, sessionList)
|
|
{
|
|
MultiConnection *connection = session->connection;
|
|
|
|
ereport(DEBUG4, (errmsg("Total number of commands sent over the session %ld: %ld",
|
|
session->sessionId, session->commandsSent)));
|
|
|
|
UnclaimConnection(connection);
|
|
|
|
if (connection->connectionState == MULTI_CONNECTION_CONNECTING ||
|
|
connection->connectionState == MULTI_CONNECTION_FAILED ||
|
|
connection->connectionState == MULTI_CONNECTION_LOST ||
|
|
connection->connectionState == MULTI_CONNECTION_TIMED_OUT)
|
|
{
|
|
/*
|
|
* We want the MultiConnection go away and not used in
|
|
* the subsequent executions.
|
|
*
|
|
* We cannot get MULTI_CONNECTION_LOST via the ConnectionStateMachine,
|
|
* but we might get it via the connection API and find us here before
|
|
* changing any states in the ConnectionStateMachine.
|
|
*
|
|
*/
|
|
CloseConnection(connection);
|
|
}
|
|
else if (connection->connectionState == MULTI_CONNECTION_CONNECTED)
|
|
{
|
|
RemoteTransaction *transaction = &(connection->remoteTransaction);
|
|
RemoteTransactionState transactionState = transaction->transactionState;
|
|
|
|
if (transactionState == REMOTE_TRANS_CLEARING_RESULTS)
|
|
{
|
|
/*
|
|
* We might have established the connection, and even sent BEGIN, but not
|
|
* get to the point where we assigned a task to this specific connection
|
|
* (because other connections in the pool already finished all the tasks).
|
|
*/
|
|
Assert(session->commandsSent == 0);
|
|
|
|
ClearResults(connection, false);
|
|
}
|
|
else if (!(transactionState == REMOTE_TRANS_NOT_STARTED ||
|
|
transactionState == REMOTE_TRANS_STARTED))
|
|
{
|
|
/*
|
|
* We don't have to handle anything else. Note that the execution
|
|
* could only finish on connectionStates of MULTI_CONNECTION_CONNECTING,
|
|
* MULTI_CONNECTION_FAILED and MULTI_CONNECTION_CONNECTED. The first two
|
|
* are already handled above.
|
|
*
|
|
* When we're on MULTI_CONNECTION_CONNECTED, TransactionStateMachine
|
|
* ensures that all the necessary commands are successfully sent over
|
|
* the connection and everything is cleared up. Otherwise, we'd have been
|
|
* on MULTI_CONNECTION_FAILED state.
|
|
*/
|
|
ereport(WARNING, (errmsg("unexpected transaction state at the end of "
|
|
"execution: %d", transactionState)));
|
|
}
|
|
|
|
/* get ready for the next executions if we need use the same connection */
|
|
connection->waitFlags = WL_SOCKET_READABLE | WL_SOCKET_WRITEABLE;
|
|
}
|
|
else
|
|
{
|
|
ereport(WARNING, (errmsg("unexpected connection state at the end of "
|
|
"execution: %d", connection->connectionState)));
|
|
}
|
|
}
|
|
}
|
|
|
|
|
|
/*
|
|
* UnclaimAllSessionConnections unclaims all of the connections for the given
|
|
* sessionList.
|
|
*/
|
|
static void
|
|
UnclaimAllSessionConnections(List *sessionList)
|
|
{
|
|
WorkerSession *session = NULL;
|
|
foreach_ptr(session, sessionList)
|
|
{
|
|
MultiConnection *connection = session->connection;
|
|
|
|
UnclaimConnection(connection);
|
|
}
|
|
}
|
|
|
|
|
|
/*
|
|
* AssignTasksToConnectionsOrWorkerPool goes through the list of tasks to determine whether any
|
|
* task placements need to be assigned to particular connections because of preceding
|
|
* operations in the transaction. It then adds those connections to the pool and adds
|
|
* the task placement executions to the assigned task queue of the connection.
|
|
*/
|
|
static void
|
|
AssignTasksToConnectionsOrWorkerPool(DistributedExecution *execution)
|
|
{
|
|
RowModifyLevel modLevel = execution->modLevel;
|
|
List *taskList = execution->remoteTaskList;
|
|
|
|
Task *task = NULL;
|
|
foreach_ptr(task, taskList)
|
|
{
|
|
bool placementExecutionReady = true;
|
|
int placementExecutionIndex = 0;
|
|
int placementExecutionCount = list_length(task->taskPlacementList);
|
|
|
|
/*
|
|
* Execution of a command on a shard, which may have multiple replicas.
|
|
*/
|
|
ShardCommandExecution *shardCommandExecution =
|
|
(ShardCommandExecution *) palloc0(sizeof(ShardCommandExecution));
|
|
shardCommandExecution->task = task;
|
|
shardCommandExecution->executionOrder = ExecutionOrderForTask(modLevel, task);
|
|
shardCommandExecution->executionState = TASK_EXECUTION_NOT_FINISHED;
|
|
shardCommandExecution->placementExecutions =
|
|
(TaskPlacementExecution **) palloc0(placementExecutionCount *
|
|
sizeof(TaskPlacementExecution *));
|
|
shardCommandExecution->placementExecutionCount = placementExecutionCount;
|
|
|
|
SetAttributeInputMetadata(execution, shardCommandExecution);
|
|
ShardPlacement *taskPlacement = NULL;
|
|
foreach_ptr(taskPlacement, task->taskPlacementList)
|
|
{
|
|
int connectionFlags = 0;
|
|
char *nodeName = taskPlacement->nodeName;
|
|
int nodePort = taskPlacement->nodePort;
|
|
WorkerPool *workerPool = FindOrCreateWorkerPool(execution, nodeName,
|
|
nodePort);
|
|
|
|
/*
|
|
* Execution of a command on a shard placement, which may not always
|
|
* happen if the query is read-only and the shard has multiple placements.
|
|
*/
|
|
TaskPlacementExecution *placementExecution =
|
|
(TaskPlacementExecution *) palloc0(sizeof(TaskPlacementExecution));
|
|
placementExecution->shardCommandExecution = shardCommandExecution;
|
|
placementExecution->shardPlacement = taskPlacement;
|
|
placementExecution->workerPool = workerPool;
|
|
placementExecution->placementExecutionIndex = placementExecutionIndex;
|
|
placementExecution->queryIndex = 0;
|
|
|
|
if (placementExecutionReady)
|
|
{
|
|
placementExecution->executionState = PLACEMENT_EXECUTION_READY;
|
|
}
|
|
else
|
|
{
|
|
placementExecution->executionState = PLACEMENT_EXECUTION_NOT_READY;
|
|
}
|
|
|
|
shardCommandExecution->placementExecutions[placementExecutionIndex] =
|
|
placementExecution;
|
|
|
|
placementExecutionIndex++;
|
|
|
|
List *placementAccessList = PlacementAccessListForTask(task, taskPlacement);
|
|
|
|
MultiConnection *connection = NULL;
|
|
if (execution->transactionProperties->useRemoteTransactionBlocks !=
|
|
TRANSACTION_BLOCKS_DISALLOWED)
|
|
{
|
|
/*
|
|
* Determine whether the task has to be assigned to a particular connection
|
|
* due to a preceding access to the placement in the same transaction.
|
|
*/
|
|
connection = GetConnectionIfPlacementAccessedInXact(
|
|
connectionFlags,
|
|
placementAccessList,
|
|
NULL);
|
|
}
|
|
|
|
if (connection != NULL)
|
|
{
|
|
/*
|
|
* Note: We may get the same connection for multiple task placements.
|
|
* FindOrCreateWorkerSession ensures that we only have one session per
|
|
* connection.
|
|
*/
|
|
WorkerSession *session =
|
|
FindOrCreateWorkerSession(workerPool, connection);
|
|
|
|
ereport(DEBUG4, (errmsg("Session %ld (%s:%d) has an assigned task",
|
|
session->sessionId, connection->hostname,
|
|
connection->port)));
|
|
|
|
placementExecution->assignedSession = session;
|
|
|
|
/* if executed, this task placement must use this session */
|
|
if (placementExecutionReady)
|
|
{
|
|
dlist_push_tail(&session->readyTaskQueue,
|
|
&placementExecution->sessionReadyQueueNode);
|
|
}
|
|
else
|
|
{
|
|
dlist_push_tail(&session->pendingTaskQueue,
|
|
&placementExecution->sessionPendingQueueNode);
|
|
}
|
|
|
|
/* always poll the connection in the first round */
|
|
UpdateConnectionWaitFlags(session,
|
|
WL_SOCKET_READABLE | WL_SOCKET_WRITEABLE);
|
|
|
|
/* If the connections are already avaliable, make sure to activate
|
|
* 2PC when necessary.
|
|
*/
|
|
Activate2PCIfModifyingTransactionExpandsToNewNode(session);
|
|
}
|
|
else
|
|
{
|
|
placementExecution->assignedSession = NULL;
|
|
|
|
if (placementExecutionReady)
|
|
{
|
|
/* task is ready to execute on any session */
|
|
dlist_push_tail(&workerPool->readyTaskQueue,
|
|
&placementExecution->workerReadyQueueNode);
|
|
|
|
workerPool->readyTaskCount++;
|
|
}
|
|
else
|
|
{
|
|
/* task can be executed on any session, but is not yet ready */
|
|
dlist_push_tail(&workerPool->pendingTaskQueue,
|
|
&placementExecution->workerPendingQueueNode);
|
|
}
|
|
}
|
|
|
|
if (shardCommandExecution->executionOrder != EXECUTION_ORDER_PARALLEL)
|
|
{
|
|
/*
|
|
* Except for commands that can be executed across all placements
|
|
* in parallel, only the first placement execution is immediately
|
|
* ready. Set placementExecutionReady to false for the remaining
|
|
* placements.
|
|
*/
|
|
placementExecutionReady = false;
|
|
}
|
|
}
|
|
}
|
|
|
|
/*
|
|
* We sort the workerList because adaptive connection management
|
|
* (e.g., OPTIONAL_CONNECTION) requires any concurrent executions
|
|
* to wait for the connections in the same order to prevent any
|
|
* starvation. If we don't sort, we might end up with:
|
|
* Execution 1: Get connection for worker 1, wait for worker 2
|
|
* Execution 2: Get connection for worker 2, wait for worker 1
|
|
*
|
|
* and, none could proceed. Instead, we enforce every execution establish
|
|
* the required connections to workers in the same order.
|
|
*/
|
|
execution->workerList = SortList(execution->workerList, WorkerPoolCompare);
|
|
|
|
/*
|
|
* The executor claims connections exclusively to make sure that calls to
|
|
* StartNodeUserDatabaseConnection do not return the same connections.
|
|
*
|
|
* We need to do this after assigning tasks to connections because the same
|
|
* connection may be be returned multiple times by GetPlacementListConnectionIfCached.
|
|
*/
|
|
WorkerSession *session = NULL;
|
|
foreach_ptr(session, execution->sessionList)
|
|
{
|
|
MultiConnection *connection = session->connection;
|
|
|
|
ClaimConnectionExclusively(connection);
|
|
}
|
|
}
|
|
|
|
|
|
/*
|
|
* WorkerPoolCompare is based on WorkerNodeCompare function. The function
|
|
* compares two worker nodes by their host name and port number.
|
|
*/
|
|
static int
|
|
WorkerPoolCompare(const void *lhsKey, const void *rhsKey)
|
|
{
|
|
const WorkerPool *workerLhs = *(const WorkerPool **) lhsKey;
|
|
const WorkerPool *workerRhs = *(const WorkerPool **) rhsKey;
|
|
|
|
return NodeNamePortCompare(workerLhs->nodeName, workerRhs->nodeName,
|
|
workerLhs->nodePort, workerRhs->nodePort);
|
|
}
|
|
|
|
|
|
/*
|
|
* SetAttributeInputMetadata sets attributeInputMetadata in
|
|
* shardCommandExecution for all the queries that are part of its task.
|
|
* This contains the deserialization functions for the tuples that will be
|
|
* received. It also sets binaryResults when applicable.
|
|
*/
|
|
static void
|
|
SetAttributeInputMetadata(DistributedExecution *execution,
|
|
ShardCommandExecution *shardCommandExecution)
|
|
{
|
|
TupleDestination *tupleDest = shardCommandExecution->task->tupleDest ?
|
|
shardCommandExecution->task->tupleDest :
|
|
execution->defaultTupleDest;
|
|
uint32 queryCount = shardCommandExecution->task->queryCount;
|
|
shardCommandExecution->attributeInputMetadata = palloc0(queryCount *
|
|
sizeof(AttInMetadata *));
|
|
|
|
for (uint32 queryIndex = 0; queryIndex < queryCount; queryIndex++)
|
|
{
|
|
AttInMetadata *attInMetadata = NULL;
|
|
TupleDesc tupleDescriptor = tupleDest->tupleDescForQuery(tupleDest,
|
|
queryIndex);
|
|
if (tupleDescriptor == NULL)
|
|
{
|
|
attInMetadata = NULL;
|
|
}
|
|
/*
|
|
* We only allow binary results when queryCount is 1, because we
|
|
* cannot use binary results with SendRemoteCommand. Which must be
|
|
* used if queryCount is larger than 1.
|
|
*/
|
|
else if (EnableBinaryProtocol && queryCount == 1 &&
|
|
CanUseBinaryCopyFormat(tupleDescriptor))
|
|
{
|
|
attInMetadata = TupleDescGetAttBinaryInMetadata(tupleDescriptor);
|
|
shardCommandExecution->binaryResults = true;
|
|
}
|
|
else
|
|
{
|
|
attInMetadata = TupleDescGetAttInMetadata(tupleDescriptor);
|
|
}
|
|
|
|
shardCommandExecution->attributeInputMetadata[queryIndex] = attInMetadata;
|
|
}
|
|
}
|
|
|
|
|
|
/*
|
|
* ExecutionOrderForTask gives the appropriate execution order for a task.
|
|
*/
|
|
static PlacementExecutionOrder
|
|
ExecutionOrderForTask(RowModifyLevel modLevel, Task *task)
|
|
{
|
|
switch (task->taskType)
|
|
{
|
|
case READ_TASK:
|
|
{
|
|
return EXECUTION_ORDER_ANY;
|
|
}
|
|
|
|
case MODIFY_TASK:
|
|
{
|
|
/*
|
|
* For non-commutative modifications we take aggressive locks, so
|
|
* there is no risk of deadlock and we can run them in parallel.
|
|
* When the modification is commutative, we take no additional
|
|
* locks, so we take a conservative approach and execute sequentially
|
|
* to avoid deadlocks.
|
|
*/
|
|
if (modLevel < ROW_MODIFY_NONCOMMUTATIVE)
|
|
{
|
|
return EXECUTION_ORDER_SEQUENTIAL;
|
|
}
|
|
else
|
|
{
|
|
return EXECUTION_ORDER_PARALLEL;
|
|
}
|
|
}
|
|
|
|
case DDL_TASK:
|
|
case VACUUM_ANALYZE_TASK:
|
|
case MAP_TASK:
|
|
case MERGE_TASK:
|
|
case MAP_OUTPUT_FETCH_TASK:
|
|
case MERGE_FETCH_TASK:
|
|
{
|
|
return EXECUTION_ORDER_PARALLEL;
|
|
}
|
|
|
|
default:
|
|
{
|
|
ereport(ERROR, (errmsg("unsupported task type %d in adaptive executor",
|
|
task->taskType)));
|
|
}
|
|
}
|
|
}
|
|
|
|
|
|
/*
|
|
* FindOrCreateWorkerPool gets the pool of connections for a particular worker.
|
|
*/
|
|
static WorkerPool *
|
|
FindOrCreateWorkerPool(DistributedExecution *execution, char *nodeName, int nodePort)
|
|
{
|
|
WorkerPool *workerPool = NULL;
|
|
foreach_ptr(workerPool, execution->workerList)
|
|
{
|
|
if (strncmp(nodeName, workerPool->nodeName, WORKER_LENGTH) == 0 &&
|
|
nodePort == workerPool->nodePort)
|
|
{
|
|
return workerPool;
|
|
}
|
|
}
|
|
|
|
workerPool = (WorkerPool *) palloc0(sizeof(WorkerPool));
|
|
workerPool->nodeName = pstrdup(nodeName);
|
|
workerPool->nodePort = nodePort;
|
|
|
|
WorkerNode *workerNode = FindWorkerNode(nodeName, nodePort);
|
|
if (workerNode)
|
|
{
|
|
workerPool->poolToLocalNode =
|
|
workerNode->groupId == GetLocalGroupId();
|
|
}
|
|
|
|
/* "open" connections aggressively when there are cached connections */
|
|
int nodeConnectionCount = MaxCachedConnectionsPerWorker;
|
|
workerPool->maxNewConnectionsPerCycle = Max(1, nodeConnectionCount);
|
|
|
|
dlist_init(&workerPool->pendingTaskQueue);
|
|
dlist_init(&workerPool->readyTaskQueue);
|
|
|
|
workerPool->distributedExecution = execution;
|
|
|
|
execution->workerList = lappend(execution->workerList, workerPool);
|
|
|
|
return workerPool;
|
|
}
|
|
|
|
|
|
/*
|
|
* FindOrCreateWorkerSession returns a session with the given connection,
|
|
* either existing or new. New sessions are added to the worker pool and
|
|
* the distributed execution.
|
|
*/
|
|
static WorkerSession *
|
|
FindOrCreateWorkerSession(WorkerPool *workerPool, MultiConnection *connection)
|
|
{
|
|
DistributedExecution *execution = workerPool->distributedExecution;
|
|
static uint64 sessionId = 1;
|
|
|
|
WorkerSession *session = NULL;
|
|
foreach_ptr(session, workerPool->sessionList)
|
|
{
|
|
if (session->connection == connection)
|
|
{
|
|
return session;
|
|
}
|
|
}
|
|
|
|
|
|
session = (WorkerSession *) palloc0(sizeof(WorkerSession));
|
|
session->sessionId = sessionId++;
|
|
session->connection = connection;
|
|
session->workerPool = workerPool;
|
|
session->commandsSent = 0;
|
|
session->waitEventSetIndex = WAIT_EVENT_SET_INDEX_NOT_INITIALIZED;
|
|
|
|
dlist_init(&session->pendingTaskQueue);
|
|
dlist_init(&session->readyTaskQueue);
|
|
|
|
/* keep track of how many connections are ready */
|
|
if (connection->connectionState == MULTI_CONNECTION_CONNECTED)
|
|
{
|
|
workerPool->activeConnectionCount++;
|
|
workerPool->idleConnectionCount++;
|
|
}
|
|
|
|
workerPool->unusedConnectionCount++;
|
|
|
|
/*
|
|
* Record the first connection establishment time to the pool. We need this
|
|
* to enforce NodeConnectionTimeout.
|
|
*/
|
|
if (list_length(workerPool->sessionList) == 0)
|
|
{
|
|
INSTR_TIME_SET_CURRENT(workerPool->poolStartTime);
|
|
workerPool->checkForPoolTimeout = true;
|
|
}
|
|
|
|
workerPool->sessionList = lappend(workerPool->sessionList, session);
|
|
execution->sessionList = lappend(execution->sessionList, session);
|
|
|
|
return session;
|
|
}
|
|
|
|
|
|
/*
|
|
* ShouldRunTasksSequentially returns true if each of the individual tasks
|
|
* should be executed one by one. Note that this is different than
|
|
* MultiShardConnectionType == SEQUENTIAL_CONNECTION case. In that case,
|
|
* running the tasks across the nodes in parallel is acceptable and implemented
|
|
* in that way.
|
|
*
|
|
* However, the executions that are qualified here would perform poorly if the
|
|
* tasks across the workers are executed in parallel. We currently qualify only
|
|
* one class of distributed queries here, multi-row INSERTs. If we do not enforce
|
|
* true sequential execution, concurrent multi-row upserts could easily form
|
|
* a distributed deadlock when the upserts touch the same rows.
|
|
*/
|
|
bool
|
|
ShouldRunTasksSequentially(List *taskList)
|
|
{
|
|
if (list_length(taskList) < 2)
|
|
{
|
|
/* single task plans are already qualified as sequential by definition */
|
|
return false;
|
|
}
|
|
|
|
/* all the tasks are the same, so we only look one */
|
|
Task *initialTask = (Task *) linitial(taskList);
|
|
if (initialTask->rowValuesLists != NIL)
|
|
{
|
|
/* found a multi-row INSERT */
|
|
return true;
|
|
}
|
|
|
|
return false;
|
|
}
|
|
|
|
|
|
/*
|
|
* SequentialRunDistributedExecution gets a distributed execution and
|
|
* executes each individual task in the execution sequentially, one
|
|
* task at a time. See related function ShouldRunTasksSequentially()
|
|
* for more detail on the definition of SequentialRun.
|
|
*/
|
|
static void
|
|
SequentialRunDistributedExecution(DistributedExecution *execution)
|
|
{
|
|
List *taskList = execution->remoteTaskList;
|
|
int connectionMode = MultiShardConnectionType;
|
|
|
|
/*
|
|
* There are some implicit assumptions about this setting for the sequential
|
|
* executions, so make sure to set it.
|
|
*/
|
|
MultiShardConnectionType = SEQUENTIAL_CONNECTION;
|
|
Task *taskToExecute = NULL;
|
|
foreach_ptr(taskToExecute, taskList)
|
|
{
|
|
execution->remoteAndLocalTaskList = list_make1(taskToExecute);
|
|
execution->remoteTaskList = list_make1(taskToExecute);
|
|
execution->totalTaskCount = 1;
|
|
execution->unfinishedTaskCount = 1;
|
|
|
|
CHECK_FOR_INTERRUPTS();
|
|
|
|
if (IsHoldOffCancellationReceived())
|
|
{
|
|
break;
|
|
}
|
|
|
|
/* simply call the regular execution function */
|
|
RunDistributedExecution(execution);
|
|
}
|
|
|
|
/* set back the original execution mode */
|
|
MultiShardConnectionType = connectionMode;
|
|
}
|
|
|
|
|
|
/*
|
|
* RunDistributedExecution runs a distributed execution to completion. It first opens
|
|
* connections for distributed execution and assigns each task with shard placements
|
|
* that have previously been modified in the current transaction to the connection
|
|
* that modified them. Then, it creates a wait event set to listen for events on
|
|
* any of the connections and runs the connection state machine when a connection
|
|
* has an event.
|
|
*/
|
|
void
|
|
RunDistributedExecution(DistributedExecution *execution)
|
|
{
|
|
WaitEvent *events = NULL;
|
|
|
|
AssignTasksToConnectionsOrWorkerPool(execution);
|
|
|
|
PG_TRY();
|
|
{
|
|
/* Preemptively step state machines in case of immediate errors */
|
|
WorkerSession *session = NULL;
|
|
foreach_ptr(session, execution->sessionList)
|
|
{
|
|
ConnectionStateMachine(session);
|
|
}
|
|
|
|
bool cancellationReceived = false;
|
|
|
|
int eventSetSize = GetEventSetSize(execution->sessionList);
|
|
|
|
/* always (re)build the wait event set the first time */
|
|
execution->rebuildWaitEventSet = true;
|
|
|
|
while (execution->unfinishedTaskCount > 0 && !cancellationReceived)
|
|
{
|
|
WorkerPool *workerPool = NULL;
|
|
foreach_ptr(workerPool, execution->workerList)
|
|
{
|
|
ManageWorkerPool(workerPool);
|
|
}
|
|
|
|
bool skipWaitEvents = false;
|
|
if (execution->remoteTaskList == NIL)
|
|
{
|
|
/*
|
|
* All the tasks are failed over to the local execution, no need
|
|
* to wait for any connection activity.
|
|
*/
|
|
continue;
|
|
}
|
|
else if (execution->rebuildWaitEventSet)
|
|
{
|
|
if (events != NULL)
|
|
{
|
|
/*
|
|
* The execution might take a while, so explicitly free at this point
|
|
* because we don't need anymore.
|
|
*/
|
|
pfree(events);
|
|
events = NULL;
|
|
}
|
|
eventSetSize = RebuildWaitEventSet(execution);
|
|
events = palloc0(eventSetSize * sizeof(WaitEvent));
|
|
|
|
skipWaitEvents =
|
|
ProcessSessionsWithFailedWaitEventSetOperations(execution);
|
|
}
|
|
else if (execution->waitFlagsChanged)
|
|
{
|
|
RebuildWaitEventSetFlags(execution->waitEventSet, execution->sessionList);
|
|
execution->waitFlagsChanged = false;
|
|
|
|
skipWaitEvents =
|
|
ProcessSessionsWithFailedWaitEventSetOperations(execution);
|
|
}
|
|
|
|
if (skipWaitEvents)
|
|
{
|
|
/*
|
|
* Some operation on the wait event set is failed, retry
|
|
* as we already removed the problematic connections.
|
|
*/
|
|
execution->rebuildWaitEventSet = true;
|
|
|
|
continue;
|
|
}
|
|
|
|
/* wait for I/O events */
|
|
long timeout = NextEventTimeout(execution);
|
|
int eventCount = WaitEventSetWait(execution->waitEventSet, timeout, events,
|
|
eventSetSize, WAIT_EVENT_CLIENT_READ);
|
|
ProcessWaitEvents(execution, events, eventCount, &cancellationReceived);
|
|
}
|
|
|
|
if (events != NULL)
|
|
{
|
|
pfree(events);
|
|
}
|
|
|
|
if (execution->waitEventSet != NULL)
|
|
{
|
|
FreeWaitEventSet(execution->waitEventSet);
|
|
execution->waitEventSet = NULL;
|
|
}
|
|
|
|
CleanUpSessions(execution);
|
|
}
|
|
PG_CATCH();
|
|
{
|
|
/*
|
|
* We can still recover from error using ROLLBACK TO SAVEPOINT,
|
|
* unclaim all connections to allow that.
|
|
*/
|
|
UnclaimAllSessionConnections(execution->sessionList);
|
|
|
|
/* do repartition cleanup if this is a repartition query*/
|
|
if (list_length(execution->jobIdList) > 0)
|
|
{
|
|
DoRepartitionCleanup(execution->jobIdList);
|
|
}
|
|
|
|
if (execution->waitEventSet != NULL)
|
|
{
|
|
FreeWaitEventSet(execution->waitEventSet);
|
|
execution->waitEventSet = NULL;
|
|
}
|
|
|
|
PG_RE_THROW();
|
|
}
|
|
PG_END_TRY();
|
|
}
|
|
|
|
|
|
/*
|
|
* ProcessSessionsWithFailedEventSetOperations goes over the session list and
|
|
* processes sessions with failed wait event set operations.
|
|
*
|
|
* Failed sessions are not going to generate any further events, so it is our
|
|
* only chance to process the failure by calling into `ConnectionStateMachine`.
|
|
*
|
|
* The function returns true if any session failed.
|
|
*/
|
|
static bool
|
|
ProcessSessionsWithFailedWaitEventSetOperations(DistributedExecution *execution)
|
|
{
|
|
bool foundFailedSession = false;
|
|
WorkerSession *session = NULL;
|
|
foreach_ptr(session, execution->sessionList)
|
|
{
|
|
if (session->waitEventSetIndex == WAIT_EVENT_SET_INDEX_FAILED)
|
|
{
|
|
/*
|
|
* We can only lost only already connected connections,
|
|
* others are regular failures.
|
|
*/
|
|
MultiConnection *connection = session->connection;
|
|
if (connection->connectionState == MULTI_CONNECTION_CONNECTED)
|
|
{
|
|
connection->connectionState = MULTI_CONNECTION_LOST;
|
|
}
|
|
else
|
|
{
|
|
connection->connectionState = MULTI_CONNECTION_FAILED;
|
|
}
|
|
|
|
|
|
ConnectionStateMachine(session);
|
|
|
|
session->waitEventSetIndex = WAIT_EVENT_SET_INDEX_NOT_INITIALIZED;
|
|
|
|
foundFailedSession = true;
|
|
}
|
|
}
|
|
|
|
return foundFailedSession;
|
|
}
|
|
|
|
|
|
/*
|
|
* RebuildWaitEventSet updates the waitEventSet for the distributed execution.
|
|
* This happens when the connection set for the distributed execution is changed,
|
|
* which means that we need to update which connections we wait on for events.
|
|
* It returns the new event set size.
|
|
*/
|
|
static int
|
|
RebuildWaitEventSet(DistributedExecution *execution)
|
|
{
|
|
if (execution->waitEventSet != NULL)
|
|
{
|
|
FreeWaitEventSet(execution->waitEventSet);
|
|
execution->waitEventSet = NULL;
|
|
}
|
|
|
|
execution->waitEventSet = BuildWaitEventSet(execution->sessionList);
|
|
execution->rebuildWaitEventSet = false;
|
|
execution->waitFlagsChanged = false;
|
|
|
|
return GetEventSetSize(execution->sessionList);
|
|
}
|
|
|
|
|
|
/*
|
|
* ProcessWaitEvents processes the received events from connections.
|
|
*/
|
|
static void
|
|
ProcessWaitEvents(DistributedExecution *execution, WaitEvent *events, int eventCount,
|
|
bool *cancellationReceived)
|
|
{
|
|
int eventIndex = 0;
|
|
|
|
/* process I/O events */
|
|
for (; eventIndex < eventCount; eventIndex++)
|
|
{
|
|
WaitEvent *event = &events[eventIndex];
|
|
|
|
if (event->events & WL_POSTMASTER_DEATH)
|
|
{
|
|
ereport(ERROR, (errmsg("postmaster was shut down, exiting")));
|
|
}
|
|
|
|
if (event->events & WL_LATCH_SET)
|
|
{
|
|
ResetLatch(MyLatch);
|
|
|
|
if (execution->raiseInterrupts)
|
|
{
|
|
CHECK_FOR_INTERRUPTS();
|
|
}
|
|
|
|
if (IsHoldOffCancellationReceived())
|
|
{
|
|
/*
|
|
* Break out of event loop immediately in case of cancellation.
|
|
* We cannot use "return" here inside a PG_TRY() block since
|
|
* then the exception stack won't be reset.
|
|
*/
|
|
*cancellationReceived = true;
|
|
}
|
|
|
|
continue;
|
|
}
|
|
|
|
WorkerSession *session = (WorkerSession *) event->user_data;
|
|
session->latestUnconsumedWaitEvents = event->events;
|
|
|
|
ConnectionStateMachine(session);
|
|
}
|
|
}
|
|
|
|
|
|
/*
|
|
* ManageWorkerPool ensures the worker pool has the appropriate number of connections
|
|
* based on the number of pending tasks.
|
|
*/
|
|
static void
|
|
ManageWorkerPool(WorkerPool *workerPool)
|
|
{
|
|
DistributedExecution *execution = workerPool->distributedExecution;
|
|
|
|
/* we do not expand the pool further if there was any failure */
|
|
if (HasAnyConnectionFailure(workerPool))
|
|
{
|
|
return;
|
|
}
|
|
|
|
/* we wait until a slow start interval has passed before expanding the pool */
|
|
if (ShouldWaitForSlowStart(workerPool))
|
|
{
|
|
return;
|
|
}
|
|
|
|
int newConnectionCount = CalculateNewConnectionCount(workerPool);
|
|
if (newConnectionCount <= 0)
|
|
{
|
|
return;
|
|
}
|
|
|
|
OpenNewConnections(workerPool, newConnectionCount, execution->transactionProperties);
|
|
|
|
/*
|
|
* Cannot establish new connections to the local host, most probably because the
|
|
* local node cannot accept new connections (e.g., hit max_connections). Switch
|
|
* the tasks to the local execution.
|
|
*
|
|
* We prefer initiatedConnectionCount over the new connection establishments happen
|
|
* in this iteration via OpenNewConnections(). The reason is that it is expected for
|
|
* OpenNewConnections() to not open any new connections as long as the connections
|
|
* are optional (e.g., the second or later connections in the pool). But, for
|
|
* initiatedConnectionCount to be zero, the connection to the local pool should have
|
|
* been failed.
|
|
*/
|
|
int initiatedConnectionCount = list_length(workerPool->sessionList);
|
|
if (initiatedConnectionCount == 0)
|
|
{
|
|
/*
|
|
* Only the pools to the local node are allowed to have optional
|
|
* connections for the first connection. Hence, initiatedConnectionCount
|
|
* could only be zero for poolToLocalNode. For other pools, the connection
|
|
* manager would wait until it gets at least one connection.
|
|
*/
|
|
Assert(workerPool->poolToLocalNode);
|
|
|
|
WorkerPoolFailed(workerPool);
|
|
|
|
if (execution->failed)
|
|
{
|
|
ereport(ERROR, (errcode(ERRCODE_CONNECTION_FAILURE),
|
|
errmsg(
|
|
"could not establish any connections to the node %s:%d "
|
|
"when local execution is also disabled.",
|
|
workerPool->nodeName,
|
|
workerPool->nodePort),
|
|
errhint("Enable local execution via SET "
|
|
"citus.enable_local_execution TO true;")));
|
|
}
|
|
|
|
return;
|
|
}
|
|
|
|
INSTR_TIME_SET_CURRENT(workerPool->lastConnectionOpenTime);
|
|
execution->rebuildWaitEventSet = true;
|
|
}
|
|
|
|
|
|
/*
|
|
* HasAnyConnectionFailure returns true if worker pool has failed,
|
|
* or connection timed out or we have a failure in connections.
|
|
*/
|
|
static bool
|
|
HasAnyConnectionFailure(WorkerPool *workerPool)
|
|
{
|
|
if (workerPool->failureState == WORKER_POOL_FAILED ||
|
|
workerPool->failureState == WORKER_POOL_FAILED_OVER_TO_LOCAL)
|
|
{
|
|
/* connection pool failed */
|
|
return true;
|
|
}
|
|
|
|
/* we might fail the execution or warn the user about connection timeouts */
|
|
if (workerPool->checkForPoolTimeout)
|
|
{
|
|
CheckConnectionTimeout(workerPool);
|
|
}
|
|
|
|
int failedConnectionCount = workerPool->failedConnectionCount;
|
|
if (failedConnectionCount >= 1)
|
|
{
|
|
/* do not attempt to open more connections after one failed */
|
|
return true;
|
|
}
|
|
return false;
|
|
}
|
|
|
|
|
|
/*
|
|
* ShouldWaitForSlowStart returns true if we should wait before
|
|
* opening a new connection because of slow start algorithm.
|
|
*/
|
|
static bool
|
|
ShouldWaitForSlowStart(WorkerPool *workerPool)
|
|
{
|
|
/* if we can use a connection per placement, we don't need to wait for slowstart */
|
|
if (UseConnectionPerPlacement())
|
|
{
|
|
return false;
|
|
}
|
|
|
|
/* if slow start is disabled, we can open new connections */
|
|
if (ExecutorSlowStartInterval == SLOW_START_DISABLED)
|
|
{
|
|
return false;
|
|
}
|
|
|
|
double milliSecondsPassedSince = MillisecondsPassedSince(
|
|
workerPool->lastConnectionOpenTime);
|
|
if (milliSecondsPassedSince < ExecutorSlowStartInterval)
|
|
{
|
|
return true;
|
|
}
|
|
|
|
/*
|
|
* Refrain from establishing new connections unless we have already
|
|
* finalized all the earlier connection attempts. This prevents unnecessary
|
|
* load on the remote nodes and emulates the TCP slow-start algorithm.
|
|
*/
|
|
int initiatedConnectionCount = list_length(workerPool->sessionList);
|
|
int finalizedConnectionCount =
|
|
workerPool->activeConnectionCount + workerPool->failedConnectionCount;
|
|
if (finalizedConnectionCount < initiatedConnectionCount)
|
|
{
|
|
return true;
|
|
}
|
|
|
|
return false;
|
|
}
|
|
|
|
|
|
/*
|
|
* CalculateNewConnectionCount returns the amount of connections
|
|
* that we can currently open.
|
|
*/
|
|
static int
|
|
CalculateNewConnectionCount(WorkerPool *workerPool)
|
|
{
|
|
DistributedExecution *execution = workerPool->distributedExecution;
|
|
|
|
int targetPoolSize = execution->targetPoolSize;
|
|
int initiatedConnectionCount = list_length(workerPool->sessionList);
|
|
int activeConnectionCount PG_USED_FOR_ASSERTS_ONLY =
|
|
workerPool->activeConnectionCount;
|
|
int idleConnectionCount PG_USED_FOR_ASSERTS_ONLY =
|
|
workerPool->idleConnectionCount;
|
|
int readyTaskCount = workerPool->readyTaskCount;
|
|
int newConnectionCount = 0;
|
|
|
|
|
|
/* we should always have more (or equal) active connections than idle connections */
|
|
Assert(activeConnectionCount >= idleConnectionCount);
|
|
|
|
/* we should always have more (or equal) initiated connections than active connections */
|
|
Assert(initiatedConnectionCount >= activeConnectionCount);
|
|
|
|
/* we should never have less than 0 connections ever */
|
|
Assert(activeConnectionCount >= 0 && idleConnectionCount >= 0);
|
|
|
|
if (UseConnectionPerPlacement())
|
|
{
|
|
int unusedConnectionCount = workerPool->unusedConnectionCount;
|
|
|
|
/*
|
|
* If force_max_query_parallelization is enabled then we ignore pool size
|
|
* and idle connections. Instead, we open new connections as long as there
|
|
* are more tasks than unused connections.
|
|
*/
|
|
|
|
newConnectionCount = Max(readyTaskCount - unusedConnectionCount, 0);
|
|
}
|
|
else
|
|
{
|
|
/* cannot open more than targetPoolSize connections */
|
|
int maxNewConnectionCount = targetPoolSize - initiatedConnectionCount;
|
|
|
|
/* total number of connections that are (almost) available for tasks */
|
|
int usableConnectionCount = UsableConnectionCount(workerPool);
|
|
|
|
/*
|
|
* Number of additional connections we would need to run all ready tasks in
|
|
* parallel.
|
|
*/
|
|
int newConnectionsForReadyTasks = Max(0, readyTaskCount - usableConnectionCount);
|
|
|
|
/* If Slow start is enabled we need to update the maxNewConnection to the current cycle's maximum.*/
|
|
if (ExecutorSlowStartInterval != SLOW_START_DISABLED)
|
|
{
|
|
maxNewConnectionCount = Min(workerPool->maxNewConnectionsPerCycle,
|
|
maxNewConnectionCount);
|
|
}
|
|
|
|
/*
|
|
* Open enough connections to handle all tasks that are ready, but no more
|
|
* than the target pool size.
|
|
*/
|
|
newConnectionCount = Min(newConnectionsForReadyTasks, maxNewConnectionCount);
|
|
if (newConnectionCount > 0)
|
|
{
|
|
/* increase the open rate every cycle (like TCP slow start) */
|
|
workerPool->maxNewConnectionsPerCycle += 1;
|
|
}
|
|
}
|
|
return newConnectionCount;
|
|
}
|
|
|
|
|
|
/*
|
|
* OpenNewConnections opens the given amount of connections for the given workerPool.
|
|
*/
|
|
static void
|
|
OpenNewConnections(WorkerPool *workerPool, int newConnectionCount,
|
|
TransactionProperties *transactionProperties)
|
|
{
|
|
ereport(DEBUG4, (errmsg("opening %d new connections to %s:%d", newConnectionCount,
|
|
workerPool->nodeName, workerPool->nodePort)));
|
|
|
|
for (int connectionIndex = 0; connectionIndex < newConnectionCount; connectionIndex++)
|
|
{
|
|
/* experimental: just to see the perf benefits of caching connections */
|
|
int connectionFlags = 0;
|
|
|
|
if (transactionProperties->useRemoteTransactionBlocks ==
|
|
TRANSACTION_BLOCKS_DISALLOWED)
|
|
{
|
|
connectionFlags |= OUTSIDE_TRANSACTION;
|
|
}
|
|
|
|
/*
|
|
* Enforce the requirements for adaptive connection management (a.k.a.,
|
|
* throttle connections if citus.max_shared_pool_size reached)
|
|
*/
|
|
int adaptiveConnectionManagementFlag =
|
|
AdaptiveConnectionManagementFlag(workerPool->poolToLocalNode,
|
|
list_length(workerPool->sessionList));
|
|
connectionFlags |= adaptiveConnectionManagementFlag;
|
|
|
|
/* open a new connection to the worker */
|
|
MultiConnection *connection = StartNodeUserDatabaseConnection(connectionFlags,
|
|
workerPool->nodeName,
|
|
workerPool->nodePort,
|
|
NULL, NULL);
|
|
if (!connection)
|
|
{
|
|
/* connection can only be NULL for optional connections */
|
|
Assert((connectionFlags & OPTIONAL_CONNECTION));
|
|
continue;
|
|
}
|
|
|
|
/*
|
|
* Assign the initial state in the connection state machine. The connection
|
|
* may already be open, but ConnectionStateMachine will immediately detect
|
|
* this.
|
|
*/
|
|
connection->connectionState = MULTI_CONNECTION_CONNECTING;
|
|
|
|
/*
|
|
* Ensure that subsequent calls to StartNodeUserDatabaseConnection get a
|
|
* different connection.
|
|
*/
|
|
connection->claimedExclusively = true;
|
|
|
|
if (list_length(workerPool->sessionList) == 0)
|
|
{
|
|
/*
|
|
* The worker pool has just started to establish connections. We need to
|
|
* defer this initilization after StartNodeUserDatabaseConnection()
|
|
* because for non-optional connections, we have some logic to wait
|
|
* until a connection is allowed to be established.
|
|
*/
|
|
INSTR_TIME_SET_ZERO(workerPool->poolStartTime);
|
|
}
|
|
|
|
/* create a session for the connection */
|
|
WorkerSession *session = FindOrCreateWorkerSession(workerPool, connection);
|
|
|
|
/* immediately run the state machine to handle potential failure */
|
|
ConnectionStateMachine(session);
|
|
}
|
|
}
|
|
|
|
|
|
/*
|
|
* CheckConnectionTimeout makes sure that the execution enforces the connection
|
|
* establishment timeout defined by the user (NodeConnectionTimeout).
|
|
*
|
|
* The rule is that if a worker pool has already initiated connection establishment
|
|
* and has not succeeded to finish establishments that are necessary to execute tasks,
|
|
* take an action. For the types of actions, see the comments in the function.
|
|
*
|
|
* Enforcing the timeout per pool (over per session) helps the execution to continue
|
|
* even if we can establish a single connection as we expect to have target pool size
|
|
* number of connections. In the end, the executor is capable of using one connection
|
|
* to execute multiple tasks.
|
|
*/
|
|
static void
|
|
CheckConnectionTimeout(WorkerPool *workerPool)
|
|
{
|
|
DistributedExecution *execution = workerPool->distributedExecution;
|
|
instr_time poolStartTime = workerPool->poolStartTime;
|
|
instr_time now;
|
|
INSTR_TIME_SET_CURRENT(now);
|
|
|
|
int initiatedConnectionCount = list_length(workerPool->sessionList);
|
|
int activeConnectionCount = workerPool->activeConnectionCount;
|
|
int requiredActiveConnectionCount = 1;
|
|
|
|
if (initiatedConnectionCount == 0)
|
|
{
|
|
/* no connection has been planned for the pool yet */
|
|
Assert(INSTR_TIME_IS_ZERO(poolStartTime));
|
|
return;
|
|
}
|
|
|
|
/*
|
|
* This is a special case where we assign tasks to sessions even before
|
|
* the connections are established. So, make sure to apply similar
|
|
* restrictions. In this case, make sure that we get all the connections
|
|
* established.
|
|
*/
|
|
if (UseConnectionPerPlacement())
|
|
{
|
|
requiredActiveConnectionCount = initiatedConnectionCount;
|
|
}
|
|
|
|
if (MillisecondsBetweenTimestamps(poolStartTime, now) >= NodeConnectionTimeout)
|
|
{
|
|
if (activeConnectionCount < requiredActiveConnectionCount)
|
|
{
|
|
int logLevel = WARNING;
|
|
|
|
/*
|
|
* First fail the pool and create an opportunity to execute tasks
|
|
* over other pools when tasks have more than one placement to execute.
|
|
*/
|
|
WorkerPoolFailed(workerPool);
|
|
|
|
if (workerPool->failureState == WORKER_POOL_FAILED_OVER_TO_LOCAL)
|
|
{
|
|
/*
|
|
*
|
|
* When the pool is failed over to local execution, warning
|
|
* the user just creates chatter as the executor is capable of
|
|
* finishing the execution.
|
|
*/
|
|
logLevel = DEBUG1;
|
|
}
|
|
else if (execution->transactionProperties->errorOnAnyFailure ||
|
|
execution->failed)
|
|
{
|
|
/*
|
|
* The enforcement is not always erroring out. For example, if a SELECT task
|
|
* has two different placements, we'd warn the user, fail the pool and continue
|
|
* with the next placement.
|
|
*/
|
|
logLevel = ERROR;
|
|
}
|
|
|
|
ereport(logLevel, (errcode(ERRCODE_CONNECTION_FAILURE),
|
|
errmsg("could not establish any connections to the node "
|
|
"%s:%d after %u ms", workerPool->nodeName,
|
|
workerPool->nodePort,
|
|
NodeConnectionTimeout)));
|
|
|
|
/*
|
|
* We hit the connection timeout. In that case, we should not let the
|
|
* connection establishment to continue because the execution logic
|
|
* pretends that failed sessions are not going to be used anymore.
|
|
*
|
|
* That's why we mark the connection as timed out to trigger the state
|
|
* changes in the executor.
|
|
*/
|
|
MarkEstablishingSessionsTimedOut(workerPool);
|
|
}
|
|
else
|
|
{
|
|
/* stop interrupting WaitEventSetWait for timeouts */
|
|
workerPool->checkForPoolTimeout = false;
|
|
}
|
|
}
|
|
}
|
|
|
|
|
|
/*
|
|
* MarkEstablishingSessionsTimedOut goes over the sessions in the given
|
|
* workerPool and marks them timed out. ConnectionStateMachine()
|
|
* later cleans up the sessions.
|
|
*/
|
|
static void
|
|
MarkEstablishingSessionsTimedOut(WorkerPool *workerPool)
|
|
{
|
|
WorkerSession *session = NULL;
|
|
foreach_ptr(session, workerPool->sessionList)
|
|
{
|
|
MultiConnection *connection = session->connection;
|
|
|
|
if (connection->connectionState == MULTI_CONNECTION_CONNECTING ||
|
|
connection->connectionState == MULTI_CONNECTION_INITIAL)
|
|
{
|
|
connection->connectionState = MULTI_CONNECTION_TIMED_OUT;
|
|
}
|
|
}
|
|
}
|
|
|
|
|
|
/*
|
|
* UsableConnectionCount returns the number of connections in the worker pool
|
|
* that are (soon to be) usable for sending commands, this includes both idle
|
|
* connections and connections that are still establishing.
|
|
*/
|
|
static int
|
|
UsableConnectionCount(WorkerPool *workerPool)
|
|
{
|
|
int initiatedConnectionCount = list_length(workerPool->sessionList);
|
|
int activeConnectionCount = workerPool->activeConnectionCount;
|
|
int failedConnectionCount = workerPool->failedConnectionCount;
|
|
int idleConnectionCount = workerPool->idleConnectionCount;
|
|
|
|
/* connections that are still establishing will soon be available for tasks */
|
|
int establishingConnectionCount =
|
|
initiatedConnectionCount - activeConnectionCount - failedConnectionCount;
|
|
|
|
int usableConnectionCount = idleConnectionCount + establishingConnectionCount;
|
|
|
|
return usableConnectionCount;
|
|
}
|
|
|
|
|
|
/*
|
|
* NextEventTimeout finds the earliest time at which we need to interrupt
|
|
* WaitEventSetWait because of a timeout and returns the number of milliseconds
|
|
* until that event with a minimum of 1ms and a maximum of 1000ms.
|
|
*
|
|
* This code may be sensitive to clock jumps, but only has the effect of waking
|
|
* up WaitEventSetWait slightly earlier to later.
|
|
*/
|
|
static long
|
|
NextEventTimeout(DistributedExecution *execution)
|
|
{
|
|
instr_time now;
|
|
INSTR_TIME_SET_CURRENT(now);
|
|
long eventTimeout = 1000; /* milliseconds */
|
|
|
|
WorkerPool *workerPool = NULL;
|
|
foreach_ptr(workerPool, execution->workerList)
|
|
{
|
|
if (workerPool->failureState == WORKER_POOL_FAILED)
|
|
{
|
|
/* worker pool may have already timed out */
|
|
continue;
|
|
}
|
|
|
|
if (!INSTR_TIME_IS_ZERO(workerPool->poolStartTime) &&
|
|
workerPool->checkForPoolTimeout)
|
|
{
|
|
long timeSincePoolStartMs =
|
|
MillisecondsBetweenTimestamps(workerPool->poolStartTime, now);
|
|
|
|
/*
|
|
* This could go into the negative if the connection timeout just passed.
|
|
* In that case we want to wake up as soon as possible. Once the timeout
|
|
* has been processed, checkForPoolTimeout will be false so we will skip
|
|
* this check.
|
|
*/
|
|
long timeUntilConnectionTimeoutMs =
|
|
NodeConnectionTimeout - timeSincePoolStartMs;
|
|
|
|
if (timeUntilConnectionTimeoutMs < eventTimeout)
|
|
{
|
|
eventTimeout = timeUntilConnectionTimeoutMs;
|
|
}
|
|
}
|
|
|
|
int initiatedConnectionCount = list_length(workerPool->sessionList);
|
|
|
|
/*
|
|
* If there are connections to open we wait at most up to the end of the
|
|
* current slow start interval.
|
|
*/
|
|
if (workerPool->readyTaskCount > UsableConnectionCount(workerPool) &&
|
|
initiatedConnectionCount < execution->targetPoolSize)
|
|
{
|
|
long timeSinceLastConnectMs =
|
|
MillisecondsBetweenTimestamps(workerPool->lastConnectionOpenTime, now);
|
|
long timeUntilSlowStartInterval =
|
|
ExecutorSlowStartInterval - timeSinceLastConnectMs;
|
|
|
|
if (timeUntilSlowStartInterval < eventTimeout)
|
|
{
|
|
eventTimeout = timeUntilSlowStartInterval;
|
|
}
|
|
}
|
|
}
|
|
|
|
return Max(1, eventTimeout);
|
|
}
|
|
|
|
|
|
/*
|
|
* MillisecondsBetweenTimestamps is a helper to get the number of milliseconds
|
|
* between timestamps when it is expected to be small enough to fit in a
|
|
* long.
|
|
*/
|
|
static long
|
|
MillisecondsBetweenTimestamps(instr_time startTime, instr_time endTime)
|
|
{
|
|
INSTR_TIME_SUBTRACT(endTime, startTime);
|
|
return INSTR_TIME_GET_MILLISEC(endTime);
|
|
}
|
|
|
|
|
|
/*
|
|
* ConnectionStateMachine opens a connection and descends into the transaction
|
|
* state machine when ready.
|
|
*/
|
|
static void
|
|
ConnectionStateMachine(WorkerSession *session)
|
|
{
|
|
WorkerPool *workerPool = session->workerPool;
|
|
DistributedExecution *execution = workerPool->distributedExecution;
|
|
|
|
MultiConnection *connection = session->connection;
|
|
MultiConnectionState currentState;
|
|
|
|
do {
|
|
currentState = connection->connectionState;
|
|
|
|
switch (currentState)
|
|
{
|
|
case MULTI_CONNECTION_INITIAL:
|
|
{
|
|
/* simply iterate the state machine */
|
|
connection->connectionState = MULTI_CONNECTION_CONNECTING;
|
|
break;
|
|
}
|
|
|
|
case MULTI_CONNECTION_TIMED_OUT:
|
|
{
|
|
/*
|
|
* When the connection timeout happens, the connection
|
|
* might still be able to successfuly established. However,
|
|
* the executor should not try to use this connection as
|
|
* the state machines might have already progressed and used
|
|
* new pools/sessions instead. That's why we terminate the
|
|
* connection, clear any state associated with it.
|
|
*/
|
|
connection->connectionState = MULTI_CONNECTION_FAILED;
|
|
break;
|
|
}
|
|
|
|
case MULTI_CONNECTION_CONNECTING:
|
|
{
|
|
ConnStatusType status = PQstatus(connection->pgConn);
|
|
if (status == CONNECTION_OK)
|
|
{
|
|
HandleMultiConnectionSuccess(session);
|
|
UpdateConnectionWaitFlags(session,
|
|
WL_SOCKET_READABLE | WL_SOCKET_WRITEABLE);
|
|
|
|
connection->connectionState = MULTI_CONNECTION_CONNECTED;
|
|
break;
|
|
}
|
|
else if (status == CONNECTION_BAD)
|
|
{
|
|
connection->connectionState = MULTI_CONNECTION_FAILED;
|
|
break;
|
|
}
|
|
|
|
int beforePollSocket = PQsocket(connection->pgConn);
|
|
PostgresPollingStatusType pollMode = PQconnectPoll(connection->pgConn);
|
|
|
|
if (beforePollSocket != PQsocket(connection->pgConn))
|
|
{
|
|
/* rebuild the wait events if PQconnectPoll() changed the socket */
|
|
execution->rebuildWaitEventSet = true;
|
|
}
|
|
|
|
if (pollMode == PGRES_POLLING_FAILED)
|
|
{
|
|
connection->connectionState = MULTI_CONNECTION_FAILED;
|
|
}
|
|
else if (pollMode == PGRES_POLLING_READING)
|
|
{
|
|
UpdateConnectionWaitFlags(session, WL_SOCKET_READABLE);
|
|
|
|
/* we should have a valid socket */
|
|
Assert(PQsocket(connection->pgConn) != -1);
|
|
}
|
|
else if (pollMode == PGRES_POLLING_WRITING)
|
|
{
|
|
UpdateConnectionWaitFlags(session, WL_SOCKET_WRITEABLE);
|
|
|
|
/* we should have a valid socket */
|
|
Assert(PQsocket(connection->pgConn) != -1);
|
|
}
|
|
else
|
|
{
|
|
HandleMultiConnectionSuccess(session);
|
|
UpdateConnectionWaitFlags(session,
|
|
WL_SOCKET_READABLE | WL_SOCKET_WRITEABLE);
|
|
|
|
connection->connectionState = MULTI_CONNECTION_CONNECTED;
|
|
|
|
/* we should have a valid socket */
|
|
Assert(PQsocket(connection->pgConn) != -1);
|
|
}
|
|
|
|
break;
|
|
}
|
|
|
|
case MULTI_CONNECTION_CONNECTED:
|
|
{
|
|
/* connection is ready, run the transaction state machine */
|
|
TransactionStateMachine(session);
|
|
break;
|
|
}
|
|
|
|
case MULTI_CONNECTION_LOST:
|
|
{
|
|
/* managed to connect, but connection was lost */
|
|
workerPool->activeConnectionCount--;
|
|
|
|
if (session->currentTask == NULL)
|
|
{
|
|
/* this was an idle connection */
|
|
workerPool->idleConnectionCount--;
|
|
}
|
|
|
|
connection->connectionState = MULTI_CONNECTION_FAILED;
|
|
break;
|
|
}
|
|
|
|
case MULTI_CONNECTION_FAILED:
|
|
{
|
|
/* connection failed or was lost */
|
|
int totalConnectionCount = list_length(workerPool->sessionList);
|
|
|
|
workerPool->failedConnectionCount++;
|
|
|
|
/* if the connection executed a critical command it should fail */
|
|
MarkRemoteTransactionFailed(connection, false);
|
|
|
|
/* mark all assigned placement executions as failed */
|
|
WorkerSessionFailed(session);
|
|
|
|
if (workerPool->failedConnectionCount >= totalConnectionCount)
|
|
{
|
|
/*
|
|
* All current connection attempts have failed.
|
|
* Mark all unassigned placement executions as failed.
|
|
*
|
|
* We do not currently retry if the first connection
|
|
* attempt fails.
|
|
*/
|
|
WorkerPoolFailed(workerPool);
|
|
}
|
|
|
|
/*
|
|
* The execution may have failed as a result of WorkerSessionFailed
|
|
* or WorkerPoolFailed.
|
|
*/
|
|
if (execution->failed ||
|
|
(execution->transactionProperties->errorOnAnyFailure &&
|
|
workerPool->failureState != WORKER_POOL_FAILED_OVER_TO_LOCAL))
|
|
{
|
|
/* a task has failed due to this connection failure */
|
|
ReportConnectionError(connection, ERROR);
|
|
}
|
|
else if (workerPool->activeConnectionCount > 0 ||
|
|
workerPool->failureState == WORKER_POOL_FAILED_OVER_TO_LOCAL)
|
|
{
|
|
/*
|
|
* We already have active connection(s) to the node, and the
|
|
* executor is capable of using those connections to successfully
|
|
* finish the execution. So, there is not much value in warning
|
|
* the user.
|
|
*
|
|
* Similarly when the pool is failed over to local execution, warning
|
|
* the user just creates chatter.
|
|
*/
|
|
ReportConnectionError(connection, DEBUG1);
|
|
}
|
|
else
|
|
{
|
|
ReportConnectionError(connection, WARNING);
|
|
}
|
|
|
|
/* remove the connection */
|
|
UnclaimConnection(connection);
|
|
|
|
/*
|
|
* We forcefully close the underlying libpq connection because
|
|
* we don't want any subsequent execution (either subPlan executions
|
|
* or new command executions within a transaction block) use the
|
|
* connection.
|
|
*
|
|
* However, we prefer to keep the MultiConnection around until
|
|
* the end of FinishDistributedExecution() to simplify the code.
|
|
* Thus, we prefer ShutdownConnection() over CloseConnection().
|
|
*/
|
|
ShutdownConnection(connection);
|
|
|
|
/* remove connection from wait event set */
|
|
execution->rebuildWaitEventSet = true;
|
|
|
|
/*
|
|
* Reset the transaction state machine since CloseConnection()
|
|
* relies on it and even if we're not inside a distributed transaction
|
|
* we set the transaction state (e.g., REMOTE_TRANS_SENT_COMMAND).
|
|
*/
|
|
if (!connection->remoteTransaction.beginSent)
|
|
{
|
|
connection->remoteTransaction.transactionState =
|
|
REMOTE_TRANS_NOT_STARTED;
|
|
}
|
|
|
|
break;
|
|
}
|
|
|
|
default:
|
|
{
|
|
break;
|
|
}
|
|
}
|
|
} while (connection->connectionState != currentState);
|
|
}
|
|
|
|
|
|
/*
|
|
* HandleMultiConnectionSuccess logs the established connection and updates connection's state.
|
|
*/
|
|
static void
|
|
HandleMultiConnectionSuccess(WorkerSession *session)
|
|
{
|
|
MultiConnection *connection = session->connection;
|
|
WorkerPool *workerPool = session->workerPool;
|
|
|
|
ereport(DEBUG4, (errmsg("established connection to %s:%d for "
|
|
"session %ld",
|
|
connection->hostname, connection->port,
|
|
session->sessionId)));
|
|
|
|
workerPool->activeConnectionCount++;
|
|
workerPool->idleConnectionCount++;
|
|
}
|
|
|
|
|
|
/*
|
|
* Activate2PCIfModifyingTransactionExpandsToNewNode sets the coordinated
|
|
* transaction to use 2PC under the following circumstances:
|
|
* - We're already in a transaction block
|
|
* - At least one of the previous commands in the transaction block
|
|
* made a modification, which have not set 2PC itself because it
|
|
* was a single shard command
|
|
* - The input "session" is used for a distributed execution which
|
|
* modifies the database. However, the session (and hence the
|
|
* connection) is established to a different worker than the ones
|
|
* that is used previously in the transaction.
|
|
*
|
|
* To give an example,
|
|
* BEGIN;
|
|
* -- assume that the following INSERT goes to worker-A
|
|
* -- also note that this single command does not activate
|
|
* -- 2PC itself since it is a single shard modification
|
|
* INSERT INTO distributed_table (dist_key) VALUES (1);
|
|
*
|
|
* -- do one more single shard UPDATE hitting the same
|
|
* shard (or worker node in general)
|
|
* -- this wouldn't activate 2PC, since we're operating on the
|
|
* -- same worker node that we've modified earlier
|
|
* -- so the executor would use the same connection
|
|
* UPDATE distributed_table SET value = 10 WHERE dist_key = 1;
|
|
*
|
|
* -- now, do one more INSERT, which goes to worker-B
|
|
* -- At this point, this function would activate 2PC
|
|
* -- since we're now expanding to a new node
|
|
* -- for example, if this command were a SELECT, we wouldn't
|
|
* -- activate 2PC since we're only interested in modifications/DDLs
|
|
* INSERT INTO distributed_table (dist_key) VALUES (2);
|
|
*/
|
|
static void
|
|
Activate2PCIfModifyingTransactionExpandsToNewNode(WorkerSession *session)
|
|
{
|
|
if (MultiShardCommitProtocol != COMMIT_PROTOCOL_2PC)
|
|
{
|
|
/* we don't need 2PC, so no need to continue */
|
|
return;
|
|
}
|
|
|
|
DistributedExecution *execution = session->workerPool->distributedExecution;
|
|
if (TransactionModifiedDistributedTable(execution) &&
|
|
DistributedExecutionModifiesDatabase(execution) &&
|
|
!ConnectionModifiedPlacement(session->connection))
|
|
{
|
|
/*
|
|
* We already did a modification, but not on the connection that we
|
|
* just opened, which means we're now going to make modifications
|
|
* over multiple connections. Activate 2PC!
|
|
*/
|
|
CoordinatedTransactionShouldUse2PC();
|
|
}
|
|
}
|
|
|
|
|
|
/*
|
|
* TransactionModifiedDistributedTable returns true if the current transaction already
|
|
* executed a command which modified at least one distributed table in the current
|
|
* transaction.
|
|
*/
|
|
static bool
|
|
TransactionModifiedDistributedTable(DistributedExecution *execution)
|
|
{
|
|
/*
|
|
* We need to explicitly check for TRANSACTION_BLOCKS_REQUIRED due to
|
|
* citus.function_opens_transaction_block flag. When set to false, we
|
|
* should not be pretending that we're in a coordinated transaction even
|
|
* if XACT_MODIFICATION_DATA is set. That's why we implemented this workaround.
|
|
*/
|
|
return execution->transactionProperties->useRemoteTransactionBlocks ==
|
|
TRANSACTION_BLOCKS_REQUIRED &&
|
|
XactModificationLevel == XACT_MODIFICATION_DATA;
|
|
}
|
|
|
|
|
|
/*
|
|
* TransactionStateMachine manages the execution of tasks over a connection.
|
|
*/
|
|
static void
|
|
TransactionStateMachine(WorkerSession *session)
|
|
{
|
|
WorkerPool *workerPool = session->workerPool;
|
|
DistributedExecution *execution = workerPool->distributedExecution;
|
|
TransactionBlocksUsage useRemoteTransactionBlocks =
|
|
execution->transactionProperties->useRemoteTransactionBlocks;
|
|
|
|
MultiConnection *connection = session->connection;
|
|
RemoteTransaction *transaction = &(connection->remoteTransaction);
|
|
RemoteTransactionState currentState;
|
|
|
|
do {
|
|
currentState = transaction->transactionState;
|
|
|
|
if (!CheckConnectionReady(session))
|
|
{
|
|
/* connection is busy, no state transitions to make */
|
|
break;
|
|
}
|
|
|
|
switch (currentState)
|
|
{
|
|
case REMOTE_TRANS_NOT_STARTED:
|
|
{
|
|
if (useRemoteTransactionBlocks == TRANSACTION_BLOCKS_REQUIRED)
|
|
{
|
|
/* if we're expanding the nodes in a transaction, use 2PC */
|
|
Activate2PCIfModifyingTransactionExpandsToNewNode(session);
|
|
|
|
/* need to open a transaction block first */
|
|
StartRemoteTransactionBegin(connection);
|
|
|
|
transaction->transactionState = REMOTE_TRANS_CLEARING_RESULTS;
|
|
}
|
|
else
|
|
{
|
|
TaskPlacementExecution *placementExecution = PopPlacementExecution(
|
|
session);
|
|
if (placementExecution == NULL)
|
|
{
|
|
/*
|
|
* No tasks are ready to be executed at the moment. But we
|
|
* still mark the socket readable to get any notices if exists.
|
|
*/
|
|
UpdateConnectionWaitFlags(session, WL_SOCKET_READABLE);
|
|
|
|
break;
|
|
}
|
|
|
|
bool placementExecutionStarted =
|
|
StartPlacementExecutionOnSession(placementExecution, session);
|
|
if (!placementExecutionStarted)
|
|
{
|
|
/* no need to continue, connection is lost */
|
|
Assert(session->connection->connectionState ==
|
|
MULTI_CONNECTION_LOST);
|
|
|
|
return;
|
|
}
|
|
|
|
transaction->transactionState = REMOTE_TRANS_SENT_COMMAND;
|
|
}
|
|
|
|
UpdateConnectionWaitFlags(session,
|
|
WL_SOCKET_READABLE | WL_SOCKET_WRITEABLE);
|
|
break;
|
|
}
|
|
|
|
case REMOTE_TRANS_SENT_BEGIN:
|
|
case REMOTE_TRANS_CLEARING_RESULTS:
|
|
{
|
|
PGresult *result = PQgetResult(connection->pgConn);
|
|
if (result != NULL)
|
|
{
|
|
if (!IsResponseOK(result))
|
|
{
|
|
/* query failures are always hard errors */
|
|
ReportResultError(connection, result, ERROR);
|
|
}
|
|
|
|
PQclear(result);
|
|
|
|
/* wake up WaitEventSetWait */
|
|
UpdateConnectionWaitFlags(session,
|
|
WL_SOCKET_READABLE | WL_SOCKET_WRITEABLE);
|
|
|
|
break;
|
|
}
|
|
|
|
if (session->currentTask != NULL)
|
|
{
|
|
TaskPlacementExecution *placementExecution = session->currentTask;
|
|
bool succeeded = true;
|
|
|
|
/*
|
|
* Once we finished a task on a connection, we no longer
|
|
* allow that connection to fail.
|
|
*/
|
|
MarkRemoteTransactionCritical(connection);
|
|
|
|
session->currentTask = NULL;
|
|
|
|
PlacementExecutionDone(placementExecution, succeeded);
|
|
|
|
/* connection is ready to use for executing commands */
|
|
workerPool->idleConnectionCount++;
|
|
}
|
|
|
|
/* connection needs to be writeable to send next command */
|
|
UpdateConnectionWaitFlags(session,
|
|
WL_SOCKET_READABLE | WL_SOCKET_WRITEABLE);
|
|
|
|
if (transaction->beginSent)
|
|
{
|
|
transaction->transactionState = REMOTE_TRANS_STARTED;
|
|
}
|
|
else
|
|
{
|
|
transaction->transactionState = REMOTE_TRANS_NOT_STARTED;
|
|
}
|
|
break;
|
|
}
|
|
|
|
case REMOTE_TRANS_STARTED:
|
|
{
|
|
TaskPlacementExecution *placementExecution = PopPlacementExecution(
|
|
session);
|
|
if (placementExecution == NULL)
|
|
{
|
|
/* no tasks are ready to be executed at the moment */
|
|
UpdateConnectionWaitFlags(session, WL_SOCKET_READABLE);
|
|
break;
|
|
}
|
|
|
|
bool placementExecutionStarted =
|
|
StartPlacementExecutionOnSession(placementExecution, session);
|
|
if (!placementExecutionStarted)
|
|
{
|
|
/* no need to continue, connection is lost */
|
|
Assert(session->connection->connectionState == MULTI_CONNECTION_LOST);
|
|
|
|
return;
|
|
}
|
|
|
|
transaction->transactionState = REMOTE_TRANS_SENT_COMMAND;
|
|
break;
|
|
}
|
|
|
|
case REMOTE_TRANS_SENT_COMMAND:
|
|
{
|
|
TaskPlacementExecution *placementExecution = session->currentTask;
|
|
if (placementExecution == NULL)
|
|
{
|
|
/*
|
|
* We have seen accounts in production where the placementExecution
|
|
* could inadvertently be not set. Investigation documented on
|
|
* https://github.com/citusdata/citus-enterprise/issues/493
|
|
* (due to sensitive data in the initial report it is not discussed
|
|
* in our community repository)
|
|
*
|
|
* Currently we don't have a reliable way of reproducing this issue.
|
|
* Erroring here seems to be a more desirable approach compared to a
|
|
* SEGFAULT on the dereference of placementExecution, with a possible
|
|
* crash recovery as a result.
|
|
*/
|
|
ereport(ERROR, (errmsg(
|
|
"unable to recover from inconsistent state in "
|
|
"the connection state machine on coordinator")));
|
|
}
|
|
|
|
ShardCommandExecution *shardCommandExecution =
|
|
placementExecution->shardCommandExecution;
|
|
Task *task = shardCommandExecution->task;
|
|
|
|
/*
|
|
* In EXPLAIN ANALYZE we need to store results except for multiple placements,
|
|
* regardless of query type. In other cases, doing the same doesn't seem to have
|
|
* a drawback.
|
|
*/
|
|
bool storeRows = true;
|
|
|
|
if (shardCommandExecution->gotResults)
|
|
{
|
|
/* already received results from another replica */
|
|
storeRows = false;
|
|
}
|
|
else if (task->partiallyLocalOrRemote)
|
|
{
|
|
/*
|
|
* For the tasks that involves placements from both
|
|
* remote and local placments, such as modifications
|
|
* to reference tables, we store the rows during the
|
|
* local placement/execution.
|
|
*/
|
|
storeRows = false;
|
|
}
|
|
|
|
bool fetchDone = ReceiveResults(session, storeRows);
|
|
if (!fetchDone)
|
|
{
|
|
break;
|
|
}
|
|
|
|
/* if this is a multi-query task, send the next query */
|
|
if (placementExecution->queryIndex < task->queryCount)
|
|
{
|
|
bool querySent = SendNextQuery(placementExecution, session);
|
|
if (!querySent)
|
|
{
|
|
/* no need to continue, connection is lost */
|
|
Assert(session->connection->connectionState ==
|
|
MULTI_CONNECTION_LOST);
|
|
|
|
return;
|
|
}
|
|
|
|
/*
|
|
* At this point the query might be just in pgconn buffers. We
|
|
* need to wait until it becomes writeable to actually send
|
|
* the query.
|
|
*/
|
|
UpdateConnectionWaitFlags(session,
|
|
WL_SOCKET_WRITEABLE | WL_SOCKET_READABLE);
|
|
|
|
transaction->transactionState = REMOTE_TRANS_SENT_COMMAND;
|
|
|
|
break;
|
|
}
|
|
|
|
shardCommandExecution->gotResults = true;
|
|
transaction->transactionState = REMOTE_TRANS_CLEARING_RESULTS;
|
|
break;
|
|
}
|
|
|
|
default:
|
|
{
|
|
break;
|
|
}
|
|
}
|
|
}
|
|
/* iterate in case we can perform multiple transitions at once */
|
|
while (transaction->transactionState != currentState);
|
|
}
|
|
|
|
|
|
/*
|
|
* UpdateConnectionWaitFlags is a wrapper around setting waitFlags of the connection.
|
|
*
|
|
* This function might further improved in a sense that to use use ModifyWaitEvent on
|
|
* waitFlag changes as opposed to what we do now: always rebuild the wait event sets.
|
|
* Our initial benchmarks didn't show any significant performance improvements, but
|
|
* good to keep in mind the potential improvements.
|
|
*/
|
|
static void
|
|
UpdateConnectionWaitFlags(WorkerSession *session, int waitFlags)
|
|
{
|
|
MultiConnection *connection = session->connection;
|
|
DistributedExecution *execution = session->workerPool->distributedExecution;
|
|
|
|
/* do not take any actions if the flags not changed */
|
|
if (connection->waitFlags == waitFlags)
|
|
{
|
|
return;
|
|
}
|
|
|
|
connection->waitFlags = waitFlags;
|
|
|
|
/* without signalling the execution, the flag changes won't be reflected */
|
|
execution->waitFlagsChanged = true;
|
|
}
|
|
|
|
|
|
/*
|
|
* CheckConnectionReady returns true if the connection is ready to
|
|
* read or write, or false if it still has bytes to send/receive.
|
|
*/
|
|
static bool
|
|
CheckConnectionReady(WorkerSession *session)
|
|
{
|
|
MultiConnection *connection = session->connection;
|
|
int waitFlags = WL_SOCKET_READABLE;
|
|
bool connectionReady = false;
|
|
|
|
ConnStatusType status = PQstatus(connection->pgConn);
|
|
if (status == CONNECTION_BAD)
|
|
{
|
|
connection->connectionState = MULTI_CONNECTION_LOST;
|
|
return false;
|
|
}
|
|
|
|
/* try to send all pending data */
|
|
int sendStatus = PQflush(connection->pgConn);
|
|
if (sendStatus == -1)
|
|
{
|
|
connection->connectionState = MULTI_CONNECTION_LOST;
|
|
return false;
|
|
}
|
|
else if (sendStatus == 1)
|
|
{
|
|
/* more data to send, wait for socket to become writable */
|
|
waitFlags = waitFlags | WL_SOCKET_WRITEABLE;
|
|
}
|
|
|
|
if ((session->latestUnconsumedWaitEvents & WL_SOCKET_READABLE) != 0)
|
|
{
|
|
if (PQconsumeInput(connection->pgConn) == 0)
|
|
{
|
|
connection->connectionState = MULTI_CONNECTION_LOST;
|
|
return false;
|
|
}
|
|
}
|
|
|
|
if (!PQisBusy(connection->pgConn))
|
|
{
|
|
connectionReady = true;
|
|
}
|
|
|
|
UpdateConnectionWaitFlags(session, waitFlags);
|
|
|
|
/* don't consume input redundantly if we cycle back into CheckConnectionReady */
|
|
session->latestUnconsumedWaitEvents = 0;
|
|
|
|
return connectionReady;
|
|
}
|
|
|
|
|
|
/*
|
|
* PopPlacementExecution returns the next available assigned or unassigned
|
|
* placement execution for the given session.
|
|
*/
|
|
static TaskPlacementExecution *
|
|
PopPlacementExecution(WorkerSession *session)
|
|
{
|
|
WorkerPool *workerPool = session->workerPool;
|
|
|
|
TaskPlacementExecution *placementExecution = PopAssignedPlacementExecution(session);
|
|
if (placementExecution == NULL)
|
|
{
|
|
if (session->commandsSent > 0 && UseConnectionPerPlacement())
|
|
{
|
|
/*
|
|
* Only send one command per connection if force_max_query_parallelisation
|
|
* is enabled, unless it's an assigned placement execution.
|
|
*/
|
|
return NULL;
|
|
}
|
|
|
|
/* no more assigned tasks, pick an unassigned task */
|
|
placementExecution = PopUnassignedPlacementExecution(workerPool);
|
|
}
|
|
|
|
return placementExecution;
|
|
}
|
|
|
|
|
|
/*
|
|
* PopAssignedPlacementExecution finds an executable task from the queue of assigned tasks.
|
|
*/
|
|
static TaskPlacementExecution *
|
|
PopAssignedPlacementExecution(WorkerSession *session)
|
|
{
|
|
dlist_head *readyTaskQueue = &(session->readyTaskQueue);
|
|
|
|
if (dlist_is_empty(readyTaskQueue))
|
|
{
|
|
return NULL;
|
|
}
|
|
|
|
TaskPlacementExecution *placementExecution = dlist_container(TaskPlacementExecution,
|
|
sessionReadyQueueNode,
|
|
dlist_pop_head_node(
|
|
readyTaskQueue));
|
|
|
|
return placementExecution;
|
|
}
|
|
|
|
|
|
/*
|
|
* PopAssignedPlacementExecution finds an executable task from the queue of assigned tasks.
|
|
*/
|
|
static TaskPlacementExecution *
|
|
PopUnassignedPlacementExecution(WorkerPool *workerPool)
|
|
{
|
|
dlist_head *readyTaskQueue = &(workerPool->readyTaskQueue);
|
|
|
|
if (dlist_is_empty(readyTaskQueue))
|
|
{
|
|
return NULL;
|
|
}
|
|
|
|
TaskPlacementExecution *placementExecution = dlist_container(TaskPlacementExecution,
|
|
workerReadyQueueNode,
|
|
dlist_pop_head_node(
|
|
readyTaskQueue));
|
|
|
|
workerPool->readyTaskCount--;
|
|
|
|
return placementExecution;
|
|
}
|
|
|
|
|
|
/*
|
|
* StartPlacementExecutionOnSession gets a TaskPlacementExecution and
|
|
* WorkerSession, the task's query is sent to the worker via the session.
|
|
*
|
|
* The function does some bookkeeping such as associating the placement
|
|
* accesses with the connection and updating session's local variables. For
|
|
* details read the comments in the function.
|
|
*
|
|
* The function returns true if the query is successfully sent over the
|
|
* connection, otherwise false.
|
|
*/
|
|
static bool
|
|
StartPlacementExecutionOnSession(TaskPlacementExecution *placementExecution,
|
|
WorkerSession *session)
|
|
{
|
|
WorkerPool *workerPool = session->workerPool;
|
|
DistributedExecution *execution = workerPool->distributedExecution;
|
|
MultiConnection *connection = session->connection;
|
|
ShardCommandExecution *shardCommandExecution =
|
|
placementExecution->shardCommandExecution;
|
|
Task *task = shardCommandExecution->task;
|
|
ShardPlacement *taskPlacement = placementExecution->shardPlacement;
|
|
List *placementAccessList = PlacementAccessListForTask(task, taskPlacement);
|
|
|
|
|
|
if (execution->transactionProperties->useRemoteTransactionBlocks !=
|
|
TRANSACTION_BLOCKS_DISALLOWED)
|
|
{
|
|
/*
|
|
* Make sure that subsequent commands on the same placement
|
|
* use the same connection.
|
|
*/
|
|
AssignPlacementListToConnection(placementAccessList, connection);
|
|
}
|
|
|
|
if (session->commandsSent == 0)
|
|
{
|
|
/* first time we send a command, consider the connection used (not unused) */
|
|
workerPool->unusedConnectionCount--;
|
|
}
|
|
|
|
/* connection is going to be in use */
|
|
workerPool->idleConnectionCount--;
|
|
session->currentTask = placementExecution;
|
|
placementExecution->executionState = PLACEMENT_EXECUTION_RUNNING;
|
|
|
|
bool querySent = SendNextQuery(placementExecution, session);
|
|
if (querySent)
|
|
{
|
|
session->commandsSent++;
|
|
|
|
if (workerPool->poolToLocalNode)
|
|
{
|
|
/*
|
|
* As we started remote execution to the local node,
|
|
* we cannot switch back to local execution as that
|
|
* would cause self-deadlocks and breaking
|
|
* read-your-own-writes consistency.
|
|
*/
|
|
SetLocalExecutionStatus(LOCAL_EXECUTION_DISABLED);
|
|
}
|
|
}
|
|
|
|
return querySent;
|
|
}
|
|
|
|
|
|
/*
|
|
* SendNextQuery sends the next query for placementExecution on the given
|
|
* session.
|
|
*/
|
|
static bool
|
|
SendNextQuery(TaskPlacementExecution *placementExecution,
|
|
WorkerSession *session)
|
|
{
|
|
WorkerPool *workerPool = session->workerPool;
|
|
DistributedExecution *execution = workerPool->distributedExecution;
|
|
MultiConnection *connection = session->connection;
|
|
ShardCommandExecution *shardCommandExecution =
|
|
placementExecution->shardCommandExecution;
|
|
bool binaryResults = shardCommandExecution->binaryResults;
|
|
Task *task = shardCommandExecution->task;
|
|
ParamListInfo paramListInfo = execution->paramListInfo;
|
|
int querySent = 0;
|
|
uint32 queryIndex = placementExecution->queryIndex;
|
|
|
|
Assert(queryIndex < task->queryCount);
|
|
char *queryString = TaskQueryStringAtIndex(task, queryIndex);
|
|
|
|
if (paramListInfo != NULL && !task->parametersInQueryStringResolved)
|
|
{
|
|
int parameterCount = paramListInfo->numParams;
|
|
Oid *parameterTypes = NULL;
|
|
const char **parameterValues = NULL;
|
|
|
|
/* force evaluation of bound params */
|
|
paramListInfo = copyParamList(paramListInfo);
|
|
|
|
ExtractParametersForRemoteExecution(paramListInfo, ¶meterTypes,
|
|
¶meterValues);
|
|
querySent = SendRemoteCommandParams(connection, queryString, parameterCount,
|
|
parameterTypes, parameterValues,
|
|
binaryResults);
|
|
}
|
|
else
|
|
{
|
|
/*
|
|
* We only need to use SendRemoteCommandParams when we desire
|
|
* binaryResults. One downside of SendRemoteCommandParams is that it
|
|
* only supports one query in the query string. In some cases we have
|
|
* more than one query. In those cases we already make sure before that
|
|
* binaryResults is false.
|
|
*
|
|
* XXX: It also seems that SendRemoteCommandParams does something
|
|
* strange/incorrectly with select statements. In
|
|
* isolation_select_vs_all.spec, when doing an s1-router-select in one
|
|
* session blocked an s2-ddl-create-index-concurrently in another.
|
|
*/
|
|
if (!binaryResults)
|
|
{
|
|
querySent = SendRemoteCommand(connection, queryString);
|
|
}
|
|
else
|
|
{
|
|
querySent = SendRemoteCommandParams(connection, queryString, 0, NULL, NULL,
|
|
binaryResults);
|
|
}
|
|
}
|
|
|
|
if (querySent == 0)
|
|
{
|
|
connection->connectionState = MULTI_CONNECTION_LOST;
|
|
return false;
|
|
}
|
|
|
|
int singleRowMode = PQsetSingleRowMode(connection->pgConn);
|
|
if (singleRowMode == 0)
|
|
{
|
|
connection->connectionState = MULTI_CONNECTION_LOST;
|
|
return false;
|
|
}
|
|
|
|
return true;
|
|
}
|
|
|
|
|
|
/*
|
|
* ReceiveResults reads the result of a command or query and writes returned
|
|
* rows to the tuple store of the scan state. It returns whether fetching results
|
|
* were done. On failure, it throws an error.
|
|
*/
|
|
static bool
|
|
ReceiveResults(WorkerSession *session, bool storeRows)
|
|
{
|
|
bool fetchDone = false;
|
|
MultiConnection *connection = session->connection;
|
|
WorkerPool *workerPool = session->workerPool;
|
|
DistributedExecution *execution = workerPool->distributedExecution;
|
|
TaskPlacementExecution *placementExecution = session->currentTask;
|
|
ShardCommandExecution *shardCommandExecution =
|
|
placementExecution->shardCommandExecution;
|
|
Task *task = placementExecution->shardCommandExecution->task;
|
|
TupleDestination *tupleDest = task->tupleDest ?
|
|
task->tupleDest :
|
|
execution->defaultTupleDest;
|
|
|
|
/*
|
|
* We use this context while converting each row fetched from remote node
|
|
* into tuple. The context is reseted on every row, thus we create it at the
|
|
* start of the loop and reset on every iteration.
|
|
*/
|
|
MemoryContext rowContext = AllocSetContextCreate(CurrentMemoryContext,
|
|
"RowContext",
|
|
ALLOCSET_DEFAULT_MINSIZE,
|
|
ALLOCSET_DEFAULT_INITSIZE,
|
|
ALLOCSET_DEFAULT_MAXSIZE);
|
|
|
|
while (!PQisBusy(connection->pgConn))
|
|
{
|
|
uint32 columnIndex = 0;
|
|
uint32 rowsProcessed = 0;
|
|
|
|
PGresult *result = PQgetResult(connection->pgConn);
|
|
if (result == NULL)
|
|
{
|
|
/* no more results, break out of loop and free allocated memory */
|
|
fetchDone = true;
|
|
break;
|
|
}
|
|
|
|
ExecStatusType resultStatus = PQresultStatus(result);
|
|
if (resultStatus == PGRES_COMMAND_OK)
|
|
{
|
|
char *currentAffectedTupleString = PQcmdTuples(result);
|
|
int64 currentAffectedTupleCount = 0;
|
|
|
|
/* if there are multiple replicas, make sure to consider only one */
|
|
if (storeRows && *currentAffectedTupleString != '\0')
|
|
{
|
|
scanint8(currentAffectedTupleString, false, ¤tAffectedTupleCount);
|
|
Assert(currentAffectedTupleCount >= 0);
|
|
execution->rowsProcessed += currentAffectedTupleCount;
|
|
}
|
|
|
|
PQclear(result);
|
|
|
|
/* task query might contain multiple queries, so fetch until we reach NULL */
|
|
placementExecution->queryIndex++;
|
|
continue;
|
|
}
|
|
else if (resultStatus == PGRES_TUPLES_OK)
|
|
{
|
|
/*
|
|
* We've already consumed all the tuples, no more results. Break out
|
|
* of loop and free allocated memory before returning.
|
|
*/
|
|
Assert(PQntuples(result) == 0);
|
|
PQclear(result);
|
|
|
|
/* task query might contain multiple queries, so fetch until we reach NULL */
|
|
placementExecution->queryIndex++;
|
|
continue;
|
|
}
|
|
else if (resultStatus != PGRES_SINGLE_TUPLE)
|
|
{
|
|
/* query failures are always hard errors */
|
|
ReportResultError(connection, result, ERROR);
|
|
}
|
|
else if (!storeRows)
|
|
{
|
|
/*
|
|
* Already receieved rows from executing on another shard placement or
|
|
* doesn't need at all (e.g., DDL).
|
|
*/
|
|
PQclear(result);
|
|
continue;
|
|
}
|
|
|
|
uint32 queryIndex = placementExecution->queryIndex;
|
|
if (queryIndex >= task->queryCount)
|
|
{
|
|
ereport(ERROR, (errmsg("unexpected query index while processing"
|
|
" query results")));
|
|
}
|
|
|
|
TupleDesc tupleDescriptor = tupleDest->tupleDescForQuery(tupleDest, queryIndex);
|
|
if (tupleDescriptor == NULL)
|
|
{
|
|
PQclear(result);
|
|
continue;
|
|
}
|
|
|
|
rowsProcessed = PQntuples(result);
|
|
uint32 columnCount = PQnfields(result);
|
|
uint32 expectedColumnCount = tupleDescriptor->natts;
|
|
|
|
if (columnCount != expectedColumnCount)
|
|
{
|
|
ereport(ERROR, (errmsg("unexpected number of columns from worker: %d, "
|
|
"expected %d",
|
|
columnCount, expectedColumnCount)));
|
|
}
|
|
|
|
if (columnCount > execution->allocatedColumnCount)
|
|
{
|
|
pfree(execution->columnArray);
|
|
int oldColumnCount = execution->allocatedColumnCount;
|
|
execution->allocatedColumnCount = columnCount;
|
|
execution->columnArray = palloc0(execution->allocatedColumnCount *
|
|
sizeof(void *));
|
|
if (EnableBinaryProtocol)
|
|
{
|
|
/*
|
|
* Using repalloc here, to not throw away any previously
|
|
* created StringInfos.
|
|
*/
|
|
execution->stringInfoDataArray = repalloc(
|
|
execution->stringInfoDataArray,
|
|
execution->allocatedColumnCount *
|
|
sizeof(StringInfoData));
|
|
for (int i = oldColumnCount; i < columnCount; i++)
|
|
{
|
|
initStringInfo(&execution->stringInfoDataArray[i]);
|
|
}
|
|
}
|
|
}
|
|
|
|
void **columnArray = execution->columnArray;
|
|
StringInfoData *stringInfoDataArray = execution->stringInfoDataArray;
|
|
bool binaryResults = shardCommandExecution->binaryResults;
|
|
|
|
/*
|
|
* stringInfoDataArray is NULL when EnableBinaryProtocol is false. So
|
|
* we make sure binaryResults is also false in that case. Otherwise we
|
|
* cannot store them anywhere.
|
|
*/
|
|
Assert(EnableBinaryProtocol || !binaryResults);
|
|
|
|
for (uint32 rowIndex = 0; rowIndex < rowsProcessed; rowIndex++)
|
|
{
|
|
uint64 tupleLibpqSize = 0;
|
|
|
|
/*
|
|
* Switch to a temporary memory context that we reset after each
|
|
* tuple. This protects us from any memory leaks that might be
|
|
* present in anything we do to parse a tuple.
|
|
*/
|
|
MemoryContext oldContext = MemoryContextSwitchTo(rowContext);
|
|
|
|
memset(columnArray, 0, columnCount * sizeof(void *));
|
|
|
|
for (columnIndex = 0; columnIndex < columnCount; columnIndex++)
|
|
{
|
|
if (PQgetisnull(result, rowIndex, columnIndex))
|
|
{
|
|
columnArray[columnIndex] = NULL;
|
|
}
|
|
else
|
|
{
|
|
int valueLength = PQgetlength(result, rowIndex, columnIndex);
|
|
char *value = PQgetvalue(result, rowIndex, columnIndex);
|
|
if (binaryResults)
|
|
{
|
|
if (PQfformat(result, columnIndex) == 0)
|
|
{
|
|
ereport(ERROR, (errmsg("unexpected text result")));
|
|
}
|
|
resetStringInfo(&stringInfoDataArray[columnIndex]);
|
|
appendBinaryStringInfo(&stringInfoDataArray[columnIndex],
|
|
value, valueLength);
|
|
columnArray[columnIndex] = &stringInfoDataArray[columnIndex];
|
|
}
|
|
else
|
|
{
|
|
if (PQfformat(result, columnIndex) == 1)
|
|
{
|
|
ereport(ERROR, (errmsg("unexpected binary result")));
|
|
}
|
|
columnArray[columnIndex] = value;
|
|
}
|
|
|
|
tupleLibpqSize += valueLength;
|
|
}
|
|
}
|
|
|
|
AttInMetadata *attInMetadata =
|
|
shardCommandExecution->attributeInputMetadata[queryIndex];
|
|
HeapTuple heapTuple;
|
|
if (binaryResults)
|
|
{
|
|
heapTuple = BuildTupleFromBytes(attInMetadata,
|
|
(fmStringInfo *) columnArray);
|
|
}
|
|
else
|
|
{
|
|
heapTuple = BuildTupleFromCStrings(attInMetadata,
|
|
(char **) columnArray);
|
|
}
|
|
|
|
MemoryContextSwitchTo(oldContext);
|
|
|
|
tupleDest->putTuple(tupleDest, task,
|
|
placementExecution->placementExecutionIndex, queryIndex,
|
|
heapTuple, tupleLibpqSize);
|
|
|
|
MemoryContextReset(rowContext);
|
|
|
|
execution->rowsProcessed++;
|
|
}
|
|
|
|
PQclear(result);
|
|
}
|
|
|
|
/* the context is local to the function, so not needed anymore */
|
|
MemoryContextDelete(rowContext);
|
|
|
|
return fetchDone;
|
|
}
|
|
|
|
|
|
/*
|
|
* TupleDescGetAttBinaryInMetadata - Build an AttInMetadata structure based on
|
|
* the supplied TupleDesc. AttInMetadata can be used in conjunction with
|
|
* fmStringInfos containing binary encoded types to produce a properly formed
|
|
* tuple.
|
|
*
|
|
* NOTE: This function is a copy of the PG function TupleDescGetAttInMetadata,
|
|
* except that it uses getTypeBinaryInputInfo instead of getTypeInputInfo.
|
|
*/
|
|
static AttInMetadata *
|
|
TupleDescGetAttBinaryInMetadata(TupleDesc tupdesc)
|
|
{
|
|
int natts = tupdesc->natts;
|
|
int i;
|
|
Oid atttypeid;
|
|
Oid attinfuncid;
|
|
|
|
AttInMetadata *attinmeta = (AttInMetadata *) palloc(sizeof(AttInMetadata));
|
|
|
|
/* "Bless" the tupledesc so that we can make rowtype datums with it */
|
|
attinmeta->tupdesc = BlessTupleDesc(tupdesc);
|
|
|
|
/*
|
|
* Gather info needed later to call the "in" function for each attribute
|
|
*/
|
|
FmgrInfo *attinfuncinfo = (FmgrInfo *) palloc0(natts * sizeof(FmgrInfo));
|
|
Oid *attioparams = (Oid *) palloc0(natts * sizeof(Oid));
|
|
int32 *atttypmods = (int32 *) palloc0(natts * sizeof(int32));
|
|
|
|
for (i = 0; i < natts; i++)
|
|
{
|
|
Form_pg_attribute att = TupleDescAttr(tupdesc, i);
|
|
|
|
/* Ignore dropped attributes */
|
|
if (!att->attisdropped)
|
|
{
|
|
atttypeid = att->atttypid;
|
|
getTypeBinaryInputInfo(atttypeid, &attinfuncid, &attioparams[i]);
|
|
fmgr_info(attinfuncid, &attinfuncinfo[i]);
|
|
atttypmods[i] = att->atttypmod;
|
|
}
|
|
}
|
|
attinmeta->attinfuncs = attinfuncinfo;
|
|
attinmeta->attioparams = attioparams;
|
|
attinmeta->atttypmods = atttypmods;
|
|
|
|
return attinmeta;
|
|
}
|
|
|
|
|
|
/*
|
|
* BuildTupleFromBytes - build a HeapTuple given user data in binary form.
|
|
* values is an array of StringInfos, one for each attribute of the return
|
|
* tuple. A NULL StringInfo pointer indicates we want to create a NULL field.
|
|
*
|
|
* NOTE: This function is a copy of the PG function BuildTupleFromCStrings,
|
|
* except that it uses ReceiveFunctionCall instead of InputFunctionCall.
|
|
*/
|
|
static HeapTuple
|
|
BuildTupleFromBytes(AttInMetadata *attinmeta, fmStringInfo *values)
|
|
{
|
|
TupleDesc tupdesc = attinmeta->tupdesc;
|
|
int natts = tupdesc->natts;
|
|
int i;
|
|
|
|
Datum *dvalues = (Datum *) palloc(natts * sizeof(Datum));
|
|
bool *nulls = (bool *) palloc(natts * sizeof(bool));
|
|
|
|
/*
|
|
* Call the "in" function for each non-dropped attribute, even for nulls,
|
|
* to support domains.
|
|
*/
|
|
for (i = 0; i < natts; i++)
|
|
{
|
|
if (!TupleDescAttr(tupdesc, i)->attisdropped)
|
|
{
|
|
/* Non-dropped attributes */
|
|
dvalues[i] = ReceiveFunctionCall(&attinmeta->attinfuncs[i],
|
|
values[i],
|
|
attinmeta->attioparams[i],
|
|
attinmeta->atttypmods[i]);
|
|
if (values[i] != NULL)
|
|
{
|
|
nulls[i] = false;
|
|
}
|
|
else
|
|
{
|
|
nulls[i] = true;
|
|
}
|
|
}
|
|
else
|
|
{
|
|
/* Handle dropped attributes by setting to NULL */
|
|
dvalues[i] = (Datum) 0;
|
|
nulls[i] = true;
|
|
}
|
|
}
|
|
|
|
/*
|
|
* Form a tuple
|
|
*/
|
|
HeapTuple tuple = heap_form_tuple(tupdesc, dvalues, nulls);
|
|
|
|
/*
|
|
* Release locally palloc'd space. XXX would probably be good to pfree
|
|
* values of pass-by-reference datums, as well.
|
|
*/
|
|
pfree(dvalues);
|
|
pfree(nulls);
|
|
|
|
return tuple;
|
|
}
|
|
|
|
|
|
/*
|
|
* WorkerPoolFailed marks a worker pool and all the placement executions scheduled
|
|
* on it as failed.
|
|
*/
|
|
static void
|
|
WorkerPoolFailed(WorkerPool *workerPool)
|
|
{
|
|
bool succeeded = false;
|
|
dlist_iter iter;
|
|
|
|
/*
|
|
* A pool cannot fail multiple times, the necessary actions
|
|
* has already be taken, so bail out.
|
|
*/
|
|
if (workerPool->failureState == WORKER_POOL_FAILED ||
|
|
workerPool->failureState == WORKER_POOL_FAILED_OVER_TO_LOCAL)
|
|
{
|
|
return;
|
|
}
|
|
|
|
dlist_foreach(iter, &workerPool->pendingTaskQueue)
|
|
{
|
|
TaskPlacementExecution *placementExecution =
|
|
dlist_container(TaskPlacementExecution, workerPendingQueueNode, iter.cur);
|
|
|
|
PlacementExecutionDone(placementExecution, succeeded);
|
|
}
|
|
|
|
dlist_foreach(iter, &workerPool->readyTaskQueue)
|
|
{
|
|
TaskPlacementExecution *placementExecution =
|
|
dlist_container(TaskPlacementExecution, workerReadyQueueNode, iter.cur);
|
|
|
|
PlacementExecutionDone(placementExecution, succeeded);
|
|
}
|
|
|
|
WorkerSession *session = NULL;
|
|
foreach_ptr(session, workerPool->sessionList)
|
|
{
|
|
WorkerSessionFailed(session);
|
|
}
|
|
|
|
/* we do not want more connections in this pool */
|
|
workerPool->readyTaskCount = 0;
|
|
if (workerPool->failureState != WORKER_POOL_FAILED_OVER_TO_LOCAL)
|
|
{
|
|
/* we prefer not to override WORKER_POOL_FAILED_OVER_TO_LOCAL */
|
|
workerPool->failureState = WORKER_POOL_FAILED;
|
|
}
|
|
|
|
/*
|
|
* The reason is that when replication factor is > 1 and we are performing
|
|
* a SELECT, then we only establish connections for the specific placements
|
|
* that we will read from. However, when a worker pool fails, we will need
|
|
* to establish multiple new connection to other workers and the query
|
|
* can only succeed if all those connections are established.
|
|
*/
|
|
if (UseConnectionPerPlacement())
|
|
{
|
|
List *workerList = workerPool->distributedExecution->workerList;
|
|
|
|
WorkerPool *pool = NULL;
|
|
foreach_ptr(pool, workerList)
|
|
{
|
|
/* failed pools or pools without any connection attempts ignored */
|
|
if (pool->failureState == WORKER_POOL_FAILED ||
|
|
INSTR_TIME_IS_ZERO(pool->poolStartTime))
|
|
{
|
|
continue;
|
|
}
|
|
|
|
/*
|
|
* This should give another NodeConnectionTimeout until all
|
|
* the necessary connections are established.
|
|
*/
|
|
INSTR_TIME_SET_CURRENT(pool->poolStartTime);
|
|
pool->checkForPoolTimeout = true;
|
|
}
|
|
}
|
|
}
|
|
|
|
|
|
/*
|
|
* WorkerSessionFailed marks all placement executions scheduled on the
|
|
* connection as failed.
|
|
*/
|
|
static void
|
|
WorkerSessionFailed(WorkerSession *session)
|
|
{
|
|
TaskPlacementExecution *placementExecution = session->currentTask;
|
|
bool succeeded = false;
|
|
dlist_iter iter;
|
|
|
|
if (placementExecution != NULL)
|
|
{
|
|
/* connection failed while a task was active */
|
|
PlacementExecutionDone(placementExecution, succeeded);
|
|
}
|
|
|
|
dlist_foreach(iter, &session->pendingTaskQueue)
|
|
{
|
|
placementExecution =
|
|
dlist_container(TaskPlacementExecution, sessionPendingQueueNode, iter.cur);
|
|
|
|
PlacementExecutionDone(placementExecution, succeeded);
|
|
}
|
|
|
|
dlist_foreach(iter, &session->readyTaskQueue)
|
|
{
|
|
placementExecution =
|
|
dlist_container(TaskPlacementExecution, sessionReadyQueueNode, iter.cur);
|
|
|
|
PlacementExecutionDone(placementExecution, succeeded);
|
|
}
|
|
}
|
|
|
|
|
|
/*
|
|
* PlacementExecutionDone marks the given placement execution as done when
|
|
* the results have been received or a failure occurred and sets the succeeded
|
|
* flag accordingly. It also adds other placement executions of the same
|
|
* task to the appropriate ready queues.
|
|
*/
|
|
static void
|
|
PlacementExecutionDone(TaskPlacementExecution *placementExecution, bool succeeded)
|
|
{
|
|
WorkerPool *workerPool = placementExecution->workerPool;
|
|
DistributedExecution *execution = workerPool->distributedExecution;
|
|
ShardCommandExecution *shardCommandExecution =
|
|
placementExecution->shardCommandExecution;
|
|
TaskExecutionState executionState = shardCommandExecution->executionState;
|
|
bool failedPlacementExecutionIsOnPendingQueue = false;
|
|
|
|
if (placementExecution->executionState == PLACEMENT_EXECUTION_FAILED)
|
|
{
|
|
/*
|
|
* We may mark placements as failed multiple times, but should only act
|
|
* the first time. Nor should we accept success after failure.
|
|
*/
|
|
return;
|
|
}
|
|
|
|
if (succeeded)
|
|
{
|
|
/* mark the placement execution as finished */
|
|
placementExecution->executionState = PLACEMENT_EXECUTION_FINISHED;
|
|
}
|
|
else if (CanFailoverPlacementExecutionToLocalExecution(placementExecution))
|
|
{
|
|
/*
|
|
* The placement execution can be done over local execution, so it is a soft
|
|
* failure for now.
|
|
*/
|
|
placementExecution->executionState =
|
|
PLACEMENT_EXECUTION_FAILOVER_TO_LOCAL_EXECUTION;
|
|
}
|
|
else
|
|
{
|
|
if (ShouldMarkPlacementsInvalidOnFailure(execution))
|
|
{
|
|
ShardPlacement *shardPlacement = placementExecution->shardPlacement;
|
|
|
|
/*
|
|
* We only set shard state if it currently is SHARD_STATE_ACTIVE, which
|
|
* prevents overwriting shard state if it was already set somewhere else.
|
|
*/
|
|
if (shardPlacement->shardState == SHARD_STATE_ACTIVE)
|
|
{
|
|
MarkShardPlacementInactive(shardPlacement);
|
|
}
|
|
}
|
|
|
|
if (placementExecution->executionState == PLACEMENT_EXECUTION_NOT_READY)
|
|
{
|
|
/*
|
|
* If the placement is in NOT_READY state, it means that the placement
|
|
* execution is assigned to the pending queue of a failed pool or
|
|
* session. So, we should not schedule the next placement execution based
|
|
* on this failure.
|
|
*/
|
|
failedPlacementExecutionIsOnPendingQueue = true;
|
|
}
|
|
|
|
placementExecution->executionState = PLACEMENT_EXECUTION_FAILED;
|
|
}
|
|
|
|
if (executionState != TASK_EXECUTION_NOT_FINISHED)
|
|
{
|
|
/*
|
|
* Task execution has already been finished, no need to continue the
|
|
* next placement.
|
|
*/
|
|
return;
|
|
}
|
|
|
|
/*
|
|
* Update unfinishedTaskCount only when state changes from not finished to
|
|
* finished or failed state.
|
|
*/
|
|
TaskExecutionState newExecutionState =
|
|
TaskExecutionStateMachine(shardCommandExecution);
|
|
if (newExecutionState == TASK_EXECUTION_FINISHED)
|
|
{
|
|
execution->unfinishedTaskCount--;
|
|
return;
|
|
}
|
|
else if (newExecutionState == TASK_EXECUTION_FAILOVER_TO_LOCAL_EXECUTION)
|
|
{
|
|
execution->unfinishedTaskCount--;
|
|
|
|
/* move the task to the local execution */
|
|
Task *task = shardCommandExecution->task;
|
|
execution->localTaskList = lappend(execution->localTaskList, task);
|
|
|
|
/* remove the task from the remote execution list */
|
|
execution->remoteTaskList = list_delete_ptr(execution->remoteTaskList, task);
|
|
|
|
/*
|
|
* As we decided to failover this task to local execution, we cannot
|
|
* allow remote execution to this pool during this distributedExecution.
|
|
*/
|
|
SetLocalExecutionStatus(LOCAL_EXECUTION_REQUIRED);
|
|
workerPool->failureState = WORKER_POOL_FAILED_OVER_TO_LOCAL;
|
|
|
|
ereport(DEBUG4, (errmsg("Task %d execution is failed over to local execution",
|
|
task->taskId)));
|
|
|
|
return;
|
|
}
|
|
else if (newExecutionState == TASK_EXECUTION_FAILED)
|
|
{
|
|
execution->unfinishedTaskCount--;
|
|
|
|
/*
|
|
* Even if a single task execution fails, there is no way to
|
|
* successfully finish the execution.
|
|
*/
|
|
execution->failed = true;
|
|
return;
|
|
}
|
|
else if (!failedPlacementExecutionIsOnPendingQueue)
|
|
{
|
|
ScheduleNextPlacementExecution(placementExecution, succeeded);
|
|
}
|
|
}
|
|
|
|
|
|
/*
|
|
* CanFailoverPlacementExecutionToLocalExecution returns true if the input
|
|
* TaskPlacementExecution can be fail overed to local execution. In other words,
|
|
* the execution can be deferred to local execution.
|
|
*/
|
|
static bool
|
|
CanFailoverPlacementExecutionToLocalExecution(TaskPlacementExecution *placementExecution)
|
|
{
|
|
if (!EnableLocalExecution)
|
|
{
|
|
/* the user explicitly disabled local execution */
|
|
return false;
|
|
}
|
|
|
|
if (GetCurrentLocalExecutionStatus() == LOCAL_EXECUTION_DISABLED)
|
|
{
|
|
/*
|
|
* If the current transaction accessed the local node over a connection
|
|
* then we can't use local execution because of visibility issues.
|
|
*/
|
|
return false;
|
|
}
|
|
|
|
WorkerPool *workerPool = placementExecution->workerPool;
|
|
if (!workerPool->poolToLocalNode)
|
|
{
|
|
/* we can only fail over tasks to local execution for local pools */
|
|
return false;
|
|
}
|
|
|
|
if (workerPool->activeConnectionCount > 0)
|
|
{
|
|
/*
|
|
* The pool has already active connections, the executor is capable
|
|
* of using those active connections. So, no need to failover
|
|
* to the local execution.
|
|
*/
|
|
return false;
|
|
}
|
|
|
|
if (placementExecution->assignedSession != NULL)
|
|
{
|
|
/*
|
|
* If the placement execution has been assigned to a specific session,
|
|
* it has to be executed over that session. Otherwise, it would cause
|
|
* self-deadlocks and break read-your-own-writes consistency.
|
|
*/
|
|
return false;
|
|
}
|
|
|
|
return true;
|
|
}
|
|
|
|
|
|
/*
|
|
* ScheduleNextPlacementExecution is triggered if the query needs to be
|
|
* executed on any or all placements in order and there is a placement on
|
|
* which the execution has not happened yet. If so make that placement
|
|
* ready-to-start by adding it to the appropriate queue.
|
|
*/
|
|
static void
|
|
ScheduleNextPlacementExecution(TaskPlacementExecution *placementExecution, bool succeeded)
|
|
{
|
|
ShardCommandExecution *shardCommandExecution =
|
|
placementExecution->shardCommandExecution;
|
|
PlacementExecutionOrder executionOrder = shardCommandExecution->executionOrder;
|
|
|
|
if ((executionOrder == EXECUTION_ORDER_ANY && !succeeded) ||
|
|
executionOrder == EXECUTION_ORDER_SEQUENTIAL)
|
|
{
|
|
TaskPlacementExecution *nextPlacementExecution = NULL;
|
|
int placementExecutionCount PG_USED_FOR_ASSERTS_ONLY =
|
|
shardCommandExecution->placementExecutionCount;
|
|
|
|
/* find a placement execution that is not yet marked as failed */
|
|
do {
|
|
int nextPlacementExecutionIndex =
|
|
placementExecution->placementExecutionIndex + 1;
|
|
|
|
/*
|
|
* If all tasks failed then we should already have errored out.
|
|
* Still, be defensive and throw error instead of crashes.
|
|
*/
|
|
if (nextPlacementExecutionIndex >= placementExecutionCount)
|
|
{
|
|
WorkerPool *workerPool = placementExecution->workerPool;
|
|
ereport(ERROR, (errmsg("execution cannot recover from multiple "
|
|
"connection failures. Last node failed "
|
|
"%s:%d", workerPool->nodeName,
|
|
workerPool->nodePort)));
|
|
}
|
|
|
|
/* get the next placement in the planning order */
|
|
nextPlacementExecution =
|
|
shardCommandExecution->placementExecutions[nextPlacementExecutionIndex];
|
|
|
|
if (nextPlacementExecution->executionState == PLACEMENT_EXECUTION_NOT_READY)
|
|
{
|
|
/* move the placement execution to the ready queue */
|
|
PlacementExecutionReady(nextPlacementExecution);
|
|
}
|
|
} while (nextPlacementExecution->executionState == PLACEMENT_EXECUTION_FAILED);
|
|
}
|
|
}
|
|
|
|
|
|
/*
|
|
* ShouldMarkPlacementsInvalidOnFailure returns true if the failure
|
|
* should trigger marking placements invalid.
|
|
*/
|
|
static bool
|
|
ShouldMarkPlacementsInvalidOnFailure(DistributedExecution *execution)
|
|
{
|
|
if (!DistributedExecutionModifiesDatabase(execution) ||
|
|
execution->transactionProperties->errorOnAnyFailure)
|
|
{
|
|
/*
|
|
* Failures that do not modify the database (e.g., mainly SELECTs) should
|
|
* never lead to invalid placement.
|
|
*
|
|
* Failures that lead throwing error, no need to mark any placement
|
|
* invalid.
|
|
*/
|
|
return false;
|
|
}
|
|
|
|
return true;
|
|
}
|
|
|
|
|
|
/*
|
|
* PlacementExecutionReady adds a placement execution to the ready queue when
|
|
* its dependent placement executions have finished.
|
|
*/
|
|
static void
|
|
PlacementExecutionReady(TaskPlacementExecution *placementExecution)
|
|
{
|
|
WorkerPool *workerPool = placementExecution->workerPool;
|
|
|
|
if (placementExecution->assignedSession != NULL)
|
|
{
|
|
WorkerSession *session = placementExecution->assignedSession;
|
|
MultiConnection *connection = session->connection;
|
|
RemoteTransaction *transaction = &(connection->remoteTransaction);
|
|
RemoteTransactionState transactionState = transaction->transactionState;
|
|
|
|
if (placementExecution->executionState == PLACEMENT_EXECUTION_NOT_READY)
|
|
{
|
|
/* remove from not-ready task queue */
|
|
dlist_delete(&placementExecution->sessionPendingQueueNode);
|
|
|
|
/* add to ready-to-start task queue */
|
|
dlist_push_tail(&session->readyTaskQueue,
|
|
&placementExecution->sessionReadyQueueNode);
|
|
}
|
|
|
|
if (transactionState == REMOTE_TRANS_NOT_STARTED ||
|
|
transactionState == REMOTE_TRANS_STARTED)
|
|
{
|
|
/*
|
|
* If the connection is idle, wake it up by checking whether
|
|
* the connection is writeable.
|
|
*/
|
|
UpdateConnectionWaitFlags(session, WL_SOCKET_READABLE | WL_SOCKET_WRITEABLE);
|
|
}
|
|
}
|
|
else
|
|
{
|
|
if (placementExecution->executionState == PLACEMENT_EXECUTION_NOT_READY)
|
|
{
|
|
/* remove from not-ready task queue */
|
|
dlist_delete(&placementExecution->workerPendingQueueNode);
|
|
|
|
/* add to ready-to-start task queue */
|
|
dlist_push_tail(&workerPool->readyTaskQueue,
|
|
&placementExecution->workerReadyQueueNode);
|
|
}
|
|
|
|
workerPool->readyTaskCount++;
|
|
|
|
/* wake up an idle connection by checking whether the connection is writeable */
|
|
WorkerSession *session = NULL;
|
|
foreach_ptr(session, workerPool->sessionList)
|
|
{
|
|
MultiConnection *connection = session->connection;
|
|
RemoteTransaction *transaction = &(connection->remoteTransaction);
|
|
RemoteTransactionState transactionState = transaction->transactionState;
|
|
|
|
if (transactionState == REMOTE_TRANS_NOT_STARTED ||
|
|
transactionState == REMOTE_TRANS_STARTED)
|
|
{
|
|
UpdateConnectionWaitFlags(session,
|
|
WL_SOCKET_READABLE | WL_SOCKET_WRITEABLE);
|
|
|
|
break;
|
|
}
|
|
}
|
|
}
|
|
|
|
/* update the state to ready for further processing */
|
|
placementExecution->executionState = PLACEMENT_EXECUTION_READY;
|
|
}
|
|
|
|
|
|
/*
|
|
* TaskExecutionStateMachine returns whether a shard command execution
|
|
* finished or failed according to its execution order. If the task is
|
|
* already finished, simply return the state. Else, calculate the state
|
|
* and return it.
|
|
*/
|
|
static TaskExecutionState
|
|
TaskExecutionStateMachine(ShardCommandExecution *shardCommandExecution)
|
|
{
|
|
PlacementExecutionOrder executionOrder = shardCommandExecution->executionOrder;
|
|
int donePlacementCount = 0;
|
|
int failedPlacementCount = 0;
|
|
int failedOverPlacementCount = 0;
|
|
int placementCount = 0;
|
|
int placementExecutionIndex = 0;
|
|
int placementExecutionCount = shardCommandExecution->placementExecutionCount;
|
|
TaskExecutionState currentTaskExecutionState = shardCommandExecution->executionState;
|
|
|
|
if (currentTaskExecutionState != TASK_EXECUTION_NOT_FINISHED)
|
|
{
|
|
/* we've already calculated the state, simply return it */
|
|
return currentTaskExecutionState;
|
|
}
|
|
|
|
for (; placementExecutionIndex < placementExecutionCount; placementExecutionIndex++)
|
|
{
|
|
TaskPlacementExecution *placementExecution =
|
|
shardCommandExecution->placementExecutions[placementExecutionIndex];
|
|
TaskPlacementExecutionState executionState = placementExecution->executionState;
|
|
|
|
if (executionState == PLACEMENT_EXECUTION_FINISHED)
|
|
{
|
|
donePlacementCount++;
|
|
}
|
|
else if (executionState == PLACEMENT_EXECUTION_FAILED)
|
|
{
|
|
failedPlacementCount++;
|
|
}
|
|
else if (executionState == PLACEMENT_EXECUTION_FAILOVER_TO_LOCAL_EXECUTION)
|
|
{
|
|
failedOverPlacementCount++;
|
|
}
|
|
|
|
placementCount++;
|
|
}
|
|
|
|
if (failedPlacementCount == placementCount)
|
|
{
|
|
currentTaskExecutionState = TASK_EXECUTION_FAILED;
|
|
}
|
|
else if (executionOrder == EXECUTION_ORDER_ANY && donePlacementCount > 0)
|
|
{
|
|
currentTaskExecutionState = TASK_EXECUTION_FINISHED;
|
|
}
|
|
else if (donePlacementCount + failedPlacementCount == placementCount)
|
|
{
|
|
currentTaskExecutionState = TASK_EXECUTION_FINISHED;
|
|
}
|
|
else if (failedOverPlacementCount + donePlacementCount + failedPlacementCount ==
|
|
placementCount)
|
|
{
|
|
/*
|
|
* For any given task, we could have 3 end states:
|
|
* - "donePlacementCount" indicates the successful placement executions
|
|
* - "failedPlacementCount" indicates the failed placement executions
|
|
* - "failedOverPlacementCount" indicates the placement executions that
|
|
* are failed when using remote execution due to connection errors,
|
|
* but there is still a possibility of being successful via
|
|
* local execution. So, for now they are considered as soft
|
|
* errors.
|
|
*/
|
|
currentTaskExecutionState = TASK_EXECUTION_FAILOVER_TO_LOCAL_EXECUTION;
|
|
}
|
|
else
|
|
{
|
|
currentTaskExecutionState = TASK_EXECUTION_NOT_FINISHED;
|
|
}
|
|
|
|
shardCommandExecution->executionState = currentTaskExecutionState;
|
|
|
|
return shardCommandExecution->executionState;
|
|
}
|
|
|
|
|
|
/*
|
|
* BuildWaitEventSet creates a WaitEventSet for the given array of connections
|
|
* which can be used to wait for any of the sockets to become read-ready or
|
|
* write-ready.
|
|
*/
|
|
static WaitEventSet *
|
|
BuildWaitEventSet(List *sessionList)
|
|
{
|
|
/* additional 2 is for postmaster and latch */
|
|
int eventSetSize = GetEventSetSize(sessionList);
|
|
|
|
WaitEventSet *waitEventSet =
|
|
CreateWaitEventSet(CurrentMemoryContext, eventSetSize);
|
|
|
|
WorkerSession *session = NULL;
|
|
foreach_ptr(session, sessionList)
|
|
{
|
|
MultiConnection *connection = session->connection;
|
|
|
|
if (connection->pgConn == NULL)
|
|
{
|
|
/* connection died earlier in the transaction */
|
|
continue;
|
|
}
|
|
|
|
if (connection->waitFlags == 0)
|
|
{
|
|
/* not currently waiting for this connection */
|
|
continue;
|
|
}
|
|
|
|
int sock = PQsocket(connection->pgConn);
|
|
if (sock == -1)
|
|
{
|
|
/* connection was closed */
|
|
continue;
|
|
}
|
|
|
|
int waitEventSetIndex =
|
|
CitusAddWaitEventSetToSet(waitEventSet, connection->waitFlags, sock,
|
|
NULL, (void *) session);
|
|
session->waitEventSetIndex = waitEventSetIndex;
|
|
}
|
|
|
|
CitusAddWaitEventSetToSet(waitEventSet, WL_POSTMASTER_DEATH, PGINVALID_SOCKET, NULL,
|
|
NULL);
|
|
CitusAddWaitEventSetToSet(waitEventSet, WL_LATCH_SET, PGINVALID_SOCKET, MyLatch,
|
|
NULL);
|
|
|
|
return waitEventSet;
|
|
}
|
|
|
|
|
|
/*
|
|
* CitusAddWaitEventSetToSet is a wrapper around Postgres' AddWaitEventToSet().
|
|
*
|
|
* AddWaitEventToSet() may throw hard errors. For example, when the
|
|
* underlying socket for a connection is closed by the remote server
|
|
* and already reflected by the OS, however Citus hasn't had a chance
|
|
* to get this information. In that case, if replication factor is >1,
|
|
* Citus can failover to other nodes for executing the query. Even if
|
|
* replication factor = 1, Citus can give much nicer errors.
|
|
*
|
|
* So CitusAddWaitEventSetToSet simply puts ModifyWaitEvent into a
|
|
* PG_TRY/PG_CATCH block in order to catch any hard errors, and
|
|
* returns this information to the caller.
|
|
*/
|
|
static int
|
|
CitusAddWaitEventSetToSet(WaitEventSet *set, uint32 events, pgsocket fd,
|
|
Latch *latch, void *user_data)
|
|
{
|
|
volatile int waitEventSetIndex = WAIT_EVENT_SET_INDEX_NOT_INITIALIZED;
|
|
MemoryContext savedContext = CurrentMemoryContext;
|
|
|
|
PG_TRY();
|
|
{
|
|
waitEventSetIndex =
|
|
AddWaitEventToSet(set, events, fd, latch, (void *) user_data);
|
|
}
|
|
PG_CATCH();
|
|
{
|
|
/*
|
|
* We might be in an arbitrary memory context when the
|
|
* error is thrown and we should get back to one we had
|
|
* at PG_TRY() time, especially because we are not
|
|
* re-throwing the error.
|
|
*/
|
|
MemoryContextSwitchTo(savedContext);
|
|
|
|
FlushErrorState();
|
|
|
|
if (user_data != NULL)
|
|
{
|
|
WorkerSession *workerSession = (WorkerSession *) user_data;
|
|
|
|
ereport(DEBUG1, (errcode(ERRCODE_CONNECTION_FAILURE),
|
|
errmsg("Adding wait event for node %s:%d failed. "
|
|
"The socket was: %d",
|
|
workerSession->workerPool->nodeName,
|
|
workerSession->workerPool->nodePort, fd)));
|
|
}
|
|
|
|
/* let the callers know about the failure */
|
|
waitEventSetIndex = WAIT_EVENT_SET_INDEX_FAILED;
|
|
}
|
|
PG_END_TRY();
|
|
|
|
return waitEventSetIndex;
|
|
}
|
|
|
|
|
|
/*
|
|
* GetEventSetSize returns the event set size for a list of sessions.
|
|
*/
|
|
static int
|
|
GetEventSetSize(List *sessionList)
|
|
{
|
|
/* additional 2 is for postmaster and latch */
|
|
return list_length(sessionList) + 2;
|
|
}
|
|
|
|
|
|
/*
|
|
* RebuildWaitEventSetFlags modifies the given waitEventSet with the wait flags
|
|
* for connections in the sessionList.
|
|
*/
|
|
static void
|
|
RebuildWaitEventSetFlags(WaitEventSet *waitEventSet, List *sessionList)
|
|
{
|
|
WorkerSession *session = NULL;
|
|
foreach_ptr(session, sessionList)
|
|
{
|
|
MultiConnection *connection = session->connection;
|
|
int waitEventSetIndex = session->waitEventSetIndex;
|
|
|
|
if (connection->pgConn == NULL)
|
|
{
|
|
/* connection died earlier in the transaction */
|
|
continue;
|
|
}
|
|
|
|
if (connection->waitFlags == 0)
|
|
{
|
|
/* not currently waiting for this connection */
|
|
continue;
|
|
}
|
|
|
|
int sock = PQsocket(connection->pgConn);
|
|
if (sock == -1)
|
|
{
|
|
/* connection was closed */
|
|
continue;
|
|
}
|
|
|
|
bool success =
|
|
CitusModifyWaitEvent(waitEventSet, waitEventSetIndex,
|
|
connection->waitFlags, NULL);
|
|
if (!success)
|
|
{
|
|
ereport(DEBUG1, (errcode(ERRCODE_CONNECTION_FAILURE),
|
|
errmsg("Modifying wait event for node %s:%d failed. "
|
|
"The wait event index was: %d",
|
|
connection->hostname, connection->port,
|
|
waitEventSetIndex)));
|
|
|
|
session->waitEventSetIndex = WAIT_EVENT_SET_INDEX_FAILED;
|
|
}
|
|
}
|
|
}
|
|
|
|
|
|
/*
|
|
* CitusModifyWaitEvent is a wrapper around Postgres' ModifyWaitEvent().
|
|
*
|
|
* ModifyWaitEvent may throw hard errors. For example, when the underlying
|
|
* socket for a connection is closed by the remote server and already
|
|
* reflected by the OS, however Citus hasn't had a chance to get this
|
|
* information. In that case, if repliction factor is >1, Citus can
|
|
* failover to other nodes for executing the query. Even if replication
|
|
* factor = 1, Citus can give much nicer errors.
|
|
*
|
|
* So CitusModifyWaitEvent simply puts ModifyWaitEvent into a PG_TRY/PG_CATCH
|
|
* block in order to catch any hard errors, and returns this information to the
|
|
* caller.
|
|
*/
|
|
static bool
|
|
CitusModifyWaitEvent(WaitEventSet *set, int pos, uint32 events, Latch *latch)
|
|
{
|
|
volatile bool success = true;
|
|
MemoryContext savedContext = CurrentMemoryContext;
|
|
|
|
PG_TRY();
|
|
{
|
|
ModifyWaitEvent(set, pos, events, latch);
|
|
}
|
|
PG_CATCH();
|
|
{
|
|
/*
|
|
* We might be in an arbitrary memory context when the
|
|
* error is thrown and we should get back to one we had
|
|
* at PG_TRY() time, especially because we are not
|
|
* re-throwing the error.
|
|
*/
|
|
MemoryContextSwitchTo(savedContext);
|
|
|
|
FlushErrorState();
|
|
|
|
/* let the callers know about the failure */
|
|
success = false;
|
|
}
|
|
PG_END_TRY();
|
|
|
|
return success;
|
|
}
|
|
|
|
|
|
/*
|
|
* SetLocalForceMaxQueryParallelization is simply a C interface for setting
|
|
* the following:
|
|
* SET LOCAL citus.force_max_query_parallelization TO on;
|
|
*/
|
|
void
|
|
SetLocalForceMaxQueryParallelization(void)
|
|
{
|
|
set_config_option("citus.force_max_query_parallelization", "on",
|
|
(superuser() ? PGC_SUSET : PGC_USERSET), PGC_S_SESSION,
|
|
GUC_ACTION_LOCAL, true, 0, false);
|
|
}
|
|
|
|
|
|
/*
|
|
* ExtractParametersForRemoteExecution extracts parameter types and values from
|
|
* the given ParamListInfo structure, and fills parameter type and value arrays.
|
|
* It changes oid of custom types to InvalidOid so that they are the same in workers
|
|
* and coordinators.
|
|
*/
|
|
static void
|
|
ExtractParametersForRemoteExecution(ParamListInfo paramListInfo, Oid **parameterTypes,
|
|
const char ***parameterValues)
|
|
{
|
|
ExtractParametersFromParamList(paramListInfo, parameterTypes,
|
|
parameterValues, false);
|
|
}
|
|
|
|
|
|
/*
|
|
* ExtractParametersFromParamList extracts parameter types and values from
|
|
* the given ParamListInfo structure, and fills parameter type and value arrays.
|
|
* If useOriginalCustomTypeOids is true, it uses the original oids for custom types.
|
|
*/
|
|
void
|
|
ExtractParametersFromParamList(ParamListInfo paramListInfo,
|
|
Oid **parameterTypes,
|
|
const char ***parameterValues, bool
|
|
useOriginalCustomTypeOids)
|
|
{
|
|
int parameterCount = paramListInfo->numParams;
|
|
|
|
*parameterTypes = (Oid *) palloc0(parameterCount * sizeof(Oid));
|
|
*parameterValues = (const char **) palloc0(parameterCount * sizeof(char *));
|
|
|
|
/* get parameter types and values */
|
|
for (int parameterIndex = 0; parameterIndex < parameterCount; parameterIndex++)
|
|
{
|
|
ParamExternData *parameterData = ¶mListInfo->params[parameterIndex];
|
|
Oid typeOutputFunctionId = InvalidOid;
|
|
bool variableLengthType = false;
|
|
|
|
/*
|
|
* Use 0 for data types where the oid values can be different on
|
|
* the coordinator and worker nodes. Therefore, the worker nodes can
|
|
* infer the correct oid.
|
|
*/
|
|
if (parameterData->ptype >= FirstNormalObjectId && !useOriginalCustomTypeOids)
|
|
{
|
|
(*parameterTypes)[parameterIndex] = 0;
|
|
}
|
|
else
|
|
{
|
|
(*parameterTypes)[parameterIndex] = parameterData->ptype;
|
|
}
|
|
|
|
/*
|
|
* If the parameter is not referenced / used (ptype == 0) and
|
|
* would otherwise have errored out inside standard_planner()),
|
|
* don't pass a value to the remote side, and pass text oid to prevent
|
|
* undetermined data type errors on workers.
|
|
*/
|
|
if (parameterData->ptype == 0)
|
|
{
|
|
(*parameterValues)[parameterIndex] = NULL;
|
|
(*parameterTypes)[parameterIndex] = TEXTOID;
|
|
|
|
continue;
|
|
}
|
|
|
|
/*
|
|
* If the parameter is NULL then we preserve its type, but
|
|
* don't need to evaluate its value.
|
|
*/
|
|
if (parameterData->isnull)
|
|
{
|
|
(*parameterValues)[parameterIndex] = NULL;
|
|
|
|
continue;
|
|
}
|
|
|
|
getTypeOutputInfo(parameterData->ptype, &typeOutputFunctionId,
|
|
&variableLengthType);
|
|
|
|
(*parameterValues)[parameterIndex] = OidOutputFunctionCall(typeOutputFunctionId,
|
|
parameterData->value);
|
|
}
|
|
}
|