citus

Commit Graph

Author	SHA1	Message	Date
Muhammad Usama	f743b35fc2	Parallelize Shard Rebalancing & Unlock Concurrent Logical Shard Moves (#7983 ) DESCRIPTION: Parallelizes shard rebalancing and removes the bottlenecks that previously blocked concurrent logical-replication moves. These improvements reduce rebalance windows—particularly for clusters with large reference tables and enable multiple shard transfers to run in parallel. Motivation: Citus’ shard rebalancer has some key performance bottlenecks: Sequential Movement of Reference Tables: Reference tables are often assumed to be small, but in real-world deployments, they can grow significantly large. Previously, reference table shards were transferred as a single unit, making the process monolithic and time-consuming. No Parallelism Within a Colocation Group: Although Citus distributes data using colocated shards, shard movements within the same colocation group were serialized. In environments with hundreds of distributed tables colocated together, this serialization significantly slowed down rebalance operations. Excessive Locking: Rebalancer used restrictive locks and redundant logical replication guards, further limiting concurrency. The goal of this commit is to eliminate these inefficiencies and enable maximum parallelism during rebalance, without compromising correctness or compatibility. Parallelize shard rebalancing to reduce rebalance time. Feature Summary: 1. Parallel Reference Table Rebalancing Each reference-table shard is now copied in its own background task. Foreign key and other constraints are deferred until all shards are copied. For single shard movement without considering colocation a new internal-only UDF '`citus_internal_copy_single_shard_placement`' is introduced to allow single-shard copy/move operations. Since this function is internal, we do not allow users to call it directly. Temporary Hack to Set Background Task Context Background tasks cannot currently set custom GUCs like application_name before executing internal-only functions. 'citus_rebalancer ...' statement as a prefix in the task command. This is a temporary hack to label internal tasks until proper GUC injection support is added to the background task executor. 2. Changes in Locking Strategy - Drop the leftover replication lock that previously serialized shard moves performed via logical replication. This lock was only needed when we used to drop and recreate the subscriptions/publications before each move. Since Citus now removes those objects later as part of the “unused distributed objects” cleanup, shard moves via logical replication can safely run in parallel without additional locking. - Introduced a per-shard advisory lock to prevent concurrent operations on the same shard while allowing maximum parallelism elsewhere. - Change the lock mode in AcquirePlacementColocationLock from ExclusiveLock to RowExclusiveLock to allow concurrent updates within the same colocation group, while still preventing concurrent DDL operations. 3. citus_rebalance_start() enhancements The citus_rebalance_start() function now accepts two new optional parameters: ``` - parallel_transfer_colocated_shards BOOLEAN DEFAULT false, - parallel_transfer_reference_tables BOOLEAN DEFAULT false ``` This ensures backward compatibility by preserving the existing behavior and avoiding any disruption to user expectations and when both are set to true, the rebalancer operates with full parallelism. Previous Rebalancer Behavior: `SELECT citus_rebalance_start(shard_transfer_mode := 'force_logical');` This would: Start a single background task for replicating all reference tables Then, move all shards serially, one at a time. ``` Task 1: replicate_reference_tables() ↓ Task 2: move_shard_1() ↓ Task 3: move_shard_2() ↓ Task 4: move_shard_3() ``` Slow and sequential. Reference table copy is a bottleneck. Colocated shards must wait for each other. New Parallel Rebalancer: ``` SELECT citus_rebalance_start( shard_transfer_mode := 'force_logical', parallel_transfer_colocated_shards := true, parallel_transfer_reference_tables := true ); ``` This would: - Schedule independent background tasks for each reference-table shard. - Move colocated shards in parallel, while still maintaining dependency order. - Defer constraint application until all reference shards are in place. - ``` Task 1: copy_ref_shard_1() Task 2: copy_ref_shard_2() Task 3: copy_ref_shard_3() → Task 4: apply_constraints() ↓ Task 5: copy_shard_1() Task 6: copy_shard_2() Task 7: copy_shard_3() ↓ Task 8-10: move_shard_1..3() ``` Each operation is scheduled independently and can run as soon as dependencies are satisfied.	2025-08-18 17:44:14 +03:00
Naisila Puka	84f2d8685a	Adds control for background task executors involving a node (#6771 ) DESCRIPTION: Adds control for background task executors involving a node ### Background and motivation Nonblocking concurrent task execution via background workers was introduced in [#6459](https://github.com/citusdata/citus/pull/6459), and concurrent shard moves in the background rebalancer were introduced in [#6756](https://github.com/citusdata/citus/pull/6756) - with a hard dependency that limits to 1 shard move per node. As we know, a shard move consists of a shard moving from a source node to a target node. The hard dependency was used because the background task runner didn't have an option to limit the parallel shard moves per node. With the motivation of controlling the number of concurrent shard moves that involve a particular node, either as source or target, this PR introduces a general new GUC citus.max_background_task_executors_per_node to be used in the background task runner infrastructure. So, why do we even want to control and limit the concurrency? Well, it's all about resource availability: because the moves involve the same nodes, extra parallelism won’t make the rebalance complete faster if some resource is already maxed out (usually cpu or disk). Or, if the cluster is being used in a production setting, the moves might compete for resources with production queries much more than if they had been executed sequentially. ### How does it work? A new column named nodes_involved is added to the catalog table that keeps track of the scheduled background tasks, pg_dist_background_task. It is of type integer[] - to store a list of node ids. It is NULL by default - the column will be filled by the rebalancer, but we may not care about the nodes involved in other uses of the background task runner. Table "pg_catalog.pg_dist_background_task" Column \| Type ============================================ job_id \| bigint task_id \| bigint owner \| regrole pid \| integer status \| citus_task_status command \| text retry_count \| integer not_before \| timestamp with time zone message \| text +nodes_involved \| integer[] A hashtable named ParallelTasksPerNode keeps track of the number of parallel running background tasks per node. An entry in the hashtable is as follows: ParallelTasksPerNodeEntry { node_id // The node is used as the hash table key counter // Number of concurrent background tasks that involve node node_id // The counter limit is citus.max_background_task_executors_per_node } When the background task runner assigns a runnable task to a new executor, it increments the counter for each of the nodes involved with that runnable task. The limit of each counter is citus.max_background_task_executors_per_node. If the limit is reached for any of the nodes involved, this runnable task is skipped. And then, later, when the running task finishes, the background task runner decrements the counter for each of the nodes involved with the done task. The following functions take care of these increment-decrement steps: IncrementParallelTaskCountForNodesInvolved(task) DecrementParallelTaskCountForNodesInvolved(task) citus.max_background_task_executors_per_node can be changed in the fly. In the background rebalancer, we simply give {source_node, target_node} as the nodesInvolved input to the ScheduleBackgroundTask function. The rest is taken care of by the general background task runner infrastructure explained above. Check background_task_queue_monitor.sql and background_rebalance_parallel.sql tests for detailed examples. #### Note This PR also adds a hard node dependency if a node is first being used as a source for a move, and then later as a target. The reason this should be a hard dependency is that the first move might make space for the second move. So, we could run out of disk space (or at least overload the node) if we move the second shard to it before the first one is moved away. Fixes https://github.com/citusdata/citus/issues/6716	2023-04-06 14:12:39 +03:00
aykut-bozkurt	1ad1a0a336	add citus_task_wait udf to wait on desired task status (#6475 ) We already have citus_job_wait to wait until the job reaches the desired state. That PR adds waiting on task state to allow more granular waiting. It can be used for Citus operations. Moreover, it is also useful for testing purposes. (wait until a task reaches specified state) Related to #6459.	2022-12-12 22:41:03 +03:00
aykut-bozkurt	65f256eec4	* add SIGTERM handler to gracefully terminate task executors, \ (#6473 ) Adds signal handlers for graceful termination, cancellation of task executors and detecting config updates. Related to PR #6459. #### How to handle termination signal? Monitor need to gracefully terminate all running task executors before terminating. Hence, we have sigterm handler for the monitor. #### How to handle cancellation signal? Monitor need to gracefully cancel all running task executors before terminating. Hence, we have sigint handler for the monitor. #### How to detect configuration changes? Monitor has SIGHUP handler to reflect configuration changes while executing tasks.	2022-12-02 18:15:31 +03:00
aykut-bozkurt	1f8675da43	nonblocking concurrent task execution via background workers (#6459 ) Improvement on our background task monitoring API (PR #6296) to support concurrent and nonblocking task execution. Mainly we have a queue monitor background process which forks task executors for `Runnable` tasks and then monitors their status by fetching messages from shared memory queue in nonblocking way.	2022-11-30 14:29:46 +03:00
Nils Dijk	00a94c7f13	Implement infrastructure to run sql jobs in the background (#6296 ) DESCRIPTION: Add infrastructure to run long running management operations in background This infrastructure introduces the primitives of jobs and tasks. A task consists of a sql statement and an owner. Tasks belong to a Job and can depend on other tasks from the same job. When there are either runnable or running tasks we would like to make sure a bacgrkound task queue monitor process is running. A Task could be in running state while there is actually no monitor present due to a database restart or failover. Once the monitor starts it will reset any running task to its runnable state. To make sure only one background task queue monitor is ever running at once it will acquire an advisory lock that self conflicts. Once a task is done it will find all tasks depending on this task. After checking that the task doesn't have unmet dependencies it will transition the task from blocked to runnable state for the task to be picked up on a subsequent task start. Currently only one task can be running at a time. This can be improved upon in later releases without changes to the higher level API. The initial goal for this background tasks is to allow a rebalance to run in the background. This will be implemented in a subsequent PR.	2022-09-09 16:11:19 +03:00

6 Commits (f743b35fc27f3386e0ef99dbaaf1431880c46415)