citus

Commit Graph

Author	SHA1	Message	Date
Muhammad Usama	be6668e440	Snapshot-Based Node Split – Foundation and Core Implementation (#8122 ) DESCRIPTION: This pull request introduces the foundation and core logic for the snapshot-based node split feature in Citus. This feature enables promoting a streaming replica (referred to as a clone in this feature and UI) to a primary node and rebalancing shards between the original and the newly promoted node without requiring a full data copy. This significantly reduces rebalance times for scale-out operations where the new node already contains a full copy of the data via streaming replication. Key Highlights: 1. Replica (Clone) Registration & Management Infrastructure Introduces a new set of UDFs to register and manage clone nodes: - citus_add_clone_node() - citus_add_clone_node_with_nodeid() - citus_remove_clone_node() - citus_remove_clone_node_with_nodeid() These functions allow administrators to register a streaming replica of an existing worker node as a clone, making it eligible for later promotion via snapshot-based split. 2. Snapshot-Based Node Split (Core Implementation) New core UDF: - citus_promote_clone_and_rebalance() This function implements the full workflow to promote a clone and rebalance shards between the old and new primaries. Steps include: 1. Ensuring Exclusivity – Blocks any concurrent placement-changing operations. 2. Blocking Writes – Temporarily blocks writes on the primary to ensure consistency. 3. Replica Catch-up – Waits for the replica to be fully in sync. 4. Promotion – Promotes the replica to a primary using pg_promote. 5. Metadata Update – Updates metadata to reflect the newly promoted primary node. 6. Shard Rebalancing – Redistributes shards between the old and new primary nodes. 3. Split Plan Preview A new helper UDF get_snapshot_based_node_split_plan() provides a preview of the shard distribution post-split, without executing the promotion. Example: ``` reb 63796> select * from pg_catalog.get_snapshot_based_node_split_plan('127.0.0.1',5433,'127.0.0.1',5453); table_name \| shardid \| shard_size \| placement_node --------------+---------+------------+---------------- companies \| 102008 \| 0 \| Primary Node campaigns \| 102010 \| 0 \| Primary Node ads \| 102012 \| 0 \| Primary Node mscompanies \| 102014 \| 0 \| Primary Node mscampaigns \| 102016 \| 0 \| Primary Node msads \| 102018 \| 0 \| Primary Node mscompanies2 \| 102020 \| 0 \| Primary Node mscampaigns2 \| 102022 \| 0 \| Primary Node msads2 \| 102024 \| 0 \| Primary Node companies \| 102009 \| 0 \| Clone Node campaigns \| 102011 \| 0 \| Clone Node ads \| 102013 \| 0 \| Clone Node mscompanies \| 102015 \| 0 \| Clone Node mscampaigns \| 102017 \| 0 \| Clone Node msads \| 102019 \| 0 \| Clone Node mscompanies2 \| 102021 \| 0 \| Clone Node mscampaigns2 \| 102023 \| 0 \| Clone Node msads2 \| 102025 \| 0 \| Clone Node (18 rows) ``` 4 Test Infrastructure Enhancements - Added a new test case scheduler for snapshot-based split scenarios. - Enhanced pg_regress_multi.pl to support creating node backups with slightly modified options to simulate real-world backup-based clone creation. ### 5. Usage Guide The snapshot-based node split can be performed using the following workflow: - Take a Backup of the Worker Node Run pg_basebackup (or an equivalent tool) against the existing worker node to create a physical backup. `pg_basebackup -h <primary_worker_host> -p <port> -D /path/to/replica/data --write-recovery-conf ` - Start the Replica Node Start PostgreSQL on the replica using the backup data directory, ensuring it is configured as a streaming replica of the original worker node. - Register the Backup Node as a Clone Mark the registered replica as a clone of its original worker node: `SELECT * FROM citus_add_clone_node('<clone_host>', <clone_port>, '<primary_host>', <primary_port>); ` - Promote and Rebalance the Clone Promote the clone to a primary and rebalance shards between it and the original worker: `SELECT * FROM citus_promote_clone_and_rebalance('clone_node_id'); ` - Drop Any Replication Slots from the Original Worker After promotion, clean up any unused replication slots from the original worker: `SELECT pg_drop_replication_slot('<slot_name>'); `	2025-08-19 14:13:55 +03:00
Muhammad Usama	f743b35fc2	Parallelize Shard Rebalancing & Unlock Concurrent Logical Shard Moves (#7983 ) DESCRIPTION: Parallelizes shard rebalancing and removes the bottlenecks that previously blocked concurrent logical-replication moves. These improvements reduce rebalance windows—particularly for clusters with large reference tables and enable multiple shard transfers to run in parallel. Motivation: Citus’ shard rebalancer has some key performance bottlenecks: Sequential Movement of Reference Tables: Reference tables are often assumed to be small, but in real-world deployments, they can grow significantly large. Previously, reference table shards were transferred as a single unit, making the process monolithic and time-consuming. No Parallelism Within a Colocation Group: Although Citus distributes data using colocated shards, shard movements within the same colocation group were serialized. In environments with hundreds of distributed tables colocated together, this serialization significantly slowed down rebalance operations. Excessive Locking: Rebalancer used restrictive locks and redundant logical replication guards, further limiting concurrency. The goal of this commit is to eliminate these inefficiencies and enable maximum parallelism during rebalance, without compromising correctness or compatibility. Parallelize shard rebalancing to reduce rebalance time. Feature Summary: 1. Parallel Reference Table Rebalancing Each reference-table shard is now copied in its own background task. Foreign key and other constraints are deferred until all shards are copied. For single shard movement without considering colocation a new internal-only UDF '`citus_internal_copy_single_shard_placement`' is introduced to allow single-shard copy/move operations. Since this function is internal, we do not allow users to call it directly. Temporary Hack to Set Background Task Context Background tasks cannot currently set custom GUCs like application_name before executing internal-only functions. 'citus_rebalancer ...' statement as a prefix in the task command. This is a temporary hack to label internal tasks until proper GUC injection support is added to the background task executor. 2. Changes in Locking Strategy - Drop the leftover replication lock that previously serialized shard moves performed via logical replication. This lock was only needed when we used to drop and recreate the subscriptions/publications before each move. Since Citus now removes those objects later as part of the “unused distributed objects” cleanup, shard moves via logical replication can safely run in parallel without additional locking. - Introduced a per-shard advisory lock to prevent concurrent operations on the same shard while allowing maximum parallelism elsewhere. - Change the lock mode in AcquirePlacementColocationLock from ExclusiveLock to RowExclusiveLock to allow concurrent updates within the same colocation group, while still preventing concurrent DDL operations. 3. citus_rebalance_start() enhancements The citus_rebalance_start() function now accepts two new optional parameters: ``` - parallel_transfer_colocated_shards BOOLEAN DEFAULT false, - parallel_transfer_reference_tables BOOLEAN DEFAULT false ``` This ensures backward compatibility by preserving the existing behavior and avoiding any disruption to user expectations and when both are set to true, the rebalancer operates with full parallelism. Previous Rebalancer Behavior: `SELECT citus_rebalance_start(shard_transfer_mode := 'force_logical');` This would: Start a single background task for replicating all reference tables Then, move all shards serially, one at a time. ``` Task 1: replicate_reference_tables() ↓ Task 2: move_shard_1() ↓ Task 3: move_shard_2() ↓ Task 4: move_shard_3() ``` Slow and sequential. Reference table copy is a bottleneck. Colocated shards must wait for each other. New Parallel Rebalancer: ``` SELECT citus_rebalance_start( shard_transfer_mode := 'force_logical', parallel_transfer_colocated_shards := true, parallel_transfer_reference_tables := true ); ``` This would: - Schedule independent background tasks for each reference-table shard. - Move colocated shards in parallel, while still maintaining dependency order. - Defer constraint application until all reference shards are in place. - ``` Task 1: copy_ref_shard_1() Task 2: copy_ref_shard_2() Task 3: copy_ref_shard_3() → Task 4: apply_constraints() ↓ Task 5: copy_shard_1() Task 6: copy_shard_2() Task 7: copy_shard_3() ↓ Task 8-10: move_shard_1..3() ``` Each operation is scheduled independently and can run as soon as dependencies are satisfied.	2025-08-18 17:44:14 +03:00
Onur Tirtir	87a1b631e8	Not automatically create citus_columnar when creating citus extension (#8081 ) DESCRIPTION: Not automatically create citus_columnar when there are no relations using it. Previously, we were always creating citus_columnar when creating citus with version >= 11.1. And how we were doing was as follows: * Detach SQL objects owned by old columnar, i.e., "drop" them from citus, but not actually drop them from the database * "old columnar" is the one that we had before Citus 11.1 as part of citus, i.e., before splitting the access method ands its catalog to citus_columnar. * Create citus_columnar and attach the SQL objects leftover from old columnar to it so that we can continue supporting the columnar tables that user had before Citus 11.1 with citus_columnar. First part is unchanged, however, now we don't create citus_columnar automatically anymore if the user didn't have any relations using columnar. For this reason, as of Citus 13.2, when these SQL objects are not owned by an extension and there are no relations using columnar access method, we drop these SQL objects when updating Citus to 13.2. The net effect is still the same as if we automatically created citus_columnar and user dropped citus_columnar later, so we should not have any issues with dropping them. (Update: Seems we've made some assumptions in citus, e.g., citus_finish_pg_upgrade() still assumes columnar metadata exists and tries to apply some fixes for it, so this PR fixes them as well. See the last section of this PR description.) Also, ideally I was hoping to just remove some lines of code from extension.c, where we decide automatically creating citus_columnar when creating citus, however, this didn't happen to be the case for two reasons: * We still need to automatically create it for the servers using columnar access method. * We need to clean-up the leftover SQL objects from old columnar when the above is not case otherwise we would have leftover SQL objects from old columnar for no reason, and that would confuse users too. * Old columnar cannot be used to create columnar tables properly, so we should clean them up and let the user decide whether they want to create citus_columnar when they really need it later. --- Also made several changes in the test suite because similarly, we don't always want to have citus_columnar created in citus tests anymore: * Now, columnar specific test targets, which cover 41 test sql files, always install columnar by default, by using "--load-extension=citus_columnar". * "--load-extension=citus_columnar" is not added to citus specific test targets because by default we don't want to have citus_columnar created during citus tests. * Excluding citus_columnar specific tests, we have 601 sql files that we have as citus tests and in 27 of them we manually create citus_columnar at the very beginning of the test because these tests do test some functionalities of citus together with columnar tables. Also, before and after schedules for PG upgrade tests are now duplicated so we have two versions of each: one with columnar tests and one without. To choose between them, check-pg-upgrade now supports a "test-with-columnar" option, which can be set to "true" or anything else to logically indicate "false". In CI, we run the check-pg-upgrade test target with both options. The purpose is to ensure we can test PG upgrades where citus_columnar is not created in the cluster before the upgrade as well. Finally, added more tests to multi_extension.sql to test Citus upgrade scenarios with / without columnar tables / citus_columnar extension. --- Also, seems citus_finish_pg_upgrade was assuming that citus_columnar is always created but actually we should have never made such an assumption. To fix that, moved columnar specific post-PG-upgrade work from citus to a new columnar UDF, which is columnar_finish_pg_upgrade. But to avoid breaking existing customer / managed service scripts, we continue to automatically perform post PG-upgrade work for columnar within citus_finish_pg_upgrade, but only if columnar access method exists this time.	2025-08-18 08:29:27 +01:00
Teja Mupparti	889aa92ac0	EXPLAIN ANALYZE - Prevent execution of the plan during the plan-print (#8017 ) DESCRIPTION: Fixed a bug in EXPLAIN ANALYZE to prevent unintended (duplicate) execution of the (sub)plans during the explain phase. Fixes #4212 ### 🐞 Bug #4212 : Redundant (Subplan) Execution in `EXPLAIN ANALYZE` codepath #### 🔍 Background In the standard PostgreSQL execution path, `ExplainOnePlan()` is responsible for two distinct operations depending on whether `EXPLAIN ANALYZE` is requested: 1. Execute the plan ```c if (es->analyze) ExecutorRun(queryDesc, direction, 0L, true); ``` 2. Print the plan tree ```c ExplainPrintPlan(es, queryDesc); ``` When printing the plan, the executor should not run the plan again. Execution is only expected to happen once—at the top level when `es->analyze = true`. --- #### ⚠️ Issue in Citus In the Citus implementation of `CustomScanMethods.ExplainCustomScan = CitusExplainScan`, which is a custom scan explain callback function used to print explain information of a Citus plan incorrectly performs redundant execution inside the explain path of `ExplainPrintPlan()` ```c ExplainOnePlan() ExplainPrintPlan() ExplainNode() CitusExplainScan() if (distributedPlan->subPlanList != NIL) { ExplainSubPlans(distributedPlan, es); { PlannedStmt plan = subPlan->plan; ExplainOnePlan(plan, ...); // ⚠️ May re-execute subplan if es->analyze is true } } ``` This causes the subplans to be executed again, even though they have already been executed during the top-level plan execution. This behavior violates the expectation in PostgreSQL where `EXPLAIN ANALYZE` should execute each node exactly once* for analysis. --- #### ✅ Fix (proposed) Save the output of Subplans during `ExecuteSubPlans()`, and later use it in `ExplainSubPlans()`	2025-07-30 11:29:50 -07:00
naisila	4cd8bb1b67	Bump Citus version to 13.2devel	2025-06-24 16:21:48 +02:00

5 Commits (be6668e4400deb89cf0fd0198d157b649c10b09d)