Commit Graph

5 Commits (be6668e4400deb89cf0fd0198d157b649c10b09d)

Author SHA1 Message Date
Muhammad Usama be6668e440
Snapshot-Based Node Split – Foundation and Core Implementation (#8122)
**DESCRIPTION:**
This pull request introduces the foundation and core logic for the
snapshot-based node split feature in Citus. This feature enables
promoting a streaming replica (referred to as a clone in this feature
and UI) to a primary node and rebalancing shards between the original
and the newly promoted node without requiring a full data copy.

This significantly reduces rebalance times for scale-out operations
where the new node already contains a full copy of the data via
streaming replication.

Key Highlights:
**1. Replica (Clone) Registration & Management Infrastructure**

Introduces a new set of UDFs to register and manage clone nodes:

- citus_add_clone_node()
- citus_add_clone_node_with_nodeid()
- citus_remove_clone_node()
- citus_remove_clone_node_with_nodeid()

These functions allow administrators to register a streaming replica of
an existing worker node as a clone, making it eligible for later
promotion via snapshot-based split.

**2. Snapshot-Based Node Split (Core Implementation)**
New core UDF: 

- citus_promote_clone_and_rebalance()

This function implements the full workflow to promote a clone and
rebalance shards between the old and new primaries. Steps include:

1. Ensuring Exclusivity – Blocks any concurrent placement-changing
operations.
2. Blocking Writes – Temporarily blocks writes on the primary to ensure
consistency.
3. Replica Catch-up – Waits for the replica to be fully in sync.
4. Promotion – Promotes the replica to a primary using pg_promote.
5. Metadata Update – Updates metadata to reflect the newly promoted
primary node.
6. Shard Rebalancing – Redistributes shards between the old and new
primary nodes.


**3. Split Plan Preview**
A new helper UDF get_snapshot_based_node_split_plan() provides a preview
of the shard distribution post-split, without executing the promotion.

**Example:**

```
reb 63796> select * from pg_catalog.get_snapshot_based_node_split_plan('127.0.0.1',5433,'127.0.0.1',5453);
  table_name  | shardid | shard_size | placement_node 
--------------+---------+------------+----------------
 companies    |  102008 |          0 | Primary Node
 campaigns    |  102010 |          0 | Primary Node
 ads          |  102012 |          0 | Primary Node
 mscompanies  |  102014 |          0 | Primary Node
 mscampaigns  |  102016 |          0 | Primary Node
 msads        |  102018 |          0 | Primary Node
 mscompanies2 |  102020 |          0 | Primary Node
 mscampaigns2 |  102022 |          0 | Primary Node
 msads2       |  102024 |          0 | Primary Node
 companies    |  102009 |          0 | Clone Node
 campaigns    |  102011 |          0 | Clone Node
 ads          |  102013 |          0 | Clone Node
 mscompanies  |  102015 |          0 | Clone Node
 mscampaigns  |  102017 |          0 | Clone Node
 msads        |  102019 |          0 | Clone Node
 mscompanies2 |  102021 |          0 | Clone Node
 mscampaigns2 |  102023 |          0 | Clone Node
 msads2       |  102025 |          0 | Clone Node
(18 rows)

```
**4 Test Infrastructure Enhancements**

- Added a new test case scheduler for snapshot-based split scenarios.
- Enhanced pg_regress_multi.pl to support creating node backups with
slightly modified options to simulate real-world backup-based clone
creation.

### 5. Usage Guide
The snapshot-based node split can be performed using the following
workflow:

**- Take a Backup of the Worker Node**
Run pg_basebackup (or an equivalent tool) against the existing worker
node to create a physical backup.

`pg_basebackup -h <primary_worker_host> -p <port> -D
/path/to/replica/data --write-recovery-conf
`

**- Start the Replica Node**
Start PostgreSQL on the replica using the backup data directory,
ensuring it is configured as a streaming replica of the original worker
node.

**- Register the Backup Node as a Clone**
Mark the registered replica as a clone of its original worker node:

`SELECT * FROM citus_add_clone_node('<clone_host>', <clone_port>,
'<primary_host>', <primary_port>);
`

**- Promote and Rebalance the Clone**
Promote the clone to a primary and rebalance shards between it and the
original worker:

`SELECT * FROM citus_promote_clone_and_rebalance('clone_node_id');
`

**- Drop Any Replication Slots from the Original Worker**
After promotion, clean up any unused replication slots from the original
worker:

`SELECT pg_drop_replication_slot('<slot_name>');
`
2025-08-19 14:13:55 +03:00
Muhammad Usama f743b35fc2
Parallelize Shard Rebalancing & Unlock Concurrent Logical Shard Moves (#7983)
DESCRIPTION: Parallelizes shard rebalancing and removes the bottlenecks
that previously blocked concurrent logical-replication moves.
These improvements reduce rebalance windows—particularly for clusters
with large reference tables and enable multiple shard transfers to run in parallel.

Motivation:
Citus’ shard rebalancer has some key performance bottlenecks:
**Sequential Movement of Reference Tables:**
Reference tables are often assumed to be small, but in real-world
deployments, they can grow significantly large. Previously, reference
table shards were transferred as a single unit, making the process
monolithic and time-consuming.
**No Parallelism Within a Colocation Group:**
Although Citus distributes data using colocated shards, shard
movements within the same colocation group were serialized. In
environments with hundreds of distributed tables colocated
together, this serialization significantly slowed down rebalance
operations.
 **Excessive Locking:**
 Rebalancer used restrictive locks and redundant logical replication
guards, further limiting concurrency.
The goal of this commit is to eliminate these inefficiencies and enable
maximum parallelism during rebalance, without compromising correctness
or compatibility. Parallelize shard rebalancing to reduce rebalance
time.

Feature Summary:

**1. Parallel Reference Table Rebalancing**
Each reference-table shard is now copied in its own background task.
Foreign key and other constraints are deferred until all shards are
copied.
For single shard movement without considering colocation a new
internal-only UDF '`citus_internal_copy_single_shard_placement`' is
introduced to allow single-shard copy/move operations.
Since this function is internal, we do not allow users to call it
directly.

**Temporary Hack to Set Background Task Context** Background tasks
cannot currently set custom GUCs like application_name before executing
internal-only functions. 'citus_rebalancer ...' statement as a prefix in
the task command. This is a temporary hack to label internal tasks until
proper GUC injection support is added to the background task executor.

**2. Changes in Locking Strategy**

- Drop the leftover replication lock that previously serialized shard
moves performed via logical replication. This lock was only needed when
we used to drop and recreate the subscriptions/publications before each
move. Since Citus now removes those objects later as part of the “unused
distributed objects” cleanup, shard moves via logical replication can
safely run in parallel without additional locking.

- Introduced a per-shard advisory lock to prevent concurrent operations
on the same shard while allowing maximum parallelism elsewhere.

- Change the lock mode in AcquirePlacementColocationLock from
ExclusiveLock to RowExclusiveLock to allow concurrent updates within the
same colocation group, while still preventing concurrent DDL operations.

**3. citus_rebalance_start() enhancements**
The citus_rebalance_start() function now accepts two new optional
parameters:

```
- parallel_transfer_colocated_shards BOOLEAN DEFAULT false,
- parallel_transfer_reference_tables BOOLEAN DEFAULT false
```
This ensures backward compatibility by preserving the existing behavior
and avoiding any disruption to user expectations and when both are set
to true, the rebalancer operates with full parallelism.

**Previous Rebalancer Behavior:**
`SELECT citus_rebalance_start(shard_transfer_mode := 'force_logical');`
This would:
Start a single background task for replicating all reference tables
Then, move all shards serially, one at a time.
```
Task 1: replicate_reference_tables()
         ↓
         Task 2: move_shard_1()
         ↓
         Task 3: move_shard_2()
         ↓
         Task 4: move_shard_3()
```
Slow and sequential. Reference table copy is a bottleneck. Colocated
shards must wait for each other.

**New Parallel Rebalancer:**
```
SELECT citus_rebalance_start(
        shard_transfer_mode := 'force_logical',
        parallel_transfer_colocated_shards := true,
        parallel_transfer_reference_tables := true
      );
```
This would:

- Schedule independent background tasks for each reference-table shard.
- Move colocated shards in parallel, while still maintaining dependency
order.
- Defer constraint application until all reference shards are in place.
-     
```
Task 1: copy_ref_shard_1()
          Task 2: copy_ref_shard_2()
          Task 3: copy_ref_shard_3()
            → Task 4: apply_constraints()
          ↓
         Task 5: copy_shard_1()
         Task 6: copy_shard_2()
         Task 7: copy_shard_3()
         ↓
         Task 8-10: move_shard_1..3()
```
Each operation is scheduled independently and can run as soon as
dependencies are satisfied.
2025-08-18 17:44:14 +03:00
Onur Tirtir 87a1b631e8
Not automatically create citus_columnar when creating citus extension (#8081)
DESCRIPTION: Not automatically create citus_columnar when there are no
relations using it.

Previously, we were always creating citus_columnar when creating citus
with version >= 11.1. And how we were doing was as follows:
* Detach SQL objects owned by old columnar, i.e., "drop" them from
citus, but not actually drop them from the database
* "old columnar" is the one that we had before Citus 11.1 as part of
citus, i.e., before splitting the access method ands its catalog to
citus_columnar.
* Create citus_columnar and attach the SQL objects leftover from old
columnar to it so that we can continue supporting the columnar tables
that user had before Citus 11.1 with citus_columnar.

First part is unchanged, however, now we don't create citus_columnar
automatically anymore if the user didn't have any relations using
columnar. For this reason, as of Citus 13.2, when these SQL objects are
not owned by an extension and there are no relations using columnar
access method, we drop these SQL objects when updating Citus to 13.2.

The net effect is still the same as if we automatically created
citus_columnar and user dropped citus_columnar later, so we should not
have any issues with dropping them.

(**Update:** Seems we've made some assumptions in citus, e.g.,
citus_finish_pg_upgrade() still assumes columnar metadata exists and
tries to apply some fixes for it, so this PR fixes them as well. See the
last section of this PR description.)

Also, ideally I was hoping to just remove some lines of code from
extension.c, where we decide automatically creating citus_columnar when
creating citus, however, this didn't happen to be the case for two
reasons:
* We still need to automatically create it for the servers using
columnar access method.
* We need to clean-up the leftover SQL objects from old columnar when
the above is not case otherwise we would have leftover SQL objects from
old columnar for no reason, and that would confuse users too.
* Old columnar cannot be used to create columnar tables properly, so we
should clean them up and let the user decide whether they want to create
citus_columnar when they really need it later.

---

Also made several changes in the test suite because similarly, we don't
always want to have citus_columnar created in citus tests anymore:
* Now, columnar specific test targets, which cover **41** test sql
files, always install columnar by default, by using
"--load-extension=citus_columnar".
* "--load-extension=citus_columnar" is not added to citus specific test
targets because by default we don't want to have citus_columnar created
during citus tests.
* Excluding citus_columnar specific tests, we have **601** sql files
that we have as citus tests and in **27** of them we manually create
citus_columnar at the very beginning of the test because these tests do
test some functionalities of citus together with columnar tables.

Also, before and after schedules for PG upgrade tests are now duplicated
so we have two versions of each: one with columnar tests and one
without. To choose between them, check-pg-upgrade now supports a
"test-with-columnar" option, which can be set to "true" or anything else
to logically indicate "false". In CI, we run the check-pg-upgrade test
target with both options. The purpose is to ensure we can test PG
upgrades where citus_columnar is not created in the cluster before the
upgrade as well.

Finally, added more tests to multi_extension.sql to test Citus upgrade
scenarios with / without columnar tables / citus_columnar extension.

---

Also, seems citus_finish_pg_upgrade was assuming that citus_columnar is
always created but actually we should have never made such an
assumption. To fix that, moved columnar specific post-PG-upgrade work
from citus to a new columnar UDF, which is columnar_finish_pg_upgrade.
But to avoid breaking existing customer / managed service scripts, we
continue to automatically perform post PG-upgrade work for columnar
within citus_finish_pg_upgrade, but only if columnar access method
exists this time.
2025-08-18 08:29:27 +01:00
Teja Mupparti 889aa92ac0
EXPLAIN ANALYZE - Prevent execution of the plan during the plan-print (#8017)
DESCRIPTION: Fixed a bug in EXPLAIN ANALYZE to prevent unintended (duplicate) execution of the (sub)plans during the explain phase.

Fixes #4212 

### 🐞 Bug #4212 : Redundant (Subplan) Execution in `EXPLAIN ANALYZE`
codepath

#### 🔍 Background
In the standard PostgreSQL execution path, `ExplainOnePlan()` is
responsible for two distinct operations depending on whether `EXPLAIN
ANALYZE` is requested:

1. **Execute the plan**

   ```c
   if (es->analyze)
       ExecutorRun(queryDesc, direction, 0L, true);
   ```

2. **Print the plan tree** 

   ```c
   ExplainPrintPlan(es, queryDesc);
   ```

When printing the plan, the executor should **not run the plan again**.
Execution is only expected to happen once—at the top level when
`es->analyze = true`.

---

#### ⚠️ Issue in Citus

In the Citus implementation of `CustomScanMethods.ExplainCustomScan =
CitusExplainScan`, which is a custom scan explain callback function used
to print explain information of a Citus plan incorrectly performs
**redundant execution** inside the explain path of `ExplainPrintPlan()`

```c
ExplainOnePlan()
  ExplainPrintPlan()
      ExplainNode()
        CitusExplainScan()
          if (distributedPlan->subPlanList != NIL)
          {
              ExplainSubPlans(distributedPlan, es);
             {
              PlannedStmt *plan = subPlan->plan;
              ExplainOnePlan(plan, ...);  // ⚠️ May re-execute subplan if es->analyze is true
             }
         }
```
This causes the subplans to be **executed again**, even though they have
already been executed during the top-level plan execution. This behavior
violates the expectation in PostgreSQL where `EXPLAIN ANALYZE` should
**execute each node exactly once** for analysis.

---
####  Fix (proposed)
Save the output of Subplans during `ExecuteSubPlans()`, and later use it
in `ExplainSubPlans()`
2025-07-30 11:29:50 -07:00
naisila 4cd8bb1b67 Bump Citus version to 13.2devel 2025-06-24 16:21:48 +02:00