DESCRIPTION: Avoid query deparse and planning of shard query in local execution. Adds citus.enable_local_execution_local_plan GUC to allow avoiding unnecessary query deparsing to improve performance of fast-path queries targeting local shards.
If a fast path query resolves to a shard that is local to the node planning the query, a shortcut can be taken so that the OID of the shard is plugged into the parse tree, which is then planned by Postgres. In `local_executor.c` the task uses that plan instead of parsing and planning a shard query. How this is done: The fast path planner identifies if the shortcut is possible, and then the distributed planner checks, using `CheckAndBuildDelayedFastPathPlan()`, if a local plan can be generated or if the shard query should be generated.
This optimization is controlled by a GUC `citus.enable_local_execution_local_plan` which is on by default. A new
regress test `local_execution_local_plan` tests both row-sharding and schema sharding. Negative tests are added to
`local_shard_execution_dropped_column` to verify that the optimization is not taken when the shard is local but there is a difference between the shard and distributed table because of a dropped column.
This PR provides successful build against PG18Beta1. RuleUtils PR was
reviewed separately: #8010
## PG 18Beta1–related changes for building Citus
### TupleDesc / Attr layout
**What changed in PG:** Postgres consolidated the
`TupleDescData.attrs[]` array into a more compact representation. Direct
field access (tupdesc->attrs[i]) was replaced by the new
`TupleDescAttr()` API.
**Citus adaptation:** Everywhere we previously used
`tupdesc->attrs[...]`, we now call `TupleDescAttr(tupdesc, idx)` (or our
own `Attr()` macro) under a compatibility guard.
*
5983a4cffc
General Logic:
* Use `Attr(...)` in places where `columnar_version_compat.h` is
included. This avoids the need to sprinkle `#if PG_VERSION_NUM` guards
around each attribute access.
* Use `TupleDescAttr(tupdesc, i)` when the relevant PostgreSQL header is
already included and the additional macro indirection is unnecessary.
### Collation‐aware `LIKE`
**What changed in PG:** The `textlike` operator now requires an explicit
collation, to avoid ambiguous‐collation errors. Core code switched from
`DirectFunctionCall2(textlike, ...)` to
`DirectFunctionCall2Coll(textlike, DEFAULT_COLLATION_OID, ...)`.
**Citus adaptation:** In `remote_commands.c` and any other LIKE call, we
now use `DirectFunctionCall2Coll(textlike, DEFAULT_COLLATION_OID, ...)`
and `#include <utils/pg_collation.h>`.
*
85b7efa1cd
### Columnar storage API
* Adapt `columnar_relation_set_new_filelocator` (and related init
routines) for PG 18’s revised SMGR and storage-initialization hooks.
* Pull in the new headers (`explain_format.h`,
`columnar_version_compat.h`) so the columnar module compiles cleanly
against PG 18.
- heap_modify_tuple + heap_inplace_update only exist on PG < 18; on PG18
the in-place helper was removed upstream
-
a07e03fd8f
### OpenSSL / TLS integration
**What changed in PG:** Moved from the legacy `SSL_library_init()` to
`OPENSSL_init_ssl(OPENSSL_INIT_LOAD_CONFIG, NULL)`, updated certificate
API calls (`X509_getm_notBefore`, `X509_getm_notAfter`), and
standardized on `TLS_method()`.
**Citus adaptation:** We now `#include <openssl/opensslv.h>` and use
`#if OPENSSL_VERSION_NUMBER >= 0x10100000L` to choose between`
OPENSSL_init_ssl()` or `SSL_library_init()`, and wrap`
X509_gmtime_adj()` calls around the new accessor functions.
*
6c66b7443c
### Adapt `ExtractColumns()` to the new PG-18 `expandRTE()` signature
PostgreSQL 18
80feb727c8
added a fourth argument of type `VarReturningType` to `expandRTE()`, so
calls that used the old 7-parameter form no longer compile. This patch:
* Wraps the `expandRTE(...)` call in a `#if PG_VERSION_NUM >= 180000`
guard.
* On PG 18+ passes the new `VAR_RETURNING_DEFAULT` argument before
`location`.
* On PG 15–17 continues to call the original 7-arg form.
* Adds the necessary includes (`parser/parse_relation.h` for `expandRTE`
and `VarReturningType`, and `pg_version_constants.h` for
`PG_VERSION_NUM`).
### Adapt `ExecutorStart`/`ExecutorRun` hooks to PG-18’s new signatures
PostgreSQL 18
525392d572
changed the signatures of the executor hooks:
* `ExecutorStart_hook` now returns `bool` instead of `void`, and
* `ExecutorRun_hook` drops its old `run_once` argument.
This patch preserves Citus’s existing hook logic by:
1. **Adding two adapter functions** under `#if PG_VERSION_NUM >=
PG_VERSION_18`:
* `citus_executor_start_adapter(QueryDesc *queryDesc, int eflags)`
Calls the old `CitusExecutorStart(queryDesc, eflags)` and then returns
`true` to satisfy the new hook’s `bool` return type.
* `citus_executor_run_adapter(QueryDesc *queryDesc, ScanDirection
direction, uint64 count)`
Calls the old `CitusExecutorRun(queryDesc, direction, count, true)`
(passing `true` for the dropped `run_once` argument), and returns
`void`.
2. **Installing the adapters** in `_PG_init()` instead of the original
hooks when building against PG 18+:
```c
#if PG_VERSION_NUM >= PG_VERSION_18
ExecutorStart_hook = citus_executor_start_adapter;
ExecutorRun_hook = citus_executor_run_adapter;
#else
ExecutorStart_hook = CitusExecutorStart;
ExecutorRun_hook = CitusExecutorRun;
#endif
```
### Adapt to PG-18’s removal of the “run\_once” flag from
ExecutorRun/PortalRun
PostgreSQL commit
[[3eea7a0](3eea7a0c97)
rationalized the executor’s parallelism logic by moving the “execute a
plan only once” check into `ExecutePlan()` itself and dropping the old
`bool run_once` argument from the public APIs:
```diff
- void ExecutorRun(QueryDesc *queryDesc,
- ScanDirection direction,
- uint64 count,
- bool run_once);
+ void ExecutorRun(QueryDesc *queryDesc,
+ ScanDirection direction,
+ uint64 count);
```
(and similarly for `PortalRun()`).
To stay compatible across PG 15–18, Citus now:
1. **Updates all internal calls** to `ExecutorRun(...)` and
`PortalRun(...)`:
* On PG 18+, use the new three-argument form (`ExecutorRun(qd, dir,
count)`).
* On PG 15–17, keep the old four-arg form (`ExecutorRun(qd, dir, count,
true)`) under a `#if PG_VERSION_NUM < 180000` guard.
2. **Guards the dispatcher hooks** via the adapter functions (from the
earlier patch) so that Citus’s executor hooks continue to work under
both the old and new signatures.
### Adapt to PG-18’s shortened PortalRun signature
PostgreSQL 18’s refactoring (see commit
[3eea7a0](3eea7a0c97))
also removed the old run_once and alternate‐dest arguments from the
public PortalRun() API. The signature changed from:
```diff
- bool PortalRun(Portal portal,
- long count,
- bool isTopLevel,
- bool run_once,
- DestReceiver *dest,
- DestReceiver *altdest,
- QueryCompletion *qc);
+ bool PortalRun(Portal portal,
+ long count,
+ bool isTopLevel,
+ DestReceiver *dest,
+ DestReceiver *altdest,
+ QueryCompletion *qc);
```
To support both versions in Citus, we:
1. **Version-guard each call** to `PortalRun()`:
* **On PG 18+** invoke the new 6-argument form.
* **On PG 15–17** fall back to the legacy 7-argument form, passing
`true` for `run_once`.
### Add support for PG-18’s new `plansource` argument in
`PortalDefineQuery`**
PostgreSQL 18 extended the `PortalDefineQuery` API to carry a
`CachedPlanSource *plansource` pointer so that the portal machinery can
track cached‐plan invalidation (as introduced alongside deferred-locking
in commit
525392d572.
To remain compatible across PG 15–18, Citus now wraps its calls under a
version guard:
```diff
- PortalDefineQuery(portal, NULL, sql, commandTag, plantree_list, NULL);
+#if PG_VERSION_NUM >= 180000
+ /* PG 18+: seven-arg signature (adds plansource) */
+ PortalDefineQuery(
+ portal,
+ NULL, /* no prepared-stmt name */
+ sql, /* the query text */
+ commandTag, /* the CommandTag */
+ plantree_list, /* List of PlannedStmt* */
+ NULL, /* no CachedPlan */
+ NULL /* no CachedPlanSource */
+ );
+#else
+ /* PG 15–17: six-arg signature */
+ PortalDefineQuery(
+ portal,
+ NULL, /* no prepared-stmt name */
+ sql, /* the query text */
+ commandTag, /* the CommandTag */
+ plantree_list, /* List of PlannedStmt* */
+ NULL /* no CachedPlan */
+ );
+#endif
```
### Adapt ExecInitRangeTable() calls to PG-18’s new signature
PostgreSQL commit
[cbc127917e04a978a788b8bc9d35a70244396d5b](cbc127917e)
overhauled the planner API for range‐table initialization:
**PG 18+**: added a fourth `Bitmapset *unpruned_relids` argument to
support deferred partition pruning
In Citus’s `create_estate_for_relation()` (in `columnar_metadata.c`), we
now wrap the call in a compile‐time guard so that the code compiles
correctly on all supported PostgreSQL versions:
```
/* Prepare permission info on PG 16+ */
#if PG_VERSION_NUM >= PG_VERSION_16
List *perminfos = NIL;
addRTEPermissionInfo(&perminfos, rte);
#else
List *perminfos = NIL; /* unused on PG 15 */
#endif
/* Initialize the range table, with the right signature for each PG version */
#if PG_VERSION_NUM >= PG_VERSION_18
/* PG 18+: four‐arg signature (adds unpruned_relids) */
ExecInitRangeTable(
estate,
list_make1(rte),
perminfos,
NULL /* unpruned_relids: not used by columnar */
);
#elif PG_VERSION_NUM >= PG_VERSION_16
/* PG 16–17: three‐arg signature (permInfos) */
ExecInitRangeTable(
estate,
list_make1(rte),
perminfos
);
#else
/* PG 15: two‐arg signature */
ExecInitRangeTable(
estate,
list_make1(rte)
);
#endif
estate->es_output_cid = GetCurrentCommandId(true);
```
### Adapt `pgstat_report_vacuum()` to PG-18’s new timestamp argument
PostgreSQL commit
[[30a6ed0ce4bb18212ec38cdb537ea4b43bc99b83](30a6ed0ce4)
extended the `pgstat_report_vacuum()` API by adding a `TimestampTz
start_time` parameter at the end so that the VACUUM statistics collector
can record when the operation began:
```diff
/* PG ≤17: four-arg signature */
- void pgstat_report_vacuum(Oid tableoid,
- bool shared,
- double num_live_tuples,
- double num_dead_tuples);
+/* PG ≥18: five-arg signature adds a start_time */
+ void pgstat_report_vacuum(Oid tableoid,
+ bool shared,
+ double num_live_tuples,
+ double num_dead_tuples,
+ TimestampTz start_time);
```
To support both versions, we now wrap the call in `columnar_tableam.c`
with a version guard, supplying `GetCurrentTimestamp()` for PG-18+:
```c
#if PG_VERSION_NUM >= 180000
/* PG 18+: include start_timestamp */
pgstat_report_vacuum(
RelationGetRelid(rel),
rel->rd_rel->relisshared,
Max(new_live_tuples, 0), /* live tuples */
0, /* dead tuples */
GetCurrentTimestamp() /* start time */
);
#else
/* PG 15–17: original signature */
pgstat_report_vacuum(
RelationGetRelid(rel),
rel->rd_rel->relisshared,
Max(new_live_tuples, 0), /* live tuples */
0 /* dead tuples */
);
#endif
```
### Adapt `ExecuteTaskPlan()` to PG-18’s expanded `CreateQueryDesc()`
signature
PostgreSQL 18 changed `CreateQueryDesc()` from an eight-argument to a
nine-argument call by inserting a `CachedPlan *cplan` parameter
immediately after the `PlannedStmt *plannedstmt` argument (see commit
525392d572).
To remain compatible with PG 15–17, Citus now wraps its invocation in
`local_executor.c` with a version guard:
```diff
- /* PG15–17: eight-arg CreateQueryDesc without cached plan */
- QueryDesc *queryDesc = CreateQueryDesc(
- taskPlan, /* PlannedStmt *plannedstmt */
- queryString, /* const char *sourceText */
- GetActiveSnapshot(),/* Snapshot snapshot */
- InvalidSnapshot, /* Snapshot crosscheck_snapshot */
- destReceiver, /* DestReceiver *dest */
- paramListInfo, /* ParamListInfo params */
- queryEnv, /* QueryEnvironment *queryEnv */
- 0 /* int instrument_options */
- );
+#if PG_VERSION_NUM >= 180000
+ /* PG18+: nine-arg CreateQueryDesc with a CachedPlan slot */
+ QueryDesc *queryDesc = CreateQueryDesc(
+ taskPlan, /* PlannedStmt *plannedstmt */
+ NULL, /* CachedPlan *cplan (none) */
+ queryString, /* const char *sourceText */
+ GetActiveSnapshot(),/* Snapshot snapshot */
+ InvalidSnapshot, /* Snapshot crosscheck_snapshot */
+ destReceiver, /* DestReceiver *dest */
+ paramListInfo, /* ParamListInfo params */
+ queryEnv, /* QueryEnvironment *queryEnv */
+ 0 /* int instrument_options */
+ );
+#else
+ /* PG15–17: eight-arg CreateQueryDesc without cached plan */
+ QueryDesc *queryDesc = CreateQueryDesc(
+ taskPlan, /* PlannedStmt *plannedstmt */
+ queryString, /* const char *sourceText */
+ GetActiveSnapshot(),/* Snapshot snapshot */
+ InvalidSnapshot, /* Snapshot crosscheck_snapshot */
+ destReceiver, /* DestReceiver *dest */
+ paramListInfo, /* ParamListInfo params */
+ queryEnv, /* QueryEnvironment *queryEnv */
+ 0 /* int instrument_options */
+ );
+#endif
```
### Adapt `RelationGetPrimaryKeyIndex()` to PG-18’s new “deferrable\_ok”
flag
PostgreSQL commit
14e87ffa5c
added a new Boolean `deferrable_ok` parameter to
`RelationGetPrimaryKeyIndex()` so that the lock manager can defer
unique‐constraint locks when requested. The API changed from:
```c
RelationGetPrimaryKeyIndex(Relation relation)
```
to:
```c
RelationGetPrimaryKeyIndex(Relation relation, bool deferrable_ok)
```
```diff
diff --git a/src/backend/distributed/metadata/node_metadata.c
b/src/backend/distributed/metadata/node_metadata.c
index e3a1b2c..f4d5e6f 100644
--- a/src/backend/distributed/metadata/node_metadata.c
+++ b/src/backend/distributed/metadata/node_metadata.c
@@ -2965,8 +2965,18 @@
*/
- Relation replicaIndex =
index_open(RelationGetPrimaryKeyIndex(pgDistNode),
- AccessShareLock);
+ #if PG_VERSION_NUM >= PG_VERSION_18
+ /* PG 18+ adds a bool "deferrable_ok" parameter */
+ Relation replicaIndex =
+ index_open(
+ RelationGetPrimaryKeyIndex(pgDistNode, false),
+ AccessShareLock);
+ #else
+ Relation replicaIndex =
+ index_open(
+ RelationGetPrimaryKeyIndex(pgDistNode),
+ AccessShareLock);
+ #endif
ScanKeyInit(&scanKey[0], Anum_pg_dist_node_nodename,
BTEqualStrategyNumber, F_TEXTEQ, CStringGetTextDatum(nodeName));
```
```diff
diff --git a/src/backend/distributed/operations/node_protocol.c b/src/backend/distributed/operations/node_protocol.c
index e3a1b2c..f4d5e6f 100644
--- a/src/backend/distributed/operations/node_protocol.c
+++ b/src/backend/distributed/operations/node_protocol.c
@@ -746,7 +746,12 @@
if (!OidIsValid(idxoid))
{
- idxoid = RelationGetPrimaryKeyIndex(rel);
+ /* Determine the index OID of the primary key (PG18 adds a second parameter) */
+#if PG_VERSION_NUM >= PG_VERSION_18
+ idxoid = RelationGetPrimaryKeyIndex(rel, false);
+#else
+ idxoid = RelationGetPrimaryKeyIndex(rel);
+#endif
}
return idxoid;
```
Because Citus has always taken the lock immediately—just as the old
two-arg call did—we pass `false` to keep that same immediate-lock
behavior. Passing `true` would switch to deferred locking, which we
don’t want.
### Adapt `ExplainOnePlan()` to PG-18’s expanded API
PostgreSQL 18 extended
525392d572
the `ExplainOnePlan()` function to carry the `CachedPlan *` and
`CachedPlanSource *` pointers plus an explicit `query_index`, letting
the EXPLAIN machinery track plan‐source invalidation. The old signature:
```c
/* PG ≤17 */
void
ExplainOnePlan(PlannedStmt *plannedstmt,
IntoClause *into,
struct ExplainState *es,
const char *queryString,
ParamListInfo params,
QueryEnvironment *queryEnv,
const instr_time *planduration,
const BufferUsage *bufusage);
```
became, in PG 18:
```c
/* PG ≥18 */
void
ExplainOnePlan(PlannedStmt *plannedstmt,
CachedPlan *cplan,
CachedPlanSource *plansource,
int query_index,
IntoClause *into,
struct ExplainState *es,
const char *queryString,
ParamListInfo params,
QueryEnvironment *queryEnv,
const instr_time *planduration,
const BufferUsage *bufusage,
const MemoryContextCounters *mem_counters);
```
To compile under both versions, Citus now wraps each call in
`multi_explain.c` with:
```c
#if PG_VERSION_NUM >= PG_VERSION_18
/* PG 18+: pass NULL for the new cached‐plan fields and zero for query_index */
ExplainOnePlan(
plan, /* PlannedStmt *plannedstmt */
NULL, /* CachedPlan *cplan */
NULL, /* CachedPlanSource *plansource */
0, /* query_index */
into, /* IntoClause *into */
es, /* ExplainState *es */
queryString, /* const char *queryString */
params, /* ParamListInfo params */
NULL, /* QueryEnvironment *queryEnv */
&planduration,/* const instr_time *planduration */
(es->buffers ? &bufusage : NULL),
(es->memory ? &mem_counters : NULL)
);
#elif PG_VERSION_NUM >= PG_VERSION_17
/* PG 17: same as before, plus passing mem_counters if enabled */
ExplainOnePlan(
plan,
into,
es,
queryString,
params,
queryEnv,
&planduration,
(es->buffers ? &bufusage : NULL),
(es->memory ? &mem_counters : NULL)
);
#else
/* PG 15–16: original seven-arg form */
ExplainOnePlan(
plan,
into,
es,
queryString,
params,
queryEnv,
&planduration,
(es->buffers ? &bufusage : NULL)
);
#endif
```
### Adapt to the unified “index interpretation” API in PG 18 (commit
a8025f544854)
PostgreSQL commit
a8025f5448
generalized the old btree‐specific operator‐interpretation API into a
single “index interpretation” interface:
* **Renamed type**:
`OpBtreeInterpretation` → `OpIndexInterpretation`
* **Renamed function**:
`get_op_btree_interpretation(opno)` →
`get_op_index_interpretation(opno)`
* **Unified field**:
Each interpretation now carries `cmptype` instead of `strategy`.
To build cleanly on PG 18 while still supporting PG 15–17, Citus’s
shard‐pruning code now wraps these changes:
```c
#include "pg_version_constants.h"
#if PG_VERSION_NUM >= PG_VERSION_18
/* On PG 18+ the btree‐only APIs vanished; alias them to the new generic versions */
typedef OpIndexInterpretation OpBtreeInterpretation;
#define get_op_btree_interpretation(opno) get_op_index_interpretation(opno)
#define ROWCOMPARE_NE COMPARE_NE
#endif
/* … later, when checking an interpretation … */
OpBtreeInterpretation *interp =
(OpBtreeInterpretation *) lfirst(cell);
#if PG_VERSION_NUM >= PG_VERSION_18
/* use cmptype on PG 18+ */
if (interp->cmptype == ROWCOMPARE_NE)
#else
/* use strategy on PG 15–17 */
if (interp->strategy == ROWCOMPARE_NE)
#endif
{
/* … */
}
```
### Adapt `create_foreignscan_path()` for PG-18’s revised signature
PostgreSQL commit
e222534679
reordered and removed a couple of parameters in the FDW‐path builder:
* **PG 15–17 signature (11 args)**
```c
create_foreignscan_path(PlannerInfo *root,
RelOptInfo *rel,
PathTarget *target,
double rows,
Cost startup_cost,
Cost total_cost,
List *pathkeys,
Relids required_outer,
Path *fdw_outerpath,
List *fdw_restrictinfo,
List *fdw_private);
```
* **PG 18+ signature (9 args)**
```c
create_foreignscan_path(PlannerInfo *root,
RelOptInfo *rel,
PathTarget *target,
double rows,
int disabled_nodes,
Cost startup_cost,
Cost total_cost,
Relids required_outer,
Path *fdw_outerpath,
List *fdw_private);
```
To support both, Citus now defines a compatibility macro in
`pg_version_compat.h`:
```c
#include "nodes/bitmapset.h" /* for Relids */
#include "nodes/pg_list.h" /* for List */
#include "optimizer/pathnode.h" /* for create_foreignscan_path() */
#if PG_VERSION_NUM >= PG_VERSION_18
/* PG18+: drop pathkeys & fdw_restrictinfo, add disabled_nodes */
#define create_foreignscan_path_compat(a, b, c, d, e, f, g, h, i, j, k) \
create_foreignscan_path( \
(a), /* root */ \
(b), /* rel */ \
(c), /* target */ \
(d), /* rows */ \
(0), /* disabled_nodes (unused by Citus) */ \
(e), /* startup_cost */ \
(f), /* total_cost */ \
(g), /* required_outer */ \
(h), /* fdw_outerpath */ \
(k) /* fdw_private */ \
)
#else
/* PG15–17: original signature */
#define create_foreignscan_path_compat(a, b, c, d, e, f, g, h, i, j, k) \
create_foreignscan_path( \
(a), (b), (c), (d), \
(e), (f), \
(g), (h), (i), (j), (k) \
)
#endif
```
Now every call to `create_foreignscan_path_compat(...)`—even in tests
like `fake_fdw.c`—automatically picks the correct argument list for
PG 15 through PG 18.
### Drop the obsolete bitmap‐scan hooks on PG 18+
PostgreSQL commit
c3953226a0
cleaned up the `TableAmRoutine` API by removing the two bitmap‐scan
callback slots:
* `scan_bitmap_next_block`
* `scan_bitmap_next_tuple`
Since those hook‐slots no longer exist in PG 18, Citus now wraps their
NULL‐initialization in a `#if PG_VERSION_NUM < PG_VERSION_18` guard. On
PG 15–17 we still explicitly set them to `NULL` (to satisfy the old
struct layout), and on PG 18+ we omit them entirely:
```c
#if PG_VERSION_NUM < PG_VERSION_18
/* PG 15–17 only: these fields were removed upstream in PG 18 */
.scan_bitmap_next_block = NULL,
.scan_bitmap_next_tuple = NULL,
#endif
```
### Adapt `vac_update_relstats()` invocation to PG-18’s new
“all\_frozen” argument
PostgreSQL commit
99f8f3fbbc
extended the `vac_update_relstats()` API by inserting a
`num_all_frozen_pages` parameter between the existing
`num_all_visible_pages` and `hasindex` arguments:
```diff
- /* PG ≤17: */
- void
- vac_update_relstats(Relation relation,
- BlockNumber num_pages,
- double num_tuples,
- BlockNumber num_all_visible_pages,
- bool hasindex,
- TransactionId frozenxid,
- MultiXactId minmulti,
- bool *frozenxid_updated,
- bool *minmulti_updated,
- bool in_outer_xact);
+ /* PG ≥18: adds num_all_frozen_pages */
+ void
+ vac_update_relstats(Relation relation,
+ BlockNumber num_pages,
+ double num_tuples,
+ BlockNumber num_all_visible_pages,
+ BlockNumber num_all_frozen_pages,
+ bool hasindex,
+ TransactionId frozenxid,
+ MultiXactId minmulti,
+ bool *frozenxid_updated,
+ bool *minmulti_updated,
+ bool in_outer_xact);
```
To compile cleanly on both PG 15–17 and PG 18+, Citus wraps its call in
a version guard and supplies a zero placeholder for the new field:
```c
#if PG_VERSION_NUM >= 180000
/* PG 18+: supply explicit “all_frozen” count */
vac_update_relstats(
rel,
new_rel_pages,
new_live_tuples,
new_rel_allvisible, /* allvisible */
0, /* all_frozen */
nindexes > 0,
newRelFrozenXid,
newRelminMxid,
&frozenxid_updated,
&minmulti_updated,
false /* in_outer_xact */
);
#else
/* PG 15–17: original signature */
vac_update_relstats(
rel,
new_rel_pages,
new_live_tuples,
new_rel_allvisible,
nindexes > 0,
newRelFrozenXid,
newRelminMxid,
&frozenxid_updated,
&minmulti_updated,
false /* in_outer_xact */
);
#endif
```
**Why all_frozen = 0?**
Columnar storage never embeds transaction IDs in its pages, so it never
needs to track “all‐frozen” pages the way a heap does. Setting both
allvisible and allfrozen to zero simply tells Postgres “there are no
pages with the visibility or frozen‐status bits set,” matching our
existing behavior.
This change ensures Citus’s VACUUM‐statistic updates work unmodified
across all supported Postgres versions.
DESCRIPTION: Adds citus_stat_counters view that can be used to query
stat counters that Citus collects while the feature is enabled, which is
controlled by citus.enable_stat_counters. citus_stat_counters() can be
used to query the stat counters for the provided database oid and
citus_stat_counters_reset() can be used to reset them for the provided
database oid or for the current database if nothing or 0 is provided.
Today we don't persist stat counters on server shutdown. In other words,
stat counters are automatically reset in case of a server restart.
Details on the underlying design can be found in header comment of
stat_counters.c and in the technical readme.
-------
Here are the details about what we track as of this PR:
For connection management, we have three statistics about the inter-node
connections initiated by the node itself:
* **connection_establishment_succeeded**
* **connection_establishment_failed**
* **connection_reused**
While the first two are relatively easier to understand, the third one
covers the case where a connection is reused. This can happen when a
connection was already established to the desired node, Citus decided to
cache it for some time (see citus.max_cached_conns_per_worker &
citus.max_cached_connection_lifetime), and then reused it for a new
remote operation. Here are the other important details about these
connection statistics:
1. connection_establishment_failed doesn't care about the connections
that we could establish but are lost later in the transaction. Plus, we
cannot guarantee that the connections that are counted in
connection_establishment_succeeded were not lost later.
2. connection_establishment_failed doesn't care about the optional
connections (see OPTIONAL_CONNECTION flag) that we gave up establishing
because of the connection throttling rules we follow (see
citus.max_shared_pool_size & citus.local_shared_pool_size). The reaason
for this is that we didn't even try to establish these connections.
3. For the rest of the cases where a connection failed for some reason,
we always increment connection_establishment_failed even if the caller
was okay with the failure and know how to recover from it (e.g., the
adaptive executor knows how to fall back local execution when the target
node is the local node and if it cannot establish a connection to the
local node). The reason is that even if it's likely that we can still
serve the operation, we still failed to establish the connection and we
want to track this.
4. Finally, the connection failures that we count in
connection_establishment_failed might be caused by any of the following
reasons and for now we prefer to _not_ further distinguish them for
simplicity:
a. remote node is down or cannot accept any more connections, or
overloaded such that citus.node_connection_timeout is not enough to
establish a connection
b. any internal Citus error that might result in preparing a bad
connection string so that libpq fails when parsing the connection string
even before actually trying to establish a connection via connect() call
c. broken citus.node_conninfo or such Citus configuration that was
incorrectly set by the user can also result in similar outcomes as in b
d. internal waitevent set / poll errors or OOM in local node
We also track two more statistics for query execution:
* **query_execution_single_shard**
* **query_execution_multi_shard**
And more importantly, both query_execution_single_shard and
query_execution_multi_shard are not only tracked for the top-level
queries but also for the subplans etc. The reason is that for some
queries, e.g., the ones that go through recursive planning, after Citus
performs the heavy work as part of subplans, the work that needs to be
done for the top-level query becomes quite straightforward. And for such
query types, it would be deceiving if we only incremented the query stat
counters for the top-level query. Similarly, for non-pushable INSERT ..
SELECT and MERGE queries, we perform separate counter increments for the
SELECT / source part of the query besides the final INSERT / MERGE
query.
DESCRIPTION: Ensure that a MERGE command on a distributed table with a
`WHEN NOT MATCHED BY SOURCE` clause runs against all shards of the
distributed table.
The Postgres MERGE command updates a table using a table or a query as a
data source. It provides three ways to match the target table with the
source: `WHEN MATCHED` means that there is a row in both the target and
source; `WHEN NOT MATCHED` means that there is a row in the source that
has no match (is not present) in the target; and, as of PG17, `WHEN NOT
MATCHED BY SOURCE` means that there is a row in the target that has no
match in the source.
In Citus, when a MERGE command updates a distributed table using a
local/reference table or a distributed query as source, that source is
repartitioned, and for each repartitioned shard that has data (i.e. 1 or
more rows) the MERGE is run against the corresponding distributed table
shard. Suppose the distributed table has 32 shards, and the source
repartitions into 4 shards that have data, with the remaining 28 shards
being empty; then the MERGE command is performed on the 4 corresponding
shards of the distributed table. However, the semantics of `WHEN NOT
MATCHED BY SOURCE` are that the specified action must be performed on
the target for each row in the target that is not in the source; so if
the source is empty, all target rows should be updated. To see this,
consider the following MERGE command:
```
MERGE INTO target AS t
USING source AS s ON t.id = s.id
WHEN NOT MATCHED BY SOURCE THEN UPDATE t SET t.col1 = 100
```
If the source has zero rows then every row in the target is updated s.t.
its col1 value is 100. Currently in Citus a MERGE on a distributed table
with a local/reference table or a distributed query as source ignores
shards of the distributed table when the corresponding shard of the
repartitioned source has zero rows. However, if the MERGE command
specifies a `WHEN NOT MATCHED BY SOURCE` clause, then the MERGE should
be performed on all shards of the distributed table, to ensure that the
specified action is performed on the target for each row in the target
that is not in the source. This PR enhances Citus MERGE execution so
that when a repartitioned source shard has zero rows, and the MERGE
command specifies a `WHEN NOT MATCHED BY SOURCE` clause, the MERGE is
performed against the corresponding shard of the distributed table using
an empty (zero row) relation as source, by generating a query of the
form:
```
MERGE INTO target_shard_0002 AS t
USING (SELECT id FROM (VALUES (NULL) ) source_0002(id) WHERE FALSE) AS s ON t.id = s.id
WHEN NOT MATCHED BY SOURCE THEN UPDATE t set t.col1 = 100
```
This works because each row in the target shard will be updated, and
`WHEN MATCHED` and `WHEN NOT MATCHED`, if specified, will be no-ops
because the source has zero rows.
To implement this when the source is a local or reference table involves
teaching function `ExcuteSourceAtCoordAndRedistribution()` in
`merge_executor.c` to not prune tasks when the query has `WHEN NOT
MATCHED BY SOURCE` but to instead replace the task's query to one that
uses an empty relation as source. And when the source is a distributed
query, function
`ExecuteMergeSourcePlanIntoColocatedIntermediateResults()` (also in
`merge_executor.c`) instead of skipping empty tasks now generates a
query that uses an empty relation as source for the corresponding target
shard of the distributed table, but again only when the query has `WHEN
NOT MATCHED BY SOURCE`. A new function `BuildEmptyResultQuery()` is
added to `recursive_planning.c` and it is used by both the
aforementioned functions in `merge_executor.c` to build an empty
relation to use as the source. It applies the appropriate type to each
column of the empty relation so the join with the target makes sense to
the query compiler.
DESCRIPTION: Drops PG14 support
1. Remove "$version_num" != 'xx' from configure file
2. delete all PG_VERSION_NUM = PG_VERSION_XX references in the code
3. Look at pg_version_compat.h file, remove all _compat functions etc
defined specifically for PGXX differences
4. delete all PG_VERSION_NUM >= PG_VERSION_(XX+1), PG_VERSION_NUM <
PG_VERSION_(XX+1) ifs in the codebase
5. delete ruleutils_xx.c file
6. cleanup normalize.sed file from pg14 specific lines
7. delete all alternative output files for that particular PG version,
server_version_ge variable helps here
This PR provides successful compilation against PG17.0.
- Remove ExecFreeExprContext call
Relevant PG commit
d060e921ea5aa47b6265174c32e1128cebdbc3df
d060e921ea
- PG17 uses streaming IO in analyze, fix scan_analyze_next_block function
Relevant PG commit
041b96802efa33d2bc9456f2ad946976b92b5ae1
041b96802e
- Define ObjectClass for PG17+ only since it's removed
Relevant PG commit:
89e5ef7e21812916c9cf9fcf56e45f0f74034656
89e5ef7e21
- Remove ReorderBufferTupleBuf structure.
Relevant PG commit:
08e6344fd6423210b339e92c069bb979ba4e7cd6
08e6344fd6
- Define colliculocale and daticulocale since they have been renamed
Relevant PG commit:
f696c0cd5f299f1b51e214efc55a22a782cc175d
f696c0cd5f
- makeStringConst defined in PG17
Relevant PG commit:
de3600452b61d1bc3967e9e37e86db8956c8f577
de3600452b
- RangeVarCallbackOwnsTable was replaced by RangeVarCallbackMaintainsTable
Relevant PG commit:
ecb0fd33720fab91df1207e85704f382f55e1eb7
ecb0fd3372
- attstattarget is nullable, define pg compatible functions for it
Relevant PG commit:
4f622503d6de975ac87448aea5cea7de4bc140d5
4f622503d6
- stxstattarget is nullable in PG17, write compat functions for it
Relevant PG commit:
012460ee93c304fbc7220e5b55d9d0577fc766ab
012460ee93
- Use ResourceOwner to track WaitEventSet in PG17
Relevant PG commit:
50c67c2019ab9ade8aa8768bfe604cd802fe8591
50c67c2019
- getIdentitySequence now uses Relation instead of relation_id
Relevant PG commit:
509199587df73f06eda898ae13284292f4ae573a
509199587d
- Remove no-op tuplestore_donestoring function
Relevant PG commit:
75680c3d805e2323cd437ac567f0677fdfc7b680
75680c3d80
- MergeAction can have 3 merge kinds (now enum) in PG17, write compat
Relevant PG commit:
0294df2f1f842dfb0eed79007b21016f486a3c6c
0294df2f1f
- EXPLAIN (MEMORY) is added, make changes to ExplainOnePlan
Relevant PG commit:
5de890e3610d5a12cdaea36413d967cf5c544e20
5de890e361
- LIMIT_OPTION_DEFAULT has been removed as it's useless, use LIMIT_OPTION_COUNT
Relevant PG commit:
a6be0600ac3b71dda8277ab0fcbe59ee101ac1ce
a6be0600ac
- write compat for create_foreignscan_path bcs of more arguments in PG17
Relevant PG commit:
9e9931d2bf40e2fea447d779c2e133c2c1256ef3
9e9931d2bf
- pgprocno and lxid have been combined into a struct in PGPROC
Relevant PG commits:
28f3915b73f75bd1b50ba070f56b34241fe53fd1
28f3915b73
ab355e3a88de745607f6dd4c21f0119b5c68f2ad
ab355e3a88
024c521117579a6d356050ad3d78fdc95e44eefa
024c521117
- Simplify CitusNewNode (#7434)
postgres refactored newNode() in PG 17, the main point for doing this is
the original tricks is no longer neccessary for modern compilers[1].
This does the same for Citus.
This should have no backward compatibility issues since it just replaces
palloc0fast with palloc0.
This is good for forward compatibility since palloc0fast no longer
exists in PG 17.
[1]
https://www.postgresql.org/message-id/b51f1fa7-7e6a-4ecc-936d-90a8a1659e7c@iki.fi
(cherry picked from commit 4b295cc)
This is prep work for successful compilation with PG17
PG17added foreach_ptr, foreach_int and foreach_oid macros
Relevant PG commit
14dd0f27d7cd56ffae9ecdbe324965073d01a9ff
14dd0f27d7
We already have these macros, but they are different with the
PG17 ones because our macros take a DECLARED variable, whereas
the PG16 macros declare a locally-scoped loop variable themselves.
Hence I am renaming our macros to foreach_declared_
I am separating this into its own PR since it touches many files. The
main compilation PR is https://github.com/citusdata/citus/pull/7699
Since Postgres commit da9b580d files and directories are supposed to
be created with pg_file_create_mode and pg_dir_create_mode permissions
when default permissions are expected.
This fixes a failure of one of the postgres tests:
If we create file add.conf containing
```
shared_preload_libraries='citus'
```
and run postgres tests
```
TEMP_CONFIG=/path/to/add.conf make installcheck -C src/bin/pg_ctl/
```
then 001_start_stop.pl fails with
```
.../data/base/pgsql_job_cache mode must be 0750
```
in the log.
In passing this also stops creating directories that we haven't used
since Citus 7.4
This change explicitely doesn't change permissions of certificates/keys
that we create.
---------
Co-authored-by: Karina Litskevich <litskevichkarina@gmail.com>
ExecuteTaskListIntoTupleDestWithParam and ExecuteTaskListIntoTupleDest
are nearly the same. I parameterized and a made a reusable structure
here
---------
Co-authored-by: Onur Tirtir <onurcantirtir@gmail.com>
This change refactors the code by using generate_qualified_relation_name
from id instead of using a sequence of functions to generate the
relation name.
Fixes#6602
This fixes#7230.
First of all, using HeapTupleHeaderGetDatumLength(heapTuple) is
definetly wrong, it gives a number that's 4 times less than the correct
tuple size (heapTuple.t_len). See
https://github.com/postgres/postgres/blob/REL_16_0/src/include/access/htup_details.h#L455-L456https://github.com/postgres/postgres/blob/REL_16_0/src/include/varatt.h#L279https://github.com/postgres/postgres/blob/REL_16_0/src/include/varatt.h#L225-L226
When I fixed it, the limit_intermediate_size test failed, so I tried to
understand what's going on there. In original commit fd546cf these
queries were supposed to fail. Then in b3af63c three of the queries that
were supposed to fail suddenly worked and tests were changed to pass
without understanding why the output had changed or how to keep test
testing what it had to test. Even comments saying that these queries
should fail were left untouched. Commit message gives no clue about why
exactly test has changed:
> It seems that when we use adaptive executor instead of task tracker,
we
> exceed the intermediate result size less in the test. Therefore
updated
> the tests accordingly.
Then 3fda2c3 also blindly raised the limit for one of the queries to
keep it working:
3fda2c3254 (diff-a9b7b617f9dfd345318cb8987d5897143ca1b723c87b81049bbadd94dcc86570R19)
When in fe3caf3 that HeapTupleHeaderGetDatumLength(heapTuple) call was
finally added, one of those test queries became failing again.
The other two of them now also failing after the fix. I don't understand
how exactly the calculation of "intermediate result size" that is
limited by citus.max_intermediate_result_size had changed through
b3af63c and fe3caf3, but these numbers are now closer to what
they originally were when this limitation was added in
fd546cf. So these queries should fail, like in the original
version of the limit_intermediate_size test.
Co-authored-by: Karina Litskevich <litskevichkarina@gmail.com>
This change adds a script to programatically group all includes in a
specific order. The script was used as a one time invocation to group
and sort all includes throught our formatted code. The grouping is as
follows:
- System includes (eg. `#include<...>`)
- Postgres.h (eg. `#include "postgres.h"`)
- Toplevel imports from postgres, not contained in a directory (eg.
`#include "miscadmin.h"`)
- General postgres includes (eg . `#include "nodes/..."`)
- Toplevel citus includes, not contained in a directory (eg. `#include
"citus_verion.h"`)
- Columnar includes (eg. `#include "columnar/..."`)
- Distributed includes (eg. `#include "distributed/..."`)
Because it is quite hard to understand the difference between toplevel
citus includes and toplevel postgres includes it hardcodes the list of
toplevel citus includes. In the same manner it assumes anything not
prefixed with `columnar/` or `distributed/` as a postgres include.
The sorting/grouping is enforced by CI. Since we do so with our own
script there are not changes required in our uncrustify configuration.
Similar to https://github.com/citusdata/citus/pull/7077.
As PG 16+ has changed the join restriction information for certain outer
joins, MERGE is also impacted given that is is also underlying an outer
join.
See #7077 for the details.
Prior to this commit, the code would skip processing the
errors happened for local commands.
Prior to https://github.com/citusdata/citus/pull/5379, it might
make sense to allow the execution continue. But, as of today,
if a modification fails on any placement, we can safely fail
the execution.
The first commit show the problem in action. The second commit
includes the fix and the test fixes.
This commit is the second and last phase of dropping PG13 support.
It consists of the following:
- Removes all PG_VERSION_13 & PG_VERSION_14 from codepaths
- Removes pg_version_compat entries and columnar_version_compat entries
specific for PG13
- Removes alternative pg13 test outputs
- Removes PG13 normalize lines and fix the test outputs based on that
It is a continuation of 5bf163a27d
We need to rewind the tuplestorestate's tuple index to get correct
results on fetching scrollable with hold cursors.
`PersistHoldablePortal` is responsible for persisting out
tuplestorestate inside a with hold cursor before commiting a
transaction.
It rewinds the cursor like below (`ExecutorRewindcalls` calls `rescan`):
```c
if (portal->cursorOptions & CURSOR_OPT_SCROLL)
{
ExecutorRewind(queryDesc);
}
```
At the end, it adjusts tuple index for holdStore in the portal properly.
```c
if (portal->cursorOptions & CURSOR_OPT_SCROLL)
{
if (!tuplestore_skiptuples(portal->holdStore,
portal->portalPos,
true))
elog(ERROR, "unexpected end of tuple stream");
}
```
DESCRIPTION: Fixes incorrect results on fetching scrollable with hold
cursors.
Fixes https://github.com/citusdata/citus/issues/7010
1) For distributed tables that are not colocated.
2) When joining on a non-distribution column for colocated tables.
3) When merging into a distributed table using reference or citus-local tables as the data source.
This is accomplished primarily through the implementation of the following two strategies.
Repartition: Plan the source query independently,
execute the results into intermediate files, and repartition the files to
co-locate them with the merge-target table. Subsequently, compile a final
merge query on the target table using the intermediate results as the data
source.
Pull-to-coordinator: Execute the plan that requires evaluation at the coordinator,
run the query on the coordinator, and redistribute the resulting rows to ensure
colocation with the target shards. Direct the MERGE SQL operation to the worker
nodes' target shards, using the intermediate files colocated with the data as the
data source.
PG16beta1 added some sanity checks for GUCS, find the Relevant PG
commits below:
1- Add check on initial and boot values when loading GUCs
a73952b795
2- Extend check_GUC_init() with checks on flag combinations when loading
GUCs
009f8d1714
I fixed our currently problematic GUCS, we can merge this directly into
main as these make sense for any PG version.
There was a particular NodeConninfo issue:
Previously we would rely on the fact that NodeConninfo initial value
is an empty string. However, with PG16 enforcing same initial and boot
values, we can't use an empty initial value for NodeConninfo anymore.
Therefore we add a new flag to indicate whether we are at boot check.
DESCRIPTION: Enabling citus_stat_tenants to support schema-based
tenants.
This pull request modifies the existing logic to enable tenant
monitoring with schema-based tenants. The changes made are as follows:
- If a query has a partitionKeyValue (which serves as a tenant
key/identifier for distributed tables), Citus annotates the query with
both the partitionKeyValue and colocationId. This allows for accurate
tracking of the query.
- If a query does not have a partitionKeyValue, but its colocationId
belongs to a distributed schema, Citus annotates the query with only the
colocationId. The tenant monitor can then easily look up the schema to
determine if it's a distributed schema and make a decision on whether to
track the query.
---------
Co-authored-by: Jelte Fennema <jelte.fennema@microsoft.com>
DESCRIPTION: Fixes a crash when explain analyze is requested for a query
that is normally locally executed.
When explain analyze is requested for a query, a task with two queries
is created. Those two queries are
1. Wrapped Query --> `SELECT ... FROM
worker_save_query_explain_analyze(<query>, <explain analyze options>)`
2. Fetch Query -->` SELECT explain_analyze_output, execution_duration
FROM worker_last_saved_explain_analyze();`
When the query is locally executed a task with multiple queries causes a
crash in production. See the Assert at
57455dc64d/src/backend/distributed/executor/tuple_destination.c#:~:text=Assert(task%2D%3EqueryCount%20%3D%3D%201)%3B
This becomes a critical issue when auto_explain extension is used. When
auto_explain extension is enabled, explain analyze is automatically
requested for every query.
One possible solution could be not to create two queries for a locally
executed query. The fetch part may not have to be a query since the
values are available in local variables.
Until we enable local execution for explain analyze, it is best to
disable local execution.
Fixes#6777.
We mark objects as distributed objects in Citus metadata only if we need
to propagate given the command that creates it to worker nodes. For this
reason, we were not doing this for the objects that are created while
pg_dist_node is empty.
One implication of doing so is that we defer the schema propagation to
the time when user creates the first distributed table in the schema.
However, this doesn't help for schema-based sharding (#6866) because we
want to sync pg_dist_tenant_schema to the worker nodes even for empty
schemas too.
* Support test dependencies for isolation tests without a schedule
* Comment out a test due to a known issue (#6901)
* Also, reduce the verbosity for some log messages and make some
tests compatible with run_test.py.
This PR updates the tenant stats implementation to set partitionKeyValue
and colocationId in ExecuteLocalTaskListExtended, in addition to
LocallyExecuteTaskPlan. This ensures that tenant stats can be properly
gathered regardless of the code path taken. The changes were initially
made while testing stored procedure calls for tenant stats.
DESCRIPTION: Adds views that monitor statistics on tenant usages
This PR adds `citus_stats_tenants` view that monitors the tenants on the
cluster.
`citus_stats_tenants` shows the node id, colocation id, tenant
attribute, read count in this period and last period, and query count in
this period and last period of the tenant.
Tenant attribute currently is the tenant's distribution column value,
later when schema based sharding is introduced, this meaning might
change.
A period is a time bucket the queries are counted by. Read and query
counts for this period can increase until the current period ends. After
that those counts are moved to last period's counts, which cannot
change. The period length can be set using 'citus.stats_tenants_period'.
`SELECT` queries are counted as _read_ queries, `INSERT`, `UPDATE` and
`DELETE` queries are counted as _write_ queries. So in the view read
counts are `SELECT` counts and query counts are `SELECT`, `INSERT`,
`UPDATE` and `DELETE` count.
The data is stored in shared memory, in a struct named
`MultiTenantMonitor`.
`citus_stats_tenants` shows the data from local tenants.
`citus_stats_tenants` show up to `citus.stats_tenant_limit` number of
tenants.
The tenants are scored based on the number of queries they run and the
recency of those queries. Every query ran increases the score of tenant
by `ONE_QUERY_SCORE`, and after every period ends the scores are halved.
Halving is done lazily.
To retain information a longer the monitor keeps up to 3 times
`citus.stats_tenant_limit` tenants. When the tenant count hits `3 *
citus.stats_tenant_limit`, last `citus.stats_tenant_limit` tenants are
removed. To see all stored tenants you can use
`citus_stats_tenants(return_all_tenants := true)`
- [x] Create collector view that gets data from all nodes. #6761
- [x] Add monitoring log #6762
- [x] Create enable/disable GUC #6769
- [x] Parse the annotation string correctly #6796
- [x] Add local queries and prepared statements #6797
- [x] Rename to citus_stat_statements #6821
- [x] Run pgbench
- [x] Fix role permissions #6812
---------
Co-authored-by: Gokhan Gulbiz <ggulbiz@gmail.com>
Co-authored-by: Jelte Fennema <github-tech@jeltef.nl>
Description:
Implementing CDC changes using Logical Replication to avoid
re-publishing events multiple times by setting up replication origin
session, which will add "DoNotReplicateId" to every WAL entry.
- shard splits
- shard moves
- create distributed table
- undistribute table
- alter distributed tables (for some cases)
- reference table operations
The citus decoder which will be decoding WAL events for CDC clients,
ignores any WAL entry with replication origin that is not zero.
It also maps the shard names to distributed table names.
We should not omit to free PGResult when we receive single tuple result
from an internal backend.
Single tuple results are normally freed by our ReceiveResults for
`tupleDescriptor != NULL` flow but not for those with `tupleDescriptor
== NULL`. See PR #6722 for details.
DESCRIPTION: Fixes memory leak issue with query results that returns
single row.
DESCRIPTION: Fix foreign key validation skip at the end of shard move
In eadc88a we started completely skipping foreign key constraint
validation at the end of a non blocking shard move, instead of only for
foreign keys to reference tables. However, it turns out that this didn't
work at all because of a hard to notice bug: By resetting the
SkipConstraintValidation flag at the end of our utility hook, we
actually make the SET command that sets it a no-op.
This fixes that bug by removing the code that resets it. This is fine
because #6543 removed the only place where we set the flag in C code. So
the resetting of the flag has no purpose anymore. This PR also adds a
regression test, because it turned out we didn't have any otherwise we
would have caught that the feature was completely broken.
It also moves the constraint validation skipping to the utility hook.
The reason is that #6550 showed us that this is the better place to skip
it, because it will also skip the planning phase and not just the
execution.
Before this commit, we created an additional WaitEventSet for
checking whether the remote socket is closed per connection -
only once at the start of the execution.
However, for certain workloads, such as pgbench select-only
workloads, the creation/deletion of the additional WaitEventSet
adds ~7% CPU overhead, which is also reflected on the benchmark
results.
With this commit, we use the same WaitEventSet for the purposes
of checking the remote socket at the start of the execution.
We use "rebuildWaitEventSet" flag so that the executor can re-use
the existing WaitEventSet.
As a result, we see the following improvements on PG 15:
main : 120051 tps, 0.532 ms latency avg.
avoid_wes_rebuild: 127119 tps, 0.503 ms latency avg.
And, on PG 14, as expected, there is no difference
main : 129191 tps, 0.495 ms latency avg.
avoid_wes_rebuild: 129480 tps, 0.494 ms latency avg.
But, note that PG 15 is slightly (~1.5%) slower than PG 14.
That is probably the overhead of checking the remote socket.
PostgreSQL 15 exposes WL_SOCKET_CLOSED in WaitEventSet API, which is
useful for detecting closed remote sockets. In this patch, we use this
new event and try to detect closed remote sockets in the executor.
When a closed socket is detected, the executor now has the ability to
retry the connection establishment. Note that, the executor can retry
connection establishments only for the connection that has not been
used. Basically, this patch is mostly useful for preventing the executor
to fail if a cached connection is closed because of the worker node
restart (or worker failover).
In other words, the executor cannot retry connection establishment if we
are in a distributed transaction AND any command has been sent over the
connection. That requires more sophisticated retry mechanisms. For now,
fixing the above use case is enough.
Fixes#5538
Earlier discussions: #5908, #6259 and #6283
### Summary of the current approach regards to earlier trials
As noted, we explored some alternatives before getting into this.
https://github.com/citusdata/citus/pull/6283 is simple, but lacks an
important property. We should be checking for `WL_SOCKET_CLOSED`
_before_ sending anything over the wire. Otherwise, it becomes very
tricky to understand which connection is actually safe to retry. For
example, in the current patch, we can safely check
`transaction->transactionState == REMOTE_TRANS_NOT_STARTED` before
restarting a connection.
#6259 does what we intent here (e.g., check for sending any command).
However, as @marcocitus noted, it is very tricky to handle
`WaitEventSets` in multiple places. And, the executor is designed such
that it reacts to the events. So, adding anything `pre-executor` seemed
too ugly.
In the end, I converged into this patch. This patch relies on the
simplicity of #6283 and also does a very limited handling of
`WaitEventSets`, just for our purpose. Just before we add any connection
to the execution, we check if the remote session has already closed.
With that, we do a brief interaction of multiple wait event processing,
but with different purposes. The new wait event processing we added does
not even consider cancellations. We let that handled by the main event
processing loop.
Co-authored-by: Marco Slot <marco.slot@gmail.com>
Introduces a new GUC named citus.skip_constraint_validation, which basically skips constraint validation when set to on.
For some several places that we hack to skip the foreign key validation phase, now we use this GUC.
This is a refactoring PR that starts using our new hash table creation
helper function. It adds a few more macros for ease of use, because C
doesn't have default arguments. It also adds a macro to check if a
struct contains automatic padding bytes. No struct that is hashed using
tag_hash should have automatic padding bytes, because those bytes are
undefined and thus using them to create a hash will result in undefined
behaviour (usually a random hash).
While testing 5670dffd33, I realized
that we have a missing RecordNonDistTableAccessesForTask() for
local utility commands.
Although we don't have to record the relation access for local
only cases, we really want to keep the behaviour for scale-out
be the same with single node on all aspects. We wouldn't want
any single node complex transaction to work on single machine,
but not on multi node cluster. Hence, we apply the same restrictions.
For example, on a distributed cluster, the following errors, and
after this commit this errors locally as well
```SQL
CREATE TABLE ref(a int primary key);
INSERT INTO ref VALUES (1);
CREATE TABLE dist(a int REFERENCES ref(a));
SELECT create_reference_table('ref');
SELECT create_distributed_table('dist', 'a');
BEGIN;
SELECT * FROM dist;
TRUNCATE ref CASCADE;
ERROR: cannot execute DDL on table "ref" because there was a parallel SELECT access to distributed table "dist" in the same transaction
HINT: Try re-running the transaction with "SET LOCAL citus.multi_shard_modify_mode TO 'sequential';"
COMMIT;
```
We also add the comprehensive test suite and run the same locally.