In preparation of sorting and grouping all includes we wanted to move
this file to the toplevel includes for good grouping/sorting.
(cherry picked from commit 0dac63afc0)
What we do to collect foreign key constraint commands in
WorkerCreateShardCommandList is quite similar to what we do in
CopyShardForeignConstraintCommandList. Plus, the code that we used
in WorkerCreateShardCommandList before was not able to properly handle
foreign key constraints between Citus local tables --when creating a
reference table from the referencing one.
With a few slight modifications made to
CopyShardForeignConstraintCommandList, we can use the same logic in
WorkerCreateShardCommandList too.
This PR provides successful compilation against PG16Beta2. It does some
necessary refactoring to prepare for full support of version 16, in
https://github.com/citusdata/citus/pull/6952 .
Change RelFileNode to RelFileNumber or RelFileLocator
Relevant PG commit
b0a55e43299c4ea2a9a8c757f9c26352407d0ccc
new header for varatt.h
Relevant PG commit:
d952373a987bad331c0e499463159dd142ced1ef
drop support for Abs, use fabs
Relevant PG commit
357cfefb09115292cfb98d504199e6df8201c957
tuplesort PGcommit: d37aa3d35832afde94e100c4d2a9618b3eb76472
Relevant PG commit:
d37aa3d35832afde94e100c4d2a9618b3eb76472
Fix vacuum in columnar
Relevant PG commit:
4ce3afb82ecfbf64d4f6247e725004e1da30f47c
older one:
b6074846cebc33d752f1d9a66e5a9932f21ad177
Add alloc_flags to pg_clean_ascii
Relevant PG commit:
45b1a67a0fcb3f1588df596431871de4c93cb76f
Merge GetNumConfigOptions() into get_guc_variables()
Relevant PG commit:
3057465acfbea2f3dd7a914a1478064022c6eecd
Minor PG refactor PG_FUNCNAME_MACRO __func__
Relevant PG commit
320f92b744b44f961e5d56f5f21de003e8027a7f
Pass NULL context to stringToQualifiedNameList, typeStringToTypeName
The pre-PG16 error behaviour for the following
stringToQualifiedNameList & typeStringToTypeName
was ereport(ERROR, ...)
Now with PG16 we have this context input. We preserve the same behaviour
by passing a NULL context, because of the following:
(copy paste comment from PG16)
If "context" isn't an ErrorSaveContext node, this behaves as
errstart(ERROR, domain), and the errsave() macro ends up acting
exactly like ereport(ERROR, ...).
Relevant PG commit
858e776c84f48841e7e16fba7b690b76e54f3675
Use RangeVarCallbackMaintainsTable instead of RangeVarCallbackOwnsTable
Relevant PG commit:
60684dd834a222fefedd49b19d1f0a6189c1632e
FIX THIS: Not implemented grant-level control of role inheritance
see PG commit
e3ce2de09d814f8770b2e3b3c152b7671bcdb83f
Make Scan node abstract
PG commit:
8c73c11a0d39049de2c1f400d8765a0eb21f5228
Change in Var representations, get_relids_in_jointree
PG commit
2489d76c4906f4461a364ca8ad7e0751ead8aa0d
Deadlock detection changes because SHM_QUEUE is removed
Relevant PG Commit:
d137cb52cb7fd44a3f24f3c750fbf7924a4e9532
TU_UpdateIndexes
Relevant PG commit
19d8e2308bc51ec4ab993ce90077342c915dd116
Use object_ownercheck and object_aclcheck functions
Relevant PG commits:
afbfc02983f86c4d71825efa6befd547fe81a926
c727f511bd7bf3c58063737bcf7a8f331346f253
Rework Permission Info for successful compilation
Relevant PG commits:
postgres/postgres@a61b1f7postgres/postgres@b803b7d
---------
Co-authored-by: onderkalaci <onderkalaci@gmail.com>
Index scans in PG16 return empty sets because of extra compatibility
enforcement for `ScanKeyInit` arguments.
Could be one of the relevant PG commits:
c8b2ef05f4
This PR fixes all incompatible `RegProcedure` and `Datum` arguments in
all `ScanKeyInit` functions used throughout the codebase.
Helpful for https://github.com/citusdata/citus/pull/6952
One problem with rebalancing by disk size is that shards in newly
created collocation groups are considered extremely small. This can
easily result in bad balances if there are some other collocation groups
that do have some data. One extremely bad example of this is:
1. You have 2 workers
2. Both contain about 100GB of data, but there's a 70MB difference.
3. You create 100 new distributed schemas with a few empty tables in
them
4. You run the rebalancer
5. Now all new distributed schemas are placed on the node with that had
70MB less.
6. You start loading some data in these shards and quickly the balance
is completely off
To address this edge case, this PR changes the by_disk_size rebalance
strategy to add a a base size of 100MB to the actual size of each
shard group. This can still result in a bad balance when shard groups
are empty, but it solves some of the worst cases.
This commit is the second and last phase of dropping PG13 support.
It consists of the following:
- Removes all PG_VERSION_13 & PG_VERSION_14 from codepaths
- Removes pg_version_compat entries and columnar_version_compat entries
specific for PG13
- Removes alternative pg13 test outputs
- Removes PG13 normalize lines and fix the test outputs based on that
It is a continuation of 5bf163a27d
citus_shard_sizes view had a shard name column we use to extract shard
id. This PR changes the column to shard id so we don't do unnecessary
string operation.
Creating a second PR to make reviewing easier.
This PR tests:
- replicate_reference_tables
- fix_partition_shard_index_names
- isolate_tenant_to_new_shard
- replicate_table_shards
DESCRIPTION: Fixes a bug in background shard rebalancer where the
replicate reference tables task fails if the current user is not a
superuser.
This change is to be backported to earlier releases. We should fix the
permissions for replicate_reference_tables on main branch such that it
can be run by non-superuser roles.
Fixes#6925.
Fixes#6926.
With this PR, we allow creating distributed tables with without
specifying a shard key via create_distributed_table(). Here are the
the important details about those tables:
* Specifying `shard_count` is not allowed because it is assumed to be 1.
* We mostly call such tables as "null shard-key" table in code /
comments.
* To avoid doing a breaking layout change in create_distributed_table();
instead of throwing an error, it will inform the user that
`distribution_type`
param is ignored unless it's explicitly set to NULL or 'h'.
* `colocate_with` param allows colocating such null shard-key tables to
each other.
* We define this table type, i.e., NULL_SHARD_KEY_TABLE, as a subclass
of
DISTRIBUTED_TABLE because we mostly want to treat them as distributed
tables in terms of SQL / DDL / operation support.
* Metadata for such tables look like:
- distribution method => DISTRIBUTE_BY_NONE
- replication model => REPLICATION_MODEL_STREAMING
- colocation id => **!=** INVALID_COLOCATION_ID (distinguishes from
Citus local tables)
* We assign colocation groups for such tables to different nodes in a
round-robin fashion based on the modulo of "colocation id".
Note that this PR doesn't care about DDL (except CREATE TABLE) / SQL /
operation (i.e., Citus UDFs) support for such tables but adds a
preliminary
API.
We are handling colocation groups with shard group count less than the
worker node count, using a method different than the usual rebalancer.
See #6739
While making the decision of using this method or not, we should've
ignored the nodes that are marked `shouldhaveshards = false`. This PR
excludes those nodes when making the decision.
Adds a test such that:
coordinator: []
worker 1: [1_1, 1_2]
worker 2: [2_1, 2_2]
(rebalance)
coordinator: []
worker 1: [1_1, 2_1]
worker 2: [1_2, 2_2]
If we take the coordinator into account, the rebalancer considers the
first state as balanced and does nothing (because shard_count <
worker_count)
But with this pr, we ignore the coordinator because it's
shouldhaveshards = false
So the rebalancer distributes each colocation group to both workers
Also, fixes an unrelated flaky test in the same file
DESCRIPTION: Fixes memory errors, caught by valgrind, of type
"conditional jump or move depends on uninitialized value"
When running Citus tests under Postgres with valgrind, the test cases
calling into `NonBlockingShardSplit` function produce valgrind errors of
type "conditional jump or move depends on uninitialized value".
The issue is caused by creating a HTAB in a wrong way. HASH_COMPARE flag
should have been used when creating a HTAB with user defined comparison
function. In the absence of HASH_COMPARE flag, HTAB falls back into
built-in string comparison function. However, valgrind somehow discovers
that the match function is not assigned to the user defined function as
intended.
Fixes#6835
DESCRIPTION: Adds control for background task executors involving a node
### Background and motivation
Nonblocking concurrent task execution via background workers was
introduced in [#6459](https://github.com/citusdata/citus/pull/6459), and
concurrent shard moves in the background rebalancer were introduced in
[#6756](https://github.com/citusdata/citus/pull/6756) - with a hard
dependency that limits to 1 shard move per node. As we know, a shard
move consists of a shard moving from a source node to a target node. The
hard dependency was used because the background task runner didn't have
an option to limit the parallel shard moves per node.
With the motivation of controlling the number of concurrent shard
moves that involve a particular node, either as source or target, this
PR introduces a general new GUC
citus.max_background_task_executors_per_node to be used in the
background task runner infrastructure. So, why do we even want to
control and limit the concurrency? Well, it's all about resource
availability: because the moves involve the same nodes, extra
parallelism won’t make the rebalance complete faster if some resource is
already maxed out (usually cpu or disk). Or, if the cluster is being
used in a production setting, the moves might compete for resources with
production queries much more than if they had been executed
sequentially.
### How does it work?
A new column named nodes_involved is added to the catalog table that
keeps track of the scheduled background tasks,
pg_dist_background_task. It is of type integer[] - to store a list
of node ids. It is NULL by default - the column will be filled by the
rebalancer, but we may not care about the nodes involved in other uses
of the background task runner.
Table "pg_catalog.pg_dist_background_task"
Column | Type
============================================
job_id | bigint
task_id | bigint
owner | regrole
pid | integer
status | citus_task_status
command | text
retry_count | integer
not_before | timestamp with time zone
message | text
+nodes_involved | integer[]
A hashtable named ParallelTasksPerNode keeps track of the number of
parallel running background tasks per node. An entry in the hashtable is
as follows:
ParallelTasksPerNodeEntry
{
node_id // The node is used as the hash table key
counter // Number of concurrent background tasks that involve node node_id
// The counter limit is citus.max_background_task_executors_per_node
}
When the background task runner assigns a runnable task to a new
executor, it increments the counter for each of the nodes involved with
that runnable task. The limit of each counter is
citus.max_background_task_executors_per_node. If the limit is reached
for any of the nodes involved, this runnable task is skipped. And then,
later, when the running task finishes, the background task runner
decrements the counter for each of the nodes involved with the done
task. The following functions take care of these increment-decrement
steps:
IncrementParallelTaskCountForNodesInvolved(task)
DecrementParallelTaskCountForNodesInvolved(task)
citus.max_background_task_executors_per_node can be changed in the
fly. In the background rebalancer, we simply give {source_node,
target_node} as the nodesInvolved input to the
ScheduleBackgroundTask function. The rest is taken care of by the
general background task runner infrastructure explained above. Check
background_task_queue_monitor.sql and
background_rebalance_parallel.sql tests for detailed examples.
#### Note
This PR also adds a hard node dependency if a node is first being used
as a source for a move, and then later as a target. The reason this
should be a hard dependency is that the first move might make space for
the second move. So, we could run out of disk space (or at least
overload the node) if we move the second shard to it before the first
one is moved away.
Fixes https://github.com/citusdata/citus/issues/6716
DESCRIPTION: Refactor and unify shard move and copy functions
Shard move and copy functions share a lot of code in common. This PR
unifies these functions into one, along with some helper functions. To
preserve the current behavior, we'll introduce and use an enum
parameter, and hardcoded strings for producing error/warning messages.
This pull request proposes a change to the logic used for propagating
identity columns to worker nodes in citus. Instead of creating a
dependent sequence for each identity column and changing its default
value to `nextval(seq)/worker_nextval(seq)`, this update will pass the
identity columns as-is to the worker nodes.
Please note that there are a few limitations to this change.
1. Only bigint identity columns will be allowed in distributed tables to
ensure compatibility with the DDL from any node functionality. Our
current distributed sequence implementation only allows insert
statements from all nodes for bigint sequences.
2. `alter_distributed_table` and `undistribute_table` operations will
not be allowed for tables with identity columns. This is because we do
not have a proper way of keeping sequence states consistent across the
cluster.
DESCRIPTION: Prevents using identity columns on data types other than
`bigint` on distributed tables
DESCRIPTION: Prevents using `alter_distributed_table` and
`undistribute_table` UDFs when a table has identity columns
DESCRIPTION: Fixes a bug that prevents enforcing identity column
restrictions on worker nodes
Depends on #6740Fixes#6694
DESCRIPTION: This PR removes the task dependencies between shard moves
for which the shards belong to different colocation groups. This change
results in scheduling multiple tasks in the RUNNABLE state. Therefore it
is possible that the background task monitor can run them concurrently.
Previously, all the shard moves planned in a rebalance operation took
dependency on each other sequentially.
For instance, given the following table and shards
colocation group 1 colocation group 2
table1 table2 table3 table4 table 5
shard11 shard21 shard31 shard41 shard51
shard12 shard22 shard32 shard42 shard52
if the rebalancer planner returned the below set of moves
` {move(shard11), move(shard12), move(shard41), move(shard42)}`
background rebalancer scheduled them such that they depend on each other
sequentially.
```
{move(reftables) if there is any, none}
|
move( shard11)
|
move(shard12)
| {move(shard41)<--- move(shard12)} This is an artificial dependency
move(shard41)
|
move(shard42)
```
This results in artificial dependencies between otherwise independent
moves.
Considering that the shards in different colocation groups can be moved
concurrently, this PR changes the dependency relationship between the
moves as follows:
```
{move(reftables) if there is any, none} {move(reftables) if there is any, none}
| |
move(shard11) move(shard41)
| |
move(shard12) move(shard42)
```
---------
Co-authored-by: Jelte Fennema <jelte.fennema@microsoft.com>
Description:
Implementing CDC changes using Logical Replication to avoid
re-publishing events multiple times by setting up replication origin
session, which will add "DoNotReplicateId" to every WAL entry.
- shard splits
- shard moves
- create distributed table
- undistribute table
- alter distributed tables (for some cases)
- reference table operations
The citus decoder which will be decoding WAL events for CDC clients,
ignores any WAL entry with replication origin that is not zero.
It also maps the shard names to distributed table names.
DESCRIPTION: Check before logicalrep for rebalancer, error if needed
Check if we can use logical replication or not, in case of shard
transfer mode = auto, before executing the shard moves. If we can't,
error out. Before this PR, we used to error out in the middle of shard
moves:
```sql
set citus.shard_count = 4; -- just to get the error sooner
select citus_remove_node('localhost',9702);
create table t1 (a int primary key);
select create_distributed_table('t1','a');
create table t2 (a bigint);
select create_distributed_table('t2','a');
select citus_add_node('localhost',9702);
select rebalance_table_shards();
NOTICE: Moving shard 102008 from localhost:9701 to localhost:9702 ...
NOTICE: Moving shard 102009 from localhost:9701 to localhost:9702 ...
NOTICE: Moving shard 102012 from localhost:9701 to localhost:9702 ...
ERROR: cannot use logical replication to transfer shards of the relation t2 since it doesn't have a REPLICA IDENTITY or PRIMARY KEY
```
Now we check and error out in the beginning, without moving the shards.
fixes: #6727
DESCRIPTION: Fixes a bug in shard copy operations.
For copying shards in both shard move and shard split operations, Citus
uses the COPY statement.
A COPY all statement in the following form
` COPY target_shard FROM STDIN;`
throws an error when there is a GENERATED column in the shard table.
In order to fix this issue, we need to exclude the GENERATED columns in
the COPY and the matching SELECT statements. Hence this fix converts the
COPY and SELECT all statements to the following form:
```
COPY target_shard (col1, col2, ..., coln) FROM STDIN;
SELECT (col1, col2, ..., coln) FROM source_shard;
```
where (col1, col2, ..., coln) does not include a GENERATED column.
GENERATED column values are created in the target_shard as the values
are inserted.
Fixes#6705.
---------
Co-authored-by: Teja Mupparti <temuppar@microsoft.com>
Co-authored-by: aykut-bozkurt <51649454+aykut-bozkurt@users.noreply.github.com>
Co-authored-by: Jelte Fennema <jelte.fennema@microsoft.com>
Co-authored-by: Gürkan İndibay <gindibay@microsoft.com>
DESCRIPTION: Adds logic to distribute unbalanced shards
If the number of shard placements (for a colocation group) is less than
the number of workers, it means that some of the workers will remain
empty. With this PR, we consider these shard groups as a colocation
group, in order to make them be distributed evenly as much as possible
across the cluster.
Example:
```sql
create table t1 (a int primary key);
create table t2 (a int primary key);
create table t3 (a int primary key);
set citus.shard_count =1;
select create_distributed_table('t1','a');
select create_distributed_table('t2','a',colocate_with=>'t1');
select create_distributed_table('t3','a',colocate_with=>'t2');
create table tb1 (a bigint);
create table tb2 (a bigint);
select create_distributed_table('tb1','a');
select create_distributed_table('tb2','a',colocate_with=>'tb1');
select citus_add_node('localhost',9702);
select rebalance_table_shards();
```
Here we have two colocation groups, each with one shard group. Both
shard groups are placed on the first worker node. When we add a new
worker node and try to rebalance table shards, the rebalance planner
considers it well balanced and does nothing. With this PR, the
rebalancer tries to distribute these shard groups evenly across the
cluster as much as possible. For this example, with this PR, the
rebalancer moves one of the shard groups to the second worker node.
fixes: #6715
We have memory leak during distribution of a table with a lot of
partitions as we do not release memory at ExprContext until all
partitions are not distributed. We improved 2 things to resolve the
issue:
1. We create and delete MemoryContext for each call to
`CreateDistributedTable` by partitions,
2. We rebuild the cache after we insert all the placements instead of
each placement for a shard.
DESCRIPTION: Fixes memory leak during distribution of a table with a lot
of partitions and shards.
Fixes https://github.com/citusdata/citus/issues/6572.
DESCRIPTION: Fix background rebalance when reference table has no PK
For the background rebalance we would always fail if a reference table
that was not replicated to all nodes would not have a PK (or replica
identity). Even when we used force_logical or block_writes as the shard
transfer mode. This fixes that and adds some regression tests.
Fixes#6680
citus_job_list() lists all background jobs by simply showing the records
in pg_dist_background_job.
citus_job_status(job_id bigint, raw boolean default false) shows the
status of a single background job by appending a jsonb details column to
the associated row from pg_dist_background_job. If the raw argument is
set, machine readable sizes are used instead of human readable
alternatives.
citus_rebalance_status(raw boolean default false) shows the status of
the last rebalance operation. If the raw argument is set, machine
readable sizes are used instead of human readable alternatives.
DESCRIPTION: Drop `SHARD_STATE_TO_DELETE` and use the cleanup records
instead
Drops the shard state that is used to mark shards as orphaned. Now we
insert cleanup records into `pg_dist_cleanup` so "orphaned" shards will
be dropped either by maintenance daemon or internal cleanup calls. With
this PR, we make the "cleanup orphaned shards" functions to be no-op, as
they would not be needed anymore.
This PR includes some naming changes about placement functions. We don't
need functions that filter orphaned shards, as there will be no orphaned
shards anymore.
We will also be introducing a small script with this PR, for users with
orphaned shards. We'll basically delete the orphaned shard entries from
`pg_dist_placement` and insert cleanup records into `pg_dist_cleanup`
for each one of them, during Citus upgrade.
We also have a lot of flakiness fixes in this PR.
Co-authored-by: Jelte Fennema <github-tech@jeltef.nl>
DESCRIPTION: Defers cleanup after a failure in shard move or split
We don't need to do a cleanup in case of failure on a shard transfer or
split anymore. Because,
* Maintenance daemon will clean them up anyway.
* We trigger a cleanup at the beginning of shard transfers/splits.
* The cleanup on failure logic also can fail sometimes and instead of
the original error, we throw the error that is raised by the cleanup
procedure, and it causes confusion.
DESCRIPTION: Cleanup the shard on the target node in case of a
failed/aborted shard move
Inserts a cleanup record for the moved shard placement on the target
node. If the move operation succeeds, the record will be deleted. If
not, it will remain there to be cleaned up later.
fixes: #6580
DESCRIPTION: Create replication artifacts with unique names
We're creating replication objects with generic names. This disallows us
to enable parallel shard moves, as two operations might use the same
objects. With this PR, we'll create below objects with operation
specific names, by appending OparationId to the names.
* Subscriptions
* Publications
* Replication Slots
* Users created for subscriptions
When using multiline strings, we occasionally forget to add a single
space at the end of the first line. When this line is concatenated with
the next one, the resulting string has a missing space.
DESCRIPTION: Extend cleanup process for replication artifacts
This PR adds new cleanup record types for:
* Subscriptions
* Replication slots
* Publications
* Users created for subscriptions
We add records for these object types, to `pg_dist_cleanup` during
creation phase. Once the operation is done, in case of success or
failure, we iterate those records and drop the objects. With this PR we
will not be dropping any of these objects during the operation. In
short, we will always be deferring the drop.
One thing that's worth mentioning is that we sort cleanup records before
processing (dropping) them, because of dependency relations among those
objects, e.g a subscription might depend on a publication. Therefore, we
always drop subscriptions before publications.
We have some renames in this PR:
* `TryDropOrphanedShards` -> `TryDropOrphanedResources`
* `DropOrphanedShardsForCleanup` -> `DropOrphanedResourcesForCleanup`
* `run_try_drop_marked_shards` -> `run_try_drop_marked_resources`
as these functions now process replication artifacts as well.
This PR drops function `DropAllLogicalReplicationLeftovers` and its all
usages, since now we rely on the deferring drop mechanism.
When debugging issues it's quite useful to see the originating gpid in
the application_name of a query on a worker. This already happens for
most queries, but not for queries created by the rebalancer or by
run_command_on_worker. This adds a gpid to those two application_names
too.
Note, that if the GPID of the new application_names is different than
the current GPID of the backend the backend will continue to keep
the old gpid as its actual GPID. This PR is just meant to make sure
that the application_name is as useful as it can be for users to
look at. Updating of gpids will be done in a follow-up PR, and
adding gpids to all internal connections will make this easier.