**Motivation**
Some customers experienced **out of memory** or **max allocation block
size** errors during metadata sync when they had a lot of shards,
partitions, indexes, or columns. This PR has motivation to prevent those
2 types of memory failures to boost the scalability of Citus and unlock
some customers with huge clusters by letting them **add new nodes** and
**upgrade their Citus version above 11.0** which introduced important
features e.g. query from any node.
**Problems**
Memory errors are caused by the fact that we finish all the metadata
sync operations within a single coordinated transaction,
which causes mainly 3 problems:
1. Collecting metadata sync commands without freeing until the end of
the transaction,
2. Each modification causes PG invalidations related to cache memory. PG
stores those invalidations until the end of transaction (for visibility
guarantees) to notify other backends about the invalidations. As we do a
lot of modifications during the metadata syncing within single
coordinated transaction, PG can sometimes exceed max allocation block
size at worker nodes due to huge invalidation messages,
3. Citus has MetadataCacheMemory for fast access to metadata objects. To
see the effects of the modifications inside the same transaction, we
locally process PG invalidations and rebuild many objects without
freeing invalidated ones until the end of transaction for simplicity.
**Solution**
We decided to add nontransactional mode for metadata sync, where we send
each command in separate transaction and reset memory context after each
transaction. User can switch to nontransactional mode via a GUC if they
hit memory problems during the sync. (Default mode is transactional) We
created a common api for both transactional (old mode) and
nontransactional modes to have a uniform code and to not disturb test
coverage by introducing new code paths.
Below items are addressed for the solution:
- [x] **Commit-1** Add a method to send multiple commands to worker list
reusing bare connections. Change will be useful for metadata sync api,
- [x] **Commit-2** Create MetadataSyncContext api to encapsulate both
transactional and nontransactional modes,
- [x] **Commit-3** Let nontransactional sync mode create transaction per
shell table during dropping the shell tables from worker,
- [x] **Commit-4** Add new metadata sync methods which uses
MetadataSyncContext api so that during the sync we can
1. free memory to prevent OOM,
2. use either transactional or nontransactional modes according to the
GUC `citus.metadata_sync_transaction_mode`.
- [x] **Commit-5** Let `ActivateNode` use new metadata sync api,
- [x] **Commit-6** Let `activate_node_snapshot` use new metadata sync
api,
- [x] **Commit-7** Remove unused old metadata sync methods,
- [x] **Commit-8** Drop table, if exists, during table dependency
creation,
- [x] **Commit-9** Do not enforce distributed transaction at
`EnsureCoordinatorInitiatedOperation`,
- [x] **Commit-10** Do not acquire strict lock on separate transaction
to localhost as we already take the lock before,
- [x] **Commit-11** Let `AddNodeMetadata` to use metadatasync api during
`citus_add_node`,
- [x] **Commit-12** Force activated bare connections to close at
transaction end,
- [x] **Commit-13** Add failure tests for nontransactional metadata sync
mode,
- [x] Verify OOM and max allowed allocation block errors do not happen
with nontransactional sync mode.
DESCRIPTION: Fixes memory leak and max allocation block errors during
metadata syncing.
DESCRIPTION: Introduces nontransactional mode for metadata sync.
DESCRIPTION: Introduces the GUC `citus.metadata_sync_mode` to switch
sync modes.
Add new metadata sync methods which uses MemorySyncContext api so that during the sync we can
- free memory to prevent OOM,
- use either transactional or nontransactional modes according to the GUC .
- Create MetadataSyncContext api to encapsulate
both transactional and nontransactional modes,
- Add a GUC to switch between metadata sync transaction modes.
This pull request proposes a change to the logic used for propagating
identity columns to worker nodes in citus. Instead of creating a
dependent sequence for each identity column and changing its default
value to `nextval(seq)/worker_nextval(seq)`, this update will pass the
identity columns as-is to the worker nodes.
Please note that there are a few limitations to this change.
1. Only bigint identity columns will be allowed in distributed tables to
ensure compatibility with the DDL from any node functionality. Our
current distributed sequence implementation only allows insert
statements from all nodes for bigint sequences.
2. `alter_distributed_table` and `undistribute_table` operations will
not be allowed for tables with identity columns. This is because we do
not have a proper way of keeping sequence states consistent across the
cluster.
DESCRIPTION: Prevents using identity columns on data types other than
`bigint` on distributed tables
DESCRIPTION: Prevents using `alter_distributed_table` and
`undistribute_table` UDFs when a table has identity columns
DESCRIPTION: Fixes a bug that prevents enforcing identity column
restrictions on worker nodes
Depends on #6740Fixes#6694
DESCRIPTION: This PR removes the task dependencies between shard moves
for which the shards belong to different colocation groups. This change
results in scheduling multiple tasks in the RUNNABLE state. Therefore it
is possible that the background task monitor can run them concurrently.
Previously, all the shard moves planned in a rebalance operation took
dependency on each other sequentially.
For instance, given the following table and shards
colocation group 1 colocation group 2
table1 table2 table3 table4 table 5
shard11 shard21 shard31 shard41 shard51
shard12 shard22 shard32 shard42 shard52
if the rebalancer planner returned the below set of moves
` {move(shard11), move(shard12), move(shard41), move(shard42)}`
background rebalancer scheduled them such that they depend on each other
sequentially.
```
{move(reftables) if there is any, none}
|
move( shard11)
|
move(shard12)
| {move(shard41)<--- move(shard12)} This is an artificial dependency
move(shard41)
|
move(shard42)
```
This results in artificial dependencies between otherwise independent
moves.
Considering that the shards in different colocation groups can be moved
concurrently, this PR changes the dependency relationship between the
moves as follows:
```
{move(reftables) if there is any, none} {move(reftables) if there is any, none}
| |
move(shard11) move(shard41)
| |
move(shard12) move(shard42)
```
---------
Co-authored-by: Jelte Fennema <jelte.fennema@microsoft.com>
Description:
Implementing CDC changes using Logical Replication to avoid
re-publishing events multiple times by setting up replication origin
session, which will add "DoNotReplicateId" to every WAL entry.
- shard splits
- shard moves
- create distributed table
- undistribute table
- alter distributed tables (for some cases)
- reference table operations
The citus decoder which will be decoding WAL events for CDC clients,
ignores any WAL entry with replication origin that is not zero.
It also maps the shard names to distributed table names.
Today we allow planning the queries that reference non-colocated tables
if the shards that query targets are placed on the same node. However,
this may not be the case, e.g., after rebalancing shards because it's
not guaranteed to have those shards on the same node anymore.
This commit adds citus.enable_non_colocated_router_query_pushdown GUC
that can be used to disallow planning such queries via router planner,
when it's set to false. Note that the default value for this GUC will be
"true" for 11.3, but we will alter it to "false" on 12.0 to not
introduce
a breaking change in a minor release.
Closes#692.
Even more, allowing such queries to go through router planner also
causes
generating an incorrect plan for the DML queries that reference
distributed
tables that are sharded based on different replication factor settings.
For
this reason, #6779 can be closed after altering the default value for
this
GUC to "false", hence not now.
DESCRIPTION: Adds `citus.enable_non_colocated_router_query_pushdown` GUC
to ensure generating a consistent distributed plan for the queries that
reference non-colocated distributed tables (when set to "false", the
default is "true").
Soon I will be doing some changes related to #692 in router planner
and those changes require updating ~5/6 tests related to router
planning. And to make those test files runnable by run_test.py
multiple times, we need to make some other tests (that they're
run in parallel / they badly depend on) ready for run_test.py too.
Because they're only interested in distributed tables. Even more,
this replaces HasDistributionKey() check with
IsCitusTableType(DISTRIBUTED_TABLE) because this doesn't make a
difference on main and sounds slightly more intuitive. Plus, this
would also allow safely using this function in
https://github.com/citusdata/citus/pull/6773.
This would be useful for testing #6773. This is because, given that
#6773
only adds support for router / fast-path queries, theoretically almost
all
the tests that we have in that test file should work for null-shard-key
tables too (and they indeed do).
I deliberately did not replace multi_router_planner_fast_path.sql with
the one that I'm adding into arbitrary configs because we might still
want to see when we're able to go through fast-path planning for the
usual distributed tables (the ones that have a shard key).
DESCRIPTION: Check before logicalrep for rebalancer, error if needed
Check if we can use logical replication or not, in case of shard
transfer mode = auto, before executing the shard moves. If we can't,
error out. Before this PR, we used to error out in the middle of shard
moves:
```sql
set citus.shard_count = 4; -- just to get the error sooner
select citus_remove_node('localhost',9702);
create table t1 (a int primary key);
select create_distributed_table('t1','a');
create table t2 (a bigint);
select create_distributed_table('t2','a');
select citus_add_node('localhost',9702);
select rebalance_table_shards();
NOTICE: Moving shard 102008 from localhost:9701 to localhost:9702 ...
NOTICE: Moving shard 102009 from localhost:9701 to localhost:9702 ...
NOTICE: Moving shard 102012 from localhost:9701 to localhost:9702 ...
ERROR: cannot use logical replication to transfer shards of the relation t2 since it doesn't have a REPLICA IDENTITY or PRIMARY KEY
```
Now we check and error out in the beginning, without moving the shards.
fixes: #6727
ci/fix_styles.sh were complaining about `black` and `isort` packages are
not found even if I `pipenv install --dev` due to broken lock file. I
regenerated the lock file and now it works fine. We also wanted to
upgrade required python version for the pipfile.
Fixes#6672
2) Move all MERGE related routines to a new file merge_planner.c
3) Make ConjunctionContainsColumnFilter() static again, and rearrange the code in MergeQuerySupported()
4) Restore the original format in the comments section.
5) Add big serial test. Implement latest set of comments
This implements the phase - II of MERGE sql support
Support routable query where all the tables in the merge-sql are distributed, co-located, and both the source and
target relations are joined on the distribution column with a constant qual. This should be a Citus single-task
query. Below is an example.
SELECT create_distributed_table('t1', 'id');
SELECT create_distributed_table('s1', 'id', colocate_with => ‘t1’);
MERGE INTO t1
USING s1 ON t1.id = s1.id AND t1.id = 100
WHEN MATCHED THEN
UPDATE SET val = s1.val + 10
WHEN MATCHED THEN
DELETE
WHEN NOT MATCHED THEN
INSERT (id, val, src) VALUES (s1.id, s1.val, s1.src)
Basically, MERGE checks to see if
There are a minimum of two distributed tables (source and a target).
All the distributed tables are indeed colocated.
MERGE relations are joined on the distribution column
MERGE .. USING .. ON target.dist_key = source.dist_key
The query should touch only a single shard i.e. JOIN AND with a constant qual
MERGE .. USING .. ON target.dist_key = source.dist_key AND target.dist_key = <>
If any of the conditions are not met, it raises an exception.
(cherry picked from commit 44c387b978)
This implements MERGE phase3
Support pushdown query where all the tables in the merge-sql are Citus-distributed, co-located, and both
the source and target relations are joined on the distribution column. This will generate multiple tasks
which execute independently after pushdown.
SELECT create_distributed_table('t1', 'id');
SELECT create_distributed_table('s1', 'id', colocate_with => ‘t1’);
MERGE INTO t1
USING s1
ON t1.id = s1.id
WHEN MATCHED THEN
UPDATE SET val = s1.val + 10
WHEN MATCHED THEN
DELETE
WHEN NOT MATCHED THEN
INSERT (id, val, src) VALUES (s1.id, s1.val, s1.src)
*The only exception for both the phases II and III is, UPDATEs and INSERTs must be done on the same shard-group
as the joined key; for example, below scenarios are NOT supported as the key-value to be inserted/updated is not
guaranteed to be on the same node as the id distribution-column.
MERGE INTO target t
USING source s ON (t.customer_id = s.customer_id)
WHEN NOT MATCHED THEN - -
INSERT(customer_id, …) VALUES (<non-local-constant-key-value>, ……);
OR this scenario where we update the distribution column itself
MERGE INTO target t
USING source s On (t.customer_id = s.customer_id)
WHEN MATCHED THEN
UPDATE SET customer_id = 100;
(cherry picked from commit fa7b8949a8)
In #6720 I'm adding a `pytest` based testing framework. This adds the
dependencies for those. They have already been [merged into our docker
files][the-process-merge] in the the-process repo preparation for #6720.
But by not having them on our citus main branch it is impossible to
make changes to the Pipfile, because our CI Dockerfiles and master
are out of date.
Since #6720 will need some more discussion and might take a few more
weeks to be merged, this takes out the Pipfile changes. By merging this
PR we can unblock new Pipfile changes.
Unblocks and partially addresses #6766
[the-process-merge]: https://github.com/citusdata/the-process/pull/117
DESCRIPTION: Fixes (pg_dump/pg_upgrade) dependency loop warnings caused
by pg_depend entries inserted by citus_columnar
Fixes#5510.
In the past, having columnar tables in the cluster was causing pg
upgrades to fail when attempting to access columnar metadata. This is
because, pg_dump doesn't see objects that we use for columnar-am related
booking as the dependencies of the tables using columnar-am.
To fix that; in #5456, we inserted some "normal dependency" edges (from
those objects to columnar-am) into pg_depend.
This helped us ensuring the existency of a class of metadata objects
--such as columnar.storageid_seq-- and helped fixing #5437.
However, the normal-dependency edges that we added for indexes on
columnar metadata tables --such columnar.stripe_pkey-- didn't help at
all because they were indeed causing dependency loops (#5510) and
pg_dump was not able to take those dependency edges into the account.
For this reason, this commit deletes those dependency edges so that
pg_dump stops complaining about them. Note that it's not critical to
delete those edges from pg_depend since they're not breaking pg upgrades
but were triggering some warning messages. And given that backporting
a sql change into older versions is hard a lot, we skip backporting
this.
In the past, having columnar tables in the cluster was causing pg
upgrades to fail when attempting to access columnar metadata. This is
because, pg_dump doesn't see objects that we use for columnar-am related
booking as the dependencies of the tables using columnar-am.
To fix that; in #5456, we inserted some "normal dependency" edges (from
those objects to columnar-am) into pg_depend.
This helped us ensuring the existency of a class of metadata objects
--such as columnar.storageid_seq-- and helped fixing #5437.
However, the normal-dependency edges that we added for indexes on
columnar metadata tables --such columnar.stripe_pkey-- didn't help at
all because they were indeed causing dependency loops (#5510) and
pg_dump was not able to take those dependency edges into the account.
For this reason, this commit deletes those dependency edges so that
pg_dump stops complaining about them. Note that it's not critical to
delete those edges from pg_depend since they're not breaking pg upgrades
but were triggering some warning messages. And given that backporting
a sql change into older versions is hard a lot, we skip backporting
this.
This commit hides port numbers in upgrade_columnar_after because the
port numbers assigned to nodes in upgrade schedule differ from the ones
that flaky test detector assigns.
When run_test.py is run for an upgrade_.*_after.sql then, then
automatically run the corresponding uprade_.*_before.sql file first.
This is because all those upgrade_.*_after.sql files depend on the
objects created in upgrade_.*_before.sql files by definition.
Decide core distribution params in CreateCitusTable to reduce the
chances of
creating Citus tables based on incorrect combinations of distribution
method
and replication model params.
Also introduce DistributedTableParams struct to encapsulate the
parameters
that are specific to distributed tables.
So that we can run the tests that require fake_fdw by using minimal
schedule too.
Also move multi_create_fdw.sql up in multi_1_schedule to make it
available to more tests.
Now that we will soon add another table type having DISTRIBUTE_BY_NONE
as distribution method and that we want the code to interpret such
tables mostly as distributed tables, let's make the definition of those
other two table types more strict by removing
CITUS_TABLE_WITH_NO_DIST_KEY
macro.
And instead, use HasDistributionKey() check in the places where the
logic applies to all table types that have / don't have a distribution
key. In future PRs, we might want to convert some of those
HasDistributionKey() checks if logic only applies to Citus local /
reference tables, not the others.
And adding HasDistributionKey() also allows us to consider having
DISTRIBUTE_BY_NONE as the distribution method as a "table attribute"
that can apply to distributed tables too, rather something that
determines the table type.
Split the main logic that allows creating a Citus table into the
internal function CreateCitusTable().
Old CreateDistributedTable() function was assuming that it's creating
a reference table when the distribution method is DISTRIBUTE_BY_NONE.
However, soon this won't be the case when adding support for creating
single-shard distributed tables because their distribution method would
also be the same.
Now the internal method CreateCitusTable() doesn't make any assumptions
about table's replication model or such. Instead, it expects callers to
properly set all such metadata bits.
Even more, some of the parameters the old CreateDistributedTable() takes
--such as the shard count-- were not meaningful for a reference table,
and would be the same as for new table type.
DESCRIPTION: Fixes a bug in shard copy operations.
For copying shards in both shard move and shard split operations, Citus
uses the COPY statement.
A COPY all statement in the following form
` COPY target_shard FROM STDIN;`
throws an error when there is a GENERATED column in the shard table.
In order to fix this issue, we need to exclude the GENERATED columns in
the COPY and the matching SELECT statements. Hence this fix converts the
COPY and SELECT all statements to the following form:
```
COPY target_shard (col1, col2, ..., coln) FROM STDIN;
SELECT (col1, col2, ..., coln) FROM source_shard;
```
where (col1, col2, ..., coln) does not include a GENERATED column.
GENERATED column values are created in the target_shard as the values
are inserted.
Fixes#6705.
---------
Co-authored-by: Teja Mupparti <temuppar@microsoft.com>
Co-authored-by: aykut-bozkurt <51649454+aykut-bozkurt@users.noreply.github.com>
Co-authored-by: Jelte Fennema <jelte.fennema@microsoft.com>
Co-authored-by: Gürkan İndibay <gindibay@microsoft.com>
DESCRIPTION: Adds logic to distribute unbalanced shards
If the number of shard placements (for a colocation group) is less than
the number of workers, it means that some of the workers will remain
empty. With this PR, we consider these shard groups as a colocation
group, in order to make them be distributed evenly as much as possible
across the cluster.
Example:
```sql
create table t1 (a int primary key);
create table t2 (a int primary key);
create table t3 (a int primary key);
set citus.shard_count =1;
select create_distributed_table('t1','a');
select create_distributed_table('t2','a',colocate_with=>'t1');
select create_distributed_table('t3','a',colocate_with=>'t2');
create table tb1 (a bigint);
create table tb2 (a bigint);
select create_distributed_table('tb1','a');
select create_distributed_table('tb2','a',colocate_with=>'tb1');
select citus_add_node('localhost',9702);
select rebalance_table_shards();
```
Here we have two colocation groups, each with one shard group. Both
shard groups are placed on the first worker node. When we add a new
worker node and try to rebalance table shards, the rebalance planner
considers it well balanced and does nothing. With this PR, the
rebalancer tries to distribute these shard groups evenly across the
cluster as much as possible. For this example, with this PR, the
rebalancer moves one of the shard groups to the second worker node.
fixes: #6715