* Currently we do not allow any Citus tables other than Citus local
tables inside a regular schema before executing
`citus_schema_distribute`.
* `citus_schema_undistribute` expects only single shard distributed
tables inside a tenant schema.
DESCRIPTION: Adds the udf `citus_schema_distribute` to convert a regular
schema into a tenant schema.
DESCRIPTION: Adds the udf `citus_schema_undistribute` to convert a
tenant schema back to a regular schema.
---------
Co-authored-by: Onur Tirtir <onurcantirtir@gmail.com>
Citus build with PG16 fails because of the following warnings:
- using char* instead of Datum
- using pointer instead of oid
- candidate function for format attribute
- remove old definition from PG11 compatibility 62bf571ced
This commit fixes the above.
DESCRIPTION: Fixes a bug which causes an error when creating a FOREIGN
KEY constraint without a name if the referenced table is schema
qualified.
In deparsing the `ALTER TABLE s1.t1 ADD FOREIGN KEY (key) REFERENCES
s2.t2; `, command back from its cooked form, we should schema qualify
the REFERENCED table.
Fixes#6982.
`citus_table_type` column of `citus_tables` and `citus_shards` will show
"schema" for tenants schema tables and "distributed" for single shard
tables that are not in a tenant schema.
PG16 removed them. They were already identical to Assert. We can merge
this directly to main branch
Relevant PG commit:
b1099eca8f
b1099eca8f38ff5cfaf0901bb91cb6a22f909bc6
Co-authored-by: onderkalaci <onderkalaci@gmail.com>
Creating a second PR to make reviewing easier.
This PR tests:
- replicate_reference_tables
- fix_partition_shard_index_names
- isolate_tenant_to_new_shard
- replicate_table_shards
Adds Support for Single Shard Tables in
`update_distributed_table_colocation`.
This PR changes checks that make sure tables should be hash distributed
table to hash or single shard distributed tables.
Verify Citus UDFs work well with single shard tables
SUPPORTED
* citus_table_size
* citus_total_relation_size
* citus_relation_size
* citus_shard_sizes
* truncate_local_data_after_distributing_table
* create_distributed_function // test function colocated with a single
shard table
* undistribute_table
* alter_table_set_access_method
UNSUPPORTED - error out for single shard tables
* master_create_empty_shard
* create_distributed_table_concurrently
* create_distributed_table
* create_reference_table
* citus_add_local_table_to_metadata
* citus_split_shard_by_split_points
* alter_distributed_table
DESCRIPTION: Adds citus.enable_schema_based_sharding GUC that allows
sharding the database based on schemas when enabled.
* Refactor the logic that automatically creates Citus managed tables
* Refactor CreateSingleShardTable() to allow specifying colocation id
instead
* Add support for schema-based-sharding via a GUC
### What this PR is about:
Add **citus.enable_schema_based_sharding GUC** to enable schema-based
sharding. Each schema created while this GUC is ON will be considered
as a tenant schema. Later on, regardless of whether the GUC is ON or
OFF, any table created in a tenant schema will be converted to a
single shard distributed table (without a shard key). All the tenant
tables that belong to a particular schema will be co-located with each
other and will have a shard count of 1.
We introduce a new metadata table --pg_dist_tenant_schema-- to do the
bookkeeping for tenant schemas:
```sql
psql> \d pg_dist_tenant_schema
Table "pg_catalog.pg_dist_tenant_schema"
┌───────────────┬─────────┬───────────┬──────────┬─────────┐
│ Column │ Type │ Collation │ Nullable │ Default │
├───────────────┼─────────┼───────────┼──────────┼─────────┤
│ schemaid │ oid │ │ not null │ │
│ colocationid │ integer │ │ not null │ │
└───────────────┴─────────┴───────────┴──────────┴─────────┘
Indexes:
"pg_dist_tenant_schema_pkey" PRIMARY KEY, btree (schemaid)
"pg_dist_tenant_schema_unique_colocationid_index" UNIQUE, btree (colocationid)
psql> table pg_dist_tenant_schema;
┌───────────┬───────────────┐
│ schemaid │ colocationid │
├───────────┼───────────────┤
│ 41963 │ 91 │
│ 41962 │ 90 │
└───────────┴───────────────┘
(2 rows)
```
Colocation id column of pg_dist_tenant_schema can never be NULL even
for the tenant schemas that don't have a tenant table yet. This is
because, we assign colocation ids to tenant schemas as soon as they
are created. That way, we can keep associating tenant schemas with
particular colocation groups even if all the tenant tables of a tenant
schema are dropped and recreated later on.
When a tenant schema is dropped, we delete the corresponding row from
pg_dist_tenant_schema. In that case, we delete the corresponding
colocation group from pg_dist_colocation as well.
### Future work for 12.0 release:
We're building schema-based sharding on top of the infrastructure that
adds support for creating distributed tables without a shard key
(https://github.com/citusdata/citus/pull/6867).
However, not all the operations that can be done on distributed tables
without a shard key necessarily make sense (in the same way) in the
context of schema-based sharding. For example, we need to think about
what happens if user attempts altering schema of a tenant table. We
will tackle such scenarios in a future PR.
We will also add a new UDF --citus.schema_tenant_set() or such-- to
allow users to use an existing schema as a tenant schema, and another
one --citus.schema_tenant_unset() or such-- to stop using a schema as
a tenant schema in future PRs.
DESCRIPTION: Fixes a crash when explain analyze is requested for a query
that is normally locally executed.
When explain analyze is requested for a query, a task with two queries
is created. Those two queries are
1. Wrapped Query --> `SELECT ... FROM
worker_save_query_explain_analyze(<query>, <explain analyze options>)`
2. Fetch Query -->` SELECT explain_analyze_output, execution_duration
FROM worker_last_saved_explain_analyze();`
When the query is locally executed a task with multiple queries causes a
crash in production. See the Assert at
57455dc64d/src/backend/distributed/executor/tuple_destination.c#:~:text=Assert(task%2D%3EqueryCount%20%3D%3D%201)%3B
This becomes a critical issue when auto_explain extension is used. When
auto_explain extension is enabled, explain analyze is automatically
requested for every query.
One possible solution could be not to create two queries for a locally
executed query. The fetch part may not have to be a query since the
values are available in local variables.
Until we enable local execution for explain analyze, it is best to
disable local execution.
Fixes#6777.
DESCRIPTION: Fixes a bug in background shard rebalancer where the
replicate reference tables task fails if the current user is not a
superuser.
This change is to be backported to earlier releases. We should fix the
permissions for replicate_reference_tables on main branch such that it
can be run by non-superuser roles.
Fixes#6925.
Fixes#6926.
Previously INSERT .. SELECT planner were pushing down some queries that should not be pushed down due to wrong colocation checks. It was checking whether one of the table in SELECT part and target table are colocated. But now, we check colocation for all tables in SELECT part and the target table.
Another problem with INSERT .. SELECT planner was that some queries, which is valid to be pushed down, were not pushed down due to unnecessary checks which are currently supported. e.g. UNION check. As solution, we reused the pushdown planner checks for INSERT .. SELECT planner.
DESCRIPTION: Fixes a bug that causes incorrectly pushing down some
INSERT .. SELECT queries that we shouldn't
DESCRIPTION: Prevents unnecessarily pulling the data into coordinator
for some INSERT .. SELECT queries
DESCRIPTION: Drops support for pushing down INSERT .. SELECT with append
table as target
Fixes#6749.
Fixes#1428.
Fixes#6920.
---------
Co-authored-by: aykutbozkurt <aykut.bozkurt1995@gmail.com>
We mark objects as distributed objects in Citus metadata only if we need
to propagate given the command that creates it to worker nodes. For this
reason, we were not doing this for the objects that are created while
pg_dist_node is empty.
One implication of doing so is that we defer the schema propagation to
the time when user creates the first distributed table in the schema.
However, this doesn't help for schema-based sharding (#6866) because we
want to sync pg_dist_tenant_schema to the worker nodes even for empty
schemas too.
* Support test dependencies for isolation tests without a schedule
* Comment out a test due to a known issue (#6901)
* Also, reduce the verbosity for some log messages and make some
tests compatible with run_test.py.
Fixes#6779.
DESCRIPTION: Disables citus.enable_non_colocated_router_query_pushdown
GUC by default to ensure generating a consistent distributed plan for
the queries that reference non-colocated distributed tables
We already have tests for the cases where this GUC is disabled,
so I'm not adding any more tests in this PR.
Also make multi_insert_select_window idempotent.
Related to: #6793
DESCRIPTION: Forward to existing emit_log_hook in our log hook
This makes us work better with other extensions installed in Postgres.
Without this change we would overwrite their emit_log_hook, causing it
to never be called.
Fixes#6874
* Add support for dist insert select by selecting from a reference
table.
This was the only pushable insert .. select case that
#6773 didn't cover.
* For the cases where we insert into a Citus table but the INSERT ..
SELECT
query cannot be pushed down, allow pull-to-coordinator when possible.
Remove the checks that we had at the very beginning of
CreateInsertSelectPlanInternal so that we can try insert .. select via
pull-to-coordinator for the cases where we cannot push-down the insert
.. select query. What we support via pull-to-coordinator is still
limited due to lacking of logical planner support for SELECT queries,
but this commit at least allows using pull-to-coordinator for the cases
where the select query can be planned via router planner, without
limiting ourselves to restrictive top-level checks.
Also introduce some additional restrictions into
CreateDistributedInsertSelectPlan for the cases it was missing to check
for null-shard-key tables. Indeed, it would make more sense to have
those checks for distributed tables in general, via separate PRs against
main branch. See https://github.com/citusdata/citus/pull/6817.
* Add support for inserting into a Postgres table.
Enable router planner and a limited version of INSERT .. SELECT planner
for the queries that reference colocated null shard key tables.
* SELECT / UPDATE / DELETE / MERGE is supported as long as it's a router
query.
* INSERT .. SELECT is supported as long as it only references colocated
null shard key tables.
Note that this is not only limited to distributed INSERT .. SELECT but
also
covers a limited set of query types that require pull-to-coordinator,
e.g.,
due to LIMIT clause, generate_series() etc. ...
(Ideally distributed INSERT .. SELECT could handle such queries too,
e.g.,
when we're only referencing tables that don't have a shard key, but
today
this is not the case. See
https://github.com/citusdata/citus/pull/6773#discussion_r1140130562.
Add tests for ddl coverage:
* indexes
* partitioned tables + indexes with long names
* triggers
* foreign keys
* statistics
* grant & revoke statements
* truncate & vacuum
* create/test/drop view that depends on a dist table with no shard key
* policy & rls test
* alter table add/drop/alter_type column (using sequences/different data
types/identity columns)
* alter table add constraint (not null, check, exclusion constraint)
* alter table add column with a default value / set default / drop
default
* alter table set option (autovacuum)
* indexes / constraints without names
* multiple subcommands
Adds support for
* Creating new partitions after distributing (with null key) the parent
table
* Attaching partitions to a distributed table with null distribution key
(and automatically distribute the new partition with null key as well)
* Detaching partitions from it
With this PR, we allow creating distributed tables with without
specifying a shard key via create_distributed_table(). Here are the
the important details about those tables:
* Specifying `shard_count` is not allowed because it is assumed to be 1.
* We mostly call such tables as "null shard-key" table in code /
comments.
* To avoid doing a breaking layout change in create_distributed_table();
instead of throwing an error, it will inform the user that
`distribution_type`
param is ignored unless it's explicitly set to NULL or 'h'.
* `colocate_with` param allows colocating such null shard-key tables to
each other.
* We define this table type, i.e., NULL_SHARD_KEY_TABLE, as a subclass
of
DISTRIBUTED_TABLE because we mostly want to treat them as distributed
tables in terms of SQL / DDL / operation support.
* Metadata for such tables look like:
- distribution method => DISTRIBUTE_BY_NONE
- replication model => REPLICATION_MODEL_STREAMING
- colocation id => **!=** INVALID_COLOCATION_ID (distinguishes from
Citus local tables)
* We assign colocation groups for such tables to different nodes in a
round-robin fashion based on the modulo of "colocation id".
Note that this PR doesn't care about DDL (except CREATE TABLE) / SQL /
operation (i.e., Citus UDFs) support for such tables but adds a
preliminary
API.
We are handling colocation groups with shard group count less than the
worker node count, using a method different than the usual rebalancer.
See #6739
While making the decision of using this method or not, we should've
ignored the nodes that are marked `shouldhaveshards = false`. This PR
excludes those nodes when making the decision.
Adds a test such that:
coordinator: []
worker 1: [1_1, 1_2]
worker 2: [2_1, 2_2]
(rebalance)
coordinator: []
worker 1: [1_1, 2_1]
worker 2: [1_2, 2_2]
If we take the coordinator into account, the rebalancer considers the
first state as balanced and does nothing (because shard_count <
worker_count)
But with this pr, we ignore the coordinator because it's
shouldhaveshards = false
So the rebalancer distributes each colocation group to both workers
Also, fixes an unrelated flaky test in the same file
We need to break sequence dependency for a table while creating the
table during non-transactional metadata sync to ensure idempotency of
the creation of the table.
**Problem:**
When we send `SELECT
pg_catalog.worker_drop_sequence_dependency(logicalrelid::regclass::text)
FROM pg_dist_partition` to workers during the non-transactional sync,
table might not be in `pg_dist_partition` at worker, and sequence
dependency is not broken at the worker.
**Solution:**
We break sequence dependency via `SELECT
pg_catalog.worker_drop_sequence_dependency(logicalrelid::regclass::text)`
for each table while creating it at the workers. It is safe to send
since the udf is a no-op when there is no sequence dependency.
DESCRIPTION: Fixes a bug related to sequence idempotency at
non-transactional sync.
Fixes https://github.com/citusdata/citus/issues/6888.
`PlaceHolderVar` is not relevant to be processed inside a restriction
clause. Otherwise, `pull_var_clause_default` would throw error. PG would
create the restriction to physical `Var` that `PlaceHolderVar` points to
anyway, so it is safe to skip this restriction.
DESCRIPTION: Fixes a bug related to WHERE clause list which contains
placeholder.
Fixes https://github.com/citusdata/citus/issues/6758
This PR updates the tenant stats implementation to set partitionKeyValue
and colocationId in ExecuteLocalTaskListExtended, in addition to
LocallyExecuteTaskPlan. This ensures that tenant stats can be properly
gathered regardless of the code path taken. The changes were initially
made while testing stored procedure calls for tenant stats.
.. rather than having it in user facing functions. That way, we
can use the same logic for creating Citus tables from other places
too.
This would be useful for creating tenant tables via a simple function
call in the utility hook, for schema-based sharding purposes.
DESCRIPTION: Fixes memory errors, caught by valgrind, of type
"conditional jump or move depends on uninitialized value"
When running Citus tests under Postgres with valgrind, the test cases
calling into `NonBlockingShardSplit` function produce valgrind errors of
type "conditional jump or move depends on uninitialized value".
The issue is caused by creating a HTAB in a wrong way. HASH_COMPARE flag
should have been used when creating a HTAB with user defined comparison
function. In the absence of HASH_COMPARE flag, HTAB falls back into
built-in string comparison function. However, valgrind somehow discovers
that the match function is not assigned to the user defined function as
intended.
Fixes#6835
Fixes the bug that causes updating the citus_stat_tenants periods
incorrectly.
`TimestampDifferenceExceeds` expects the difference in milliseconds but
it was microseconds, this is fixed.
`tenantStats->lastQueryTime` was updated during monitoring too, now it's
updated only when there are tenant queries.
DESCRIPTION:
Makefile changes to build different versions of CDC decoder for different base decoders like pgoutput and wal2json with the same name and copy it to $packagelib/cdc_decoders dir. This helps the user to use logical replication slots normally with pgoutput without being aware of CDC decoder.
1) Changed src/backend/distributed/cdc/Makefile to setup a build directory
for CDC in build-cdc-$(DECODER) dir and copy the source files (.c.h and Makefile.decoder) to
the build dir and build it for each base decoder.
2) copy the pgoutput.so and wal2json.so into the above build dir and
install them in PG packagelibdir/citus_decoders directory.
3)Added a testcase 016_cdc_wal2json.pl for testing the wal2json decoder
using pg_recv_logical_changes function.
DESCRIPTION: Adds control for background task executors involving a node
### Background and motivation
Nonblocking concurrent task execution via background workers was
introduced in [#6459](https://github.com/citusdata/citus/pull/6459), and
concurrent shard moves in the background rebalancer were introduced in
[#6756](https://github.com/citusdata/citus/pull/6756) - with a hard
dependency that limits to 1 shard move per node. As we know, a shard
move consists of a shard moving from a source node to a target node. The
hard dependency was used because the background task runner didn't have
an option to limit the parallel shard moves per node.
With the motivation of controlling the number of concurrent shard
moves that involve a particular node, either as source or target, this
PR introduces a general new GUC
citus.max_background_task_executors_per_node to be used in the
background task runner infrastructure. So, why do we even want to
control and limit the concurrency? Well, it's all about resource
availability: because the moves involve the same nodes, extra
parallelism won’t make the rebalance complete faster if some resource is
already maxed out (usually cpu or disk). Or, if the cluster is being
used in a production setting, the moves might compete for resources with
production queries much more than if they had been executed
sequentially.
### How does it work?
A new column named nodes_involved is added to the catalog table that
keeps track of the scheduled background tasks,
pg_dist_background_task. It is of type integer[] - to store a list
of node ids. It is NULL by default - the column will be filled by the
rebalancer, but we may not care about the nodes involved in other uses
of the background task runner.
Table "pg_catalog.pg_dist_background_task"
Column | Type
============================================
job_id | bigint
task_id | bigint
owner | regrole
pid | integer
status | citus_task_status
command | text
retry_count | integer
not_before | timestamp with time zone
message | text
+nodes_involved | integer[]
A hashtable named ParallelTasksPerNode keeps track of the number of
parallel running background tasks per node. An entry in the hashtable is
as follows:
ParallelTasksPerNodeEntry
{
node_id // The node is used as the hash table key
counter // Number of concurrent background tasks that involve node node_id
// The counter limit is citus.max_background_task_executors_per_node
}
When the background task runner assigns a runnable task to a new
executor, it increments the counter for each of the nodes involved with
that runnable task. The limit of each counter is
citus.max_background_task_executors_per_node. If the limit is reached
for any of the nodes involved, this runnable task is skipped. And then,
later, when the running task finishes, the background task runner
decrements the counter for each of the nodes involved with the done
task. The following functions take care of these increment-decrement
steps:
IncrementParallelTaskCountForNodesInvolved(task)
DecrementParallelTaskCountForNodesInvolved(task)
citus.max_background_task_executors_per_node can be changed in the
fly. In the background rebalancer, we simply give {source_node,
target_node} as the nodesInvolved input to the
ScheduleBackgroundTask function. The rest is taken care of by the
general background task runner infrastructure explained above. Check
background_task_queue_monitor.sql and
background_rebalance_parallel.sql tests for detailed examples.
#### Note
This PR also adds a hard node dependency if a node is first being used
as a source for a move, and then later as a target. The reason this
should be a hard dependency is that the first move might make space for
the second move. So, we could run out of disk space (or at least
overload the node) if we move the second shard to it before the first
one is moved away.
Fixes https://github.com/citusdata/citus/issues/6716
DESCRIPTION: PR description that will go into the change log, up to 78
characters
---------
Co-authored-by: Hanefi Onaldi <Hanefi.Onaldi@microsoft.com>
DESCRIPTION: Adds views that monitor statistics on tenant usages
This PR adds `citus_stats_tenants` view that monitors the tenants on the
cluster.
`citus_stats_tenants` shows the node id, colocation id, tenant
attribute, read count in this period and last period, and query count in
this period and last period of the tenant.
Tenant attribute currently is the tenant's distribution column value,
later when schema based sharding is introduced, this meaning might
change.
A period is a time bucket the queries are counted by. Read and query
counts for this period can increase until the current period ends. After
that those counts are moved to last period's counts, which cannot
change. The period length can be set using 'citus.stats_tenants_period'.
`SELECT` queries are counted as _read_ queries, `INSERT`, `UPDATE` and
`DELETE` queries are counted as _write_ queries. So in the view read
counts are `SELECT` counts and query counts are `SELECT`, `INSERT`,
`UPDATE` and `DELETE` count.
The data is stored in shared memory, in a struct named
`MultiTenantMonitor`.
`citus_stats_tenants` shows the data from local tenants.
`citus_stats_tenants` show up to `citus.stats_tenant_limit` number of
tenants.
The tenants are scored based on the number of queries they run and the
recency of those queries. Every query ran increases the score of tenant
by `ONE_QUERY_SCORE`, and after every period ends the scores are halved.
Halving is done lazily.
To retain information a longer the monitor keeps up to 3 times
`citus.stats_tenant_limit` tenants. When the tenant count hits `3 *
citus.stats_tenant_limit`, last `citus.stats_tenant_limit` tenants are
removed. To see all stored tenants you can use
`citus_stats_tenants(return_all_tenants := true)`
- [x] Create collector view that gets data from all nodes. #6761
- [x] Add monitoring log #6762
- [x] Create enable/disable GUC #6769
- [x] Parse the annotation string correctly #6796
- [x] Add local queries and prepared statements #6797
- [x] Rename to citus_stat_statements #6821
- [x] Run pgbench
- [x] Fix role permissions #6812
---------
Co-authored-by: Gokhan Gulbiz <ggulbiz@gmail.com>
Co-authored-by: Jelte Fennema <github-tech@jeltef.nl>
DESCRIPTION: Fix an issue that caused some queries with custom
aggregates to fail
While playing around with https://github.com/pgvector/pgvector I noticed
that the AVG query was broken. That's because we treat it as any other
AVG by breaking it down in SUM and COUNT, but there are no SUM/COUNT
functions in this case, but there is a perfectly usable combinefunc.
This PR changes our aggregate logic to prefer custom aggregates with a
combinefunc even if they have a common name.
Co-authored-by: Marco Slot <marco.slot@gmail.com>
DESCRIPTION:
- The CDC decoder is refacroted into a seperate extension that can be used loaded dynamically without having to reload citus.
- CDC decoder code can be compiled using DECODER flag to work with different decoders like pgoutput and wal2json.
by default the base decode is "pgoutput".
- the dynamic_library_path config is adjusted dynamically to prefer the decoders in cdc_decoders directory in citus init
so that the users can use the replication subscription commands without having to make any config changes.
DESCRIPTION: Refactor and unify shard move and copy functions
Shard move and copy functions share a lot of code in common. This PR
unifies these functions into one, along with some helper functions. To
preserve the current behavior, we'll introduce and use an enum
parameter, and hardcoded strings for producing error/warning messages.
Add new metadata sync methods which uses MemorySyncContext api so that during the sync we can
- free memory to prevent OOM,
- use either transactional or nontransactional modes according to the GUC .
- Create MetadataSyncContext api to encapsulate
both transactional and nontransactional modes,
- Add a GUC to switch between metadata sync transaction modes.
This pull request proposes a change to the logic used for propagating
identity columns to worker nodes in citus. Instead of creating a
dependent sequence for each identity column and changing its default
value to `nextval(seq)/worker_nextval(seq)`, this update will pass the
identity columns as-is to the worker nodes.
Please note that there are a few limitations to this change.
1. Only bigint identity columns will be allowed in distributed tables to
ensure compatibility with the DDL from any node functionality. Our
current distributed sequence implementation only allows insert
statements from all nodes for bigint sequences.
2. `alter_distributed_table` and `undistribute_table` operations will
not be allowed for tables with identity columns. This is because we do
not have a proper way of keeping sequence states consistent across the
cluster.
DESCRIPTION: Prevents using identity columns on data types other than
`bigint` on distributed tables
DESCRIPTION: Prevents using `alter_distributed_table` and
`undistribute_table` UDFs when a table has identity columns
DESCRIPTION: Fixes a bug that prevents enforcing identity column
restrictions on worker nodes
Depends on #6740Fixes#6694
DESCRIPTION: This PR removes the task dependencies between shard moves
for which the shards belong to different colocation groups. This change
results in scheduling multiple tasks in the RUNNABLE state. Therefore it
is possible that the background task monitor can run them concurrently.
Previously, all the shard moves planned in a rebalance operation took
dependency on each other sequentially.
For instance, given the following table and shards
colocation group 1 colocation group 2
table1 table2 table3 table4 table 5
shard11 shard21 shard31 shard41 shard51
shard12 shard22 shard32 shard42 shard52
if the rebalancer planner returned the below set of moves
` {move(shard11), move(shard12), move(shard41), move(shard42)}`
background rebalancer scheduled them such that they depend on each other
sequentially.
```
{move(reftables) if there is any, none}
|
move( shard11)
|
move(shard12)
| {move(shard41)<--- move(shard12)} This is an artificial dependency
move(shard41)
|
move(shard42)
```
This results in artificial dependencies between otherwise independent
moves.
Considering that the shards in different colocation groups can be moved
concurrently, this PR changes the dependency relationship between the
moves as follows:
```
{move(reftables) if there is any, none} {move(reftables) if there is any, none}
| |
move(shard11) move(shard41)
| |
move(shard12) move(shard42)
```
---------
Co-authored-by: Jelte Fennema <jelte.fennema@microsoft.com>
Description:
Implementing CDC changes using Logical Replication to avoid
re-publishing events multiple times by setting up replication origin
session, which will add "DoNotReplicateId" to every WAL entry.
- shard splits
- shard moves
- create distributed table
- undistribute table
- alter distributed tables (for some cases)
- reference table operations
The citus decoder which will be decoding WAL events for CDC clients,
ignores any WAL entry with replication origin that is not zero.
It also maps the shard names to distributed table names.
Today we allow planning the queries that reference non-colocated tables
if the shards that query targets are placed on the same node. However,
this may not be the case, e.g., after rebalancing shards because it's
not guaranteed to have those shards on the same node anymore.
This commit adds citus.enable_non_colocated_router_query_pushdown GUC
that can be used to disallow planning such queries via router planner,
when it's set to false. Note that the default value for this GUC will be
"true" for 11.3, but we will alter it to "false" on 12.0 to not
introduce
a breaking change in a minor release.
Closes#692.
Even more, allowing such queries to go through router planner also
causes
generating an incorrect plan for the DML queries that reference
distributed
tables that are sharded based on different replication factor settings.
For
this reason, #6779 can be closed after altering the default value for
this
GUC to "false", hence not now.
DESCRIPTION: Adds `citus.enable_non_colocated_router_query_pushdown` GUC
to ensure generating a consistent distributed plan for the queries that
reference non-colocated distributed tables (when set to "false", the
default is "true").
Because they're only interested in distributed tables. Even more,
this replaces HasDistributionKey() check with
IsCitusTableType(DISTRIBUTED_TABLE) because this doesn't make a
difference on main and sounds slightly more intuitive. Plus, this
would also allow safely using this function in
https://github.com/citusdata/citus/pull/6773.
DESCRIPTION: Check before logicalrep for rebalancer, error if needed
Check if we can use logical replication or not, in case of shard
transfer mode = auto, before executing the shard moves. If we can't,
error out. Before this PR, we used to error out in the middle of shard
moves:
```sql
set citus.shard_count = 4; -- just to get the error sooner
select citus_remove_node('localhost',9702);
create table t1 (a int primary key);
select create_distributed_table('t1','a');
create table t2 (a bigint);
select create_distributed_table('t2','a');
select citus_add_node('localhost',9702);
select rebalance_table_shards();
NOTICE: Moving shard 102008 from localhost:9701 to localhost:9702 ...
NOTICE: Moving shard 102009 from localhost:9701 to localhost:9702 ...
NOTICE: Moving shard 102012 from localhost:9701 to localhost:9702 ...
ERROR: cannot use logical replication to transfer shards of the relation t2 since it doesn't have a REPLICA IDENTITY or PRIMARY KEY
```
Now we check and error out in the beginning, without moving the shards.
fixes: #6727
Fixes#6672
2) Move all MERGE related routines to a new file merge_planner.c
3) Make ConjunctionContainsColumnFilter() static again, and rearrange the code in MergeQuerySupported()
4) Restore the original format in the comments section.
5) Add big serial test. Implement latest set of comments
This implements the phase - II of MERGE sql support
Support routable query where all the tables in the merge-sql are distributed, co-located, and both the source and
target relations are joined on the distribution column with a constant qual. This should be a Citus single-task
query. Below is an example.
SELECT create_distributed_table('t1', 'id');
SELECT create_distributed_table('s1', 'id', colocate_with => ‘t1’);
MERGE INTO t1
USING s1 ON t1.id = s1.id AND t1.id = 100
WHEN MATCHED THEN
UPDATE SET val = s1.val + 10
WHEN MATCHED THEN
DELETE
WHEN NOT MATCHED THEN
INSERT (id, val, src) VALUES (s1.id, s1.val, s1.src)
Basically, MERGE checks to see if
There are a minimum of two distributed tables (source and a target).
All the distributed tables are indeed colocated.
MERGE relations are joined on the distribution column
MERGE .. USING .. ON target.dist_key = source.dist_key
The query should touch only a single shard i.e. JOIN AND with a constant qual
MERGE .. USING .. ON target.dist_key = source.dist_key AND target.dist_key = <>
If any of the conditions are not met, it raises an exception.
(cherry picked from commit 44c387b978)
This implements MERGE phase3
Support pushdown query where all the tables in the merge-sql are Citus-distributed, co-located, and both
the source and target relations are joined on the distribution column. This will generate multiple tasks
which execute independently after pushdown.
SELECT create_distributed_table('t1', 'id');
SELECT create_distributed_table('s1', 'id', colocate_with => ‘t1’);
MERGE INTO t1
USING s1
ON t1.id = s1.id
WHEN MATCHED THEN
UPDATE SET val = s1.val + 10
WHEN MATCHED THEN
DELETE
WHEN NOT MATCHED THEN
INSERT (id, val, src) VALUES (s1.id, s1.val, s1.src)
*The only exception for both the phases II and III is, UPDATEs and INSERTs must be done on the same shard-group
as the joined key; for example, below scenarios are NOT supported as the key-value to be inserted/updated is not
guaranteed to be on the same node as the id distribution-column.
MERGE INTO target t
USING source s ON (t.customer_id = s.customer_id)
WHEN NOT MATCHED THEN - -
INSERT(customer_id, …) VALUES (<non-local-constant-key-value>, ……);
OR this scenario where we update the distribution column itself
MERGE INTO target t
USING source s On (t.customer_id = s.customer_id)
WHEN MATCHED THEN
UPDATE SET customer_id = 100;
(cherry picked from commit fa7b8949a8)
Decide core distribution params in CreateCitusTable to reduce the
chances of
creating Citus tables based on incorrect combinations of distribution
method
and replication model params.
Also introduce DistributedTableParams struct to encapsulate the
parameters
that are specific to distributed tables.
Now that we will soon add another table type having DISTRIBUTE_BY_NONE
as distribution method and that we want the code to interpret such
tables mostly as distributed tables, let's make the definition of those
other two table types more strict by removing
CITUS_TABLE_WITH_NO_DIST_KEY
macro.
And instead, use HasDistributionKey() check in the places where the
logic applies to all table types that have / don't have a distribution
key. In future PRs, we might want to convert some of those
HasDistributionKey() checks if logic only applies to Citus local /
reference tables, not the others.
And adding HasDistributionKey() also allows us to consider having
DISTRIBUTE_BY_NONE as the distribution method as a "table attribute"
that can apply to distributed tables too, rather something that
determines the table type.
Split the main logic that allows creating a Citus table into the
internal function CreateCitusTable().
Old CreateDistributedTable() function was assuming that it's creating
a reference table when the distribution method is DISTRIBUTE_BY_NONE.
However, soon this won't be the case when adding support for creating
single-shard distributed tables because their distribution method would
also be the same.
Now the internal method CreateCitusTable() doesn't make any assumptions
about table's replication model or such. Instead, it expects callers to
properly set all such metadata bits.
Even more, some of the parameters the old CreateDistributedTable() takes
--such as the shard count-- were not meaningful for a reference table,
and would be the same as for new table type.
DESCRIPTION: Fixes a bug in shard copy operations.
For copying shards in both shard move and shard split operations, Citus
uses the COPY statement.
A COPY all statement in the following form
` COPY target_shard FROM STDIN;`
throws an error when there is a GENERATED column in the shard table.
In order to fix this issue, we need to exclude the GENERATED columns in
the COPY and the matching SELECT statements. Hence this fix converts the
COPY and SELECT all statements to the following form:
```
COPY target_shard (col1, col2, ..., coln) FROM STDIN;
SELECT (col1, col2, ..., coln) FROM source_shard;
```
where (col1, col2, ..., coln) does not include a GENERATED column.
GENERATED column values are created in the target_shard as the values
are inserted.
Fixes#6705.
---------
Co-authored-by: Teja Mupparti <temuppar@microsoft.com>
Co-authored-by: aykut-bozkurt <51649454+aykut-bozkurt@users.noreply.github.com>
Co-authored-by: Jelte Fennema <jelte.fennema@microsoft.com>
Co-authored-by: Gürkan İndibay <gindibay@microsoft.com>
DESCRIPTION: Adds logic to distribute unbalanced shards
If the number of shard placements (for a colocation group) is less than
the number of workers, it means that some of the workers will remain
empty. With this PR, we consider these shard groups as a colocation
group, in order to make them be distributed evenly as much as possible
across the cluster.
Example:
```sql
create table t1 (a int primary key);
create table t2 (a int primary key);
create table t3 (a int primary key);
set citus.shard_count =1;
select create_distributed_table('t1','a');
select create_distributed_table('t2','a',colocate_with=>'t1');
select create_distributed_table('t3','a',colocate_with=>'t2');
create table tb1 (a bigint);
create table tb2 (a bigint);
select create_distributed_table('tb1','a');
select create_distributed_table('tb2','a',colocate_with=>'tb1');
select citus_add_node('localhost',9702);
select rebalance_table_shards();
```
Here we have two colocation groups, each with one shard group. Both
shard groups are placed on the first worker node. When we add a new
worker node and try to rebalance table shards, the rebalance planner
considers it well balanced and does nothing. With this PR, the
rebalancer tries to distribute these shard groups evenly across the
cluster as much as possible. For this example, with this PR, the
rebalancer moves one of the shard groups to the second worker node.
fixes: #6715
DESCRIPTION: Correctly report shard size in citus_shards view
When looking at citus_shards, people are interested in the actual size
that all the data related to the shard takes up on disk.
`pg_total_relation_size` is the function to use for that purpose. The
previously used `pg_relation_size` does not include indexes or TOAST.
Especially the missing toast can have enormous impact on the size of the
shown data.
2 improvements to prevent memory leaks during altering or undistributing
distributed tables with a lot of partitions and shards:
1. Free memory for each call to ConvertTable so that colocated and partition tables at
`AlterDistributedTable`, `UndistributeTable`, or
`AlterTableSetAccessMethod` will not cause an increase
in memory usage,
2. Free memory while executing attach partition commands for each partition table at
`AlterDistributedTable` to prevent an increase in memory usage.
DESCRIPTION: Fixes memory leak issue during altering distributed table
with a lot of partition and shards.
Fixes https://github.com/citusdata/citus/issues/6503.
We have memory leak during distribution of a table with a lot of
partitions as we do not release memory at ExprContext until all
partitions are not distributed. We improved 2 things to resolve the
issue:
1. We create and delete MemoryContext for each call to
`CreateDistributedTable` by partitions,
2. We rebuild the cache after we insert all the placements instead of
each placement for a shard.
DESCRIPTION: Fixes memory leak during distribution of a table with a lot
of partitions and shards.
Fixes https://github.com/citusdata/citus/issues/6572.