DESCRIPTION: Adds support for 2PC from non-Citus main databases
This PR only adds support for `CREATE USER` queries, other queries need
to be added. But it should be simple because this PR creates the
underlying structure.
Citus main database is the database where the Citus extension is
created. A non-main database is all the other databases that are in the
same node with a Citus main database.
When a `CREATE USER` query is run on a non-main database we:
1. Run `start_management_transaction` on the main database. This
function saves the outer transaction's xid (the non-main database
query's transaction id) and marks the current query as main db command.
2. Run `execute_command_on_remote_nodes_as_user("CREATE USER
<username>", <username to run the command>)` on the main database. This
function creates the users in the rest of the cluster by running the
query on the other nodes. The user on the current node is created by the
query on the outer, non-main db, query to make sure consequent commands
in the same transaction can see this user.
3. Run `mark_object_distributed` on the main database. This function
adds the user to `pg_dist_object` in all of the nodes, including the
current one.
This PR also implements transaction recovery for the queries from
non-main databases.
DESCRIPTION: Adds support for propagating `CREATE`/`DROP` database
In this PR, create and drop database support is added.
For CREATE DATABASE:
* "oid" option is not supported
* specifying "strategy" to be different than "wal_log" is not supported
* specifying "template" to be different than "template1" is not
supported
The last two are because those are not saved in `pg_database` and when
activating a node, we cannot assume what parameters were provided when
creating the database.
And "oid" is not supported because whether user specified an arbitrary
oid when creating the database is not saved in pg_database and we want
to avoid from oid collisions that might arise from attempting to use an
auto-assigned oid on workers.
Finally, in case of node activation, GRANTs for the database are also
propagated.
---------
Co-authored-by: Jelte Fennema-Nio <github-tech@jeltef.nl>
Co-authored-by: Jelte Fennema-Nio <jelte.fennema@microsoft.com>
Co-authored-by: Onur Tirtir <onurcantirtir@gmail.com>
Add citus_schema_move() that can be used to move tenant tables within a distributed
schema to another node. The function has two variations as simple wrappers around
citus_move_shard_placement() and citus_move_shard_placement_with_nodeid() respectively.
They pick a shard that belongs to the given tenant schema and resolve the source node
that contain the shards under given tenant schema. Hence their signatures are quite
similar to underlying functions:
```sql
-- citus_schema_move(), using target node name and node port
CREATE OR REPLACE FUNCTION pg_catalog.citus_schema_move(
schema_id regnamespace,
target_node_name text,
target_node_port integer,
shard_transfer_mode citus.shard_transfer_mode default 'auto')
RETURNS void
LANGUAGE C STRICT
AS 'MODULE_PATHNAME', $$citus_schema_move$$;
-- citus_schema_move(), using target node id
CREATE OR REPLACE FUNCTION pg_catalog.citus_schema_move(
schema_id regnamespace,
target_node_id integer,
shard_transfer_mode citus.shard_transfer_mode default 'auto')
RETURNS void
LANGUAGE C STRICT
AS 'MODULE_PATHNAME', $$citus_schema_move_with_nodeid$$;
```
DESCRIPTION: Presenting citus_pause_node UDF enabling pausing by
node_id.
citus_pause_node takes a node_id parameter and fetches all the shards in
that node and puts AccessExclusiveLock on all the shards inside that
node. With this lock, insert is disabled, until citus_pause_node
transaction is closed.
---------
Co-authored-by: Hanefi Onaldi <Hanefi.Onaldi@microsoft.com>
PG16 compatibility - part 7
Check out part 1 42d956888d
part 2 0d503dd5ac
part 3 907d72e60d
part 4 7c6b4ce103
part 5 6056cb2c29
part 6 b36c431abb
part 7 ee3153fe50
This commit is in the series of PG16 compatibility commits. PG16 introduced a new entry
varnnullingrels to Var, which represents our partkey in pg_dist_partition.
This commit does the necessary changes in Citus to support this.
Relevant PG commit:
2489d76c49
2489d76c4906f4461a364ca8ad7e0751ead8aa0d
More PG16 compatibility commits are coming soon ...
PG16 compatibility - Part 2
Part 1 provided successful compilation against pg16beta2.
42d956888d
This PR provides ruleutils changes with pg16beta2 and successful CREATE EXTENSION command.
Note that more changes are needed in order to have successful regression tests.
More commits are coming soon ...
For any_value changes, I referred to this commit
8ef94dc1f5
where we did something similar for PG14 support.
This is to implement custom cast of table partition column
type from / to `timestamptz` in time partition management UDFs, as
proposed in ticket #6454
The general idea is for a time partition column with type other than
`date`, `timestamp`, or `timestamptz`, users can provide custom
bidirectional cast between the column type and `timestamptz`, the UDFs
then will be able to create and drop time partitions for such tables.
Fixes#6454
---------
Signed-off-by: Xin Li <xin@swirldslabs.com>
Co-authored-by: Marco Slot <marco.slot@microsoft.com>
Co-authored-by: Ahmet Gedemenli <afgedemenli@gmail.com>
DESCRIPTION: Adds citus_schemas view
The citus_schemas view will be created in public schema if it exists, if
not the view will be created in pg_catalog.
Need to:
- [x] Add tests
- [x] Fix tests
citus_shard_sizes view had a shard name column we use to extract shard
id. This PR changes the column to shard id so we don't do unnecessary
string operation.
DESCRIPTION: Enabling citus_stat_tenants to support schema-based
tenants.
This pull request modifies the existing logic to enable tenant
monitoring with schema-based tenants. The changes made are as follows:
- If a query has a partitionKeyValue (which serves as a tenant
key/identifier for distributed tables), Citus annotates the query with
both the partitionKeyValue and colocationId. This allows for accurate
tracking of the query.
- If a query does not have a partitionKeyValue, but its colocationId
belongs to a distributed schema, Citus annotates the query with only the
colocationId. The tenant monitor can then easily look up the schema to
determine if it's a distributed schema and make a decision on whether to
track the query.
---------
Co-authored-by: Jelte Fennema <jelte.fennema@microsoft.com>
* Currently we do not allow any Citus tables other than Citus local
tables inside a regular schema before executing
`citus_schema_distribute`.
* `citus_schema_undistribute` expects only single shard distributed
tables inside a tenant schema.
DESCRIPTION: Adds the udf `citus_schema_distribute` to convert a regular
schema into a tenant schema.
DESCRIPTION: Adds the udf `citus_schema_undistribute` to convert a
tenant schema back to a regular schema.
---------
Co-authored-by: Onur Tirtir <onurcantirtir@gmail.com>
`citus_table_type` column of `citus_tables` and `citus_shards` will show
"schema" for tenants schema tables and "distributed" for single shard
tables that are not in a tenant schema.
DESCRIPTION: Adds citus.enable_schema_based_sharding GUC that allows
sharding the database based on schemas when enabled.
* Refactor the logic that automatically creates Citus managed tables
* Refactor CreateSingleShardTable() to allow specifying colocation id
instead
* Add support for schema-based-sharding via a GUC
### What this PR is about:
Add **citus.enable_schema_based_sharding GUC** to enable schema-based
sharding. Each schema created while this GUC is ON will be considered
as a tenant schema. Later on, regardless of whether the GUC is ON or
OFF, any table created in a tenant schema will be converted to a
single shard distributed table (without a shard key). All the tenant
tables that belong to a particular schema will be co-located with each
other and will have a shard count of 1.
We introduce a new metadata table --pg_dist_tenant_schema-- to do the
bookkeeping for tenant schemas:
```sql
psql> \d pg_dist_tenant_schema
Table "pg_catalog.pg_dist_tenant_schema"
┌───────────────┬─────────┬───────────┬──────────┬─────────┐
│ Column │ Type │ Collation │ Nullable │ Default │
├───────────────┼─────────┼───────────┼──────────┼─────────┤
│ schemaid │ oid │ │ not null │ │
│ colocationid │ integer │ │ not null │ │
└───────────────┴─────────┴───────────┴──────────┴─────────┘
Indexes:
"pg_dist_tenant_schema_pkey" PRIMARY KEY, btree (schemaid)
"pg_dist_tenant_schema_unique_colocationid_index" UNIQUE, btree (colocationid)
psql> table pg_dist_tenant_schema;
┌───────────┬───────────────┐
│ schemaid │ colocationid │
├───────────┼───────────────┤
│ 41963 │ 91 │
│ 41962 │ 90 │
└───────────┴───────────────┘
(2 rows)
```
Colocation id column of pg_dist_tenant_schema can never be NULL even
for the tenant schemas that don't have a tenant table yet. This is
because, we assign colocation ids to tenant schemas as soon as they
are created. That way, we can keep associating tenant schemas with
particular colocation groups even if all the tenant tables of a tenant
schema are dropped and recreated later on.
When a tenant schema is dropped, we delete the corresponding row from
pg_dist_tenant_schema. In that case, we delete the corresponding
colocation group from pg_dist_colocation as well.
### Future work for 12.0 release:
We're building schema-based sharding on top of the infrastructure that
adds support for creating distributed tables without a shard key
(https://github.com/citusdata/citus/pull/6867).
However, not all the operations that can be done on distributed tables
without a shard key necessarily make sense (in the same way) in the
context of schema-based sharding. For example, we need to think about
what happens if user attempts altering schema of a tenant table. We
will tackle such scenarios in a future PR.
We will also add a new UDF --citus.schema_tenant_set() or such-- to
allow users to use an existing schema as a tenant schema, and another
one --citus.schema_tenant_unset() or such-- to stop using a schema as
a tenant schema in future PRs.
DESCRIPTION: PR description that will go into the change log, up to 78
characters
---------
Co-authored-by: Hanefi Onaldi <Hanefi.Onaldi@microsoft.com>
DESCRIPTION: Adds views that monitor statistics on tenant usages
This PR adds `citus_stats_tenants` view that monitors the tenants on the
cluster.
`citus_stats_tenants` shows the node id, colocation id, tenant
attribute, read count in this period and last period, and query count in
this period and last period of the tenant.
Tenant attribute currently is the tenant's distribution column value,
later when schema based sharding is introduced, this meaning might
change.
A period is a time bucket the queries are counted by. Read and query
counts for this period can increase until the current period ends. After
that those counts are moved to last period's counts, which cannot
change. The period length can be set using 'citus.stats_tenants_period'.
`SELECT` queries are counted as _read_ queries, `INSERT`, `UPDATE` and
`DELETE` queries are counted as _write_ queries. So in the view read
counts are `SELECT` counts and query counts are `SELECT`, `INSERT`,
`UPDATE` and `DELETE` count.
The data is stored in shared memory, in a struct named
`MultiTenantMonitor`.
`citus_stats_tenants` shows the data from local tenants.
`citus_stats_tenants` show up to `citus.stats_tenant_limit` number of
tenants.
The tenants are scored based on the number of queries they run and the
recency of those queries. Every query ran increases the score of tenant
by `ONE_QUERY_SCORE`, and after every period ends the scores are halved.
Halving is done lazily.
To retain information a longer the monitor keeps up to 3 times
`citus.stats_tenant_limit` tenants. When the tenant count hits `3 *
citus.stats_tenant_limit`, last `citus.stats_tenant_limit` tenants are
removed. To see all stored tenants you can use
`citus_stats_tenants(return_all_tenants := true)`
- [x] Create collector view that gets data from all nodes. #6761
- [x] Add monitoring log #6762
- [x] Create enable/disable GUC #6769
- [x] Parse the annotation string correctly #6796
- [x] Add local queries and prepared statements #6797
- [x] Rename to citus_stat_statements #6821
- [x] Run pgbench
- [x] Fix role permissions #6812
---------
Co-authored-by: Gokhan Gulbiz <ggulbiz@gmail.com>
Co-authored-by: Jelte Fennema <github-tech@jeltef.nl>
This pull request proposes a change to the logic used for propagating
identity columns to worker nodes in citus. Instead of creating a
dependent sequence for each identity column and changing its default
value to `nextval(seq)/worker_nextval(seq)`, this update will pass the
identity columns as-is to the worker nodes.
Please note that there are a few limitations to this change.
1. Only bigint identity columns will be allowed in distributed tables to
ensure compatibility with the DDL from any node functionality. Our
current distributed sequence implementation only allows insert
statements from all nodes for bigint sequences.
2. `alter_distributed_table` and `undistribute_table` operations will
not be allowed for tables with identity columns. This is because we do
not have a proper way of keeping sequence states consistent across the
cluster.
DESCRIPTION: Prevents using identity columns on data types other than
`bigint` on distributed tables
DESCRIPTION: Prevents using `alter_distributed_table` and
`undistribute_table` UDFs when a table has identity columns
DESCRIPTION: Fixes a bug that prevents enforcing identity column
restrictions on worker nodes
Depends on #6740Fixes#6694
Description:
Implementing CDC changes using Logical Replication to avoid
re-publishing events multiple times by setting up replication origin
session, which will add "DoNotReplicateId" to every WAL entry.
- shard splits
- shard moves
- create distributed table
- undistribute table
- alter distributed tables (for some cases)
- reference table operations
The citus decoder which will be decoding WAL events for CDC clients,
ignores any WAL entry with replication origin that is not zero.
It also maps the shard names to distributed table names.
If there is a problem with an ongoing rebalance, we did not show details
on background tasks that are stuck in runnable state. Similar to how we
show details for errored tasks, we now show details on tasks that are
being retried.
Earlier we showed the following output when a task was stuck:
```
┌────────────────────────────┐
│ { ↵│
│ "tasks": [ ↵│
│ ], ↵│
│ "task_state_counts": {↵│
│ "done": 13, ↵│
│ "blocked": 2, ↵│
│ "runnable": 1 ↵│
│ } ↵│
│ } │
└────────────────────────────┘
```
Now we show details like the following:
```
+-----------------------------------------------------------------------
| {
| "tasks": [
| {
| "state": "runnable",
| "command": "SELECT pg_catalog.citus_move_shard_placement(1
| "message": "ERROR: Moving shards to a node that shouldn't
| "retried": 2,
| "task_id": 3
| }
| ],
| "task_state_counts": {
| "blocked": 1,
| "runnable": 1
| }
| }
+-----------------------------------------------------------------------
```
citus_job_list() lists all background jobs by simply showing the records
in pg_dist_background_job.
citus_job_status(job_id bigint, raw boolean default false) shows the
status of a single background job by appending a jsonb details column to
the associated row from pg_dist_background_job. If the raw argument is
set, machine readable sizes are used instead of human readable
alternatives.
citus_rebalance_status(raw boolean default false) shows the status of
the last rebalance operation. If the raw argument is set, machine
readable sizes are used instead of human readable alternatives.
Fixes a missed include in #6315.
While adding the cluster clock we have added some extra steps to
`citus_prepare_pg_upgrade` and `citus_finish_pg_upgrade`. These changes
were not added to the citus upgrade and downgrade scripts, this allowed
for a syntax error to slip in.
This PR adds the new versions of both UDF's to the upgrade script while
adding the old version to the downgrade script. This exposed the syntax
error which is also solved.
We already have citus_job_wait to wait until the job reaches the desired
state. That PR adds waiting on task state to allow more granular
waiting. It can be used for Citus operations. Moreover, it is also
useful for testing purposes. (wait until a task reaches specified state)
Related to #6459.
DESCRIPTION: Create replication artifacts with unique names
We're creating replication objects with generic names. This disallows us
to enable parallel shard moves, as two operations might use the same
objects. With this PR, we'll create below objects with operation
specific names, by appending OparationId to the names.
* Subscriptions
* Publications
* Replication Slots
* Users created for subscriptions
1) Regular users fail to use clock UDF with permission issue.
2) Clock functions were declared as STABLE, whereas by definition they are VOLATILE. By design, any clock/time
functions will return different results for each call even within a single SQL statement.
Note: UDF citus_get_transaction_clock() is a misnomer as it internally calls the clock tick which always returns
different results for every invocation in the same transaction.
increasing logical clock. Clock guarantees to never go back in value after restarts,
and makes best attempt to keep the value close to unix epoch time in milliseconds.
Also, introduces a new GUC "citus.enable_cluster_clock", when true, every
distributed transaction is stamped with logical causal clock and persisted
in a catalog pg_dist_commit_transaction.
DESCRIPTION: Adds status column to get_rebalance_progress()
Introduces a new column named `status` for the function
`get_rebalance_progress()`. For each ongoing shard move, this column
will reveal information about that shard move operation's current
status.
For now, candidate status messages could be one of the below.
* Not Started
* Setting Up
* Copying Data
* Catching Up
* Creating Constraints
* Final Catchup
* Creating Foreign Keys
* Completing
* Completed
Sometimes our CI randomly fails on a test in a way similar to this:
```diff
step s2-drop:
DROP TABLE cancel_table;
-
+ <waiting ...>
+step s2-drop: <... completed>
starting permutation: s1-timeout s1-begin s1-sleep10000 s1-rollback s1-reset s1-drop
```
Source:
https://app.circleci.com/pipelines/github/citusdata/citus/26524/workflows/5415b84f-13a3-482f-bef9-648314c79a67/jobs/756377
I tried to fix that already in #6252 by disabling the maintenance daemon
during isolation tests. But it seems that hasn't fixed all cases of
these errors. This is another attempt at fixing these issues that seems
to have better results.
What it does is that it starts using the pInterestingPids parameter that
citus_isolation_test_session_is_blocked receives. With this change we
start filter out block-edges that are not caused by any of these pids.
In passing this change also makes it possible to run
`isolation_create_distributed_table_concurrently` with
`check-isolation-base`
DESCRIPTION: Adds source_lsn and target_lsn fields into
get_rebalance_progress
Adding two fields named `source_lsn` and `target_lsn` to the function
`get_rebalance_progress`.
Target lsn data is fetched in `GetShardStatistics`, by expanding the
query sent to workers (joining with pg_subscription_rel and
pg_stat_subscription). Then put into the hashmap, for each shard.
Source lsn data is fetched in `BuildWorkerShardStatististicsHash`, in
the loop that iterate each node, by sending a pg_current_wal_lsn query
to each node. Then put into the hashmap, for each node.
DESCRIPTION: Show citus_copy_shard_placement progress in
get_rebalance_progress
When rebalancing to a new node that does not have reference tables yet
the rebalancer will first copy the reference tables to the nodes.
Depending on the size of the reference tables, this might take a long
time. However, there's no indication of what's happening at this stage
of the rebalance.
This PR improves this situation by also showing the progress of any
citus_copy_shard_placement calls when calling get_rebalance_progress.
DESCRIPTION: Add a rebalancer that uses background tasks for its
execution
Based on the baclground jobs and tasks introduced in #6296 we implement
a new rebalancer on top of the primitives of background execution. This
allows the user to initiate a rebalance and let Citus execute the long
running steps in the background until completion.
Users can invoke the new background rebalancer with `SELECT
citus_rebalance_start();`. It will output information on its job id and
how to track progress. Also it returns its job id for automation
purposes. If you simply want to wait till the rebalance is done you can
use `SELECT citus_rebalance_wait();`
A running rebalance can be canelled/stopped with `SELECT
citus_rebalance_stop();`.
DESCRIPTION: Add infrastructure to run long running management operations in background
This infrastructure introduces the primitives of jobs and tasks.
A task consists of a sql statement and an owner. Tasks belong to a
Job and can depend on other tasks from the same job.
When there are either runnable or running tasks we would like to
make sure a bacgrkound task queue monitor process is running. A Task
could be in running state while there is actually no monitor present
due to a database restart or failover. Once the monitor starts it
will reset any running task to its runnable state.
To make sure only one background task queue monitor is ever running
at once it will acquire an advisory lock that self conflicts.
Once a task is done it will find all tasks depending on this task.
After checking that the task doesn't have unmet dependencies it will
transition the task from blocked to runnable state for the task to
be picked up on a subsequent task start.
Currently only one task can be running at a time. This can be
improved upon in later releases without changes to the higher level
API.
The initial goal for this background tasks is to allow a rebalance
to run in the background. This will be implemented in a subsequent PR.
When introducing our overrides of pg_cancel_backend and
pg_terminate_backend we accidentally did that in such a way that we
cannot call the original pg_cancel_backend and pg_terminate_backend from
C anymore. This happened because we defined the exact same symbols in
our shared library as postgres does in its own binary.
This fixes that by using a different names for the C function than for
the SQL function.
Making this work in all upgrade and downgrade scenarios is not trivial
though, because we actually need to remove the C function definition.
Postgres errors in two different times when the symbol that a C function
wants to call is not defined in the library it expects it in:
1. When creating the SQL function definition
2. When calling the SQL function
Item 1 causes an issue when creating our extension for the first time.
We then go execute all the migrations that we have. So if the 11.0
migration contains a SQL function definition that still references the
pg_cancel_backend symbol, that migration will fail. This issue is solved
by actually changing the SQL definition in the old migration.
This is not enough to fix all issues though. Item 2 causes an issue
after an upgrade to 11.1, because it won't have the new definition of
the SQL function. This is solved by recreating the SQL functions in the
migration to 11.1. That way it gets the new definition.
Then finally there's the case of downgrades. To continue to make our
pg_cancel_backend SQL function work after downgrading, we will need to
make a patch release for 11.0 that includes the new citus_cancel_backend
symbol. This is done in a separate commit.
DESCRIPTION:
This PR adds support for 'Deferred Drop' and robust 'Shard Cleanup' for Splits.
Common Infrastructure
This PR introduces new common infrastructure so as any operation that wants robust cleanup of resources can register with the cleaner and have the resources cleaned appropriately based on a specified policy. 'Shard Split' is the first consumer using this new infrastructure.
Note : We only support adding 'shards' as resources to be cleaned-up right now but the framework will be extended to support other resources in future.
Deferred Drop for Split
Deferred Drop Support ensures that shards undergoing split are not dropped inline as part of operation but dropped later when no active read queries are running on shard. This helps with :
Avoids any potential deadlock scenarios that can cause long running Split operation to rollback.
Avoids Split operation blocking writes and then getting blocked (due to running queries on the shard) when trying to drop shards.
Deferred drop is the new default behavior going forward.
Shard Cleaner Extension
Shard Cleaner is a background task responsible for deferred drops in case of 'Move' operations.
The cleaner has been extended to ensure robust cleanup of shards (dummy shards and split children) in case of a failure based on the new infrastructure mentioned above. The cleaner also handles deferred drop for 'Splits'.
TESTING:
New test ''citus_split_shard_by_split_points_deferred_drop' to test deferred drop support.
New test 'failure_split_cleanup' to test shard cleanup with failures in different stages.
Update 'isolation_blocking_shard_split and isolation_non_blocking_shard_split' for deferred drop.
Added non-deferred drop version of existing tests : 'citus_split_shard_no_deferred_drop' and 'citus_non_blocking_splits_no_deferred_drop'
Added create_distributed_table_concurrently which is nonblocking variant of create_distributed_table.
It bases on the split API which takes advantage of logical replication to support nonblocking split operations.
Co-authored-by: Marco Slot <marco.slot@gmail.com>
Co-authored-by: aykutbozkurt <aykut.bozkurt1995@gmail.com>