Commit Graph

221 Commits (48f7d6c27925af3813d787bbe9e9d1458f54413e)

Author SHA1 Message Date
Marco Slot 48f7d6c279
Show local managed tables in citus_tables and hide tables owned by extensions (#6321)
Co-authored-by: Marco Slot <marco.slot@gmail.com>
2022-09-12 17:49:17 +03:00
Marco Slot ba2fe3e3c4
Remove do_repair option from citus_copy_shard_placement (#6299)
Co-authored-by: Marco Slot <marco.slot@gmail.com>
2022-09-09 15:44:30 +02:00
Nils Dijk 00a94c7f13
Implement infrastructure to run sql jobs in the background (#6296)
DESCRIPTION: Add infrastructure to run long running management operations in background

This infrastructure introduces the primitives of jobs and tasks.
A task consists of a sql statement and an owner. Tasks belong to a
Job and can depend on other tasks from the same job.

When there are either runnable or running tasks we would like to
make sure a bacgrkound task queue monitor process is running. A Task
could be in running state while there is actually no monitor present
due to a database restart or failover. Once the monitor starts it
will reset any running task to its runnable state.

To make sure only one background task queue monitor is ever running
at once it will acquire an advisory lock that self conflicts.

Once a task is done it will find all tasks depending on this task.
After checking that the task doesn't have unmet dependencies it will
transition the task from blocked to runnable state for the task to
be picked up on a subsequent task start.

Currently only one task can be running at a time. This can be
improved upon in later releases without changes to the higher level
API.

The initial goal for this background tasks is to allow a rebalance
to run in the background. This will be implemented in a subsequent PR.
2022-09-09 16:11:19 +03:00
Jelte Fennema e29db74a19
Don't override postgres C symbols with our own (#6300)
When introducing our overrides of pg_cancel_backend and
pg_terminate_backend we accidentally did that in such a way that we
cannot call the original pg_cancel_backend and pg_terminate_backend from
C anymore. This happened because we defined the exact same symbols in
our shared library as postgres does in its own binary.

This fixes that by using a different names for the C function than for
the SQL function.

Making this work in all upgrade and downgrade scenarios is not trivial
though, because we actually need to remove the C function definition.
Postgres errors in two different times when the symbol that a C function
wants to call is not defined in the library it expects it in:
1. When creating the SQL function definition
2. When calling the SQL function

Item 1 causes an issue when creating our extension for the first time.
We then go execute all the migrations that we have. So if the 11.0
migration contains a SQL function definition that still references the
pg_cancel_backend symbol, that migration will fail. This issue is solved
by actually changing the SQL definition in the old migration.

This is not enough to fix all issues though. Item 2 causes an issue
after an upgrade to 11.1, because it won't have the new definition of
the SQL function. This is solved by recreating the SQL functions in the
migration to 11.1. That way it gets the new definition.

Then finally there's the case of downgrades. To continue to make our
pg_cancel_backend SQL function work after downgrading, we will need to
make a patch release for 11.0 that includes the new citus_cancel_backend
symbol. This is done in a separate commit.
2022-09-07 11:27:05 +02:00
Nitish Upreti d7404a9446
'Deferred Drop' and robust 'Shard Cleanup' for Splits. (#6258)
DESCRIPTION:
This PR adds support for 'Deferred Drop' and robust 'Shard Cleanup' for Splits.

Common Infrastructure
This PR introduces new common infrastructure so as any operation that wants robust cleanup of resources can register with the cleaner and have the resources cleaned appropriately based on a specified policy. 'Shard Split' is the first consumer using this new infrastructure.
Note : We only support adding 'shards' as resources to be cleaned-up right now but the framework will be extended to support other resources in future.

Deferred Drop for Split
Deferred Drop Support ensures that shards undergoing split are not dropped inline as part of operation but dropped later when no active read queries are running on shard. This helps with :

Avoids any potential deadlock scenarios that can cause long running Split operation to rollback.
Avoids Split operation blocking writes and then getting blocked (due to running queries on the shard) when trying to drop shards.
Deferred drop is the new default behavior going forward.
Shard Cleaner Extension
Shard Cleaner is a background task responsible for deferred drops in case of 'Move' operations.
The cleaner has been extended to ensure robust cleanup of shards (dummy shards and split children) in case of a failure based on the new infrastructure mentioned above. The cleaner also handles deferred drop for 'Splits'.

TESTING:
New test ''citus_split_shard_by_split_points_deferred_drop' to test deferred drop support.
New test 'failure_split_cleanup' to test shard cleanup with failures in different stages.
Update 'isolation_blocking_shard_split and isolation_non_blocking_shard_split' for deferred drop.
Added non-deferred drop version of existing tests : 'citus_split_shard_no_deferred_drop' and 'citus_non_blocking_splits_no_deferred_drop'
2022-09-06 12:11:20 -07:00
Marco Slot 6bb31c5d75
Add non-blocking variant of create_distributed_table (#6087)
Added create_distributed_table_concurrently which is nonblocking variant of create_distributed_table.

It bases on the split API which takes advantage of logical replication to support nonblocking split operations.

Co-authored-by: Marco Slot <marco.slot@gmail.com>
Co-authored-by: aykutbozkurt <aykut.bozkurt1995@gmail.com>
2022-08-30 15:35:40 +03:00
Sameer Awasekar 4df8eca77f
Add worker_split_shard_release_dsm udf to release dynamic shared memory (#6248)
The code introduces worker_split_shard_release_dsm udf to release the dynamic shared memory segment allocated during non-blocking split workflow.
2022-08-26 18:27:32 +05:30
Ahmet Gedemenli 0631e1998b
Fix upgrade paths for #6100 (#6176)
* Fix upgrade paths for #6100

Co-authored-by: Hanefi Onaldi <Hanefi.Onaldi@microsoft.com>
2022-08-17 18:56:53 +03:00
aykut-bozkurt 52efe08642
default mode for shard splitting is set to auto. (#6179) 2022-08-17 12:18:47 +03:00
aykut-bozkurt be06d65721
Nonblocking tenant isolation is supported by using split api. (#6167) 2022-08-17 11:13:07 +03:00
Jelte Fennema 8017693b2f
Allow specifying the shard_transfer_mode when replicating reference tables (#6070)
When using `citus.replicate_reference_tables_on_activate = off`,
reference tables need to be replicated later. This can be done using the
`replicate_reference_tables()` UDF. However, this function only allowed
blocking replication. This changes the function to default to logical
replication instead, and allows choosing any of our existing shard
transfer modes.
2022-08-09 13:21:31 +03:00
Jelte Fennema dd548ee3c7
Use faster custom copy logic for non-blocking shard moves (#6119)
DESCRIPTION: Use faster custom copy logic for non-blocking shard moves

Non-blocking shard moves consist of two main phases:
1. Initial data copy
2. Catchup phase

This changes the first of these phases significantly. Previously we used the
copy logic provided by postgres subscriptions. This meant we didn't have
to implement it ourselves, but it came with the downside of little control.
When implementing shard splits we needed more control to even make it
work, so we implemented our own logic for copying data between nodes.

This PR starts using that logic for non-blocking shard moves. Doing so
has four main advantages:
1. It uses COPY in binary format when possible, which is cheaper to encode 
    and decode. Furthermore it very often results in less data that needs to 
    be sent over the network.
2. It allows us to create the primary key (or other replica identity) after doing
    the initial data copy. This should give some speed up over the total run,
    because creating an index is bulk is much faster than incrementally building it.
3. It doesn't require a replication slot per parallel copy. Increasing the maximum
    number of replication slots uses resources in postgres, even if they are not used.
    So reducing the number of replication slots that shard moves need is nice.
4. Logical replication table_sync workers are slow to start up, so if lots of shards
    need to be copied that can make it quite slow. This can happen easily when
    combining Postgres partitioning with Citus.
2022-08-08 17:09:43 +02:00
Ahmet Gedemenli 8b68b0b5bb
Fix pg upgrade script for foreign tables (#6100)
Fixes unexpected error for foreign tables when upgrading pg
2022-08-05 13:35:17 +03:00
Sameer Awasekar e236711eea Introduce Non-Blocking Shard Split Workflow 2022-08-04 16:32:38 +02:00
Jelte Fennema abffa6c3b9
Use shard split copy code for blocking shard moves (#6098)
The new shard copy code that was created for shard splits has some
advantages over the old shard copy code. The old code was using 
worker_append_table_to_shard, which wrote to disk twice. And it also 
didn't use binary copy when that was possible. Both of these issues
were fixed in the new copy code. This PR starts using this new copy
logic also for shard moves, not just for shard splits.

On my local machine I created a single shard table like this.
```sql
set citus.shard_count = 1;
create table t(id bigint, a bigint);
select create_distributed_table('t', 'id');

INSERT into t(id, a) SELECT i, i from generate_series(1, 100000000) i;
```

I then turned `fsync` off to make sure I wasn't bottlenecked by disk. 
Finally I moved this shard between nodes with `citus_move_shard_placement`
with `block_writes`.

Before this PR a move took ~127s, after this PR it took only ~38s. So for this 
small test this resulted in spending ~70% less time.

And I also tried the same test for a table that contained large strings:
```sql
set citus.shard_count = 1;
create table t(id bigint, a bigint, content text);
select create_distributed_table('t', 'id');

INSERT into t(id, a, content) SELECT i, i, 'aunethautnehoautnheaotnuhetnohueoutnehotnuhetncouhaeohuaeochgrhgd.athbetndairgexdbuhaobulrhdbaetoausnetohuracehousncaoehuesousnaceohuenacouhancoexdaseohusnaetobuetnoduhasneouhaceohusnaoetcuhmsnaetohuacoeuhebtokteaoshetouhsanetouhaoug.lcuahesonuthaseauhcoerhuaoecuh.lg;rcydabsnetabuesabhenth' from generate_series(1, 20000000) i;
```
2022-08-01 20:10:36 +03:00
Hanefi Onaldi eb3e5ee227 Introduce citus_locks view
citus_locks combines the pg_locks views from all nodes and adds
global_pid, nodeid, and relation_name. The columns of citus_locks don't
change based on the Postgres version, however the pg_locks's columns do.
Postgres 14 added one more column to pg_locks (waitstart timestamptz).
citus_locks has the most expansive column set, including the newly added
column. If citus_locks is queried in a Postgres version where pg_locks
doesn't have some columns, the values for those columns in citus_locks
will be NULL
2022-07-21 03:06:57 +03:00
Nitish Upreti 5b3537cdff
Shard Split for Citus (#6029)
* Blocking split setup

* Add missing type

* Missing API from Metadata Sync

* Shard Split e2e code

* Worker Split Copy DestReceiver skeleton

* Basic destreceiver code

* worker_split_copy UDF

* UDF calling

* Split points are text

* Isolate Tenant and Split Shard Unification

* Fixing executor and misc

* Reindent code

* Fixing UDF definitions

* Hello World Local Copy works

* Remote copy hello world works

* Local and Remote binary test

* Fixing text local copy and adding tests

* Hello World shard split works

* Negative tests

* Blocking Split workflow works

* Refactor

* Bug fix

* Reindent

* Cleaning up and adding comments

* Basic test for shard split workflow

* ReIndent

* Circle CI integration

* Removing include causing circle-ci build failure

* Remove SplitCopyDestReceiver and use PartitionedResultDestReceiver

* Add support for citus.enable_binary_protocol

* Reindent

* Fix build break

* Update Test

* Cleanup on catch

* Addressing open comments

* Update downgrade script and quote schema/table in COPY statement

* Fix metadata sync issue. Update regression test

* Isolation test and bug fix

* Add Isolation test, fix foreign constraint deadlock issue

* Misc code review comments

* Test name needing to be quoted

* Refactor code from review comments

* Explaining shardGroupSplitIntervalListList

* Fix upgrade & downgrade

* Fix broken test

* Test fix Round 2

* Fixing bug and modifying test appropriately

* Fully qualify copy udf name. Run Reindent

* Address PR comments

* Fix null handling when creating AuxiliaryStructures

* Ensure local copy is triggered in tests

* Limit max shards that can be created with split

* Test failure fix

* Remove split_mode and use shard_transfer_mode instead'

* Fix test failure

* Fix test failure

* Fixing permission issue when splitting non-superuser owned tables

* Fix test expected output

* Remove extra space

* Fix test

* attempt to fix test

* Addressing Marco's PR comment

* Only clean shards created by workflow

* Remove from merge

* Update test
2022-07-18 02:54:15 -07:00
ywj 1675519f93
Support citus_columnar as separate extension (#5911)
* Support upgrade and downgrade and separate columnar as citus_columnar extension

Co-authored-by: Yanwen Jin <yanwjin@microsoft.com>
Co-authored-by: Jeff Davis <jeff@j-davis.com>
2022-07-13 21:08:29 -07:00
Halil Ozan Akgul 1490acbbe9 Removes incorrect parameter from get_all_active_transactions 2022-07-06 11:35:46 +03:00
Hanefi Onaldi f60809a6c1
Fix downgrade scripts from 11.0-2 to 11.0-1 (#6039) 2022-06-29 22:43:50 +03:00
Onder Kalaci bab4c0a8c3 Fixes a bug that prevents upgrades when there are no worker nodes 2022-06-28 15:54:49 +02:00
Jelte Fennema 184c7c0bce
Make enterprise features open source (#6008)
This PR makes all of the features open source that were previously only
available in Citus Enterprise.

Features that this adds:
1. Non blocking shard moves/shard rebalancer
   (`citus.logical_replication_timeout`)
2. Propagation of CREATE/DROP/ALTER ROLE statements
3. Propagation of GRANT statements
4. Propagation of CLUSTER statements
5. Propagation of ALTER DATABASE ... OWNER TO ...
6. Optimization for COPY when loading JSON to avoid double parsing of
   the JSON object (`citus.skip_jsonb_validation_in_copy`)
7. Support for row level security
8. Support for `pg_dist_authinfo`, which allows storing different
   authentication options for different users, e.g. you can store
   passwords or certificates here.
9. Support for `pg_dist_poolinfo`, which allows using connection poolers
   in between coordinator and workers
10. Tracking distributed query execution times using
   citus_stat_statements (`citus.stat_statements_max`,
   `citus.stat_statements_purge_interval`,
   `citus.stat_statements_track`). This is disabled by default.
11. Blocking tenant_isolation
12. Support for `sslkey` and `sslcert` in `citus.node_conninfo`
2022-06-16 00:23:46 -07:00
Marco Slot 36c4ec6d53 Introduce a citus_finish_citus_upgrade() function 2022-06-13 13:15:15 +02:00
Onder Kalaci dd02e1755f Parallelize metadata syncing on node activate
It is often useful to be able to sync the metadata in parallel
across nodes.

Also citus_finalize_upgrade_to_citus11() uses
start_metadata_sync_to_primary_nodes() after this commit.

Note that this commit does not parallelize all pieces of node
activation or metadata syncing. Instead, it tries to parallelize
potenially large parts of metadata, which is the objects and
distributed tables (in general Citus tables).

In the future, it would be nice to sync the reference tables
in parallel across nodes.

Create ~720 distributed tables / ~23450 shards
```SQL
-- declaratively partitioned table
CREATE TABLE github_events_looooooooooooooong_name (
  event_id bigint,
  event_type text,
  event_public boolean,
  repo_id bigint,
  payload jsonb,
  repo jsonb,
  actor jsonb,
  org jsonb,
  created_at timestamp
) PARTITION BY RANGE (created_at);

SELECT create_time_partitions(
  table_name         := 'github_events_looooooooooooooong_name',
  partition_interval := '1 day',
  end_at             := now() + '24 months'
);

CREATE INDEX ON github_events_looooooooooooooong_name USING btree (event_id, event_type, event_public, repo_id);
SELECT create_distributed_table('github_events_looooooooooooooong_name', 'repo_id');

SET client_min_messages TO ERROR;

```

across 1 node: almost same as expected
```SQL

SELECT start_metadata_sync_to_primary_nodes();
Time: 15664.418 ms (00:15.664)

select start_metadata_sync_to_node(nodename,nodeport) from pg_dist_node;
Time: 14284.069 ms (00:14.284)
```

across 7 nodes: ~3.5x improvement
```SQL

SELECT start_metadata_sync_to_primary_nodes();
┌──────────────────────────────────────┐
│ start_metadata_sync_to_primary_nodes │
├──────────────────────────────────────┤
│ t                                    │
└──────────────────────────────────────┘
(1 row)

Time: 25711.192 ms (00:25.711)

-- across 7 nodes
select start_metadata_sync_to_node(nodename,nodeport) from pg_dist_node;
Time: 82126.075 ms (01:22.126)
```
2022-05-23 09:15:48 +02:00
jeff-davis a9f8a60007
Columnar: support relation options with ALTER TABLE. (#5935)
Columnar: support relation options with ALTER TABLE.

Use ALTER TABLE ... SET/RESET to specify relation options rather than
alter_columnar_table_set() and alter_columnar_table_reset().

Not only is this more ergonomic, but it also allows better integration
because it can be treated like DDL on a regular table. For instance,
citus can use its own ProcessUtility_hook to distribute the new
settings to the shards.

DESCRIPTION: Columnar: support relation options with ALTER TABLE.
2022-05-20 08:35:00 -07:00
Marco Slot 79d7e860e6 Add a run_command_on_coordinator function 2022-05-19 10:26:09 +02:00
Marco Slot fa9cee409c Fix downgrade scripts and add new downgrade tests 2022-05-19 10:26:09 +02:00
Onder Kalaci 127450466e Do not warn unncessarily when a node is removed
In the past (pre-11), we allowed removing worker nodes
that had active placements for replicated distributed
table, without even checking if there are any other
replicas of the same placement.

However, with #5469, we prevent disabling nodes via a hard
error when there is the last active placement of shard, as we
do for reference tables. Note that otherwise, we'd allow
users to lose data.

As of today, the NOTICE is completely irrelevant.
2022-05-18 17:23:38 +02:00
Onder Kalaci db998b3d66 Adds "sync" option to citus_disable_node() UDF
Before this commit, we had:
```SQL
SELECT citus_disable_node(nodename, nodeport, force boolean DEFAULT false)
```

Where, we allow forcing to disable first worker node with
`force:=true`. However, it entails the risk for losing
data / diverging placement data etc.

With `force` flag, we control disabling the first worker node,
and with `async` flag we control whether the changes are done
via bg worker or immediately.

```SQL
SELECT citus_disable_node(nodename, nodeport, force boolean DEFAULT false, sync boolean DEFAULT false)
```

Where we can achieve all the following:

| Mode  | Data loss possibility | Can run in 2PC | Handle multiple node failures | Immediately effective |
| --- |--- |--- |--- |--- |
| force:false, sync: false  | false   | true  | true  | false |
| force:false, sync: true   | false  | false | false | true |
| force:true, sync: false   | true   | true  | true   | false |
| force:true, sync: true    | false  | false | false  | true |
2022-05-18 17:21:12 +02:00
Marco Slot 6fad5dc207 Add a citus_is_coordinator function 2022-05-13 10:02:52 +02:00
Gledis Zeneli 4c6f62efc6
Switch to using LOCK instead of lock_relation_if_exists in TRUNCATE (#5930)
Breaking down #5899 into smaller PR-s

This particular PR changes the way TRUNCATE acquires distributed locks on the relations it is truncating to use the LOCK command instead of lock_relation_if_exists. This has the benefit of using pg's recursive locking logic it implements for the LOCK command instead of us having to resolve relation dependencies and lock them explicitly. While this does not directly affect truncate, it will allow us to generalize this locking logic to then log different relations where the pg recursive locking will become useful (e.g. locking views).

This implementation is a bit more complex that it needs to be due to pg not supporting locking foreign tables. We can however, still lock foreign tables with lock_relation_if_exists. So for a command:

TRUNCATE dist_table_1, dist_table_2, foreign_table_1, foreign_table_2, dist_table_3;

We generate and send the following command to all the workers in metadata:
```sql
SEL citus.enable_ddl_propagation TO FALSE;
LOCK dist_table_1, dist_table_2 IN ACCESS EXCLUSIVE MODE;
SELECT lock_relation_if_exists('foreign_table_1', 'ACCESS EXCLUSIVE');
SELECT lock_relation_if_exists('foreign_table_2', 'ACCESS EXCLUSIVE');
LOCK dist_table_3 IN ACCESS EXCLUSIVE MODE;
SEL citus.enable_ddl_propagation TO TRUE;
```

Note that we need to alternate between the lock command and lock_table_if_exists in order to preserve the TRUNCATE order of relations.
When pg supports locking foreign tables, we will be able to massive simplify this logic and send a single LOCK command.
2022-05-11 18:38:48 +03:00
Marco Slot ceb593c9da Convert citus.hide_shards_from_app_name_prefixes to citus.show_shards_for_app_name_prefixes 2022-05-03 14:22:13 +02:00
Marco Slot 9476f377b5 Remove old re-partitioning functions 2022-04-04 18:11:52 +02:00
Halil Ozan Akgul 4dbc760603 Introduces citus_coordinator_node_id 2022-03-22 10:34:22 +03:00
Marco Slot 5bb5359da0 Fix worker node version check 2022-03-17 13:23:02 +01:00
Marco Slot 22a18fc1f2 Fix typo in upgrade function 2022-03-17 13:23:02 +01:00
Jelte Fennema c8839de68b
Don't use cascading deletes in Citus 11 migration script (#5767)
Using CASCADE in a DELETE can inadvertently delete things we don't
intend to. It's safer to fail hard and make the user delete depending
things manually.
2022-03-09 14:35:23 +01:00
Halil Ozan Akgül 333bcc7948
Global PID Helper Functions (#5768)
* Introduces citus_nodename_for_nodeid and citus_nodeport_for_nodeid functions

* Introduces citus_nodeid_for_gpid and citus_pid_for_gpid functions

* Add tests
2022-03-09 13:15:59 +03:00
Onder Kalaci c32b2de1a7 Improve citus_lock_waits
1) Remove useless columns
2) Show backends that are blocked on a DDL even before
   gpid is assigned
3) One minor bugfix, where we clear distributedCommandOriginator
   properly.
2022-03-07 11:10:44 +01:00
Ahmet Gedemenli 2a3c0c1914
Revert upgrade script changes (#5757) 2022-03-07 13:04:58 +03:00
Nils Dijk 3801576dfb
Move pg_dist_object to pg_catalog (#5765)
DESCRIPTION: Move pg_dist_object to pg_catalog

Historically `pg_dist_object` had been created in the `citus` schema as an experiment to understand if we could move our catalog tables to a branded schema. We quickly realised that this interfered with the UX on our managed services and other environments, where users connected via a user with the name of `citus`.

By default postgres put the username on the search_path. To be able to read the catalog in the `citus` schema we would need to grant access permissions to the schema. This caused newly created objects like tables etc, to default to this schema for creation. This failed due to the write permissions to that schema.

With this change we move the `pg_dist_object` catalog table to the `pg_catalog` schema, where our other schema's are also located. This makes the catalog table visible and readable by any user, like our other catalog tables, for debugging purposes.

Note: due to the change of schema, we had to disable 1 test that was running into a discrepancy between the schema and binary. Secondly, we needed to make the lookup functions for the `pg_dist_object` relation and their indexes less strict on the fallback of the naming due to an other test that, due to an unfortunate cache invalidation, needed to lookup the relation again. This makes that we won't default to _only_ resolving from `pg_catalog` outside of upgrades.
2022-03-04 17:40:38 +00:00
Halil Ozan Akgul 0500a62515 Updates citus_dist_stat_activity to use citus_stat_activity 2022-03-04 17:28:17 +03:00
Onder Kalaci c7b67ba0ea Add citus_backend_gpid()
And also citus_calculate_gpid(nodeId,pid). These UDFs are just
wrappers for the existing functions. Useful for testing and simple
manipulation of citus_stat_activity.
2022-03-03 15:29:40 +01:00
Halil Ozan Akgul 06a0509b1a Introduces citus_stat_activity view 2022-03-03 16:19:20 +03:00
Marco Slot 3ba61244b8 Synchronize pg_dist_colocation metadata 2022-03-03 11:01:59 +01:00
Onder Kalaci 35ec9721b4 Add a new API for enabling Citus MX for clusters upgrading from earlier versions
Clusters created pre-Citus 11 mostly didn't have metadata sync enabled.
For those clusters, we add a utility UDF which fixes some minor issues
and sync the necessary objects to the workers.
2022-03-02 17:02:55 +01:00
Ahmet Gedemenli e1809af376 Propagate CREATE AGGREGATE commands 2022-03-02 10:52:43 +03:00
Hanefi Onaldi 6c25eea62f Fix some typos in comments 2022-02-24 19:48:52 +03:00
Marco Slot 0c4e3cb69c Drop worker_partition_query_result on downgrade 2022-02-24 10:18:56 +01:00
Marco Slot 72d8fde28b Use intermediate results for re-partition joins 2022-02-23 19:40:21 +01:00