Commit Graph

98 Commits (a77d35f4af4b44f90580540f5f5cda124eb23902)

Author SHA1 Message Date
Gledis Zeneli 27ddb4fc8e
Do not obtain AccessShareLock before actual lock (#5965)
Do not obtain AccessShareLock before acquiring the distributed locks.

Acquiring an AccessShareLock ensures that the relations which we are trying to get a distributed lock on will not be dropped in the time between when the LOCK command is issued and the LOCK commands are send to the worker. However, this also leads to distributed deadlocks in such scenarios:

```sql
-- for dist lock acquiring order coor, w1, w2

-- on w2
LOCK t1 IN ACCESS EXLUSIVE MODE;
-- acquire AccessShareLock locally on t1 to ensure it is not dropped while we get ready to distribute the lock

      -- concurrently on w1
      LOCK t1 IN ACCESS EXLUSIVE MODE;
      -- acquire AccessShareLock locally on t1 to ensure it is not dropped while we get ready to distribute the lock
      -- acquire dist lock on coor, w1, gets blocked on local AccessShareLock on w2

-- on w2 continuation of the execution above
-- starts to acquire dist locks and gets blocked on the coor by the lock acquired by w1

-- distributed deadlock

``` 

We opt for avoiding such deadlocks with the cost of the possibility of running into errors when the relations on which we are trying to acquire locks on get dropped.
2022-05-23 13:06:38 +03:00
Hanefi Onaldi 52541c5802 Add normalization rules for flaky isolation tests
We remove `<waiting ...>` and `<... completed>` outputs for some CREATE
INDEX CONCURRENTLY commands since they can cause flakiness in some scenarios.

Postgres calls WaitForOlderSnapshots() and this can cause CREATE INDEX
CONCURRENTLY commands for shards to get blocked by each other for brief
periods of time. The extra waits can pop-up, or they can get completed
at different lines in the output files. To remedy that, we rename those
indexes to be captured by the new normalization rule.
2022-05-21 00:55:47 +03:00
Marco Slot ad5214b50c Allow distributed execution from run_command_on_* functions 2022-05-20 15:26:47 +02:00
gledis69 4731630741 Add distributing lock command support 2022-05-20 12:28:07 +03:00
Halil Ozan Akgul d171a736ab Revert "Creates new colocation for colocate_with:='none' too"
This reverts commit f74447b3b7.
2022-05-17 15:32:22 +03:00
Halil Ozan Akgul f74447b3b7 Creates new colocation for colocate_with:='none' too 2022-05-16 13:39:05 +03:00
Gledis Zeneli 4c6f62efc6
Switch to using LOCK instead of lock_relation_if_exists in TRUNCATE (#5930)
Breaking down #5899 into smaller PR-s

This particular PR changes the way TRUNCATE acquires distributed locks on the relations it is truncating to use the LOCK command instead of lock_relation_if_exists. This has the benefit of using pg's recursive locking logic it implements for the LOCK command instead of us having to resolve relation dependencies and lock them explicitly. While this does not directly affect truncate, it will allow us to generalize this locking logic to then log different relations where the pg recursive locking will become useful (e.g. locking views).

This implementation is a bit more complex that it needs to be due to pg not supporting locking foreign tables. We can however, still lock foreign tables with lock_relation_if_exists. So for a command:

TRUNCATE dist_table_1, dist_table_2, foreign_table_1, foreign_table_2, dist_table_3;

We generate and send the following command to all the workers in metadata:
```sql
SEL citus.enable_ddl_propagation TO FALSE;
LOCK dist_table_1, dist_table_2 IN ACCESS EXCLUSIVE MODE;
SELECT lock_relation_if_exists('foreign_table_1', 'ACCESS EXCLUSIVE');
SELECT lock_relation_if_exists('foreign_table_2', 'ACCESS EXCLUSIVE');
LOCK dist_table_3 IN ACCESS EXCLUSIVE MODE;
SEL citus.enable_ddl_propagation TO TRUE;
```

Note that we need to alternate between the lock command and lock_table_if_exists in order to preserve the TRUNCATE order of relations.
When pg supports locking foreign tables, we will be able to massive simplify this logic and send a single LOCK command.
2022-05-11 18:38:48 +03:00
Burak Velioglu 1460452442 Introduce CREATE/DROP VIEW
Adds support for propagating create/drop view commands and views to
worker node while scaling out the cluster. Since views are dropped while
converting the table type, metadata connection will be used while
propagating view commands to not switch to sequential mode.
2022-05-10 13:07:14 +03:00
Önder Kalacı dd78c81378
Fix flaky isolation - 1 (#5900)
* Do not show any PG internal queries
2022-04-11 20:43:51 -07:00
Burak Velioglu 5d9599f964
Create function in transaction according to create object propagation guc 2022-04-08 17:15:31 +03:00
Halil Ozan Akgül 37fafd007c
Turn metadata sync on in isolation_update_node and isolation_update_node_lock_writes tests (#5779) 2022-03-11 16:39:20 +03:00
Halil Ozan Akgül c9913b135c
Turn metadata sync on in isolation_ref2ref_foreign_keys test (#5791) 2022-03-11 13:30:11 +03:00
Halil Ozan Akgül 2edaf0971c
Turn metadata sync on in isolation reference copy vs all (#5790)
* Turn metadata sync on in isolation_reference_copy_vs_all test

* Update the output of isolation_reference_copy_vs_all test
2022-03-11 11:27:46 +03:00
Marco Slot 7559ad12ba Change create_object_propagation default to immediate 2022-03-09 17:40:50 +01:00
Halil Ozan Akgül 333bcc7948
Global PID Helper Functions (#5768)
* Introduces citus_nodename_for_nodeid and citus_nodeport_for_nodeid functions

* Introduces citus_nodeid_for_gpid and citus_pid_for_gpid functions

* Add tests
2022-03-09 13:15:59 +03:00
Onder Kalaci c32b2de1a7 Improve citus_lock_waits
1) Remove useless columns
2) Show backends that are blocked on a DDL even before
   gpid is assigned
3) One minor bugfix, where we clear distributedCommandOriginator
   properly.
2022-03-07 11:10:44 +01:00
Nils Dijk 3801576dfb
Move pg_dist_object to pg_catalog (#5765)
DESCRIPTION: Move pg_dist_object to pg_catalog

Historically `pg_dist_object` had been created in the `citus` schema as an experiment to understand if we could move our catalog tables to a branded schema. We quickly realised that this interfered with the UX on our managed services and other environments, where users connected via a user with the name of `citus`.

By default postgres put the username on the search_path. To be able to read the catalog in the `citus` schema we would need to grant access permissions to the schema. This caused newly created objects like tables etc, to default to this schema for creation. This failed due to the write permissions to that schema.

With this change we move the `pg_dist_object` catalog table to the `pg_catalog` schema, where our other schema's are also located. This makes the catalog table visible and readable by any user, like our other catalog tables, for debugging purposes.

Note: due to the change of schema, we had to disable 1 test that was running into a discrepancy between the schema and binary. Secondly, we needed to make the lookup functions for the `pg_dist_object` relation and their indexes less strict on the fallback of the naming due to an other test that, due to an unfortunate cache invalidation, needed to lookup the relation again. This makes that we won't default to _only_ resolving from `pg_catalog` outside of upgrades.
2022-03-04 17:40:38 +00:00
Halil Ozan Akgul 0500a62515 Updates citus_dist_stat_activity to use citus_stat_activity 2022-03-04 17:28:17 +03:00
Halil Ozan Akgul 06a0509b1a Introduces citus_stat_activity view 2022-03-03 16:19:20 +03:00
Marco Slot 43e4dd3808 Add a citus.internal_reserved_connections setting 2022-03-02 19:13:53 +01:00
Onder Kalaci e80a36c4b6 Improve visibility rules for non-priviledge roles
It seems like our approach is way too restrictive and some places
are wrong. Now, we follow very similar approach to pg_stat_activity.

Some of the changes are pre-requsite for implementing citus_dist_stat_activity
via citus_stat_activity.
2022-03-02 18:04:01 +01:00
Onder Kalaci df95d59e33 Drop support for CitusInitiatedBackend
CitusInitiatedBackend was a pre-mature implemenation of the whole
GlobalPID infrastructure. We used it to track whether any individual
query is triggered by Citus or not.

As of now, after GlobalPID is already in place, we don't need
CitusInitiatedBackend, in fact it could even be wrong.
2022-02-24 12:12:43 +01:00
Hanefi Onaldi 7bd6c2c9ac
Isolation tests for various ddl operations and metadata sync 2022-02-24 03:19:56 +03:00
Onder Kalaci 95d5918967 Properly set worker_query and use 2022-02-21 18:22:33 +01:00
Onder Kalaci dffcafc096 Use global pids in citus_lock_waits 2022-02-21 17:46:34 +01:00
Hanefi Onaldi 2e5ca8ba2b
Add isolation tests for metadata sync vs all
This commit introduces several test cases for concurrent operations that
change metadata, and a concurrent metadata sync operation.

The overall structure is as follows:
- Session#1 starts metadata syncing in a transaction block
- Session#2 does an operation that change metadata
- Both sessions are committed
- Another session checks whether the metadata are the same accross all
  nodes in the cluster.
2022-02-11 01:55:04 +03:00
Önder Kalacı dc6c194916
Show IDLE backends in citus_dist_stat_activity (#5700)
* Break the dependency to CitusInitiatedBackend infrastructure

With this change, we start to show non-distributed backends as well
in citus_dist_stat_activity. I think that
  (a) it is essential for making citus_lock_waits to work for blocked
      on DDL commands.
  (b) it is more expected from the user's perspective. The name of
      the view is a little inconsistent now (e.g., citus_dist_stat_activity)
      but we are already planning to improve the names with followup
      PRs.

Also, we have global pids assigned, the CitusInitiatedBackend
becomes obsolete.
2022-02-10 08:59:28 -08:00
Ahmet Gedemenli 76b63a307b Propagate create/drop schema commands 2022-02-10 14:58:09 +03:00
Halil Ozan Akgul 8ee02b29d0 Introduce global PID 2022-02-08 16:49:38 +03:00
Onder Kalaci 923bb194a4 Move isolation_multiuser_locking to MX tests 2022-02-04 10:52:57 +01:00
Onder Kalaci ff234fbfd2 Unify old GUCs into a single one
Replaces citus.enable_object_propagation with citus.enable_metadata_sync

Also, within Citus 11 release cycle, we added citus.enable_metadata_sync_by_default,
that is also replaced with citus.enable_metadata_sync.

In essence, when citus.enable_metadata_sync is set to true, all the objects
and the metadata is send to the remote node.

We strongly advice that the users never changes the value of
this GUC.
2022-02-04 10:52:56 +01:00
Burak Velioglu f88cc230bf
Handle tables and objects as metadata. Update UDFs accordingly
With this commit we've started to propagate sequences and shell
tables within the object dependency resolution. So, ensuring any
dependencies for any object will consider shell tables and sequences
as well. Separate logics for both shell tables and sequences have
been removed.

Since both shell tables and sequences logic were implemented as a
part of the metadata handling before that logic, we were propagating
them while syncing table metadata. With this commit we've divided
metadata (which means anything except shards thereafter) syncing
logic into multiple parts and implemented it either as a part of
ActivateNode. You can check the functions called in ActivateNode
to check definition of different metadata.

Definitions of start_metadata_sync_to_node and citus_activate_node
have also been updated. citus_activate_node will basically create
an active node with all metadata and reference table shards.
start_metadata_sync_to_node will be same with citus_activate_node
except replicating reference tables. stop_metadata_sync_to_node
will remove all the metadata. All of those UDFs need to be called
by superuser.
2022-01-31 16:20:15 +03:00
Halil Ozan Akgul 9547228e8d Add isolation_check_mx test 2021-12-30 14:58:30 +03:00
Onder Kalaci 549edcabb6 Allow disabling node(s) when multiple failures happen
As of master branch, Citus does all the modifications to replicated tables
(e.g., reference tables and distributed tables with replication factor > 1),
via 2PC and avoids any shardstate=3. As a side-effect of those changes,
handling node failures for replicated tables change.

With this PR, when one (or multiple) node failures happen, the users would
see query errors on modifications. If the problem is intermitant, that's OK,
once the node failure(s) recover by themselves, the modification queries would
succeed. If the node failure(s) are permenant, the users should call
`SELECT citus_disable_node(...)` to disable the node. As soon as the node is
disabled, modification would start to succeed. However, now the old node gets
behind. It means that, when the node is up again, the placements should be
re-created on the node. First, use `SELECT citus_activate_node()`. Then, use
`SELECT replicate_table_shards(...)` to replicate the missing placements on
the re-activated node.
2021-12-01 10:19:48 +01:00
Onder Kalaci 38b08ebde9 Generalize the error checks while removing node
The checks for preventing to remove a node are very much reference
table centric. We are soon going to add the same checks for replicated
tables. So, make the checks generic such that:
	 (a) replicated tables fit naturally
	 (b) we can the same checks in `citus_disable_node`.
2021-11-26 14:25:29 +01:00
Hanefi Onaldi 4c135de9e4 Introduce CI checks for hash comments in specs
We do not use comments starting with # in spec files because it creates
errors from C preprocessor that expects directives after this character.
Instead use C style comments, i.e:
// single line comment

You can also use multiline comments as well
/*
 * multi line comment
 */
2021-11-26 14:52:51 +03:00
Marco Slot f49d26fbeb Remove citus_update_table_statistics isolation test 2021-11-19 10:51:15 +01:00
Marco Slot 9e6ca23286 Remove cstore_fdw-related logic 2021-11-16 13:59:03 +01:00
Önder Kalacı 8c0bc94b51
Enable replication factor > 1 in metadata syncing (#5392)
- [x] Add some more regression test coverage
- [x] Make sure returning works fine in case of
     local execution + remote execution
     (task->partiallyLocalOrRemote works as expected, already added tests)
- [x] Implement locking properly (and add isolation tests)
     - [x] We do #shardcount round-trips on `SerializeNonCommutativeWrites`.
           We made it a single round-trip.
- [x] Acquire locks for subselects on the workers & add isolation tests
- [x] Add a GUC to prevent modification from the workers, hence increase the
      coordinator-only throughput
       - The performance slightly drops (~%15), unless
         `citus.allow_modifications_from_workers_to_replicated_tables`
         is set to false
2021-11-15 15:10:18 +03:00
Önder Kalacı d5b371b2e0
Merge branch 'master' into naisila/fix-partitioned-index 2021-11-08 10:53:16 +01:00
naisila 385ba94d15 Run fix_partition_shard_index_names after each wrong naming command 2021-11-08 10:43:34 +01:00
Marco Slot 78866df13c Remove master_append_table_to_shard UDF 2021-11-08 10:43:24 +01:00
Marco Slot fba93df4b0 Remove copy into new append shard logic 2021-11-07 21:01:40 +01:00
Halil Ozan Akgul a8f3f712cc Turns mx on in isolations tests 2021-11-04 17:12:30 +03:00
Marco Slot 096660d61d Remove master_apply_delete_command 2021-10-18 22:29:37 +02:00
Halil Ozan Akgul 9c9d4b5eeb Turn MX on by default 2021-10-08 18:17:21 +03:00
Naisila Puka d0390af72d
Add fix_partition_shard_index_names udf to fix currently broken names (#5291)
* Add udf to include shardId in broken partition shard index names

* Address reviews: rename index such that operations can be done on it

* More comprehensive index tests

* Final touches and formatting
2021-10-07 19:34:52 +03:00
Sait Talha Nisanci f3fa133caa Bind seg version to 1.3 in isolation_textension_commands 2021-09-03 15:41:28 +03:00
Onur Tirtir 889a2731cb
Split columnar stripe reservation into two phases (#5188)
Previously, we were doing `first_row_number` reservation for the first
row written to current `WriteState` but were doing `stripe_id`
reservation when flushing the `WriteState` and were inserting the
related record to `columnar.stripe` at that time as well.

However, inserting `columnar.stripe` record at flush-time is
problematic. This is because, as told in #5160, if relation has
any index-based constraints and if there are two concurrent
writes that are inserting conflicting key values for that constraint,
then postgres relies on `tableAM->fetch_index_tuple`
(=`columnar_fetch_index_tuple`) callback to return `true` when
indexAM is checking against possible constraint violations.

However, pending writes of other backends are not visible to concurrent
sessions in columnar since we were not inserting the stripe metadata
record until flushing the stripe.

With this commit, we split stripe reservation into two phases:
i) Reserve `stripe_id` and insert a "dummy" record to `columnar.stripe`
at the very same time we reserve `first_row_number`, i.e. when writing
the first row to the current `WriteState`.
ii) At flush time, do the storage level allocation and complete the
missing fields of the dummy record inserted into `columnar.stripe`
during i).

That way, any concurrent writes would be able to check against possible
constraint violations by using `SnapshotDirty` when scanning
`columnar.stripe`.

Note that `columnar_fetch_index_tuple` still wouldn't be able to fill
the output tupleslot for the requested tid but it would at least return
`true` for such index look-up's and we believe this should be sufficient
for the caller indexAM callback to make the concurrent writer block on
prior one.

That is how we fix #5160.

Only downside of reserving `stripe_id` at the same time we reserve
`first_row_number` is that now any aborted writes would also waste
some amount of `stripe_id` as in the case of `first_row_number` but
we are just wasting them one-by-one.

Considering the fact that we waste `first_row_number` by the amount
stripe row limit (=150k by default) in such cases, this shouldn't be
important at all.
2021-09-02 11:49:14 +03:00
Onur Tirtir bf4dfad6f7 Update curcid of given snapshot if it is MVCC
Before starting to scan a columnar table, we always flush the pending
writes to disk.

However, we increment command counter after modifying metadata tables.

On the other hand, now that we _don't always use_ xact snapshot to scan
a columnar table, writes that we just flushed might not be visible to
the query that just flushed pending writes to disk since curcid of
provided snapshot would become smaller than the command id being used
when modifying metadata tables.

To give an example, before this change, below was a possible scenario
due to the changes that we made to use the correct snapshot.

```sql
CREATE TABLE t(a int, b int) USING columnar;
BEGIN;
  INSERT INTO t VALUES (5, 10);

  SELECT * FROM t;
  ┌───┬───┐
  │ a │ b │
  ├───┼───┤
  └───┴───┘
  (0 rows)

  SELECT * FROM t;
  ┌───┬────┐
  │ a │ b  │
  ├───┼────┤
  │ 5 │ 10 │
  └───┴────┘
  (1 row)
```
2021-09-02 11:11:59 +03:00