We remove `<waiting ...>` and `<... completed>` outputs for some CREATE
INDEX CONCURRENTLY commands since they can cause flakiness in some scenarios.
Postgres calls WaitForOlderSnapshots() and this can cause CREATE INDEX
CONCURRENTLY commands for shards to get blocked by each other for brief
periods of time. The extra waits can pop-up, or they can get completed
at different lines in the output files. To remedy that, we rename those
indexes to be captured by the new normalization rule.
(cherry picked from commit 52541c5802)
The error comes due to the datum jsonb in pg_dist_metadata_node.metadata being 0 in some scenarios. This is likely due to not copying the data when receiving a datum from a tuple and pg deciding to deallocate that memory when the table that the tuple was from is closed.
Also fix another place in the code that might have been susceptible to this issue.
I tested on both multi-vg and multi-1-vg and the test were successful.
(cherry picked from commit beef392f5a)
The general rule is:
If the data is used within the bounds of table_open ... table_close > no need to copy
If the data is required for use even after the table is closed > copy
(cherry picked from commit dc9da7630f)
altering the distributed table.
To be able to alter view's owner without enforcing sequential mode.
Alter view process functions have been udpated to use metadata
connection.
Do not obtain AccessShareLock before acquiring the distributed locks.
Acquiring an AccessShareLock ensures that the relations which we are trying to get a distributed lock on will not be dropped in the time between when the LOCK command is issued and the LOCK commands are send to the worker. However, this also leads to distributed deadlocks in such scenarios:
```sql
-- for dist lock acquiring order coor, w1, w2
-- on w2
LOCK t1 IN ACCESS EXLUSIVE MODE;
-- acquire AccessShareLock locally on t1 to ensure it is not dropped while we get ready to distribute the lock
-- concurrently on w1
LOCK t1 IN ACCESS EXLUSIVE MODE;
-- acquire AccessShareLock locally on t1 to ensure it is not dropped while we get ready to distribute the lock
-- acquire dist lock on coor, w1, gets blocked on local AccessShareLock on w2
-- on w2 continuation of the execution above
-- starts to acquire dist locks and gets blocked on the coor by the lock acquired by w1
-- distributed deadlock
```
We opt for avoiding such deadlocks with the cost of the possibility of running into errors when the relations on which we are trying to acquire locks on get dropped.
(cherry picked from commit 27ddb4fc8e)
It is often useful to be able to sync the metadata in parallel
across nodes.
Also citus_finalize_upgrade_to_citus11() uses
start_metadata_sync_to_primary_nodes() after this commit.
Note that this commit does not parallelize all pieces of node
activation or metadata syncing. Instead, it tries to parallelize
potenially large parts of metadata, which is the objects and
distributed tables (in general Citus tables).
In the future, it would be nice to sync the reference tables
in parallel across nodes.
Create ~720 distributed tables / ~23450 shards
```SQL
-- declaratively partitioned table
CREATE TABLE github_events_looooooooooooooong_name (
event_id bigint,
event_type text,
event_public boolean,
repo_id bigint,
payload jsonb,
repo jsonb,
actor jsonb,
org jsonb,
created_at timestamp
) PARTITION BY RANGE (created_at);
SELECT create_time_partitions(
table_name := 'github_events_looooooooooooooong_name',
partition_interval := '1 day',
end_at := now() + '24 months'
);
CREATE INDEX ON github_events_looooooooooooooong_name USING btree (event_id, event_type, event_public, repo_id);
SELECT create_distributed_table('github_events_looooooooooooooong_name', 'repo_id');
SET client_min_messages TO ERROR;
```
across 1 node: almost same as expected
```SQL
SELECT start_metadata_sync_to_primary_nodes();
Time: 15664.418 ms (00:15.664)
select start_metadata_sync_to_node(nodename,nodeport) from pg_dist_node;
Time: 14284.069 ms (00:14.284)
```
across 7 nodes: ~3.5x improvement
```SQL
SELECT start_metadata_sync_to_primary_nodes();
┌──────────────────────────────────────┐
│ start_metadata_sync_to_primary_nodes │
├──────────────────────────────────────┤
│ t │
└──────────────────────────────────────┘
(1 row)
Time: 25711.192 ms (00:25.711)
-- across 7 nodes
select start_metadata_sync_to_node(nodename,nodeport) from pg_dist_node;
Time: 82126.075 ms (01:22.126)
```
(cherry picked from commit dd02e1755f)
There are two problems in this area. First, when there are expressions
on the index name, we should call `transformIndexExpression()` before
generating the index name. That is what Postgres does.
Second, because of 40c24bfef9
PG 13 and PG 14 generates different names for indexes with function calls even for local PG tables.
Assume we have:
```SQL
create table t(id int);
select create_distributed_table('t', 'id');
create index ON t (my_very_boring_function(id));
```
On PG 13, the name of the index is `t_expr_idx`
```SQL
\d t
Table "public.t"
┌────────┬─────────┬───────────┬──────────┬─────────┐
│ Column │ Type │ Collation │ Nullable │ Default │
├────────┼─────────┼───────────┼──────────┼─────────┤
│ id │ integer │ │ │ │
└────────┴─────────┴───────────┴──────────┴─────────┘
Indexes:
"t_expr_idx" btree (my_very_boring_function(id::bigint))
```
On PG 14, the name of the index is `t_my_very_boring_function_idx`
```SQL
\d t
Table "public.t"
┌────────┬─────────┬───────────┬──────────┬─────────┐
│ Column │ Type │ Collation │ Nullable │ Default │
├────────┼─────────┼───────────┼──────────┼─────────┤
│ id │ integer │ │ │ │
└────────┴─────────┴───────────┴──────────┴─────────┘
Indexes:
"t_my_very_boring_function_idx" btree (my_very_boring_function(id::bigint))
```
The second issue is not very critical. The important part is that
we adjust regression tests to drop all the indexes, which ensures
the index names are sane on any version.
(cherry picked from commit 2cc4053fc1)
We have a mechanism which ensures that newly distributed
objects are recorded during `alter extension citus update`.
However, the logic was lacking "view"s. With this commit, we make
sure that existing views are also marked as distributed during
upgrade.
(cherry picked from commit ee45e7bfbf)
Breaking down #5899 into smaller PR-s
This particular PR changes the way TRUNCATE acquires distributed locks on the relations it is truncating to use the LOCK command instead of lock_relation_if_exists. This has the benefit of using pg's recursive locking logic it implements for the LOCK command instead of us having to resolve relation dependencies and lock them explicitly. While this does not directly affect truncate, it will allow us to generalize this locking logic to then log different relations where the pg recursive locking will become useful (e.g. locking views).
This implementation is a bit more complex that it needs to be due to pg not supporting locking foreign tables. We can however, still lock foreign tables with lock_relation_if_exists. So for a command:
TRUNCATE dist_table_1, dist_table_2, foreign_table_1, foreign_table_2, dist_table_3;
We generate and send the following command to all the workers in metadata:
```sql
SEL citus.enable_ddl_propagation TO FALSE;
LOCK dist_table_1, dist_table_2 IN ACCESS EXCLUSIVE MODE;
SELECT lock_relation_if_exists('foreign_table_1', 'ACCESS EXCLUSIVE');
SELECT lock_relation_if_exists('foreign_table_2', 'ACCESS EXCLUSIVE');
LOCK dist_table_3 IN ACCESS EXCLUSIVE MODE;
SEL citus.enable_ddl_propagation TO TRUE;
```
Note that we need to alternate between the lock command and lock_table_if_exists in order to preserve the TRUNCATE order of relations.
When pg supports locking foreign tables, we will be able to massive simplify this logic and send a single LOCK command.
(cherry picked from commit 4c6f62efc6)
Adds support for propagation ALTER VIEW commands to
- Change owner of view
- SET/RESET option
- Rename view and view's column name
- Change schema of the view
Since PG also supports targeting views with ALTER TABLE
commands, related code also added to direct such ALTER TABLE
commands to ALTER VIEW commands while sending them to workers.
Adds support for propagating create/drop view commands and views to
worker node while scaling out the cluster. Since views are dropped while
converting the table type, metadata connection will be used while
propagating view commands to not switch to sequential mode.
With Citus MX enabled, when a reference table is modified, it does
some operations on the first worker node(e.g., acquire locks).
If node metadata is locked (via add node or create restore point),
the changes to the reference tables should be blocked.
First, it is not needed. Second, in the past we had issues regarding
this: https://github.com/citusdata/citus/pull/4344
When I create 10k tables, ~120K shards, this saves
40Mb of memory during ALTER EXTENSION citus UPDATE.
Before the change: MetadataCacheMemoryContext: 41943040 ~ 40MB
After the change: MetadataCacheMemoryContext: 8192
(cherry picked from commit f193e16a01)
In the past, for all modifications on the local execution,
we enabled 2PC (with 6a7ed7b309).
This also required us to enable coordinated transactions
via https://github.com/citusdata/citus/pull/4831 .
However, it does have a very substantial impact on the
distributed deadlock detection. The distributed deadlock
detection is designed to avoid single-statement transactions
because they cannot lead to any actual deadlocks.
The implementation is to skip backends without distributed
transactions are assigned. Now that we assign single
statement local executions in the lock graphs, we are
conflicting with the design of distributed deadlock
detection.
In general, we should fix it. However, one might
think that it is not a big deal, even if the processes
show up in the lock graphs, the deadlock detection
should not be causing any false positives. That is
false, unless https://github.com/citusdata/citus/issues/1803
is fixed. Now that local processes are considered as a single
distributed backend, the lock graphs might find:
local execution 1 [tx id: 1] -> any local process [tx id: 0]
any local process [tx id: 0] -> local execution 2 [tx id: 2]
And, decides that there is a distributed deadlock.
This commit is:
(a) right thing to do, as local execuion should not need any
distributed tx id
(b) Eliminates performance issues that might come up with
deadlock detection does a lot of unncessary checks
(c) After moving local execution after the remote execution
via https://github.com/citusdata/citus/pull/4301, the
vauge requirement for assigning distributed tx ids are
already gone.
(cherry picked from commit a2debe0f02)
The aim of hiding shards is to hide shards from client applications.
Certain bg workers (such as pg_cron or Citus maintanince daemon)
should be treated like client applications because users can run
queries from such bg workers. And, these bg workers should follow
the similar application_name checks as client backeends.
Certain other bg workers, such as logical replication or postgres'
parallel workers, should never hide shards. They are internal
operations.
Similarly the other backend types like the walsender or
checkpointer or autovacuum should never hide shards.
(cherry picked from commit 9043a1ed3f)
We've had custom versions of Postgres its `foreach` macro which with a
hidden ListCell for quite some time now. People like these custom
macros, because they are easier to use and require less boilerplate.
This adds similar custom versions of Postgres its `forboth` macro. Now
you don't need ListCells anymore when looping over two lists at the same
time.
Since now we don't throw an error for enums that user attempts creating
in temp schema, the preprocess / DDL job that contains the prepared
statement (to idempotently create the enum type) gets executed. As a
result, we were emitting the following warning because of the error the
underlying worker connection throws:
```sql
WARNING: cannot PREPARE a transaction that has operated on temporary objects
CONTEXT: while executing command on localhost:xxxxx
WARNING: connection to the remote node localhost:xxxxx failed with the following error: another command is already in progress
ERROR: cannot PREPARE a transaction that has operated on temporary objects
CONTEXT: while executing command on localhost:xxxxx
```
We were already doing so for functions & types believing that
this cannot be the case for other object types.
However, as in #5830, we cannot distribute an object that user
attempts creating in temp schema. Even more, this doesn't only
apply to functions and types but also to many other object types.
So with this commit, we teach preprocess/postprocess functions
(that need to create dependencies on worker nodes) how to skip
trying to distribute such objects.
We also start identifying temp schemas as the objects that we
don't know how to propagate to worker nodes so that we can
simply create objects locally if user attempts creating them
in a temp schema.
There are 36 callers of `EnsureDependenciesExistOnAllNodes` in
the codebase atm and for the most we still need to throw a hard
error (i.e.: not use `DeferErrorIfHasUnsupportedDependency`
beforehand), such as:
i) user explicitly wants to create a distributed object
* CreateCitusLocalTable
* CreateDistributedTable
* master_create_worker_shards
* master_create_empty_shard
* create_distributed_function
* EnsureExtensionFunctionCanBeDistributed
ii) we don't want to skip altering distributed table on worker nodes
* PostprocessIndexStmt
* PostprocessCreateTriggerStmt
* PostprocessCreateStatisticsStmt
iii) object is already distributed / handled by Citus before, so we
aren't okay with not propagating the ALTER command
* PostprocessAlterTableSchemaStmt
* PostprocessAlterCollationOwnerStmt
* PostprocessAlterCollationSchemaStmt
* PostprocessAlterDatabaseOwnerStmt
* PostprocessAlterExtensionSchemaStmt
* PostprocessAlterFunctionOwnerStmt
* PostprocessAlterFunctionSchemaStmt
* PostprocessAlterSequenceOwnerStmt
* PostprocessAlterSequenceSchemaStmt
* PostprocessAlterStatisticsSchemaStmt
* PostprocessAlterStatisticsOwnerStmt
* PostprocessAlterTextSearchConfigurationSchemaStmt
* PostprocessAlterTextSearchDictionarySchemaStmt
* PostprocessAlterTextSearchConfigurationOwnerStmt
* PostprocessAlterTextSearchDictionaryOwnerStmt
* PostprocessAlterTypeSchemaStmt
* PostprocessAlterForeignServerOwnerStmt
iv) we already cannot create those objects in temp schemas, so skipping
for now
* PostprocessCreateExtensionStmt
* PostprocessCreateForeignServerStmt
Also note that there are 3 more callers of
`EnsureDependenciesExistOnAllNodes` in enterprise in addition to those
36 but we don't need to do anything specific about them due to the same
reasoning given in iii).
In `pg_regress_multi.pl` we're running `initdb` with some options that
the `common.py` `initdb` is currently not using. All these flags seem
reasonable, so this brings `common.py` in line with
`pg_regress_multi.pl`.
In passing change the `--nosync` flag to `--no-sync`, since that's what
the PG documentation lists as the official option name (but both work).
Cluster setup time is significant in arbitrary configs. We can
parallelize this a bit more.
Runtime of the following command decreases from ~25 seconds to ~22
seconds on my machine with this change:
```
make -C src/test/regress/ check-arbitrary-base CONFIGS=CitusDefaultClusterConfig EXTRA_TESTS=prepared_statements_1
```
Currently we can only run different configs in parallel. However, when working on a feature or trying to fix a bug this is not important. In those cases you simply want to run a single test file on a single config. And you want to run that every time you made a change to the code that you think fixes the issue.
This PR allows parallelising running of bash commands. So `initdb` and `pg_ctl start` is run in parallel for all nodes in the cluster. Instead of one waiting for the other.
When you run the above command nothing is being run in parallel.
After this PR, cluster setup is being run in parallel.
We have fsync enabled for regular tests already in `pg_regress_multi.pl`.
This does the same for the arbitrary config tests.
On my machine this changes the runtime from the following command from
~37 to ~25 seconds:
```bash
make -C src/test/regress/ check-arbitrary-configs CONFIGS=CitusDefaultClusterConfig
```
Here is a list of some functions, and the `TargetWorkerSet` parameters
they supply to `NodeDDLTaskList`:
PostprocessCreateTextSearchConfigurationStmt -
NON_COORDINATOR_NODES
PreprocessDropTextSearchConfigurationStmt -
NON_COORDINATOR_METADATA_NODES
PreprocessAlterTextSearchConfigurationSchemaStmt -
NON_COORDINATOR_METADATA_NODES
I guess this means that, if metadata
syncing is disabled on the node, we may have some issues. Consider the
following:
Let's assume the user has metadata syncing disabled. 2 workers.
`CREATE TEXT SEARCH CONFIGURATION ...` will get propagated to all
workers. `ALTER ... CONFIGURATION ...` will not get propagated to
workers.
After adding a new non-metadata node, the new node will get the altered
configuration as it reads from catalog. At this point CONFIGURATION
definitions got diverged in the cluster.
I suggest that we always use `NON_COORDINATOR_METADATA_NODES` in all the
TEXT SEARCH operations here.
Before this commit, we erroneously converted the sequence
type to the column's type it is used. However, it is possible
that the sequence is used in an expression which then converted
to a type that cannot be a sequence, such as text.
With this commit, we only try this conversion if the column
type is a supported sequence type (e.g., smallint, int and bigint).
Note that we do this conversion because if the column type is a
bigint and the sequence is NOT a bigint, users would be in trouble
because sequences would generate values that are out of the range
of the column. (The other ways are already not supported such as
the column is int and the sequence is bigint would fail on the worker.)
In other words, with this commit, we scope this optimization only
when the target column type is a supported sequence type. Otherwise,
we let users to more freely use the sequences.
With the introduction of #4385 we inadvertently started allowing and
pushing down certain lateral subqueries that were unsafe to push down.
To be precise the type of LATERAL subqueries that is unsafe to push down
has all of the following properties:
1. The lateral subquery contains some non recurring tuples
2. The lateral subquery references a recurring tuple from
outside of the subquery (recurringRelids)
3. The lateral subquery requires a merge step (e.g. a LIMIT)
4. The reference to the recurring tuple should be something else than an
equality check on the distribution column, e.g. equality on a non
distribution column.
Property number four is considered both hard to detect and probably not
used very often. Thus this PR ignores property number four and causes
query planning to error out if the first three properties hold.
Fixes#5327
TEXT SEARCH DICTIONARY objects depend on TEXT SEARCH TEMPLATE objects.
Since we do not yet support distributed TS TEMPLATE objects, we skip
dependency checks for text search templates, similar to what we do for
roles.
The user is expected to manually create the TEXT SEARCH TEMPLATE objects
before a) adding new nodes, b) creating TEXT SEARCH DICTIONARY objects.
If a worker node is being added, a command is sent to get the server_id of the worker from the pg_dist_node_metadata table. If the worker's id is the same as the node executing the code, we will know the node is trying to add itself. If the node tries to add itself without specifying `groupid:=0` the operation will result in an error.
Using CASCADE in a DELETE can inadvertently delete things we don't
intend to. It's safer to fail hard and make the user delete depending
things manually.
1) Remove useless columns
2) Show backends that are blocked on a DDL even before
gpid is assigned
3) One minor bugfix, where we clear distributedCommandOriginator
properly.
DESCRIPTION: Move pg_dist_object to pg_catalog
Historically `pg_dist_object` had been created in the `citus` schema as an experiment to understand if we could move our catalog tables to a branded schema. We quickly realised that this interfered with the UX on our managed services and other environments, where users connected via a user with the name of `citus`.
By default postgres put the username on the search_path. To be able to read the catalog in the `citus` schema we would need to grant access permissions to the schema. This caused newly created objects like tables etc, to default to this schema for creation. This failed due to the write permissions to that schema.
With this change we move the `pg_dist_object` catalog table to the `pg_catalog` schema, where our other schema's are also located. This makes the catalog table visible and readable by any user, like our other catalog tables, for debugging purposes.
Note: due to the change of schema, we had to disable 1 test that was running into a discrepancy between the schema and binary. Secondly, we needed to make the lookup functions for the `pg_dist_object` relation and their indexes less strict on the fallback of the naming due to an other test that, due to an unfortunate cache invalidation, needed to lookup the relation again. This makes that we won't default to _only_ resolving from `pg_catalog` outside of upgrades.
* Notice when create_distributed_function called without params
* Move variable comments to top
* Add valid check for cache entry
* add objtype to notice msg
* update test outputs
* Add more tests
* Address feedback
And also citus_calculate_gpid(nodeId,pid). These UDFs are just
wrappers for the existing functions. Useful for testing and simple
manipulation of citus_stat_activity.
It seems like our approach is way too restrictive and some places
are wrong. Now, we follow very similar approach to pg_stat_activity.
Some of the changes are pre-requsite for implementing citus_dist_stat_activity
via citus_stat_activity.
Clusters created pre-Citus 11 mostly didn't have metadata sync enabled.
For those clusters, we add a utility UDF which fixes some minor issues
and sync the necessary objects to the workers.
* [Columnar] Build columnar.so and let citus depends on it
Co-authored-by: Yanwen Jin <yanwjin@microsoft.com>
Co-authored-by: Ying Xu <32597660+yxu2162@users.noreply.github.com>
Co-authored-by: jeff-davis <Jeffrey.Davis@microsoft.com>
DESCRIPTION: Add GUC to control ddl creation behaviour in transactions
Historically we would _not_ propagate objects when we are in a transaction block. Creation of distributed tables would not always work in sequential mode, hence objects created in the same transaction as distributing a table that would use the just created object wouldn't work. The benefit was that the user could still benefit from parallelism.
Now that the creation of distributed tables is supported in sequential mode it would make sense for users to force transactional consistency of ddl commands for distributed tables. A transaction could switch more aggressively to sequential mode when creating new objects in a transaction.
We don't change the default behaviour just yet.
Also, many objects would not even propagate their creation when the transaction was already set to sequential, leaving the probability of a self deadlock. The new policy checks solve this discrepancy between objects as well.
The issue in question is caused when rebalance / replication call `FullShardPlacementList` which returns all shard placements (including those in disabled nodes with `citus_disable_node`). Eventually, `FindFillStateForPlacement` looks for the state across active workers and fails to find a state for the placements which are in the disabled workers causing a seg fault shortly after.
Approach:
* `ActivePlacementHash` was not using the status of the shard placement's node to determine if the node it is active. Initially, I just fixed that.
* Additionally, I refactored the code which handles active shards in replication / rebalance to:
* use a single function to determine if a shard placement is active.
* do the shard active shard filtering before calling `RebalancePlacementUpdates` and `ReplicationPlacementUpdates`, so test methods like `shard_placement_rebalance_array` and `shard_placement_replication_array` which have different shard placement active requirements can do their own filtering while using the same rebalance / replicate logic that `rebalance_table_shards` and `replicate_table_shards` use.
Fix#5664
CitusInitiatedBackend was a pre-mature implemenation of the whole
GlobalPID infrastructure. We used it to track whether any individual
query is triggered by Citus or not.
As of now, after GlobalPID is already in place, we don't need
CitusInitiatedBackend, in fact it could even be wrong.
#5685 introduced the resolution of dependencies for indices. This missed support for indices on partitioned tables. This change adds support for partitioned indices to the dependency resolution code.
It turns out `whereis` is incredibly slow on WSL2 (at least on my
machine):
```
$ time whereis diff
diff: /usr/bin/diff /usr/share/man/man1/diff.1.gz
real 0m0.408s
user 0m0.010s
sys 0m0.101s
```
This command is run by our custom `diff` script, which is run for every
test file that is run. So this adds lots of unnecessary runtime time to
tests.
This changes our custom `diff` script to only call `whereis` in the
strange case that `/usr/bin/diff` does not exist.
The impact of this small change on the total runtime of the tests on WSL
is huge. As an example the following command takes 18 seconds without
this change and 7 seconds with it:
```
make -C src/test/regress/ check-arbitrary-configs CONFIGS=PostgresConfig
```
(cherry picked from commit 4e93afd1f78854e1aaab63690c441b0b0598a82c)
(cherry picked from commit 0295fe2f5b)
(cherry picked from commit 878510725fab9cb6870b4504e0b1f055d7bbc68d)
Before this commit, dumping wait edges can only be used for
distributed deadlock detection purposes. With this commit,
we open the possibility that we can use it for any backend.