While testing 5670dffd33, I realized
that we have a missing RecordNonDistTableAccessesForTask() for
local utility commands.
Although we don't have to record the relation access for local
only cases, we really want to keep the behaviour for scale-out
be the same with single node on all aspects. We wouldn't want
any single node complex transaction to work on single machine,
but not on multi node cluster. Hence, we apply the same restrictions.
For example, on a distributed cluster, the following errors, and
after this commit this errors locally as well
```SQL
CREATE TABLE ref(a int primary key);
INSERT INTO ref VALUES (1);
CREATE TABLE dist(a int REFERENCES ref(a));
SELECT create_reference_table('ref');
SELECT create_distributed_table('dist', 'a');
BEGIN;
SELECT * FROM dist;
TRUNCATE ref CASCADE;
ERROR: cannot execute DDL on table "ref" because there was a parallel SELECT access to distributed table "dist" in the same transaction
HINT: Try re-running the transaction with "SET LOCAL citus.multi_shard_modify_mode TO 'sequential';"
COMMIT;
```
We also add the comprehensive test suite and run the same locally.
Code snippet in Makefile was blocking Citus build when USE_PGXS flag was set. This was included for port to FSPG but is not needed for Citus engine and can be safely removed.
Reported bug #5803 shows that we are currently not sending the IN clause to our planner for columnar. This PR fixes it by checking for ScalarArrayOpExpr in ExtractPushdownClause so that we do not skip it. Also added a test case for this new addition.
It turns out that create_distributed_table
and citus_move/copy_shard_placement does not
work well concurrently.
To fix that, we need to acquire a lock, which
sounds like a good use of colocation lock.
However, the current usage of colocation lock is
limited to higher level UDFs like rebalance_table_shards
etc. Those usage of lock is still useful, but
we cannot acquire the same lock on citus_move_shard_placement
etc. because the coordinator connects to itself to acquire
the lock. Hence, the high level UDF blocks itself.
To fix that, we use one more colocation lock, with the placements
are the main objects to consider.
Before this commit, we required multiple copies of the
same stringInfo if we needed to append/prepend data to
the stringInfo. Now, we optionally get prefix/postfix.
For large string operations, this can save up to %10
memory.
Previously, CreateFixPartitionShardIndexNames() created all
the relevant query strings for all the shards, and executed
the large query string. And, in terms of the memory consumption,
this huge command (and its ExprContext generated while running
the command) is the main bottleneck/
With this change, we are reducing the total amount of memory
usage to almost 1/shard_count.
On my local machine, a distributed partitioned table with 120 partitions,
each 32 shards, the total memory consumption reduced from ~3GB
to ~0.1GB. And, the total execution time increased from ~28 seconds
to ~30 seconds. This seems like a good trade-off.
We used to only check whether the PID is valid
or not. However, Postgres does not necessarily
set the PID of the backend to 0 when it exists.
Instead, we need to be able to check it from procArray.
IsBackendPid() is what pg_stat_activity also relies
on for a similar purpose.
Historically we have been testing with the 'latest' version of libpq
when the CI images were build. This has the downside that rebuilding the
images often break our tests due to different errors returned from
libpq.
With this change we will actually test with a stable version of libpq
that is based on the postgres minor version that we test against.
This will make it easier to maintain postgres images over time, as well
as running _all_ tests locally, where we change libpq in sync with the
postgres server version.
use RecurseObjectDependencies api to find if an object is citus depended
make vanilla tests runnable to see if citus_depended function is working correctly
citus_locks combines the pg_locks views from all nodes and adds
global_pid, nodeid, and relation_name. The columns of citus_locks don't
change based on the Postgres version, however the pg_locks's columns do.
Postgres 14 added one more column to pg_locks (waitstart timestamptz).
citus_locks has the most expansive column set, including the newly added
column. If citus_locks is queried in a Postgres version where pg_locks
doesn't have some columns, the values for those columns in citus_locks
will be NULL
DESCRIPTION:
This PR extends support for Partitioned and Columnar tables in blocking 'citus_split_shard_by_split_points' workflow.
Columnar Support : No special handling required. Just removing checks that fails split for columnar table and adding test coverage.
Partitioned Table Support :
Skip copying of parent table as they are empty, The partitions contain data and are treated as co-located shards that will be copied separately.
Attach partitions to parent on destination after inserting new shard metadata and before creating foreign key constraints.
MISC:
Fix Bug #4949 where Blocking shard moves fails if there is a foreign key between partitioned distributed tables (from child to parent).
TEST:
Added new test 'citus_split_shards_columnar_partitioned' for splitting 'partitioned' and 'columnar + partitioned' table.
Added new test 'shard_move_constraints_blocking' to add coverage for shard move bug fix.
Updated test 'citus_split_shard_by_split_points_negative' to allow columnar and partitioned table.
* Remove if conditions with PG_VERSION_NUM < 13
* Remove server_above_twelve(&eleven) checks from tests
* Fix tests
* Remove pg12 and pg11 alternative test output files
* Remove pg12 specific normalization rules
* Some more if conditions in the code
* Change RemoteCollationIdExpression and some pg12/pg13 comments
* Remove some more normalization rules
When building packages on ubuntu jammy, we started to see some warnings.
autoreconf: warning: autoconf input should be named 'configure.ac', not
'configure.in'
* Blocking split setup
* Add missing type
* Missing API from Metadata Sync
* Shard Split e2e code
* Worker Split Copy DestReceiver skeleton
* Basic destreceiver code
* worker_split_copy UDF
* UDF calling
* Split points are text
* Isolate Tenant and Split Shard Unification
* Fixing executor and misc
* Reindent code
* Fixing UDF definitions
* Hello World Local Copy works
* Remote copy hello world works
* Local and Remote binary test
* Fixing text local copy and adding tests
* Hello World shard split works
* Negative tests
* Blocking Split workflow works
* Refactor
* Bug fix
* Reindent
* Cleaning up and adding comments
* Basic test for shard split workflow
* ReIndent
* Circle CI integration
* Removing include causing circle-ci build failure
* Remove SplitCopyDestReceiver and use PartitionedResultDestReceiver
* Add support for citus.enable_binary_protocol
* Reindent
* Fix build break
* Update Test
* Cleanup on catch
* Addressing open comments
* Update downgrade script and quote schema/table in COPY statement
* Fix metadata sync issue. Update regression test
* Isolation test and bug fix
* Add Isolation test, fix foreign constraint deadlock issue
* Misc code review comments
* Test name needing to be quoted
* Refactor code from review comments
* Explaining shardGroupSplitIntervalListList
* Fix upgrade & downgrade
* Fix broken test
* Test fix Round 2
* Fixing bug and modifying test appropriately
* Fully qualify copy udf name. Run Reindent
* Address PR comments
* Fix null handling when creating AuxiliaryStructures
* Ensure local copy is triggered in tests
* Limit max shards that can be created with split
* Test failure fix
* Remove split_mode and use shard_transfer_mode instead'
* Fix test failure
* Fix test failure
* Fixing permission issue when splitting non-superuser owned tables
* Fix test expected output
* Remove extra space
* Fix test
* attempt to fix test
* Addressing Marco's PR comment
* Only clean shards created by workflow
* Remove from merge
* Update test
Similar to #5897, one more step for running Citus with PG 15.
This PR at least make Citus run with PG 15. I have not tried running the tests with PG 15.
Shmem changes are based on 4f2400cb3f
Compile breaks are mostly due to #6008
This is a continuation of a refactor (with commit sha
2b7cf0c097) that aimed to use Citus helper
UDFs by default in iso tests.
PostgreSQL isolation test infrastructure uses some UDFs to detect
whether concurrent sessions block each other. Citus implements
alternatives to that UDF so that we are able to detect and report
distributed transactions that get blocked on the worker nodes as well.
We needed to explicitly replace PG helper functions with Citus
implementations in each isolation file. Now we replace them by default.
* Support upgrade and downgrade and separate columnar as citus_columnar extension
Co-authored-by: Yanwen Jin <yanwjin@microsoft.com>
Co-authored-by: Jeff Davis <jeff@j-davis.com>
Use Citus helper UDFs by default in iso tests
PostgreSQL isolation test infrastructure uses some UDFs to detect
whether concurrent sessions block each other. Citus implements
alternatives to that UDF so that we are able to detect and report
distributed transactions that get blocked on the worker nodes as well.
We needed to explicitly replace PG helper functions with Citus
implementations in each isolation file. Now we replace them by default.
* Added more regression tests for more vacuum options,
* Fixed deadlock for unqualified vacuum when there is only 1 worker,
* Supported lock_skipped for vacuum.
This PR makes all of the features open source that were previously only
available in Citus Enterprise.
Features that this adds:
1. Non blocking shard moves/shard rebalancer
(`citus.logical_replication_timeout`)
2. Propagation of CREATE/DROP/ALTER ROLE statements
3. Propagation of GRANT statements
4. Propagation of CLUSTER statements
5. Propagation of ALTER DATABASE ... OWNER TO ...
6. Optimization for COPY when loading JSON to avoid double parsing of
the JSON object (`citus.skip_jsonb_validation_in_copy`)
7. Support for row level security
8. Support for `pg_dist_authinfo`, which allows storing different
authentication options for different users, e.g. you can store
passwords or certificates here.
9. Support for `pg_dist_poolinfo`, which allows using connection poolers
in between coordinator and workers
10. Tracking distributed query execution times using
citus_stat_statements (`citus.stat_statements_max`,
`citus.stat_statements_purge_interval`,
`citus.stat_statements_track`). This is disabled by default.
11. Blocking tenant_isolation
12. Support for `sslkey` and `sslcert` in `citus.node_conninfo`
We already have tests relying on citus_finalize_upgrade_to_citus11().
Now, adjust those to rely on citus_finish_citus_upgrade() and
always call citus_finish_citus_upgrade().
The error comes due to the datum jsonb in pg_dist_metadata_node.metadata being 0 in some scenarios. This is likely due to not copying the data when receiving a datum from a tuple and pg deciding to deallocate that memory when the table that the tuple was from is closed.
Also fix another place in the code that might have been susceptible to this issue.
I tested on both multi-vg and multi-1-vg and the test were successful.
altering the distributed table.
To be able to alter view's owner without enforcing sequential mode.
Alter view process functions have been udpated to use metadata
connection.
Do not obtain AccessShareLock before acquiring the distributed locks.
Acquiring an AccessShareLock ensures that the relations which we are trying to get a distributed lock on will not be dropped in the time between when the LOCK command is issued and the LOCK commands are send to the worker. However, this also leads to distributed deadlocks in such scenarios:
```sql
-- for dist lock acquiring order coor, w1, w2
-- on w2
LOCK t1 IN ACCESS EXLUSIVE MODE;
-- acquire AccessShareLock locally on t1 to ensure it is not dropped while we get ready to distribute the lock
-- concurrently on w1
LOCK t1 IN ACCESS EXLUSIVE MODE;
-- acquire AccessShareLock locally on t1 to ensure it is not dropped while we get ready to distribute the lock
-- acquire dist lock on coor, w1, gets blocked on local AccessShareLock on w2
-- on w2 continuation of the execution above
-- starts to acquire dist locks and gets blocked on the coor by the lock acquired by w1
-- distributed deadlock
```
We opt for avoiding such deadlocks with the cost of the possibility of running into errors when the relations on which we are trying to acquire locks on get dropped.
It is often useful to be able to sync the metadata in parallel
across nodes.
Also citus_finalize_upgrade_to_citus11() uses
start_metadata_sync_to_primary_nodes() after this commit.
Note that this commit does not parallelize all pieces of node
activation or metadata syncing. Instead, it tries to parallelize
potenially large parts of metadata, which is the objects and
distributed tables (in general Citus tables).
In the future, it would be nice to sync the reference tables
in parallel across nodes.
Create ~720 distributed tables / ~23450 shards
```SQL
-- declaratively partitioned table
CREATE TABLE github_events_looooooooooooooong_name (
event_id bigint,
event_type text,
event_public boolean,
repo_id bigint,
payload jsonb,
repo jsonb,
actor jsonb,
org jsonb,
created_at timestamp
) PARTITION BY RANGE (created_at);
SELECT create_time_partitions(
table_name := 'github_events_looooooooooooooong_name',
partition_interval := '1 day',
end_at := now() + '24 months'
);
CREATE INDEX ON github_events_looooooooooooooong_name USING btree (event_id, event_type, event_public, repo_id);
SELECT create_distributed_table('github_events_looooooooooooooong_name', 'repo_id');
SET client_min_messages TO ERROR;
```
across 1 node: almost same as expected
```SQL
SELECT start_metadata_sync_to_primary_nodes();
Time: 15664.418 ms (00:15.664)
select start_metadata_sync_to_node(nodename,nodeport) from pg_dist_node;
Time: 14284.069 ms (00:14.284)
```
across 7 nodes: ~3.5x improvement
```SQL
SELECT start_metadata_sync_to_primary_nodes();
┌──────────────────────────────────────┐
│ start_metadata_sync_to_primary_nodes │
├──────────────────────────────────────┤
│ t │
└──────────────────────────────────────┘
(1 row)
Time: 25711.192 ms (00:25.711)
-- across 7 nodes
select start_metadata_sync_to_node(nodename,nodeport) from pg_dist_node;
Time: 82126.075 ms (01:22.126)
```
Move internal storage details to a separate schema with no public
access to limit the possibility for information leakage.
Create views with public access that show storage details for those
columnar tables where the user has ownership privileges. Include
mapping between relation ID and storage ID for easier interpretation.
We remove `<waiting ...>` and `<... completed>` outputs for some CREATE
INDEX CONCURRENTLY commands since they can cause flakiness in some scenarios.
Postgres calls WaitForOlderSnapshots() and this can cause CREATE INDEX
CONCURRENTLY commands for shards to get blocked by each other for brief
periods of time. The extra waits can pop-up, or they can get completed
at different lines in the output files. To remedy that, we rename those
indexes to be captured by the new normalization rule.
* Bug fix for bug #5876. Memset MetadataCacheSystem every time there is an abort
* Created an ObjectAccessHook that saves the transactionlevel of when citus was created and will clear metadatacache if that transaction level is rolled back. Added additional tests to make sure metadatacache is cleared
Columnar: support relation options with ALTER TABLE.
Use ALTER TABLE ... SET/RESET to specify relation options rather than
alter_columnar_table_set() and alter_columnar_table_reset().
Not only is this more ergonomic, but it also allows better integration
because it can be treated like DDL on a regular table. For instance,
citus can use its own ProcessUtility_hook to distribute the new
settings to the shards.
DESCRIPTION: Columnar: support relation options with ALTER TABLE.
With Citus MX enabled, when a reference table is modified, it does
some operations on the first worker node(e.g., acquire locks).
If node metadata is locked (via add node or create restore point),
the changes to the reference tables should be blocked.
In the past (pre-11), we allowed removing worker nodes
that had active placements for replicated distributed
table, without even checking if there are any other
replicas of the same placement.
However, with #5469, we prevent disabling nodes via a hard
error when there is the last active placement of shard, as we
do for reference tables. Note that otherwise, we'd allow
users to lose data.
As of today, the NOTICE is completely irrelevant.
First worker node has a special meaning for modifications on the replicated tables
It is used to acquire a remote lock, such that the modifications are serialized.
With this commit, we make sure that we do not let any distributed query to see a
different 'first worker node' while first worker node is disabled.
Note that, maybe implicitly mentioned above, when first worker node is disabled,
the first worker node changes, that's why we have to handle the situation.
Before this commit, we had:
```SQL
SELECT citus_disable_node(nodename, nodeport, force boolean DEFAULT false)
```
Where, we allow forcing to disable first worker node with
`force:=true`. However, it entails the risk for losing
data / diverging placement data etc.
With `force` flag, we control disabling the first worker node,
and with `async` flag we control whether the changes are done
via bg worker or immediately.
```SQL
SELECT citus_disable_node(nodename, nodeport, force boolean DEFAULT false, sync boolean DEFAULT false)
```
Where we can achieve all the following:
| Mode | Data loss possibility | Can run in 2PC | Handle multiple node failures | Immediately effective |
| --- |--- |--- |--- |--- |
| force:false, sync: false | false | true | true | false |
| force:false, sync: true | false | false | false | true |
| force:true, sync: false | true | true | true | false |
| force:true, sync: true | false | false | false | true |
There are two problems in this area. First, when there are expressions
on the index name, we should call `transformIndexExpression()` before
generating the index name. That is what Postgres does.
Second, because of 40c24bfef9
PG 13 and PG 14 generates different names for indexes with function calls even for local PG tables.
Assume we have:
```SQL
create table t(id int);
select create_distributed_table('t', 'id');
create index ON t (my_very_boring_function(id));
```
On PG 13, the name of the index is `t_expr_idx`
```SQL
\d t
Table "public.t"
┌────────┬─────────┬───────────┬──────────┬─────────┐
│ Column │ Type │ Collation │ Nullable │ Default │
├────────┼─────────┼───────────┼──────────┼─────────┤
│ id │ integer │ │ │ │
└────────┴─────────┴───────────┴──────────┴─────────┘
Indexes:
"t_expr_idx" btree (my_very_boring_function(id::bigint))
```
On PG 14, the name of the index is `t_my_very_boring_function_idx`
```SQL
\d t
Table "public.t"
┌────────┬─────────┬───────────┬──────────┬─────────┐
│ Column │ Type │ Collation │ Nullable │ Default │
├────────┼─────────┼───────────┼──────────┼─────────┤
│ id │ integer │ │ │ │
└────────┴─────────┴───────────┴──────────┴─────────┘
Indexes:
"t_my_very_boring_function_idx" btree (my_very_boring_function(id::bigint))
```
The second issue is not very critical. The important part is that
we adjust regression tests to drop all the indexes, which ensures
the index names are sane on any version.
Over time we have added significantly improved the support for objects to be propagated by Citus as to make scaling out the database more seamless. It became evident that there was a lot of code duplication that got into the codebase to implement the propagation.
This PR tries to reduce the amount of repeated code that is at most only slightly different. To make things worse, most of the differences were actually oversights instead of correct.
This Patch introduces 3 reusable sets of pre/post processing steps for respectively
- create
- alter
- drop
With the use of the common functionality we should have more coherent behaviour between different supported object by Citus.
Some steps either omit the Pre or Post processing step if they would not make sense to include.
All tests pass, only 1 test needed changing, foreign servers, as the dropping of foreign servers didn't implement support for dropping multiple foreign servers at once. Given the common approach correctly supports dropping of multiple objects, either distributed or not, the test that assumed it wouldn't work was now obsolete.
We have a mechanism which ensures that newly distributed
objects are recorded during `alter extension citus update`.
However, the logic was lacking "view"s. With this commit, we make
sure that existing views are also marked as distributed during
upgrade.
We are nearing the 100 objects being propagated in `master_copy_shard_placement` and with the extra supported objects this gets pushed over a 100 objects.
When a 100 objects are reached for propagation a notice will be shown to the user, informing them it might take a while to finish the operation.
During testing this is not important to see. Since the message contains the exact number of objects to be propagated the tests becomes very unstable when merging community into enterprsie.
This change makes that the test output stays stable.
Adds support for propagation ALTER VIEW commands to
- Change owner of view
- SET/RESET option
- Rename view and view's column name
- Change schema of the view
Since PG also supports targeting views with ALTER TABLE
commands, related code also added to direct such ALTER TABLE
commands to ALTER VIEW commands while sending them to workers.
Breaking down #5899 into smaller PR-s
This particular PR changes the way TRUNCATE acquires distributed locks on the relations it is truncating to use the LOCK command instead of lock_relation_if_exists. This has the benefit of using pg's recursive locking logic it implements for the LOCK command instead of us having to resolve relation dependencies and lock them explicitly. While this does not directly affect truncate, it will allow us to generalize this locking logic to then log different relations where the pg recursive locking will become useful (e.g. locking views).
This implementation is a bit more complex that it needs to be due to pg not supporting locking foreign tables. We can however, still lock foreign tables with lock_relation_if_exists. So for a command:
TRUNCATE dist_table_1, dist_table_2, foreign_table_1, foreign_table_2, dist_table_3;
We generate and send the following command to all the workers in metadata:
```sql
SEL citus.enable_ddl_propagation TO FALSE;
LOCK dist_table_1, dist_table_2 IN ACCESS EXCLUSIVE MODE;
SELECT lock_relation_if_exists('foreign_table_1', 'ACCESS EXCLUSIVE');
SELECT lock_relation_if_exists('foreign_table_2', 'ACCESS EXCLUSIVE');
LOCK dist_table_3 IN ACCESS EXCLUSIVE MODE;
SEL citus.enable_ddl_propagation TO TRUE;
```
Note that we need to alternate between the lock command and lock_table_if_exists in order to preserve the TRUNCATE order of relations.
When pg supports locking foreign tables, we will be able to massive simplify this logic and send a single LOCK command.
Adds support for propagating create/drop view commands and views to
worker node while scaling out the cluster. Since views are dropped while
converting the table type, metadata connection will be used while
propagating view commands to not switch to sequential mode.
First, it is not needed. Second, in the past we had issues regarding
this: https://github.com/citusdata/citus/pull/4344
When I create 10k tables, ~120K shards, this saves
40Mb of memory during ALTER EXTENSION citus UPDATE.
Before the change: MetadataCacheMemoryContext: 41943040 ~ 40MB
After the change: MetadataCacheMemoryContext: 8192
Here is a flaky test output that is quite hard to fix:
```diff
diff -dU10 -w /home/circleci/project/src/test/regress/expected/isolation_master_update_node_1.out /home/circleci/project/src/test/regress/results/isolation_master_update_node.out
--- /home/circleci/project/src/test/regress/expected/isolation_master_update_node_1.out.modified 2022-03-21 19:03:54.237042562 +0000
+++ /home/circleci/project/src/test/regress/results/isolation_master_update_node.out.modified 2022-03-21 19:03:54.257043084 +0000
@@ -49,18 +49,20 @@
<waiting ...>
step s2-update-node-1-force: <... completed>
master_update_node
------------------
(1 row)
step s2-abort: ABORT;
step s1-abort: ABORT;
FATAL: terminating connection due to administrator command
-SSL connection has been closed unexpectedly
+server closed the connection unexpectedly
+ This probably means the server terminated abnormally
+ before or while processing the request.
```
I could not come up with a solution that would decrease the flakiness in the test outputs. We already have 3 output files for the same test and now I introduced a 4th one.
I can also add complex regular expressions that span multiple lines, and normalize these error messages. Feel free to suggest a normalized error message in a comment here.
## Current alternative file contents
`isolation_master_update_node.out`
```
step s1-abort: ABORT;
FATAL: terminating connection due to administrator command
FATAL: terminating connection due to administrator command
SSL connection has been closed unexpectedly
```
`isolation_master_update_node_0.out`
```
step s1-abort: ABORT;
WARNING: this step had a leftover error message
FATAL: terminating connection due to administrator command
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
```
`isolation_master_update_node_1.out`
```
step s1-abort: ABORT;
FATAL: terminating connection due to administrator command
SSL connection has been closed unexpectedly
```
new file: `isolation_master_update_node_2.out`
```
step s1-abort: ABORT;
FATAL: terminating connection due to administrator command
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
```
In the past, for all modifications on the local execution,
we enabled 2PC (with 6a7ed7b309).
This also required us to enable coordinated transactions
via https://github.com/citusdata/citus/pull/4831 .
However, it does have a very substantial impact on the
distributed deadlock detection. The distributed deadlock
detection is designed to avoid single-statement transactions
because they cannot lead to any actual deadlocks.
The implementation is to skip backends without distributed
transactions are assigned. Now that we assign single
statement local executions in the lock graphs, we are
conflicting with the design of distributed deadlock
detection.
In general, we should fix it. However, one might
think that it is not a big deal, even if the processes
show up in the lock graphs, the deadlock detection
should not be causing any false positives. That is
false, unless https://github.com/citusdata/citus/issues/1803
is fixed. Now that local processes are considered as a single
distributed backend, the lock graphs might find:
local execution 1 [tx id: 1] -> any local process [tx id: 0]
any local process [tx id: 0] -> local execution 2 [tx id: 2]
And, decides that there is a distributed deadlock.
This commit is:
(a) right thing to do, as local execuion should not need any
distributed tx id
(b) Eliminates performance issues that might come up with
deadlock detection does a lot of unncessary checks
(c) After moving local execution after the remote execution
via https://github.com/citusdata/citus/pull/4301, the
vauge requirement for assigning distributed tx ids are
already gone.
For some reason search_path is not always set correctly on the worker
when calling a distributed function, this shows up when calling
`insert_document` in our distributed_triggers test. The underlying
reason is currently unknown and warrants deeper investigation.
Currently this test is one of the main causes for random CI failures. So
this change sets the search_path of each function explicitly, to reduce
these failures. So other devs can be more efficient, while I continue
investigating the root cause of this issue.
Also changes explicit `SET citus.enable_unsafe_triggers = false` to
`RESET citus.enable_unsafe_triggers` in passing.
* Separate build of citus.so and citus_columnar.so.
Because columnar code is statically-linked to both modules, it doesn't
make sense to load them both at once.
A subsequent commit will make the modules entirely separate and allow
loading them both simultaneously.
Author: Yanwen Jin
* Separate citus and citus_columnar modules.
Now the modules are independent. Columnar can be loaded by itself, or
along with citus.
Co-authored-by: Jeff Davis <jefdavi@microsoft.com>
The aim of hiding shards is to hide shards from client applications.
Certain bg workers (such as pg_cron or Citus maintanince daemon)
should be treated like client applications because users can run
queries from such bg workers. And, these bg workers should follow
the similar application_name checks as client backeends.
Certain other bg workers, such as logical replication or postgres'
parallel workers, should never hide shards. They are internal
operations.
Similarly the other backend types like the walsender or
checkpointer or autovacuum should never hide shards.
We've had custom versions of Postgres its `foreach` macro which with a
hidden ListCell for quite some time now. People like these custom
macros, because they are easier to use and require less boilerplate.
This adds similar custom versions of Postgres its `forboth` macro. Now
you don't need ListCells anymore when looping over two lists at the same
time.
Since now we don't throw an error for enums that user attempts creating
in temp schema, the preprocess / DDL job that contains the prepared
statement (to idempotently create the enum type) gets executed. As a
result, we were emitting the following warning because of the error the
underlying worker connection throws:
```sql
WARNING: cannot PREPARE a transaction that has operated on temporary objects
CONTEXT: while executing command on localhost:xxxxx
WARNING: connection to the remote node localhost:xxxxx failed with the following error: another command is already in progress
ERROR: cannot PREPARE a transaction that has operated on temporary objects
CONTEXT: while executing command on localhost:xxxxx
```
We were already doing so for functions & types believing that
this cannot be the case for other object types.
However, as in #5830, we cannot distribute an object that user
attempts creating in temp schema. Even more, this doesn't only
apply to functions and types but also to many other object types.
So with this commit, we teach preprocess/postprocess functions
(that need to create dependencies on worker nodes) how to skip
trying to distribute such objects.
We also start identifying temp schemas as the objects that we
don't know how to propagate to worker nodes so that we can
simply create objects locally if user attempts creating them
in a temp schema.
There are 36 callers of `EnsureDependenciesExistOnAllNodes` in
the codebase atm and for the most we still need to throw a hard
error (i.e.: not use `DeferErrorIfHasUnsupportedDependency`
beforehand), such as:
i) user explicitly wants to create a distributed object
* CreateCitusLocalTable
* CreateDistributedTable
* master_create_worker_shards
* master_create_empty_shard
* create_distributed_function
* EnsureExtensionFunctionCanBeDistributed
ii) we don't want to skip altering distributed table on worker nodes
* PostprocessIndexStmt
* PostprocessCreateTriggerStmt
* PostprocessCreateStatisticsStmt
iii) object is already distributed / handled by Citus before, so we
aren't okay with not propagating the ALTER command
* PostprocessAlterTableSchemaStmt
* PostprocessAlterCollationOwnerStmt
* PostprocessAlterCollationSchemaStmt
* PostprocessAlterDatabaseOwnerStmt
* PostprocessAlterExtensionSchemaStmt
* PostprocessAlterFunctionOwnerStmt
* PostprocessAlterFunctionSchemaStmt
* PostprocessAlterSequenceOwnerStmt
* PostprocessAlterSequenceSchemaStmt
* PostprocessAlterStatisticsSchemaStmt
* PostprocessAlterStatisticsOwnerStmt
* PostprocessAlterTextSearchConfigurationSchemaStmt
* PostprocessAlterTextSearchDictionarySchemaStmt
* PostprocessAlterTextSearchConfigurationOwnerStmt
* PostprocessAlterTextSearchDictionaryOwnerStmt
* PostprocessAlterTypeSchemaStmt
* PostprocessAlterForeignServerOwnerStmt
iv) we already cannot create those objects in temp schemas, so skipping
for now
* PostprocessCreateExtensionStmt
* PostprocessCreateForeignServerStmt
Also note that there are 3 more callers of
`EnsureDependenciesExistOnAllNodes` in enterprise in addition to those
36 but we don't need to do anything specific about them due to the same
reasoning given in iii).
In `pg_regress_multi.pl` we're running `initdb` with some options that
the `common.py` `initdb` is currently not using. All these flags seem
reasonable, so this brings `common.py` in line with
`pg_regress_multi.pl`.
In passing change the `--nosync` flag to `--no-sync`, since that's what
the PG documentation lists as the official option name (but both work).
Cluster setup time is significant in arbitrary configs. We can
parallelize this a bit more.
Runtime of the following command decreases from ~25 seconds to ~22
seconds on my machine with this change:
```
make -C src/test/regress/ check-arbitrary-base CONFIGS=CitusDefaultClusterConfig EXTRA_TESTS=prepared_statements_1
```
Currently we can only run different configs in parallel. However, when working on a feature or trying to fix a bug this is not important. In those cases you simply want to run a single test file on a single config. And you want to run that every time you made a change to the code that you think fixes the issue.
This PR allows parallelising running of bash commands. So `initdb` and `pg_ctl start` is run in parallel for all nodes in the cluster. Instead of one waiting for the other.
When you run the above command nothing is being run in parallel.
After this PR, cluster setup is being run in parallel.
We have fsync enabled for regular tests already in `pg_regress_multi.pl`.
This does the same for the arbitrary config tests.
On my machine this changes the runtime from the following command from
~37 to ~25 seconds:
```bash
make -C src/test/regress/ check-arbitrary-configs CONFIGS=CitusDefaultClusterConfig
```
Here is a list of some functions, and the `TargetWorkerSet` parameters
they supply to `NodeDDLTaskList`:
PostprocessCreateTextSearchConfigurationStmt -
NON_COORDINATOR_NODES
PreprocessDropTextSearchConfigurationStmt -
NON_COORDINATOR_METADATA_NODES
PreprocessAlterTextSearchConfigurationSchemaStmt -
NON_COORDINATOR_METADATA_NODES
I guess this means that, if metadata
syncing is disabled on the node, we may have some issues. Consider the
following:
Let's assume the user has metadata syncing disabled. 2 workers.
`CREATE TEXT SEARCH CONFIGURATION ...` will get propagated to all
workers. `ALTER ... CONFIGURATION ...` will not get propagated to
workers.
After adding a new non-metadata node, the new node will get the altered
configuration as it reads from catalog. At this point CONFIGURATION
definitions got diverged in the cluster.
I suggest that we always use `NON_COORDINATOR_METADATA_NODES` in all the
TEXT SEARCH operations here.
Before this commit, we erroneously converted the sequence
type to the column's type it is used. However, it is possible
that the sequence is used in an expression which then converted
to a type that cannot be a sequence, such as text.
With this commit, we only try this conversion if the column
type is a supported sequence type (e.g., smallint, int and bigint).
Note that we do this conversion because if the column type is a
bigint and the sequence is NOT a bigint, users would be in trouble
because sequences would generate values that are out of the range
of the column. (The other ways are already not supported such as
the column is int and the sequence is bigint would fail on the worker.)
In other words, with this commit, we scope this optimization only
when the target column type is a supported sequence type. Otherwise,
we let users to more freely use the sequences.
With the introduction of #4385 we inadvertently started allowing and
pushing down certain lateral subqueries that were unsafe to push down.
To be precise the type of LATERAL subqueries that is unsafe to push down
has all of the following properties:
1. The lateral subquery contains some non recurring tuples
2. The lateral subquery references a recurring tuple from
outside of the subquery (recurringRelids)
3. The lateral subquery requires a merge step (e.g. a LIMIT)
4. The reference to the recurring tuple should be something else than an
equality check on the distribution column, e.g. equality on a non
distribution column.
Property number four is considered both hard to detect and probably not
used very often. Thus this PR ignores property number four and causes
query planning to error out if the first three properties hold.
Fixes#5327
TEXT SEARCH DICTIONARY objects depend on TEXT SEARCH TEMPLATE objects.
Since we do not yet support distributed TS TEMPLATE objects, we skip
dependency checks for text search templates, similar to what we do for
roles.
The user is expected to manually create the TEXT SEARCH TEMPLATE objects
before a) adding new nodes, b) creating TEXT SEARCH DICTIONARY objects.
If a worker node is being added, a command is sent to get the server_id of the worker from the pg_dist_node_metadata table. If the worker's id is the same as the node executing the code, we will know the node is trying to add itself. If the node tries to add itself without specifying `groupid:=0` the operation will result in an error.
Using CASCADE in a DELETE can inadvertently delete things we don't
intend to. It's safer to fail hard and make the user delete depending
things manually.
1) Remove useless columns
2) Show backends that are blocked on a DDL even before
gpid is assigned
3) One minor bugfix, where we clear distributedCommandOriginator
properly.
DESCRIPTION: Move pg_dist_object to pg_catalog
Historically `pg_dist_object` had been created in the `citus` schema as an experiment to understand if we could move our catalog tables to a branded schema. We quickly realised that this interfered with the UX on our managed services and other environments, where users connected via a user with the name of `citus`.
By default postgres put the username on the search_path. To be able to read the catalog in the `citus` schema we would need to grant access permissions to the schema. This caused newly created objects like tables etc, to default to this schema for creation. This failed due to the write permissions to that schema.
With this change we move the `pg_dist_object` catalog table to the `pg_catalog` schema, where our other schema's are also located. This makes the catalog table visible and readable by any user, like our other catalog tables, for debugging purposes.
Note: due to the change of schema, we had to disable 1 test that was running into a discrepancy between the schema and binary. Secondly, we needed to make the lookup functions for the `pg_dist_object` relation and their indexes less strict on the fallback of the naming due to an other test that, due to an unfortunate cache invalidation, needed to lookup the relation again. This makes that we won't default to _only_ resolving from `pg_catalog` outside of upgrades.
* Notice when create_distributed_function called without params
* Move variable comments to top
* Add valid check for cache entry
* add objtype to notice msg
* update test outputs
* Add more tests
* Address feedback
And also citus_calculate_gpid(nodeId,pid). These UDFs are just
wrappers for the existing functions. Useful for testing and simple
manipulation of citus_stat_activity.
It seems like our approach is way too restrictive and some places
are wrong. Now, we follow very similar approach to pg_stat_activity.
Some of the changes are pre-requsite for implementing citus_dist_stat_activity
via citus_stat_activity.
Clusters created pre-Citus 11 mostly didn't have metadata sync enabled.
For those clusters, we add a utility UDF which fixes some minor issues
and sync the necessary objects to the workers.
* [Columnar] Build columnar.so and let citus depends on it
Co-authored-by: Yanwen Jin <yanwjin@microsoft.com>
Co-authored-by: Ying Xu <32597660+yxu2162@users.noreply.github.com>
Co-authored-by: jeff-davis <Jeffrey.Davis@microsoft.com>
DESCRIPTION: Add GUC to control ddl creation behaviour in transactions
Historically we would _not_ propagate objects when we are in a transaction block. Creation of distributed tables would not always work in sequential mode, hence objects created in the same transaction as distributing a table that would use the just created object wouldn't work. The benefit was that the user could still benefit from parallelism.
Now that the creation of distributed tables is supported in sequential mode it would make sense for users to force transactional consistency of ddl commands for distributed tables. A transaction could switch more aggressively to sequential mode when creating new objects in a transaction.
We don't change the default behaviour just yet.
Also, many objects would not even propagate their creation when the transaction was already set to sequential, leaving the probability of a self deadlock. The new policy checks solve this discrepancy between objects as well.
The issue in question is caused when rebalance / replication call `FullShardPlacementList` which returns all shard placements (including those in disabled nodes with `citus_disable_node`). Eventually, `FindFillStateForPlacement` looks for the state across active workers and fails to find a state for the placements which are in the disabled workers causing a seg fault shortly after.
Approach:
* `ActivePlacementHash` was not using the status of the shard placement's node to determine if the node it is active. Initially, I just fixed that.
* Additionally, I refactored the code which handles active shards in replication / rebalance to:
* use a single function to determine if a shard placement is active.
* do the shard active shard filtering before calling `RebalancePlacementUpdates` and `ReplicationPlacementUpdates`, so test methods like `shard_placement_rebalance_array` and `shard_placement_replication_array` which have different shard placement active requirements can do their own filtering while using the same rebalance / replicate logic that `rebalance_table_shards` and `replicate_table_shards` use.
Fix#5664
CitusInitiatedBackend was a pre-mature implemenation of the whole
GlobalPID infrastructure. We used it to track whether any individual
query is triggered by Citus or not.
As of now, after GlobalPID is already in place, we don't need
CitusInitiatedBackend, in fact it could even be wrong.
#5685 introduced the resolution of dependencies for indices. This missed support for indices on partitioned tables. This change adds support for partitioned indices to the dependency resolution code.
It turns out `whereis` is incredibly slow on WSL2 (at least on my
machine):
```
$ time whereis diff
diff: /usr/bin/diff /usr/share/man/man1/diff.1.gz
real 0m0.408s
user 0m0.010s
sys 0m0.101s
```
This command is run by our custom `diff` script, which is run for every
test file that is run. So this adds lots of unnecessary runtime time to
tests.
This changes our custom `diff` script to only call `whereis` in the
strange case that `/usr/bin/diff` does not exist.
The impact of this small change on the total runtime of the tests on WSL
is huge. As an example the following command takes 18 seconds without
this change and 7 seconds with it:
```
make -C src/test/regress/ check-arbitrary-configs CONFIGS=PostgresConfig
```
(cherry picked from commit 4e93afd1f78854e1aaab63690c441b0b0598a82c)
(cherry picked from commit 0295fe2f5b)
(cherry picked from commit 878510725fab9cb6870b4504e0b1f055d7bbc68d)
Before this commit, dumping wait edges can only be used for
distributed deadlock detection purposes. With this commit,
we open the possibility that we can use it for any backend.
CREATE FUNCTION command together with it's dependencies.
If the function depends on any nondistributable object,
function will be created only locally. Parameterless
version of create_distributed_function becomes obsolete
with this change, it will deprecated from the code with a subsequent PR.
* When a worker tried to create a collation which had a dependency in the same worker node,
it would cause a deadlock, now it throws the correct "not a coordinator" error.
DESCRIPTION: Implement TEXT SEARCH CONFIGURATION propagation
The change adds support to Citus for propagating TEXT SEARCH CONFIGURATION objects. TSConfig objects cannot always be created in one create statement, and instead require a create statement followed by many alter statements to get turned into the object they should represent.
To support this we add functionality to the worker to create or replace objects based on a list of statements. When the lists of the local object and the remote object correspond 1:1 we skip the creation of the object and simply mark it distributed. This is especially important for TSConfig objects as initdb pre-populates databases with a dozen configurations (for many different languages).
When the user creates a new TSConfig based on the copy of an existing configuration there is no direct link to the object copied from. Since there is no link we can't simply rely on propagating the dependencies to the worker and send a qualified
We check for metadata consistency across the cluster in the test
isolation_metadata_sync_vs_all. However, some earlier tests in
enterprise repo leave invalid pg_dist_node entries in the worker nodes
that have Oid values for already dropped role objects.
To remedy that, I suggest that we move the test to earlier in the
schedule, thereby making the tests pass for the time being. We should
later introduce metadata checking either in a new isolation test or by
moving this test later in the schedule. However, we should do that after
we fix the underlying issue.
The low-level StoreAllActiveTransactions() function filters out
backends that exited.
Before this commit, if you run a pgbench, after that you'd still
see the backends show up:
```SQL
select count(*) from get_global_active_transactions();
┌───────┐
│ count │
├───────┤
│ 538 │
└───────┘
```
After this patch, only active backends show-up:
```SQL
select count(*) from get_global_active_transactions();
┌───────┐
│ count │
├───────┤
│ 72 │
└───────┘
```
DESCRIPTION: Prevent Citus table functions from being called on shards
The operations that guard against using shards are:
* Create Local Table
* Create distributed table (which affects reference table creation as well).
* I used a `ErrorIfRaltionIsKnownShard` instead of `ErrorIfIllegallyChangingKnownShard`.
`ErrorIfIllegallyChangingKnownShard` allows the operation if `citus.enable_manual_changes_to_shards`,
but I am not sure if it ever makes sense to create a distributed, reference, or citus local table out of a shard.
I tried to go over the code to identify other UDF-s where shards could be illegaly changed, but I could not find any other.
My knowledge of the codebase is not solid enough for me to say for sure.
Fixes#5610
This commit introduces several test cases for concurrent operations that
change metadata, and a concurrent metadata sync operation.
The overall structure is as follows:
- Session#1 starts metadata syncing in a transaction block
- Session#2 does an operation that change metadata
- Both sessions are committed
- Another session checks whether the metadata are the same accross all
nodes in the cluster.
* Break the dependency to CitusInitiatedBackend infrastructure
With this change, we start to show non-distributed backends as well
in citus_dist_stat_activity. I think that
(a) it is essential for making citus_lock_waits to work for blocked
on DDL commands.
(b) it is more expected from the user's perspective. The name of
the view is a little inconsistent now (e.g., citus_dist_stat_activity)
but we are already planning to improve the names with followup
PRs.
Also, we have global pids assigned, the CitusInitiatedBackend
becomes obsolete.
With https://github.com/citusdata/citus/pull/5657, Citus uses
a fixed application_name while connecting to remote nodes
for internal purposes.
It means that we cannot allow users to override it via
citus.node_conninfo.
Implement #5649
Allow create_distributed_function() on functions owned by extensions
1) Only update pg_dist_object, and do not propagate CREATE FUNCTION.
2) Ensure corresponding extension is in pg_dist_object.
3) Verify if dependencies exist on the function they should resolve to the extension.
4) Impact on node-scaling: We build a list of ddl commands based on all objects in
pg_dist_object. We need to omit the ddl's for the extension-function, as it
will get propagated by the virtue of the extension creation.
5) Extra checks for functions coming from extensions, to not propagate changes
via ddl commands, even though the function is marked as distributed in pg_dist_object
If the expression is simple, such as, SELECT function() or PEFORM function()
in PL/PgSQL code, PL engine does a simple expression evaluation which can't
interpret the Citus CustomScan Node. Code checks for simple expressions when
executing an UDF but missed the DO-Block scenario, this commit fixes it.
Removed dependency for EnsureTableOwner. Also removed pg_fini() and columnar_tableam_finish() Still need to remove CheckCitusVersion dependency to make Columnar_tableam.h dependency free from Citus.
Previously, we were wrapping targetlist nodes with Vars that reference
to the result of the worker query, if the node itself is not `Const` or
not a `Param`. Indeed, we should not do that unless the node itself is
a `Var` node or contains a `Var` within it (e.g.: `OpExpr(Var(column_a) > 2)`).
Otherwise, when worker query returns empty result set, then combine
query exec would crash since the `Var` would be pointing to an empty
tuple slot, which is not desirable for the node-executor methods.
Replaces citus.enable_object_propagation with citus.enable_metadata_sync
Also, within Citus 11 release cycle, we added citus.enable_metadata_sync_by_default,
that is also replaced with citus.enable_metadata_sync.
In essence, when citus.enable_metadata_sync is set to true, all the objects
and the metadata is send to the remote node.
We strongly advice that the users never changes the value of
this GUC.
With this commit, rebalancer backends are identified by application_name = citus_rebalancer
and the regular internal backends are identified by application_name = citus_internal
With this commit we've started to propagate sequences and shell
tables within the object dependency resolution. So, ensuring any
dependencies for any object will consider shell tables and sequences
as well. Separate logics for both shell tables and sequences have
been removed.
Since both shell tables and sequences logic were implemented as a
part of the metadata handling before that logic, we were propagating
them while syncing table metadata. With this commit we've divided
metadata (which means anything except shards thereafter) syncing
logic into multiple parts and implemented it either as a part of
ActivateNode. You can check the functions called in ActivateNode
to check definition of different metadata.
Definitions of start_metadata_sync_to_node and citus_activate_node
have also been updated. citus_activate_node will basically create
an active node with all metadata and reference table shards.
start_metadata_sync_to_node will be same with citus_activate_node
except replicating reference tables. stop_metadata_sync_to_node
will remove all the metadata. All of those UDFs need to be called
by superuser.
When creating a new table, we bypass the buffer cache and write the
initial pages directly with smgrwrite(). However, you're supposed to
use smgrextend() when extending a relation, rather than smgrwrite().
There isn't much difference between them, but smgrextend() updates the
relation size cache, which seems important, although I haven't seen
any real bugs caused by that.
Also, write the block to disk only after WAL-logging it, so that we
can include the LSN of the WAL record in the version that we write
out. Currently, the page as written to disk has LSN 0. That doesn't
cause any user-visible issues either, at worst it could make us
WAL-log a full page image of the page earlier than necessary, but that
doesn't matter currently because we WAL-log full page images of all
changes anyway.
I bumped into that issue with LSN 0 in the page header when testing
Citus with Zenith (https://github.com/zenithdb/zenith/issues/1176).
Zenith contains a check that PANICs if you write a block to disk
without WAL-logging it, and it works by checking the LSN of the page
that's written out. In this case, we are WAL-logging the page even
though the LSN on the page is 0, so it was a false alarm, but I'd love
to get this changed in Citus to keep the check in Zenith simple.
A downside of WAL-logging the page first is that if you run out of
disk space, you have already created the WAL record. So if you then
crash and restart, WAL recovery will likely run out of disk space,
too, which is bad. In practice, we have the same problem in other
places, like rewriteheap.c. Also, if you are on the brink of running
out of disk space, you will probably run out at WAL replay anyway,
regardless of which order we write these few pages. But if we wanted
to fix that, we could first extend the relation with zeros, and then
WAL-log the pages. That's how heap extension works.
It would be even nicer to use the buffer cache for this, and skip the
smgrimmedsync() on the relation. However, that would require more
work, because we don't have the Relation struct for the relation here.
We could use ReadBufferWithoutRelcache(), but that doesn't work for
unlogged tables. Unlogged tables are currently not supported
(https://github.com/citusdata/citus/issues/4742), but that would
become a problem if we want to support them in the future.
CreateFakeRelcacheEntry() also doesn't work with unlogged tables. We
could do things differently for logged and unlogged tables, but that
complicates the code further.
Co-authored-by: jeff-davis <Jeffrey.Davis@microsoft.com>
Citus heavily relies on application_name, see
`IsCitusInitiatedRemoteBackend()`.
But if the user set the application name, such as export PGAPPNAME=test_name,
Citus uses that name while connecting to the remote node.
With this commit, we ensure that Citus always connects with
the "citus" user name to the remote nodes.
With https://github.com/citusdata/citus/pull/2780, we allow
COPY to use any number of connections that the executor used
in a tx block.
Meaning that, while COPYing data to the shards, create_distributed_table
could allow sequential mode.
We fall back to local execution if we cannot establish any more
connections to local node. However, we should not do that for the
commands that we don't know how to execute locally (or we know we
shouldn't execute locally). To fix that, we take localExecutionSupported
take into account in CanFailoverPlacementExecutionToLocalExecution too.
Moreover, we also prompt a more accurate hint message to inform user
about whether the execution is failed because local execution is
disabled by them, or because local execution wasn't possible for given
command.
multi_log_hook() hook is called by EmitErrorReport() when emitting the
ereport either to frontend or to the server logs. And some callers of
EmitErrorReport() (e.g.: errfinish()) seems to assume that string fields
of given ErrorData object needs to be freed. For this reason, we copy the
message into heap here.
I don't think we have faced with such a problem before but it seems worth
fixing as it is theoretically possible due to the reasoning above.
BEGIN/COMMIT transaction block or in a UDF calling another UDF.
(2) Prohibit/Limit the delegated function not to do a 2PC (or any work on a
remote connection).
(3) Have a safety net to ensure the (2) i.e. we should block the connections
from the delegated procedure or make sure that no 2PC happens on the node.
(4) Such delegated functions are restricted to use only the distributed argument
value.
Note: To limit the scope of the project we are considering only Functions(not
procedures) for the initial work.
DESCRIPTION: Introduce a new flag "force_delegation" in create_distributed_function(),
which will allow a function to be delegated in an explicit transaction block.
Fixes#3265
Once the function is delegated to the worker, on that node during the planning
distributed_planner()
TryToDelegateFunctionCall()
CheckDelegatedFunctionExecution()
EnableInForceDelegatedFuncExecution()
Save the distribution argument (Constant)
ExecutorStart()
CitusBeginScan()
IsShardKeyValueAllowed()
Ensure to not use non-distribution argument.
ExecutorRun()
AdaptiveExecutor()
StartDistributedExecution()
EnsureNoRemoteExecutionFromWorkers()
Ensure all the shards are local to the node in the remoteTaskList.
NonPushableInsertSelectExecScan()
InitializeCopyShardState()
EnsureNoRemoteExecutionFromWorkers()
Ensure all the shards are local to the node in the placementList.
This also fixes a minor issue: Properly handle expressions+parameters in distribution arguments
* Removed distributed dependency in columnar_metadata.c
* Changed columnar_debug.c so that it no longer needed distributed/tuplestore and made it return a record instead of a tuplestore
* removed distributed/commands.h dependency
* Made columnar_tableam.c dependency-free
* Fixed spacing for columnar_store_memory_stats function
* indentation fix
* fixed test failures
* Require superuser while activating a node
With this change, we require ActiveNode() (hence citus_add_node(),
citus_activate_node()) explicitly require for a superuser.
Before this commit, these functions were designed to work with
non-superuser roles with the relevent GRANTs given.
However, that is not a widely used way for calling the functions
above.
Due to possibility of non-super user calling the UDFs, they were
designed in a way that some commands were using some additional
short-lived superuser connections. That is:
(a) breaking transactional behavior (e.g., ROLLBACK
wouldn't fully rollback the whole transaction)
(b) Making it very complicated to reason about which
parts of the node activation goes over which connections,
and becoming vulnerable to deadlocks / visibility issues.
In addition to starting a new transaction, we also need to tell other
backends --including the ones spawned for connections opened to
localhost to build indexes on shards of this relation-- that concurrent
index builds can safely ignore us.
Normally, DefineIndex() only does that if index doesn't have any
predicates (i.e.: where clause) and no index expressions at all.
However, now that we already called standard process utility, index
build on the shell table is finished anyway.
The reason behind doing so is that we cannot guarantee not grabbing any
snapshots via adaptive executor, and the backends creating indexes on
local shards (if any) might block on waiting for current xact of the
current backend to finish, which would cause self deadlocks that are not
detectable.
With https://github.com/citusdata/citus/pull/5493 we introduced
metadata specific connections.
With this connection we guarantee that there is a single metadata connection.
But note that this connection can be used for any other operation.
In other words, this connection is not only reserved for metadata
operations.
However, as https://github.com/citusdata/citus-enterprise/issues/715 showed
us that the logic has a flaw. We allowed ineligible connections to be
picked as metadata connections: such as exclusively claimed connections
or not fully initialized connections.
With this commit, we make sure that we only consider eligable connections
for metadata operations.
We prefer the background daemon to only sync node metadata. That's
why we move placement metadata changes from disable node to
activate node. With that, we can make sure that disable node
only changes node metadata, whereas activate node syncs all
the metadata changes. In essence, we already expect all
nodes to be up when a node is activated. So, this does not change
the behavior much.
Dropping sequences means we need to recreate
and hence losing the sequence.
With this commit, we keep the existing sequences
such that resyncing wouldn't drop the sequence.
We do that by breaking the dependency of the sequence
from the table.
Split distributed/version_compat.h into dependency-free
pg_version_compat.h, and the original which still has
dependencies. The original doesn't have much purpose, but until other
files have better discipline about including the correct header files,
then it's still needed.
Also make distributed/listutils.h dependency-free. Should be moved
outside of 'distributed' subdirectory, but that will cause significant
code churn, so leave for another cleanup patch.
Now both files can be included in columnar without creating a
dependency on citus.
Previously, we cheated by using the RM_GENERIC_ID record type, but not
actually using the generic WAL API. This worked because we always took
a full page image, and saved the extra work of allocating and copying
to a temporary page.
But it introduced complexity, and perhaps fragility, so better to just
use the API properly. The performance penalty for a serial data load
seems to be less than 1%.
Before this commit, Citus was triggering metadata syncing
in the background when a function is distributed. However,
with Citus 11, we expect all clusters to have metadata synced
enabled. So, we do not expect any nodes not to have the metadata.
This change:
(a) pro: simplifies the code and opens up possibilities
to simplify futher by reducing the scope of
bg worker to only sync node metadata
(b) pro: explicitly asks users to sync the metadata such that
any unforseen impact can be easily detected
(c) con: For distributed functions without distribution
argument, we do not necessarily require the metadata
sycned. However, for completeness and simplicity, we
do so.
With Citus 11, the default behavior is to sync the metadata.
However, partitioned tables created pre-Citus 11 might have
index names that are not compatiable with metadata syncing.
See https://github.com/citusdata/citus/issues/4962 for the
details.
With this commit, we record the existence of partitioned tables
such that we can fix it later if any exists.
With this commit, fix_partition_shard_index_names()
works significantly faster.
For example,
32 shards, 365 partitions, 5 indexes drop from ~120 seconds to ~44 seconds
32 shards, 1095 partitions, 5 indexes drop from ~600 seconds to ~265 seconds
`queryStringList` can be really long, because it may contain #partitions * #indexes entries.
Before this change, we were actually going through the executor where each command
in the query string triggers 1 round trip per entry in queryStringList.
The aim of this commit is to avoid the round-trips by creating a single query string.
I first simply tried sending `q1;q2;..;qn` . However, the executor is designed to
handle `q1;q2;..;qn` type of query executions via the infrastructure mentioned
above (e.g., by tracking the query indexes in the list and doing 1 statement
per round trip).
One another option could have been to change the executor such that only track
the query index when `queryStringList` is provided not with queryString
including multiple `;`s . That is (a) more work (b) could cause weird edge
cases with failure handling (c) felt like coding a special case in to the executor
(cherry picked from commit 90928cfd74)
Fix function signature generation
Fix comment typo
Add test for worker_create_or_replace_object
Add test for recreating distributed functions with OUT/TABLE params
Add test for recreating distributed function that returns setof int
Fix test output
Fix comment
Simply applies
```SQL
SELECT textlike(command, citus.grep_remote_commands)
```
And, if returns true, the command is logged. Else, the log is ignored.
When citus.grep_remote_commands is empty string, all commands are
logged.