This commit is inspired by a commit
dc9c3b0ff21465fa89d71eecf5e6cc956d647eca from PostgreSQL 15 that shares
the same header.
I also removed some gitignore rules so that I can add some files to git
worktree. We used to ignore the generated files, that are no longer
generated after this commit.
--------------------
Below is the commit message from PostgreSQL 15 commit
dc9c3b0ff21465fa89d71eecf5e6cc956d647eca :
"git mv" all the input/*.source and output/*.source files into
the corresponding sql/ and expected/ directories. Then remove
the pg_regress and Makefile infrastructure associated with
dynamic translation.
Discussion: https://postgr.es/m/1655733.1639871614@sss.pgh.pa.us
This commit is inspired by a commit
d1029bb5a26cb84b116b0dee4dde312291359f2a from PostgreSQL 15 that shares
the same header.
--------------------
Below is the commit message from PostgreSQL 15 commit
d1029bb5a26cb84b116b0dee4dde312291359f2a :
pg_regress has long had provisions for dynamically substituting path
names into regression test scripts and result files, but use of that
feature has always been a serious pain in the neck, mainly because
updating the result files requires tedious manual editing. Let's
get rid of that in favor of passing down the paths in environment
variables.
In addition to being easier to maintain, this way is capable of
dealing with path names that require escaping at runtime, for example
paths containing single-quote marks. (There are other stumbling
blocks in the way of actually building in a path that looks like
that, but removing this one seems like a good thing to do.) The key
coding rule that makes that possible is to concatenate pieces of a
dynamically-variable string using psql's \set command, and then use
the :'variable' notation to quote and escape the string for the next
level of interpretation.
In hopes of making this change more transparent to "git blame",
I've split it into two steps. This commit adds the necessary
pg_regress.c support and changes all the *.source files in-place
so that they no longer require any dynamic translation. The next
commit will just "git mv" them into the regular sql/ and expected/
directories.
Discussion: https://postgr.es/m/1655733.1639871614@sss.pgh.pa.us
PostgreSQL 15 dropped usage of .source files that are used to generate
.sql and .out files by replacing some placeholders with the actual
values before test runs. Instead, the information is passed from
pg_regress to the .sql and .out files directly via env variables. Those
variables are read via \getenv psql command in relevant test files.
PostgreSQL 15 commit d1029bb5a26cb84b116b0dee4dde312291359f2a introduced
some changes to pg_regress binary that allowed this to happen. However
this change is not backported to earlier versions of PG, and thus we
come up with a similar mechanism in pg_regress_multi that works in all
available PG versions.
When using `citus.replicate_reference_tables_on_activate = off`,
reference tables need to be replicated later. This can be done using the
`replicate_reference_tables()` UDF. However, this function only allowed
blocking replication. This changes the function to default to logical
replication instead, and allows choosing any of our existing shard
transfer modes.
DESCRIPTION: Use faster custom copy logic for non-blocking shard moves
Non-blocking shard moves consist of two main phases:
1. Initial data copy
2. Catchup phase
This changes the first of these phases significantly. Previously we used the
copy logic provided by postgres subscriptions. This meant we didn't have
to implement it ourselves, but it came with the downside of little control.
When implementing shard splits we needed more control to even make it
work, so we implemented our own logic for copying data between nodes.
This PR starts using that logic for non-blocking shard moves. Doing so
has four main advantages:
1. It uses COPY in binary format when possible, which is cheaper to encode
and decode. Furthermore it very often results in less data that needs to
be sent over the network.
2. It allows us to create the primary key (or other replica identity) after doing
the initial data copy. This should give some speed up over the total run,
because creating an index is bulk is much faster than incrementally building it.
3. It doesn't require a replication slot per parallel copy. Increasing the maximum
number of replication slots uses resources in postgres, even if they are not used.
So reducing the number of replication slots that shard moves need is nice.
4. Logical replication table_sync workers are slow to start up, so if lots of shards
need to be copied that can make it quite slow. This can happen easily when
combining Postgres partitioning with Citus.
master_drain_node in distributed_triggers.sql test file takes too
long to execute. It is directly dependent on the shard count.
Hence I reduced shard count from 32 to 4 (default in tests),
since this doesn't affect the validity of the tests.
This change reduces the setup time of our minimal schedules in two ways:
1. Don't run `multi_cluster_managament`, but instead run a much smaller
sql file with almost the same results. `multi_cluster_management`
adds and removes lots of nodes and tests all kinds of failure
scenarios. This is not needed for the minimal schedules. The only
reason we were using it there was to get a working cluster of the
layout that the tests expected. The new `minimal_cluster_management`
test achieves this with much less work, going from ~2s to ~0.5s.
2. Parallelize a bit more of the helper tests.
We are reducing the log level here to avoid alternative test output
in PG15 because of the change in the display of SQL-standard
function's arguments in INSERT/SELECT in PG15.
The log level changes can be reverted when we drop support for PG14
Relevant PG commit:
a8d8445a7b2f80f6d0bfe97b19f90bd2cbef8759
The new shard copy code that was created for shard splits has some
advantages over the old shard copy code. The old code was using
worker_append_table_to_shard, which wrote to disk twice. And it also
didn't use binary copy when that was possible. Both of these issues
were fixed in the new copy code. This PR starts using this new copy
logic also for shard moves, not just for shard splits.
On my local machine I created a single shard table like this.
```sql
set citus.shard_count = 1;
create table t(id bigint, a bigint);
select create_distributed_table('t', 'id');
INSERT into t(id, a) SELECT i, i from generate_series(1, 100000000) i;
```
I then turned `fsync` off to make sure I wasn't bottlenecked by disk.
Finally I moved this shard between nodes with `citus_move_shard_placement`
with `block_writes`.
Before this PR a move took ~127s, after this PR it took only ~38s. So for this
small test this resulted in spending ~70% less time.
And I also tried the same test for a table that contained large strings:
```sql
set citus.shard_count = 1;
create table t(id bigint, a bigint, content text);
select create_distributed_table('t', 'id');
INSERT into t(id, a, content) SELECT i, i, 'aunethautnehoautnheaotnuhetnohueoutnehotnuhetncouhaeohuaeochgrhgd.athbetndairgexdbuhaobulrhdbaetoausnetohuracehousncaoehuesousnaceohuenacouhancoexdaseohusnaetobuetnoduhasneouhaceohusnaoetcuhmsnaetohuacoeuhebtokteaoshetouhsanetouhaoug.lcuahesonuthaseauhcoerhuaoecuh.lg;rcydabsnetabuesabhenth' from generate_series(1, 20000000) i;
```
While testing 5670dffd33, I realized
that we have a missing RecordNonDistTableAccessesForTask() for
local utility commands.
Although we don't have to record the relation access for local
only cases, we really want to keep the behaviour for scale-out
be the same with single node on all aspects. We wouldn't want
any single node complex transaction to work on single machine,
but not on multi node cluster. Hence, we apply the same restrictions.
For example, on a distributed cluster, the following errors, and
after this commit this errors locally as well
```SQL
CREATE TABLE ref(a int primary key);
INSERT INTO ref VALUES (1);
CREATE TABLE dist(a int REFERENCES ref(a));
SELECT create_reference_table('ref');
SELECT create_distributed_table('dist', 'a');
BEGIN;
SELECT * FROM dist;
TRUNCATE ref CASCADE;
ERROR: cannot execute DDL on table "ref" because there was a parallel SELECT access to distributed table "dist" in the same transaction
HINT: Try re-running the transaction with "SET LOCAL citus.multi_shard_modify_mode TO 'sequential';"
COMMIT;
```
We also add the comprehensive test suite and run the same locally.
Code snippet in Makefile was blocking Citus build when USE_PGXS flag was set. This was included for port to FSPG but is not needed for Citus engine and can be safely removed.
Reported bug #5803 shows that we are currently not sending the IN clause to our planner for columnar. This PR fixes it by checking for ScalarArrayOpExpr in ExtractPushdownClause so that we do not skip it. Also added a test case for this new addition.
It turns out that create_distributed_table
and citus_move/copy_shard_placement does not
work well concurrently.
To fix that, we need to acquire a lock, which
sounds like a good use of colocation lock.
However, the current usage of colocation lock is
limited to higher level UDFs like rebalance_table_shards
etc. Those usage of lock is still useful, but
we cannot acquire the same lock on citus_move_shard_placement
etc. because the coordinator connects to itself to acquire
the lock. Hence, the high level UDF blocks itself.
To fix that, we use one more colocation lock, with the placements
are the main objects to consider.
Before this commit, we required multiple copies of the
same stringInfo if we needed to append/prepend data to
the stringInfo. Now, we optionally get prefix/postfix.
For large string operations, this can save up to %10
memory.
Previously, CreateFixPartitionShardIndexNames() created all
the relevant query strings for all the shards, and executed
the large query string. And, in terms of the memory consumption,
this huge command (and its ExprContext generated while running
the command) is the main bottleneck/
With this change, we are reducing the total amount of memory
usage to almost 1/shard_count.
On my local machine, a distributed partitioned table with 120 partitions,
each 32 shards, the total memory consumption reduced from ~3GB
to ~0.1GB. And, the total execution time increased from ~28 seconds
to ~30 seconds. This seems like a good trade-off.
We used to only check whether the PID is valid
or not. However, Postgres does not necessarily
set the PID of the backend to 0 when it exists.
Instead, we need to be able to check it from procArray.
IsBackendPid() is what pg_stat_activity also relies
on for a similar purpose.
Historically we have been testing with the 'latest' version of libpq
when the CI images were build. This has the downside that rebuilding the
images often break our tests due to different errors returned from
libpq.
With this change we will actually test with a stable version of libpq
that is based on the postgres minor version that we test against.
This will make it easier to maintain postgres images over time, as well
as running _all_ tests locally, where we change libpq in sync with the
postgres server version.
use RecurseObjectDependencies api to find if an object is citus depended
make vanilla tests runnable to see if citus_depended function is working correctly
citus_locks combines the pg_locks views from all nodes and adds
global_pid, nodeid, and relation_name. The columns of citus_locks don't
change based on the Postgres version, however the pg_locks's columns do.
Postgres 14 added one more column to pg_locks (waitstart timestamptz).
citus_locks has the most expansive column set, including the newly added
column. If citus_locks is queried in a Postgres version where pg_locks
doesn't have some columns, the values for those columns in citus_locks
will be NULL
DESCRIPTION:
This PR extends support for Partitioned and Columnar tables in blocking 'citus_split_shard_by_split_points' workflow.
Columnar Support : No special handling required. Just removing checks that fails split for columnar table and adding test coverage.
Partitioned Table Support :
Skip copying of parent table as they are empty, The partitions contain data and are treated as co-located shards that will be copied separately.
Attach partitions to parent on destination after inserting new shard metadata and before creating foreign key constraints.
MISC:
Fix Bug #4949 where Blocking shard moves fails if there is a foreign key between partitioned distributed tables (from child to parent).
TEST:
Added new test 'citus_split_shards_columnar_partitioned' for splitting 'partitioned' and 'columnar + partitioned' table.
Added new test 'shard_move_constraints_blocking' to add coverage for shard move bug fix.
Updated test 'citus_split_shard_by_split_points_negative' to allow columnar and partitioned table.
* Remove if conditions with PG_VERSION_NUM < 13
* Remove server_above_twelve(&eleven) checks from tests
* Fix tests
* Remove pg12 and pg11 alternative test output files
* Remove pg12 specific normalization rules
* Some more if conditions in the code
* Change RemoteCollationIdExpression and some pg12/pg13 comments
* Remove some more normalization rules
When building packages on ubuntu jammy, we started to see some warnings.
autoreconf: warning: autoconf input should be named 'configure.ac', not
'configure.in'
* Blocking split setup
* Add missing type
* Missing API from Metadata Sync
* Shard Split e2e code
* Worker Split Copy DestReceiver skeleton
* Basic destreceiver code
* worker_split_copy UDF
* UDF calling
* Split points are text
* Isolate Tenant and Split Shard Unification
* Fixing executor and misc
* Reindent code
* Fixing UDF definitions
* Hello World Local Copy works
* Remote copy hello world works
* Local and Remote binary test
* Fixing text local copy and adding tests
* Hello World shard split works
* Negative tests
* Blocking Split workflow works
* Refactor
* Bug fix
* Reindent
* Cleaning up and adding comments
* Basic test for shard split workflow
* ReIndent
* Circle CI integration
* Removing include causing circle-ci build failure
* Remove SplitCopyDestReceiver and use PartitionedResultDestReceiver
* Add support for citus.enable_binary_protocol
* Reindent
* Fix build break
* Update Test
* Cleanup on catch
* Addressing open comments
* Update downgrade script and quote schema/table in COPY statement
* Fix metadata sync issue. Update regression test
* Isolation test and bug fix
* Add Isolation test, fix foreign constraint deadlock issue
* Misc code review comments
* Test name needing to be quoted
* Refactor code from review comments
* Explaining shardGroupSplitIntervalListList
* Fix upgrade & downgrade
* Fix broken test
* Test fix Round 2
* Fixing bug and modifying test appropriately
* Fully qualify copy udf name. Run Reindent
* Address PR comments
* Fix null handling when creating AuxiliaryStructures
* Ensure local copy is triggered in tests
* Limit max shards that can be created with split
* Test failure fix
* Remove split_mode and use shard_transfer_mode instead'
* Fix test failure
* Fix test failure
* Fixing permission issue when splitting non-superuser owned tables
* Fix test expected output
* Remove extra space
* Fix test
* attempt to fix test
* Addressing Marco's PR comment
* Only clean shards created by workflow
* Remove from merge
* Update test
Similar to #5897, one more step for running Citus with PG 15.
This PR at least make Citus run with PG 15. I have not tried running the tests with PG 15.
Shmem changes are based on 4f2400cb3f
Compile breaks are mostly due to #6008
This is a continuation of a refactor (with commit sha
2b7cf0c097) that aimed to use Citus helper
UDFs by default in iso tests.
PostgreSQL isolation test infrastructure uses some UDFs to detect
whether concurrent sessions block each other. Citus implements
alternatives to that UDF so that we are able to detect and report
distributed transactions that get blocked on the worker nodes as well.
We needed to explicitly replace PG helper functions with Citus
implementations in each isolation file. Now we replace them by default.
* Support upgrade and downgrade and separate columnar as citus_columnar extension
Co-authored-by: Yanwen Jin <yanwjin@microsoft.com>
Co-authored-by: Jeff Davis <jeff@j-davis.com>
Use Citus helper UDFs by default in iso tests
PostgreSQL isolation test infrastructure uses some UDFs to detect
whether concurrent sessions block each other. Citus implements
alternatives to that UDF so that we are able to detect and report
distributed transactions that get blocked on the worker nodes as well.
We needed to explicitly replace PG helper functions with Citus
implementations in each isolation file. Now we replace them by default.
* Added more regression tests for more vacuum options,
* Fixed deadlock for unqualified vacuum when there is only 1 worker,
* Supported lock_skipped for vacuum.
This PR makes all of the features open source that were previously only
available in Citus Enterprise.
Features that this adds:
1. Non blocking shard moves/shard rebalancer
(`citus.logical_replication_timeout`)
2. Propagation of CREATE/DROP/ALTER ROLE statements
3. Propagation of GRANT statements
4. Propagation of CLUSTER statements
5. Propagation of ALTER DATABASE ... OWNER TO ...
6. Optimization for COPY when loading JSON to avoid double parsing of
the JSON object (`citus.skip_jsonb_validation_in_copy`)
7. Support for row level security
8. Support for `pg_dist_authinfo`, which allows storing different
authentication options for different users, e.g. you can store
passwords or certificates here.
9. Support for `pg_dist_poolinfo`, which allows using connection poolers
in between coordinator and workers
10. Tracking distributed query execution times using
citus_stat_statements (`citus.stat_statements_max`,
`citus.stat_statements_purge_interval`,
`citus.stat_statements_track`). This is disabled by default.
11. Blocking tenant_isolation
12. Support for `sslkey` and `sslcert` in `citus.node_conninfo`
We already have tests relying on citus_finalize_upgrade_to_citus11().
Now, adjust those to rely on citus_finish_citus_upgrade() and
always call citus_finish_citus_upgrade().
The error comes due to the datum jsonb in pg_dist_metadata_node.metadata being 0 in some scenarios. This is likely due to not copying the data when receiving a datum from a tuple and pg deciding to deallocate that memory when the table that the tuple was from is closed.
Also fix another place in the code that might have been susceptible to this issue.
I tested on both multi-vg and multi-1-vg and the test were successful.
altering the distributed table.
To be able to alter view's owner without enforcing sequential mode.
Alter view process functions have been udpated to use metadata
connection.
Do not obtain AccessShareLock before acquiring the distributed locks.
Acquiring an AccessShareLock ensures that the relations which we are trying to get a distributed lock on will not be dropped in the time between when the LOCK command is issued and the LOCK commands are send to the worker. However, this also leads to distributed deadlocks in such scenarios:
```sql
-- for dist lock acquiring order coor, w1, w2
-- on w2
LOCK t1 IN ACCESS EXLUSIVE MODE;
-- acquire AccessShareLock locally on t1 to ensure it is not dropped while we get ready to distribute the lock
-- concurrently on w1
LOCK t1 IN ACCESS EXLUSIVE MODE;
-- acquire AccessShareLock locally on t1 to ensure it is not dropped while we get ready to distribute the lock
-- acquire dist lock on coor, w1, gets blocked on local AccessShareLock on w2
-- on w2 continuation of the execution above
-- starts to acquire dist locks and gets blocked on the coor by the lock acquired by w1
-- distributed deadlock
```
We opt for avoiding such deadlocks with the cost of the possibility of running into errors when the relations on which we are trying to acquire locks on get dropped.
It is often useful to be able to sync the metadata in parallel
across nodes.
Also citus_finalize_upgrade_to_citus11() uses
start_metadata_sync_to_primary_nodes() after this commit.
Note that this commit does not parallelize all pieces of node
activation or metadata syncing. Instead, it tries to parallelize
potenially large parts of metadata, which is the objects and
distributed tables (in general Citus tables).
In the future, it would be nice to sync the reference tables
in parallel across nodes.
Create ~720 distributed tables / ~23450 shards
```SQL
-- declaratively partitioned table
CREATE TABLE github_events_looooooooooooooong_name (
event_id bigint,
event_type text,
event_public boolean,
repo_id bigint,
payload jsonb,
repo jsonb,
actor jsonb,
org jsonb,
created_at timestamp
) PARTITION BY RANGE (created_at);
SELECT create_time_partitions(
table_name := 'github_events_looooooooooooooong_name',
partition_interval := '1 day',
end_at := now() + '24 months'
);
CREATE INDEX ON github_events_looooooooooooooong_name USING btree (event_id, event_type, event_public, repo_id);
SELECT create_distributed_table('github_events_looooooooooooooong_name', 'repo_id');
SET client_min_messages TO ERROR;
```
across 1 node: almost same as expected
```SQL
SELECT start_metadata_sync_to_primary_nodes();
Time: 15664.418 ms (00:15.664)
select start_metadata_sync_to_node(nodename,nodeport) from pg_dist_node;
Time: 14284.069 ms (00:14.284)
```
across 7 nodes: ~3.5x improvement
```SQL
SELECT start_metadata_sync_to_primary_nodes();
┌──────────────────────────────────────┐
│ start_metadata_sync_to_primary_nodes │
├──────────────────────────────────────┤
│ t │
└──────────────────────────────────────┘
(1 row)
Time: 25711.192 ms (00:25.711)
-- across 7 nodes
select start_metadata_sync_to_node(nodename,nodeport) from pg_dist_node;
Time: 82126.075 ms (01:22.126)
```
Move internal storage details to a separate schema with no public
access to limit the possibility for information leakage.
Create views with public access that show storage details for those
columnar tables where the user has ownership privileges. Include
mapping between relation ID and storage ID for easier interpretation.
We remove `<waiting ...>` and `<... completed>` outputs for some CREATE
INDEX CONCURRENTLY commands since they can cause flakiness in some scenarios.
Postgres calls WaitForOlderSnapshots() and this can cause CREATE INDEX
CONCURRENTLY commands for shards to get blocked by each other for brief
periods of time. The extra waits can pop-up, or they can get completed
at different lines in the output files. To remedy that, we rename those
indexes to be captured by the new normalization rule.
* Bug fix for bug #5876. Memset MetadataCacheSystem every time there is an abort
* Created an ObjectAccessHook that saves the transactionlevel of when citus was created and will clear metadatacache if that transaction level is rolled back. Added additional tests to make sure metadatacache is cleared
Columnar: support relation options with ALTER TABLE.
Use ALTER TABLE ... SET/RESET to specify relation options rather than
alter_columnar_table_set() and alter_columnar_table_reset().
Not only is this more ergonomic, but it also allows better integration
because it can be treated like DDL on a regular table. For instance,
citus can use its own ProcessUtility_hook to distribute the new
settings to the shards.
DESCRIPTION: Columnar: support relation options with ALTER TABLE.
With Citus MX enabled, when a reference table is modified, it does
some operations on the first worker node(e.g., acquire locks).
If node metadata is locked (via add node or create restore point),
the changes to the reference tables should be blocked.
In the past (pre-11), we allowed removing worker nodes
that had active placements for replicated distributed
table, without even checking if there are any other
replicas of the same placement.
However, with #5469, we prevent disabling nodes via a hard
error when there is the last active placement of shard, as we
do for reference tables. Note that otherwise, we'd allow
users to lose data.
As of today, the NOTICE is completely irrelevant.
First worker node has a special meaning for modifications on the replicated tables
It is used to acquire a remote lock, such that the modifications are serialized.
With this commit, we make sure that we do not let any distributed query to see a
different 'first worker node' while first worker node is disabled.
Note that, maybe implicitly mentioned above, when first worker node is disabled,
the first worker node changes, that's why we have to handle the situation.
Before this commit, we had:
```SQL
SELECT citus_disable_node(nodename, nodeport, force boolean DEFAULT false)
```
Where, we allow forcing to disable first worker node with
`force:=true`. However, it entails the risk for losing
data / diverging placement data etc.
With `force` flag, we control disabling the first worker node,
and with `async` flag we control whether the changes are done
via bg worker or immediately.
```SQL
SELECT citus_disable_node(nodename, nodeport, force boolean DEFAULT false, sync boolean DEFAULT false)
```
Where we can achieve all the following:
| Mode | Data loss possibility | Can run in 2PC | Handle multiple node failures | Immediately effective |
| --- |--- |--- |--- |--- |
| force:false, sync: false | false | true | true | false |
| force:false, sync: true | false | false | false | true |
| force:true, sync: false | true | true | true | false |
| force:true, sync: true | false | false | false | true |
There are two problems in this area. First, when there are expressions
on the index name, we should call `transformIndexExpression()` before
generating the index name. That is what Postgres does.
Second, because of 40c24bfef9
PG 13 and PG 14 generates different names for indexes with function calls even for local PG tables.
Assume we have:
```SQL
create table t(id int);
select create_distributed_table('t', 'id');
create index ON t (my_very_boring_function(id));
```
On PG 13, the name of the index is `t_expr_idx`
```SQL
\d t
Table "public.t"
┌────────┬─────────┬───────────┬──────────┬─────────┐
│ Column │ Type │ Collation │ Nullable │ Default │
├────────┼─────────┼───────────┼──────────┼─────────┤
│ id │ integer │ │ │ │
└────────┴─────────┴───────────┴──────────┴─────────┘
Indexes:
"t_expr_idx" btree (my_very_boring_function(id::bigint))
```
On PG 14, the name of the index is `t_my_very_boring_function_idx`
```SQL
\d t
Table "public.t"
┌────────┬─────────┬───────────┬──────────┬─────────┐
│ Column │ Type │ Collation │ Nullable │ Default │
├────────┼─────────┼───────────┼──────────┼─────────┤
│ id │ integer │ │ │ │
└────────┴─────────┴───────────┴──────────┴─────────┘
Indexes:
"t_my_very_boring_function_idx" btree (my_very_boring_function(id::bigint))
```
The second issue is not very critical. The important part is that
we adjust regression tests to drop all the indexes, which ensures
the index names are sane on any version.
Over time we have added significantly improved the support for objects to be propagated by Citus as to make scaling out the database more seamless. It became evident that there was a lot of code duplication that got into the codebase to implement the propagation.
This PR tries to reduce the amount of repeated code that is at most only slightly different. To make things worse, most of the differences were actually oversights instead of correct.
This Patch introduces 3 reusable sets of pre/post processing steps for respectively
- create
- alter
- drop
With the use of the common functionality we should have more coherent behaviour between different supported object by Citus.
Some steps either omit the Pre or Post processing step if they would not make sense to include.
All tests pass, only 1 test needed changing, foreign servers, as the dropping of foreign servers didn't implement support for dropping multiple foreign servers at once. Given the common approach correctly supports dropping of multiple objects, either distributed or not, the test that assumed it wouldn't work was now obsolete.
We have a mechanism which ensures that newly distributed
objects are recorded during `alter extension citus update`.
However, the logic was lacking "view"s. With this commit, we make
sure that existing views are also marked as distributed during
upgrade.
We are nearing the 100 objects being propagated in `master_copy_shard_placement` and with the extra supported objects this gets pushed over a 100 objects.
When a 100 objects are reached for propagation a notice will be shown to the user, informing them it might take a while to finish the operation.
During testing this is not important to see. Since the message contains the exact number of objects to be propagated the tests becomes very unstable when merging community into enterprsie.
This change makes that the test output stays stable.
Adds support for propagation ALTER VIEW commands to
- Change owner of view
- SET/RESET option
- Rename view and view's column name
- Change schema of the view
Since PG also supports targeting views with ALTER TABLE
commands, related code also added to direct such ALTER TABLE
commands to ALTER VIEW commands while sending them to workers.
Breaking down #5899 into smaller PR-s
This particular PR changes the way TRUNCATE acquires distributed locks on the relations it is truncating to use the LOCK command instead of lock_relation_if_exists. This has the benefit of using pg's recursive locking logic it implements for the LOCK command instead of us having to resolve relation dependencies and lock them explicitly. While this does not directly affect truncate, it will allow us to generalize this locking logic to then log different relations where the pg recursive locking will become useful (e.g. locking views).
This implementation is a bit more complex that it needs to be due to pg not supporting locking foreign tables. We can however, still lock foreign tables with lock_relation_if_exists. So for a command:
TRUNCATE dist_table_1, dist_table_2, foreign_table_1, foreign_table_2, dist_table_3;
We generate and send the following command to all the workers in metadata:
```sql
SEL citus.enable_ddl_propagation TO FALSE;
LOCK dist_table_1, dist_table_2 IN ACCESS EXCLUSIVE MODE;
SELECT lock_relation_if_exists('foreign_table_1', 'ACCESS EXCLUSIVE');
SELECT lock_relation_if_exists('foreign_table_2', 'ACCESS EXCLUSIVE');
LOCK dist_table_3 IN ACCESS EXCLUSIVE MODE;
SEL citus.enable_ddl_propagation TO TRUE;
```
Note that we need to alternate between the lock command and lock_table_if_exists in order to preserve the TRUNCATE order of relations.
When pg supports locking foreign tables, we will be able to massive simplify this logic and send a single LOCK command.
Adds support for propagating create/drop view commands and views to
worker node while scaling out the cluster. Since views are dropped while
converting the table type, metadata connection will be used while
propagating view commands to not switch to sequential mode.
First, it is not needed. Second, in the past we had issues regarding
this: https://github.com/citusdata/citus/pull/4344
When I create 10k tables, ~120K shards, this saves
40Mb of memory during ALTER EXTENSION citus UPDATE.
Before the change: MetadataCacheMemoryContext: 41943040 ~ 40MB
After the change: MetadataCacheMemoryContext: 8192
Here is a flaky test output that is quite hard to fix:
```diff
diff -dU10 -w /home/circleci/project/src/test/regress/expected/isolation_master_update_node_1.out /home/circleci/project/src/test/regress/results/isolation_master_update_node.out
--- /home/circleci/project/src/test/regress/expected/isolation_master_update_node_1.out.modified 2022-03-21 19:03:54.237042562 +0000
+++ /home/circleci/project/src/test/regress/results/isolation_master_update_node.out.modified 2022-03-21 19:03:54.257043084 +0000
@@ -49,18 +49,20 @@
<waiting ...>
step s2-update-node-1-force: <... completed>
master_update_node
------------------
(1 row)
step s2-abort: ABORT;
step s1-abort: ABORT;
FATAL: terminating connection due to administrator command
-SSL connection has been closed unexpectedly
+server closed the connection unexpectedly
+ This probably means the server terminated abnormally
+ before or while processing the request.
```
I could not come up with a solution that would decrease the flakiness in the test outputs. We already have 3 output files for the same test and now I introduced a 4th one.
I can also add complex regular expressions that span multiple lines, and normalize these error messages. Feel free to suggest a normalized error message in a comment here.
## Current alternative file contents
`isolation_master_update_node.out`
```
step s1-abort: ABORT;
FATAL: terminating connection due to administrator command
FATAL: terminating connection due to administrator command
SSL connection has been closed unexpectedly
```
`isolation_master_update_node_0.out`
```
step s1-abort: ABORT;
WARNING: this step had a leftover error message
FATAL: terminating connection due to administrator command
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
```
`isolation_master_update_node_1.out`
```
step s1-abort: ABORT;
FATAL: terminating connection due to administrator command
SSL connection has been closed unexpectedly
```
new file: `isolation_master_update_node_2.out`
```
step s1-abort: ABORT;
FATAL: terminating connection due to administrator command
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
```
In the past, for all modifications on the local execution,
we enabled 2PC (with 6a7ed7b309).
This also required us to enable coordinated transactions
via https://github.com/citusdata/citus/pull/4831 .
However, it does have a very substantial impact on the
distributed deadlock detection. The distributed deadlock
detection is designed to avoid single-statement transactions
because they cannot lead to any actual deadlocks.
The implementation is to skip backends without distributed
transactions are assigned. Now that we assign single
statement local executions in the lock graphs, we are
conflicting with the design of distributed deadlock
detection.
In general, we should fix it. However, one might
think that it is not a big deal, even if the processes
show up in the lock graphs, the deadlock detection
should not be causing any false positives. That is
false, unless https://github.com/citusdata/citus/issues/1803
is fixed. Now that local processes are considered as a single
distributed backend, the lock graphs might find:
local execution 1 [tx id: 1] -> any local process [tx id: 0]
any local process [tx id: 0] -> local execution 2 [tx id: 2]
And, decides that there is a distributed deadlock.
This commit is:
(a) right thing to do, as local execuion should not need any
distributed tx id
(b) Eliminates performance issues that might come up with
deadlock detection does a lot of unncessary checks
(c) After moving local execution after the remote execution
via https://github.com/citusdata/citus/pull/4301, the
vauge requirement for assigning distributed tx ids are
already gone.
For some reason search_path is not always set correctly on the worker
when calling a distributed function, this shows up when calling
`insert_document` in our distributed_triggers test. The underlying
reason is currently unknown and warrants deeper investigation.
Currently this test is one of the main causes for random CI failures. So
this change sets the search_path of each function explicitly, to reduce
these failures. So other devs can be more efficient, while I continue
investigating the root cause of this issue.
Also changes explicit `SET citus.enable_unsafe_triggers = false` to
`RESET citus.enable_unsafe_triggers` in passing.
* Separate build of citus.so and citus_columnar.so.
Because columnar code is statically-linked to both modules, it doesn't
make sense to load them both at once.
A subsequent commit will make the modules entirely separate and allow
loading them both simultaneously.
Author: Yanwen Jin
* Separate citus and citus_columnar modules.
Now the modules are independent. Columnar can be loaded by itself, or
along with citus.
Co-authored-by: Jeff Davis <jefdavi@microsoft.com>
The aim of hiding shards is to hide shards from client applications.
Certain bg workers (such as pg_cron or Citus maintanince daemon)
should be treated like client applications because users can run
queries from such bg workers. And, these bg workers should follow
the similar application_name checks as client backeends.
Certain other bg workers, such as logical replication or postgres'
parallel workers, should never hide shards. They are internal
operations.
Similarly the other backend types like the walsender or
checkpointer or autovacuum should never hide shards.
We've had custom versions of Postgres its `foreach` macro which with a
hidden ListCell for quite some time now. People like these custom
macros, because they are easier to use and require less boilerplate.
This adds similar custom versions of Postgres its `forboth` macro. Now
you don't need ListCells anymore when looping over two lists at the same
time.
Since now we don't throw an error for enums that user attempts creating
in temp schema, the preprocess / DDL job that contains the prepared
statement (to idempotently create the enum type) gets executed. As a
result, we were emitting the following warning because of the error the
underlying worker connection throws:
```sql
WARNING: cannot PREPARE a transaction that has operated on temporary objects
CONTEXT: while executing command on localhost:xxxxx
WARNING: connection to the remote node localhost:xxxxx failed with the following error: another command is already in progress
ERROR: cannot PREPARE a transaction that has operated on temporary objects
CONTEXT: while executing command on localhost:xxxxx
```
We were already doing so for functions & types believing that
this cannot be the case for other object types.
However, as in #5830, we cannot distribute an object that user
attempts creating in temp schema. Even more, this doesn't only
apply to functions and types but also to many other object types.
So with this commit, we teach preprocess/postprocess functions
(that need to create dependencies on worker nodes) how to skip
trying to distribute such objects.
We also start identifying temp schemas as the objects that we
don't know how to propagate to worker nodes so that we can
simply create objects locally if user attempts creating them
in a temp schema.
There are 36 callers of `EnsureDependenciesExistOnAllNodes` in
the codebase atm and for the most we still need to throw a hard
error (i.e.: not use `DeferErrorIfHasUnsupportedDependency`
beforehand), such as:
i) user explicitly wants to create a distributed object
* CreateCitusLocalTable
* CreateDistributedTable
* master_create_worker_shards
* master_create_empty_shard
* create_distributed_function
* EnsureExtensionFunctionCanBeDistributed
ii) we don't want to skip altering distributed table on worker nodes
* PostprocessIndexStmt
* PostprocessCreateTriggerStmt
* PostprocessCreateStatisticsStmt
iii) object is already distributed / handled by Citus before, so we
aren't okay with not propagating the ALTER command
* PostprocessAlterTableSchemaStmt
* PostprocessAlterCollationOwnerStmt
* PostprocessAlterCollationSchemaStmt
* PostprocessAlterDatabaseOwnerStmt
* PostprocessAlterExtensionSchemaStmt
* PostprocessAlterFunctionOwnerStmt
* PostprocessAlterFunctionSchemaStmt
* PostprocessAlterSequenceOwnerStmt
* PostprocessAlterSequenceSchemaStmt
* PostprocessAlterStatisticsSchemaStmt
* PostprocessAlterStatisticsOwnerStmt
* PostprocessAlterTextSearchConfigurationSchemaStmt
* PostprocessAlterTextSearchDictionarySchemaStmt
* PostprocessAlterTextSearchConfigurationOwnerStmt
* PostprocessAlterTextSearchDictionaryOwnerStmt
* PostprocessAlterTypeSchemaStmt
* PostprocessAlterForeignServerOwnerStmt
iv) we already cannot create those objects in temp schemas, so skipping
for now
* PostprocessCreateExtensionStmt
* PostprocessCreateForeignServerStmt
Also note that there are 3 more callers of
`EnsureDependenciesExistOnAllNodes` in enterprise in addition to those
36 but we don't need to do anything specific about them due to the same
reasoning given in iii).
In `pg_regress_multi.pl` we're running `initdb` with some options that
the `common.py` `initdb` is currently not using. All these flags seem
reasonable, so this brings `common.py` in line with
`pg_regress_multi.pl`.
In passing change the `--nosync` flag to `--no-sync`, since that's what
the PG documentation lists as the official option name (but both work).
Cluster setup time is significant in arbitrary configs. We can
parallelize this a bit more.
Runtime of the following command decreases from ~25 seconds to ~22
seconds on my machine with this change:
```
make -C src/test/regress/ check-arbitrary-base CONFIGS=CitusDefaultClusterConfig EXTRA_TESTS=prepared_statements_1
```
Currently we can only run different configs in parallel. However, when working on a feature or trying to fix a bug this is not important. In those cases you simply want to run a single test file on a single config. And you want to run that every time you made a change to the code that you think fixes the issue.
This PR allows parallelising running of bash commands. So `initdb` and `pg_ctl start` is run in parallel for all nodes in the cluster. Instead of one waiting for the other.
When you run the above command nothing is being run in parallel.
After this PR, cluster setup is being run in parallel.
We have fsync enabled for regular tests already in `pg_regress_multi.pl`.
This does the same for the arbitrary config tests.
On my machine this changes the runtime from the following command from
~37 to ~25 seconds:
```bash
make -C src/test/regress/ check-arbitrary-configs CONFIGS=CitusDefaultClusterConfig
```
Here is a list of some functions, and the `TargetWorkerSet` parameters
they supply to `NodeDDLTaskList`:
PostprocessCreateTextSearchConfigurationStmt -
NON_COORDINATOR_NODES
PreprocessDropTextSearchConfigurationStmt -
NON_COORDINATOR_METADATA_NODES
PreprocessAlterTextSearchConfigurationSchemaStmt -
NON_COORDINATOR_METADATA_NODES
I guess this means that, if metadata
syncing is disabled on the node, we may have some issues. Consider the
following:
Let's assume the user has metadata syncing disabled. 2 workers.
`CREATE TEXT SEARCH CONFIGURATION ...` will get propagated to all
workers. `ALTER ... CONFIGURATION ...` will not get propagated to
workers.
After adding a new non-metadata node, the new node will get the altered
configuration as it reads from catalog. At this point CONFIGURATION
definitions got diverged in the cluster.
I suggest that we always use `NON_COORDINATOR_METADATA_NODES` in all the
TEXT SEARCH operations here.
Before this commit, we erroneously converted the sequence
type to the column's type it is used. However, it is possible
that the sequence is used in an expression which then converted
to a type that cannot be a sequence, such as text.
With this commit, we only try this conversion if the column
type is a supported sequence type (e.g., smallint, int and bigint).
Note that we do this conversion because if the column type is a
bigint and the sequence is NOT a bigint, users would be in trouble
because sequences would generate values that are out of the range
of the column. (The other ways are already not supported such as
the column is int and the sequence is bigint would fail on the worker.)
In other words, with this commit, we scope this optimization only
when the target column type is a supported sequence type. Otherwise,
we let users to more freely use the sequences.
With the introduction of #4385 we inadvertently started allowing and
pushing down certain lateral subqueries that were unsafe to push down.
To be precise the type of LATERAL subqueries that is unsafe to push down
has all of the following properties:
1. The lateral subquery contains some non recurring tuples
2. The lateral subquery references a recurring tuple from
outside of the subquery (recurringRelids)
3. The lateral subquery requires a merge step (e.g. a LIMIT)
4. The reference to the recurring tuple should be something else than an
equality check on the distribution column, e.g. equality on a non
distribution column.
Property number four is considered both hard to detect and probably not
used very often. Thus this PR ignores property number four and causes
query planning to error out if the first three properties hold.
Fixes#5327
TEXT SEARCH DICTIONARY objects depend on TEXT SEARCH TEMPLATE objects.
Since we do not yet support distributed TS TEMPLATE objects, we skip
dependency checks for text search templates, similar to what we do for
roles.
The user is expected to manually create the TEXT SEARCH TEMPLATE objects
before a) adding new nodes, b) creating TEXT SEARCH DICTIONARY objects.
If a worker node is being added, a command is sent to get the server_id of the worker from the pg_dist_node_metadata table. If the worker's id is the same as the node executing the code, we will know the node is trying to add itself. If the node tries to add itself without specifying `groupid:=0` the operation will result in an error.
Using CASCADE in a DELETE can inadvertently delete things we don't
intend to. It's safer to fail hard and make the user delete depending
things manually.
1) Remove useless columns
2) Show backends that are blocked on a DDL even before
gpid is assigned
3) One minor bugfix, where we clear distributedCommandOriginator
properly.
DESCRIPTION: Move pg_dist_object to pg_catalog
Historically `pg_dist_object` had been created in the `citus` schema as an experiment to understand if we could move our catalog tables to a branded schema. We quickly realised that this interfered with the UX on our managed services and other environments, where users connected via a user with the name of `citus`.
By default postgres put the username on the search_path. To be able to read the catalog in the `citus` schema we would need to grant access permissions to the schema. This caused newly created objects like tables etc, to default to this schema for creation. This failed due to the write permissions to that schema.
With this change we move the `pg_dist_object` catalog table to the `pg_catalog` schema, where our other schema's are also located. This makes the catalog table visible and readable by any user, like our other catalog tables, for debugging purposes.
Note: due to the change of schema, we had to disable 1 test that was running into a discrepancy between the schema and binary. Secondly, we needed to make the lookup functions for the `pg_dist_object` relation and their indexes less strict on the fallback of the naming due to an other test that, due to an unfortunate cache invalidation, needed to lookup the relation again. This makes that we won't default to _only_ resolving from `pg_catalog` outside of upgrades.
* Notice when create_distributed_function called without params
* Move variable comments to top
* Add valid check for cache entry
* add objtype to notice msg
* update test outputs
* Add more tests
* Address feedback
And also citus_calculate_gpid(nodeId,pid). These UDFs are just
wrappers for the existing functions. Useful for testing and simple
manipulation of citus_stat_activity.
It seems like our approach is way too restrictive and some places
are wrong. Now, we follow very similar approach to pg_stat_activity.
Some of the changes are pre-requsite for implementing citus_dist_stat_activity
via citus_stat_activity.
Clusters created pre-Citus 11 mostly didn't have metadata sync enabled.
For those clusters, we add a utility UDF which fixes some minor issues
and sync the necessary objects to the workers.
* [Columnar] Build columnar.so and let citus depends on it
Co-authored-by: Yanwen Jin <yanwjin@microsoft.com>
Co-authored-by: Ying Xu <32597660+yxu2162@users.noreply.github.com>
Co-authored-by: jeff-davis <Jeffrey.Davis@microsoft.com>
DESCRIPTION: Add GUC to control ddl creation behaviour in transactions
Historically we would _not_ propagate objects when we are in a transaction block. Creation of distributed tables would not always work in sequential mode, hence objects created in the same transaction as distributing a table that would use the just created object wouldn't work. The benefit was that the user could still benefit from parallelism.
Now that the creation of distributed tables is supported in sequential mode it would make sense for users to force transactional consistency of ddl commands for distributed tables. A transaction could switch more aggressively to sequential mode when creating new objects in a transaction.
We don't change the default behaviour just yet.
Also, many objects would not even propagate their creation when the transaction was already set to sequential, leaving the probability of a self deadlock. The new policy checks solve this discrepancy between objects as well.
The issue in question is caused when rebalance / replication call `FullShardPlacementList` which returns all shard placements (including those in disabled nodes with `citus_disable_node`). Eventually, `FindFillStateForPlacement` looks for the state across active workers and fails to find a state for the placements which are in the disabled workers causing a seg fault shortly after.
Approach:
* `ActivePlacementHash` was not using the status of the shard placement's node to determine if the node it is active. Initially, I just fixed that.
* Additionally, I refactored the code which handles active shards in replication / rebalance to:
* use a single function to determine if a shard placement is active.
* do the shard active shard filtering before calling `RebalancePlacementUpdates` and `ReplicationPlacementUpdates`, so test methods like `shard_placement_rebalance_array` and `shard_placement_replication_array` which have different shard placement active requirements can do their own filtering while using the same rebalance / replicate logic that `rebalance_table_shards` and `replicate_table_shards` use.
Fix#5664
CitusInitiatedBackend was a pre-mature implemenation of the whole
GlobalPID infrastructure. We used it to track whether any individual
query is triggered by Citus or not.
As of now, after GlobalPID is already in place, we don't need
CitusInitiatedBackend, in fact it could even be wrong.
#5685 introduced the resolution of dependencies for indices. This missed support for indices on partitioned tables. This change adds support for partitioned indices to the dependency resolution code.
It turns out `whereis` is incredibly slow on WSL2 (at least on my
machine):
```
$ time whereis diff
diff: /usr/bin/diff /usr/share/man/man1/diff.1.gz
real 0m0.408s
user 0m0.010s
sys 0m0.101s
```
This command is run by our custom `diff` script, which is run for every
test file that is run. So this adds lots of unnecessary runtime time to
tests.
This changes our custom `diff` script to only call `whereis` in the
strange case that `/usr/bin/diff` does not exist.
The impact of this small change on the total runtime of the tests on WSL
is huge. As an example the following command takes 18 seconds without
this change and 7 seconds with it:
```
make -C src/test/regress/ check-arbitrary-configs CONFIGS=PostgresConfig
```
(cherry picked from commit 4e93afd1f78854e1aaab63690c441b0b0598a82c)
(cherry picked from commit 0295fe2f5b)
(cherry picked from commit 878510725fab9cb6870b4504e0b1f055d7bbc68d)
Before this commit, dumping wait edges can only be used for
distributed deadlock detection purposes. With this commit,
we open the possibility that we can use it for any backend.
CREATE FUNCTION command together with it's dependencies.
If the function depends on any nondistributable object,
function will be created only locally. Parameterless
version of create_distributed_function becomes obsolete
with this change, it will deprecated from the code with a subsequent PR.
* When a worker tried to create a collation which had a dependency in the same worker node,
it would cause a deadlock, now it throws the correct "not a coordinator" error.
DESCRIPTION: Implement TEXT SEARCH CONFIGURATION propagation
The change adds support to Citus for propagating TEXT SEARCH CONFIGURATION objects. TSConfig objects cannot always be created in one create statement, and instead require a create statement followed by many alter statements to get turned into the object they should represent.
To support this we add functionality to the worker to create or replace objects based on a list of statements. When the lists of the local object and the remote object correspond 1:1 we skip the creation of the object and simply mark it distributed. This is especially important for TSConfig objects as initdb pre-populates databases with a dozen configurations (for many different languages).
When the user creates a new TSConfig based on the copy of an existing configuration there is no direct link to the object copied from. Since there is no link we can't simply rely on propagating the dependencies to the worker and send a qualified
We check for metadata consistency across the cluster in the test
isolation_metadata_sync_vs_all. However, some earlier tests in
enterprise repo leave invalid pg_dist_node entries in the worker nodes
that have Oid values for already dropped role objects.
To remedy that, I suggest that we move the test to earlier in the
schedule, thereby making the tests pass for the time being. We should
later introduce metadata checking either in a new isolation test or by
moving this test later in the schedule. However, we should do that after
we fix the underlying issue.
The low-level StoreAllActiveTransactions() function filters out
backends that exited.
Before this commit, if you run a pgbench, after that you'd still
see the backends show up:
```SQL
select count(*) from get_global_active_transactions();
┌───────┐
│ count │
├───────┤
│ 538 │
└───────┘
```
After this patch, only active backends show-up:
```SQL
select count(*) from get_global_active_transactions();
┌───────┐
│ count │
├───────┤
│ 72 │
└───────┘
```
DESCRIPTION: Prevent Citus table functions from being called on shards
The operations that guard against using shards are:
* Create Local Table
* Create distributed table (which affects reference table creation as well).
* I used a `ErrorIfRaltionIsKnownShard` instead of `ErrorIfIllegallyChangingKnownShard`.
`ErrorIfIllegallyChangingKnownShard` allows the operation if `citus.enable_manual_changes_to_shards`,
but I am not sure if it ever makes sense to create a distributed, reference, or citus local table out of a shard.
I tried to go over the code to identify other UDF-s where shards could be illegaly changed, but I could not find any other.
My knowledge of the codebase is not solid enough for me to say for sure.
Fixes#5610
This commit introduces several test cases for concurrent operations that
change metadata, and a concurrent metadata sync operation.
The overall structure is as follows:
- Session#1 starts metadata syncing in a transaction block
- Session#2 does an operation that change metadata
- Both sessions are committed
- Another session checks whether the metadata are the same accross all
nodes in the cluster.
* Break the dependency to CitusInitiatedBackend infrastructure
With this change, we start to show non-distributed backends as well
in citus_dist_stat_activity. I think that
(a) it is essential for making citus_lock_waits to work for blocked
on DDL commands.
(b) it is more expected from the user's perspective. The name of
the view is a little inconsistent now (e.g., citus_dist_stat_activity)
but we are already planning to improve the names with followup
PRs.
Also, we have global pids assigned, the CitusInitiatedBackend
becomes obsolete.
With https://github.com/citusdata/citus/pull/5657, Citus uses
a fixed application_name while connecting to remote nodes
for internal purposes.
It means that we cannot allow users to override it via
citus.node_conninfo.
Implement #5649
Allow create_distributed_function() on functions owned by extensions
1) Only update pg_dist_object, and do not propagate CREATE FUNCTION.
2) Ensure corresponding extension is in pg_dist_object.
3) Verify if dependencies exist on the function they should resolve to the extension.
4) Impact on node-scaling: We build a list of ddl commands based on all objects in
pg_dist_object. We need to omit the ddl's for the extension-function, as it
will get propagated by the virtue of the extension creation.
5) Extra checks for functions coming from extensions, to not propagate changes
via ddl commands, even though the function is marked as distributed in pg_dist_object
If the expression is simple, such as, SELECT function() or PEFORM function()
in PL/PgSQL code, PL engine does a simple expression evaluation which can't
interpret the Citus CustomScan Node. Code checks for simple expressions when
executing an UDF but missed the DO-Block scenario, this commit fixes it.
Removed dependency for EnsureTableOwner. Also removed pg_fini() and columnar_tableam_finish() Still need to remove CheckCitusVersion dependency to make Columnar_tableam.h dependency free from Citus.
Previously, we were wrapping targetlist nodes with Vars that reference
to the result of the worker query, if the node itself is not `Const` or
not a `Param`. Indeed, we should not do that unless the node itself is
a `Var` node or contains a `Var` within it (e.g.: `OpExpr(Var(column_a) > 2)`).
Otherwise, when worker query returns empty result set, then combine
query exec would crash since the `Var` would be pointing to an empty
tuple slot, which is not desirable for the node-executor methods.
Replaces citus.enable_object_propagation with citus.enable_metadata_sync
Also, within Citus 11 release cycle, we added citus.enable_metadata_sync_by_default,
that is also replaced with citus.enable_metadata_sync.
In essence, when citus.enable_metadata_sync is set to true, all the objects
and the metadata is send to the remote node.
We strongly advice that the users never changes the value of
this GUC.
With this commit, rebalancer backends are identified by application_name = citus_rebalancer
and the regular internal backends are identified by application_name = citus_internal
With this commit we've started to propagate sequences and shell
tables within the object dependency resolution. So, ensuring any
dependencies for any object will consider shell tables and sequences
as well. Separate logics for both shell tables and sequences have
been removed.
Since both shell tables and sequences logic were implemented as a
part of the metadata handling before that logic, we were propagating
them while syncing table metadata. With this commit we've divided
metadata (which means anything except shards thereafter) syncing
logic into multiple parts and implemented it either as a part of
ActivateNode. You can check the functions called in ActivateNode
to check definition of different metadata.
Definitions of start_metadata_sync_to_node and citus_activate_node
have also been updated. citus_activate_node will basically create
an active node with all metadata and reference table shards.
start_metadata_sync_to_node will be same with citus_activate_node
except replicating reference tables. stop_metadata_sync_to_node
will remove all the metadata. All of those UDFs need to be called
by superuser.
When creating a new table, we bypass the buffer cache and write the
initial pages directly with smgrwrite(). However, you're supposed to
use smgrextend() when extending a relation, rather than smgrwrite().
There isn't much difference between them, but smgrextend() updates the
relation size cache, which seems important, although I haven't seen
any real bugs caused by that.
Also, write the block to disk only after WAL-logging it, so that we
can include the LSN of the WAL record in the version that we write
out. Currently, the page as written to disk has LSN 0. That doesn't
cause any user-visible issues either, at worst it could make us
WAL-log a full page image of the page earlier than necessary, but that
doesn't matter currently because we WAL-log full page images of all
changes anyway.
I bumped into that issue with LSN 0 in the page header when testing
Citus with Zenith (https://github.com/zenithdb/zenith/issues/1176).
Zenith contains a check that PANICs if you write a block to disk
without WAL-logging it, and it works by checking the LSN of the page
that's written out. In this case, we are WAL-logging the page even
though the LSN on the page is 0, so it was a false alarm, but I'd love
to get this changed in Citus to keep the check in Zenith simple.
A downside of WAL-logging the page first is that if you run out of
disk space, you have already created the WAL record. So if you then
crash and restart, WAL recovery will likely run out of disk space,
too, which is bad. In practice, we have the same problem in other
places, like rewriteheap.c. Also, if you are on the brink of running
out of disk space, you will probably run out at WAL replay anyway,
regardless of which order we write these few pages. But if we wanted
to fix that, we could first extend the relation with zeros, and then
WAL-log the pages. That's how heap extension works.
It would be even nicer to use the buffer cache for this, and skip the
smgrimmedsync() on the relation. However, that would require more
work, because we don't have the Relation struct for the relation here.
We could use ReadBufferWithoutRelcache(), but that doesn't work for
unlogged tables. Unlogged tables are currently not supported
(https://github.com/citusdata/citus/issues/4742), but that would
become a problem if we want to support them in the future.
CreateFakeRelcacheEntry() also doesn't work with unlogged tables. We
could do things differently for logged and unlogged tables, but that
complicates the code further.
Co-authored-by: jeff-davis <Jeffrey.Davis@microsoft.com>
Citus heavily relies on application_name, see
`IsCitusInitiatedRemoteBackend()`.
But if the user set the application name, such as export PGAPPNAME=test_name,
Citus uses that name while connecting to the remote node.
With this commit, we ensure that Citus always connects with
the "citus" user name to the remote nodes.
With https://github.com/citusdata/citus/pull/2780, we allow
COPY to use any number of connections that the executor used
in a tx block.
Meaning that, while COPYing data to the shards, create_distributed_table
could allow sequential mode.
We fall back to local execution if we cannot establish any more
connections to local node. However, we should not do that for the
commands that we don't know how to execute locally (or we know we
shouldn't execute locally). To fix that, we take localExecutionSupported
take into account in CanFailoverPlacementExecutionToLocalExecution too.
Moreover, we also prompt a more accurate hint message to inform user
about whether the execution is failed because local execution is
disabled by them, or because local execution wasn't possible for given
command.
multi_log_hook() hook is called by EmitErrorReport() when emitting the
ereport either to frontend or to the server logs. And some callers of
EmitErrorReport() (e.g.: errfinish()) seems to assume that string fields
of given ErrorData object needs to be freed. For this reason, we copy the
message into heap here.
I don't think we have faced with such a problem before but it seems worth
fixing as it is theoretically possible due to the reasoning above.
BEGIN/COMMIT transaction block or in a UDF calling another UDF.
(2) Prohibit/Limit the delegated function not to do a 2PC (or any work on a
remote connection).
(3) Have a safety net to ensure the (2) i.e. we should block the connections
from the delegated procedure or make sure that no 2PC happens on the node.
(4) Such delegated functions are restricted to use only the distributed argument
value.
Note: To limit the scope of the project we are considering only Functions(not
procedures) for the initial work.
DESCRIPTION: Introduce a new flag "force_delegation" in create_distributed_function(),
which will allow a function to be delegated in an explicit transaction block.
Fixes#3265
Once the function is delegated to the worker, on that node during the planning
distributed_planner()
TryToDelegateFunctionCall()
CheckDelegatedFunctionExecution()
EnableInForceDelegatedFuncExecution()
Save the distribution argument (Constant)
ExecutorStart()
CitusBeginScan()
IsShardKeyValueAllowed()
Ensure to not use non-distribution argument.
ExecutorRun()
AdaptiveExecutor()
StartDistributedExecution()
EnsureNoRemoteExecutionFromWorkers()
Ensure all the shards are local to the node in the remoteTaskList.
NonPushableInsertSelectExecScan()
InitializeCopyShardState()
EnsureNoRemoteExecutionFromWorkers()
Ensure all the shards are local to the node in the placementList.
This also fixes a minor issue: Properly handle expressions+parameters in distribution arguments
* Removed distributed dependency in columnar_metadata.c
* Changed columnar_debug.c so that it no longer needed distributed/tuplestore and made it return a record instead of a tuplestore
* removed distributed/commands.h dependency
* Made columnar_tableam.c dependency-free
* Fixed spacing for columnar_store_memory_stats function
* indentation fix
* fixed test failures
* Require superuser while activating a node
With this change, we require ActiveNode() (hence citus_add_node(),
citus_activate_node()) explicitly require for a superuser.
Before this commit, these functions were designed to work with
non-superuser roles with the relevent GRANTs given.
However, that is not a widely used way for calling the functions
above.
Due to possibility of non-super user calling the UDFs, they were
designed in a way that some commands were using some additional
short-lived superuser connections. That is:
(a) breaking transactional behavior (e.g., ROLLBACK
wouldn't fully rollback the whole transaction)
(b) Making it very complicated to reason about which
parts of the node activation goes over which connections,
and becoming vulnerable to deadlocks / visibility issues.
In addition to starting a new transaction, we also need to tell other
backends --including the ones spawned for connections opened to
localhost to build indexes on shards of this relation-- that concurrent
index builds can safely ignore us.
Normally, DefineIndex() only does that if index doesn't have any
predicates (i.e.: where clause) and no index expressions at all.
However, now that we already called standard process utility, index
build on the shell table is finished anyway.
The reason behind doing so is that we cannot guarantee not grabbing any
snapshots via adaptive executor, and the backends creating indexes on
local shards (if any) might block on waiting for current xact of the
current backend to finish, which would cause self deadlocks that are not
detectable.
With https://github.com/citusdata/citus/pull/5493 we introduced
metadata specific connections.
With this connection we guarantee that there is a single metadata connection.
But note that this connection can be used for any other operation.
In other words, this connection is not only reserved for metadata
operations.
However, as https://github.com/citusdata/citus-enterprise/issues/715 showed
us that the logic has a flaw. We allowed ineligible connections to be
picked as metadata connections: such as exclusively claimed connections
or not fully initialized connections.
With this commit, we make sure that we only consider eligable connections
for metadata operations.
We prefer the background daemon to only sync node metadata. That's
why we move placement metadata changes from disable node to
activate node. With that, we can make sure that disable node
only changes node metadata, whereas activate node syncs all
the metadata changes. In essence, we already expect all
nodes to be up when a node is activated. So, this does not change
the behavior much.
Dropping sequences means we need to recreate
and hence losing the sequence.
With this commit, we keep the existing sequences
such that resyncing wouldn't drop the sequence.
We do that by breaking the dependency of the sequence
from the table.
Split distributed/version_compat.h into dependency-free
pg_version_compat.h, and the original which still has
dependencies. The original doesn't have much purpose, but until other
files have better discipline about including the correct header files,
then it's still needed.
Also make distributed/listutils.h dependency-free. Should be moved
outside of 'distributed' subdirectory, but that will cause significant
code churn, so leave for another cleanup patch.
Now both files can be included in columnar without creating a
dependency on citus.
Previously, we cheated by using the RM_GENERIC_ID record type, but not
actually using the generic WAL API. This worked because we always took
a full page image, and saved the extra work of allocating and copying
to a temporary page.
But it introduced complexity, and perhaps fragility, so better to just
use the API properly. The performance penalty for a serial data load
seems to be less than 1%.
Before this commit, Citus was triggering metadata syncing
in the background when a function is distributed. However,
with Citus 11, we expect all clusters to have metadata synced
enabled. So, we do not expect any nodes not to have the metadata.
This change:
(a) pro: simplifies the code and opens up possibilities
to simplify futher by reducing the scope of
bg worker to only sync node metadata
(b) pro: explicitly asks users to sync the metadata such that
any unforseen impact can be easily detected
(c) con: For distributed functions without distribution
argument, we do not necessarily require the metadata
sycned. However, for completeness and simplicity, we
do so.
With Citus 11, the default behavior is to sync the metadata.
However, partitioned tables created pre-Citus 11 might have
index names that are not compatiable with metadata syncing.
See https://github.com/citusdata/citus/issues/4962 for the
details.
With this commit, we record the existence of partitioned tables
such that we can fix it later if any exists.
With this commit, fix_partition_shard_index_names()
works significantly faster.
For example,
32 shards, 365 partitions, 5 indexes drop from ~120 seconds to ~44 seconds
32 shards, 1095 partitions, 5 indexes drop from ~600 seconds to ~265 seconds
`queryStringList` can be really long, because it may contain #partitions * #indexes entries.
Before this change, we were actually going through the executor where each command
in the query string triggers 1 round trip per entry in queryStringList.
The aim of this commit is to avoid the round-trips by creating a single query string.
I first simply tried sending `q1;q2;..;qn` . However, the executor is designed to
handle `q1;q2;..;qn` type of query executions via the infrastructure mentioned
above (e.g., by tracking the query indexes in the list and doing 1 statement
per round trip).
One another option could have been to change the executor such that only track
the query index when `queryStringList` is provided not with queryString
including multiple `;`s . That is (a) more work (b) could cause weird edge
cases with failure handling (c) felt like coding a special case in to the executor
(cherry picked from commit 90928cfd74)
Fix function signature generation
Fix comment typo
Add test for worker_create_or_replace_object
Add test for recreating distributed functions with OUT/TABLE params
Add test for recreating distributed function that returns setof int
Fix test output
Fix comment
Simply applies
```SQL
SELECT textlike(command, citus.grep_remote_commands)
```
And, if returns true, the command is logged. Else, the log is ignored.
When citus.grep_remote_commands is empty string, all commands are
logged.
This UDF coordinates connectivity checks accross the whole cluster.
This UDF gets the list of active readable nodes in the cluster, and
coordinates all connectivity checks in sequential order.
The algorithm is:
for sourceNode in activeReadableWorkerList:
c = connectToNode(sourceNode)
for targetNode in activeReadableWorkerList:
result = c.execute(
"SELECT citus_check_connection_to_node(targetNode.name,
targetNode.port")
emit sourceNode.name,
sourceNode.port,
targetNode.name,
targetNode.port,
result
- result -> true -> connection attempt from source to target succeeded
- result -> false -> connection attempt from source to target failed
- result -> NULL -> connection attempt from the current node to source node failed
I suggest you use the following query to get an overview on the connectivity:
SELECT bool_and(COALESCE(result, false))
FROM citus_check_cluster_node_health();
Whenever this query returns false, there is a connectivity issue, check in detail.
PostgreSQL does not need calling this function since 7.4 release, and it
is a NOOP.
For more details, check PostgreSQL commit below :
commit dd04e958c8b03c0f0512497651678c7816af3198
Author: Tom Lane <tgl@sss.pgh.pa.us>
Date: Sun Mar 9 03:34:10 2003 +0000
tuplestore_donestoring() isn't needed anymore, but provide a no-op
macro definition so as not to create compatibility problems.
diff --git a/src/include/utils/tuplestore.h b/src/include/utils/tuplestore.h
index b46babacd1..76fe9fb428 100644
--- a/src/include/utils/tuplestore.h
+++ b/src/include/utils/tuplestore.h
@@ -17,7 +17,7 @@
* Portions Copyright (c) 1996-2002, PostgreSQL Global Development Group
* Portions Copyright (c) 1994, Regents of the University of California
*
- * $Id: tuplestore.h,v 1.8 2003/03/09 02:19:13 tgl Exp $
+ * $Id: tuplestore.h,v 1.9 2003/03/09 03:34:10 tgl Exp $
*
*-------------------------------------------------------------------------
*/
@@ -41,6 +41,9 @@ extern Tuplestorestate *tuplestore_begin_heap(bool randomAccess,
extern void tuplestore_puttuple(Tuplestorestate *state, void *tuple);
+/* tuplestore_donestoring() used to be required, but is no longer used */
+#define tuplestore_donestoring(state) ((void) 0)
+
/* backwards scan is only allowed if randomAccess was specified 'true' */
extern void *tuplestore_gettuple(Tuplestorestate *state, bool forward,
bool *should_free);
We had 2 class definitions for CitusCacheManyConnectionsConfig, where
one of them was a copy of CitusSmallCopyBuffersConfig.
This commit leaves the intended class definition that configures caching
many connections, and removes the one that is a copy of another class
Since sequences are not marked as distributed while creating table if no
metadata worker node exists, we are marking all sequences distributed
while syncing metadata explicitly.
We've both allowed delegating functions and procedures from worker nodes
and also prevented delegation if a function/procedure has already been
propagated from another node.
Before that PR we were updating citus.pg_dist_object metadata, which keeps
the metadata related to objects on Citus, only on the coordinator node. In
order to allow using those object from worker nodes (or erroring out with
proper error message) we've started to propagate that metedata to worker
nodes as well.
citus_check_connection_to_node runs a simple query on a remote node and
reports whether this attempt was successful.
This UDF will be used to make sure each worker node can connect to all
the worker nodes in the cluster.
parameters:
nodename: required
nodeport: optional (default: 5432)
return value:
boolean success
* Update broken link for upgrade tests
* Update src/test/regress/README.md
Co-authored-by: Nils Dijk <nils@citusdata.com>
Co-authored-by: Nils Dijk <nils@citusdata.com>
As of master branch, Citus does all the modifications to replicated tables
(e.g., reference tables and distributed tables with replication factor > 1),
via 2PC and avoids any shardstate=3. As a side-effect of those changes,
handling node failures for replicated tables change.
With this PR, when one (or multiple) node failures happen, the users would
see query errors on modifications. If the problem is intermitant, that's OK,
once the node failure(s) recover by themselves, the modification queries would
succeed. If the node failure(s) are permenant, the users should call
`SELECT citus_disable_node(...)` to disable the node. As soon as the node is
disabled, modification would start to succeed. However, now the old node gets
behind. It means that, when the node is up again, the placements should be
re-created on the node. First, use `SELECT citus_activate_node()`. Then, use
`SELECT replicate_table_shards(...)` to replicate the missing placements on
the re-activated node.
With this commit, we make sure to use a dedicated connection per
node for all the metadata operations within the same transaction.
This is needed because the same metadata (e.g., metadata includes
the distributed table on the workers) can be modified accross
multiple connections.
With this connection we guarantee that there is a single metadata connection.
But note that this connection can be used for any other operation.
In other words, this connection is not only reserved for metadata
operations.
The checks for preventing to remove a node are very much reference
table centric. We are soon going to add the same checks for replicated
tables. So, make the checks generic such that:
(a) replicated tables fit naturally
(b) we can the same checks in `citus_disable_node`.
We do not use comments starting with # in spec files because it creates
errors from C preprocessor that expects directives after this character.
Instead use C style comments, i.e:
// single line comment
You can also use multiline comments as well
/*
* multi line comment
*/
We re-define the meaning of active shard placement. It used
to only be defined via shardstate == SHARD_STATE_ACTIVE.
Now, we also add one more check. The worker node that the
placement is on should be active as well.
This is a preparation for supporting citus_disable_node()
for MX with multiple failures at the same time.
With this change, the maintanince daemon only needs to
sync the "node metadata" (e.g., pg_dist_node), not the
shard metadata.
Before this commit, we acquire the metadata locks on the reference
tables while removing/disabling a node on all the MX nodes.
Although it has some marginal benefits, such as a concurrent
modification during remove/disable node blocks, instead of erroring
out, the drawbacks seems worse. Both citus_remove_node and citus_disable_node
are not tolerant to multiple node failures.
With this commit, we relax the locks. The implication is that while
a node is removed/disabled, users might see query errors. On the
other hand, this change becomes removing/disabling nodes more
tolerant to multiple node failures.
When refactoring storage layer in #4907, we deleted the code that allows
overwriting a disk page previously written but not known by metadata.
Readers can see the change that introduced the code allows doing so in
commit a8da9acc63.
The reasoning was that; as of 10.2, we started aligning page
reservations (`AlignReservation`) for subsequent writes right after
allocating pages from disk. That means, even if writer transaction
fails, subsequent writes are guaranteed to allocate a new page and write
to there. For this reason, attempting to write to a page allocated
before is not possible for a columnar table that user created when using
v10.2.x.
However, since the older versions of columnar doesn't do that, following
example scenario can still result in writing to such disk page, even if
user now upgraded to v10.2.x. This is because, when upgrading storage to
2.0 (`ColumnarStorageUpdateIfNeeded`), we calculate `reservedOffset` of
the metapage based on the highest used address known by stripe
metadata (`GetHighestUsedAddressAndId`). However, stripe metadata
doesn't have entries for aborted writes. As a result, highest used
address would be computed by ignoring pages that are allocated but not
used.
- User attempts writing to columnar table on Citus v10.0x/v10.1x.
- Write operation fails for some reason.
- User upgrades Citus to v10.2.x.
- When attempting to write to same columnar table, they hit to "attempt
to write columnar data .." error since write operation done in the
older version of columnar already allocated that page, and now we are
overwriting it.
For this reason, with this commit, we re-do the change done in
a8da9acc63.
And for the reasons given above, it wasn't possible to add a test for
this commit via usual code-paths. For this reason, added a UDF only for
testing purposes so that we can reproduce the exact scenario in our
regression test suite.
During pg upgrades, we have seen that it is not guaranteed that a
columnar table will be created after metadata objects got created.
Prior to changes done in this commit, we had such a dependency
relationship in `pg_depend`:
```
columnar_table ----> columnarAM ----> citus extension
^ ^
| |
columnar.storage_id_seq -------------------- |
|
columnar.stripe -------------------------------
```
Since `pg_upgrade` just knows to follow topological sort of the objects
when creating database dump, above dependency graph doesn't imply that
`columnar_table` should be created before metadata objects such as
`columnar.storage_id_seq` and `columnar.stripe` are created.
For this reason, with this commit we add new records to `pg_depend` to
make columnarAM depending on all rel objects living in `columnar`
schema. That way, `pg_upgrade` will know it needs to create those before
creating `columnarAM`, and similarly, before creating any tables using
`columnarAM`.
Note that in addition to inserting those records via installation script,
we also do the same in `citus_finish_pg_upgrade()`. This is because,
`pg_upgrade` rebuilds catalog tables in the new cluster and that means,
we must insert them in the new cluster too.
- [x] Add some more regression test coverage
- [x] Make sure returning works fine in case of
local execution + remote execution
(task->partiallyLocalOrRemote works as expected, already added tests)
- [x] Implement locking properly (and add isolation tests)
- [x] We do #shardcount round-trips on `SerializeNonCommutativeWrites`.
We made it a single round-trip.
- [x] Acquire locks for subselects on the workers & add isolation tests
- [x] Add a GUC to prevent modification from the workers, hence increase the
coordinator-only throughput
- The performance slightly drops (~%15), unless
`citus.allow_modifications_from_workers_to_replicated_tables`
is set to false
Drop extension might cascade to columnar.options before dropping a
columnar table. In that case, we were getting below error when opening
columnar.options to delete records for the columnar table that we are
about to drop.: "ERROR: could not open relation with OID 0".
I somehow reproduced this bug easily when upgrading pg, that is why
adding added the test to after_pg_upgrade_schedule.
We recently introduced a set of patches to 10.2, and introduced 10.2-4
migration version. This migration version only resides on `release-10.2`
branch, and is missing on our default branch. This creates a problem
because we do not have a valid migration path from 10.2 to latest 11.0.
To remedy this issue, I copied the relevant migration files from
`release-10.2` branch, and renamed some of our migration files on
default branch to make sure we have a linear upgrade path.
Before this commit, we required the user to be owner of the shard/table
in order to call lock_shard_resources.
However, that is too restrictive. We can have users with GRANTS
to the table who are not owners of the tables/shards.
With this commit, we allow such patterns.
This change creates a slightly higher abstraction of the `PartitionedResultDestReceiver` where it decouples the partitioning from writing it to a file. This allows for easier reuse for other `DestReceiver`'s that would like to route different tuples to different `DestReceiver`'s.
Originally there was a lot of state kept in `PartitionedResultDestReceiver` to be able to lazily create `FileDestReceivers` when the first tuple arrived for that target. This convoluted the implementation of the processing of tuples with where they should go.
This refactor changes that where it makes the `PartitionedResultDestReceiver` completely agnostic of what kind of Receivers it is writing to. When constructed you pass it a list of `DestReceiver` compatible pointers with the length of `partitionCount`. Internally the `PartitionedResultDestReceiver` keeps track of which `DestReceiver`'s have been started or not, and start them when they first receive a tuple.
Alternatively, if the instantiating code of the `PartitionedResultDestReceiver` wants, the startup can be turned from lazily to eagerly. When the startup is eager (not lazy) all `rStartup` functions on the list of `DestReceiver`'s are called during the startup of the `PartitionedResultDestReceiver` and marked as such.
A downside of this approach is the following. On highly partitioned destinations we now need to allocate a `FileDestReceiver` for every target, _always_. When the data passed into the `PartitionedResultDestReceiver` is highly skewed to a small set of `FileDestReceiver`'s this will waste some memory. Given the small size of a `FileDestReceiver`, and the fact that actual file handles are only created during the processing of the startup of the `FileDestReceiver` I think this memory waste is not a problem. If this would become a problem we could refactor the source list into some kind of generator object which can generate the `DestReceiver`'s on the fly.
* Refactor some checks in citus local tables
* all existing citus local tables are auto converted after upgrade
* Update warning messages in CreateCitusLocalTable
* Hide notice msg for auto converting local tables
* Hide hint msg
Co-authored-by: Ahmet Gedemenli <afgedemenli@gmail.com>
This PR is fixing 2 separate issues related to the local run of citus upgrade tests.
d3e7c825ab fixes the issue that, with our new testing infrastructure, we moved/renamed some of existing folders. This created a problem for local runs of citus upgrade tests since some paths were sensitive to such changes. This commit tries to make it more generic so that this issue is less likely to happen in the future, while also fixing the current issue.
93de6b60c3 we are fixing an issue that a new environment variable was added for citus upgrade tests, which is defined in the CI. 0cb51f8c37/.circleci/config.yml (L294)
This environment variable wasn't set in our local runs hence it would create problems. Instead of defining this environment variable in the local run, we change the citus_upgrade run command to use an existing env variable, which is now also set in the CI.
We fixed some crashes a while back that would only occur in cases where
the value of a distribution column would have result in a high or a very
low hash value. This adds a regression test for those crashes.
This test starts passing because of PR #4508, to be precise commit:
24e60b44a1
When I undo that commit this newly added test starts failing. This adds
this test to make sure we don't regress on this again.
Clang 13 complains about a suspicious string concatenation. It thinks we
might have missed a comma. This adds parentheses to make it clear that
concatenation is indeed what we meant.
There is a vulnerability in mitmproxy with the version we are using.
It would be hard to exploit anything with regards to the artifacts we ship as its only used in our test suite. Still its good hygiene to _not_ use software with known vulnerabilities.
This PR updates the version of python, mitmproxy and the crypto libraries used.
The latest version of mitmproxy for python 3.6 is not patched, hence the upgrade of python.
For our CI images this cascades into upgrading debian as well :)
For CI we bake these versions in our images so we need to update them as well.
Changes to the CI images: https://github.com/citusdata/the-process/pull/65
It seems like the decision for 2PC is more complicated than
it should be.
With this change, we do one behavioral change. In essense,
before this commit, when a SELECT task with replication factor > 1
is executed, the executor was triggering 2PC. And, in fact,
the transaction manager (`ConnectionModifiedPlacement()`) was
able to understand not to trigger 2PC when no modification happens.
However, for transaction blocks like:
BEGIN;
-- a command that triggers 2PC
-- A SELECT command on replication > 1
..
COMMIT;
The SELECT was used to be qualified as required 2PC. And, as a side-effect
the executor was setting `xactProperties.errorOnAnyFailure = true;`
So, the commands was failing at the time of execution. Now, they fail at
the end of the transaction.
In the past, we allowed users to manually switch to 1PC
(e.g., one phase commit). However, with this commit, we
don't. All multi-shard modifications are done via 2PC.
With Citus 9.0, we introduced `citus.single_shard_commit_protocol` which
defaults to 2PC.
With this commit, we prevent any user to set it to 1PC and drop support
for `citus.single_shard_commit_protocol`.
Although this might add some overhead for users, it is already the default
behaviour (so less likely) and marking placements as INVALID is much
worse.
- citus_get_all_dependencies_for_object: emulate what Citus
would qualify as
dependency when adding
a new node
- citus_get_dependencies_for_object: emulate what Citus would qualify
as dependency when creating an
object
Example use:
```SQL
-- find all the depedencies of table test
SELECT
pg_identify_object(t.classid, t.objid, t.objsubid)
FROM
(SELECT * FROM pg_get_object_address('table', '{test}', '{}')) as addr
JOIN LATERAL
citus_get_all_dependencies_for_object(addr.classid, addr.objid, addr.objsubid) as t(classid oid, objid oid, objsubid int)
ON TRUE
ORDER BY 1;
```
To run tests in parallel use:
```bash
make check-arbitrary-configs parallel=4
```
To run tests sequentially use:
```bash
make check-arbitrary-configs parallel=1
```
To run only some configs:
```bash
make check-arbitrary-base CONFIGS=CitusSingleNodeClusterConfig,CitusSmallSharedPoolSizeConfig
```
To run only some test files with some config:
```bash
make check-arbitrary-base CONFIGS=CitusSingleNodeClusterConfig EXTRA_TESTS=dropped_columns_1
```
To get a deterministic run, you can give the random's seed:
```bash
make check-arbitrary-configs parallel=4 seed=12312
```
The `seed` will be in the output of the run.
In our regular regression tests, we can see all the details about either planning or execution but this means
we need to run the same query under different configs/cluster setups again and again, which is not really maintanable.
When we don't care about the internals of how planning/execution is done but the correctness, especially with different configs
this infrastructure can be used.
With `check-arbitrary-configs` target, the following happens:
- a bunch of configs are loaded, which are defined in `config.py`. These configs have different settings such as different shard count, different citus settings, postgres settings, worker amount, or different metadata.
- For each config, a separate data directory is created for tests in `tmp_citus_test` with the config's name.
- For each config, `create_schedule` is run on the coordinator to setup the necessary tables.
- For each config, `sql_schedule` is run. `sql_schedule` is run on the coordinator if it is a non-mx cluster. And if it is mx, it is either run on the coordinator or a random worker.
- Tests results are checked if they match with the expected.
When tests results don't match, you can see the regression diffs in a config's datadir, such as `tmp_citus_tests/dataCitusSingleNodeClusterConfig`.
We also have a PostgresConfig which runs all the test suite with Postgres.
By default configs use regular user, but we have a config to run as a superuser as well.
So the infrastructure tests:
- Postgres vs Citus
- Mx vs Non-Mx
- Superuser vs regular user
- Arbitrary Citus configs
When you want to add a new test, you can add the create statements to `create_schedule` and add the sql queries to `sql_schedule`.
If you are adding Citus UDFs that should be a NO-OP for Postgres, make sure to override the UDFs in `postgres.sql`.
You can add your new config to `config.py`. Make sure to extend either `CitusDefaultClusterConfig` or `CitusMXBaseClusterConfig`.
On the CI, upon a failure, all logfiles will be uploaded as artifacts, so you can check the artifacts tab.
All the regressions will be shown as part of the job on CI.
In your local, you can check the regression diffs in config's datadirs as in `tmp_citus_tests/dataCitusSingleNodeClusterConfig`.
Add/fix tests
Fix creating partitions
Add test for mx - partition creating case
Enable cascading to partitioned tables
Fix mx partition adding test
Fix cascading through fkeys
Style
Disable converting with non-inherited fkeys
Fix detach bug
Early return in case of cascade & Add tests
Style
Fix undistribute_table bug & Fix test outputs
Remove RemovePartitionRelationIds
Test with undistribute_table
Add test for mx+convert+undistribute
Remove redundant usage of CreatePartitionedCitusLocalTable
Add some comments
Introduce bulk functions for generating attach/detach partition commands
Fix: Convert partitioned tables after adding fkey
Change the error message for partitions
Introduce function ErrorIfPartitionTableAddedToMetadata
Polish attach/detach command generation functions
Use time_partitions for testing
Move mx tests to citus_local_tables_mx
Add new partitioned table to cascade test
Add test with time series management UDFs
Fix test output
Fix: Assertion fail on relation access tracking
Style
Refactor creating partitioned citus local tables
Remove CreatePartitionedCitusLocalTable
Style
Error out if converting multi-level table
Revert some old tests
Error out adding partitioned partition
Polish
Polish/address
Fix create table partition of case
Use CascadeOperationForRelationIdList if no cascade needed
Fix create partition bug
Revert / Add new tests to mx
Style
Fix dropping fkey bug
Add test with IF NOT EXISTS
Convert to CLT when doing ATTACH PARTITION
Add comments
Add more tests with time series management
Edit the error message for converting the child
Use OR instead of AND in ErrorIfUnsupportedAlterTableStmt
Edit/improve tests
Disable ddl prop when dropping default column definitions
Disable/enable ddl prop just before/after the command
Add comment
Add sequence test
Add trigger test
Remove NeedCascadeViaForeignKeys
Add one more insert to sequence test
Add comment
Style
Fix test output shard ids
Update comments
Disable creating fkey on partitions
Move partition check to CreateCitusLocalTable
Add comment
Add check for attachingmulti-level partition
Add test for pg_constraint
Check pg_dist_partition in tests
Add test inserting on the worker
* Add udf to include shardId in broken partition shard index names
* Address reviews: rename index such that operations can be done on it
* More comprehensive index tests
* Final touches and formatting
Under high write concurrency, we were sometimes reading columnar
metapage as all zeros.
In `WriteToBlock()`, if `clear == true`, then it will clear the page before
writing the new one, rather than just adding data to the page. That
means any concurrent connection that is holding only a pin will be
able to see the all-zero state between the `InitPage()` and the
`memcpy_s()`.
Moreover, postgres/storage/buffer/README states that:
> Buffer access rules:
>
> 1. To scan a page for tuples, one must hold a pin and either shared or
> exclusive content lock. To examine the commit status (XIDs and status bits)
> of a tuple in a shared buffer, one must likewise hold a pin and either shared
> or exclusive lock.
For those reasons, we have to make sure to never keep a pin on the
page without (at least) the shared lock, to avoid having such problems.
A write operation might trigger index deletion if index already had
dead entries for the key we are about to insert.
There are two ways of index deletion:
a) simple deletion
b) bottom-up deletion (>= pg14)
Since columnar_index_fetch_tuple never sets all_dead to true,
columnarAM doesn't ever expect to receive simple deletion requests
(columnar_index_delete_tuples) as we don't mark any index entries
as dead.
However, since columnarAM doesn't delete any dead entries via simple
deletion, postgres might ask for a more comprehensive deletion
(i.e.: bottom-up) at some point when pg >= 14.
So with this commit, we start gracefully ignoring bottom-up deletion
requests made to columnar_index_delete_tuples.
Given that users can anyway "VACUUM FULL" their columnar tables,
we don't see any problem in ignoring deletion requests.
* Make (columnar.stripe) first_row_number index a unique constraint
Since stripe_first_row_number_idx is required to scan a columnar
table, we need to make sure that it is created before doing anything
with columnar tables during pg upgrades.
However, a plain btree index is not a dependency of a table, so
pg_upgrade cannot guarantee that stripe_first_row_number_idx gets
created when creating columnar.stripe, unless we make it a unique
"constraint".
To do that, drop stripe_first_row_number_idx and create a unique
constraint with the same name to keep the code change at minimum.
* Add more pg upgrade tests for columnar
* Fix a logic error in uprade_columnar_after test
Co-authored-by: Onur Tirtir <onurcantirtir@gmail.com>
We were trying to find the cause for a strange update bug. We thought
`pg_upgrade` succeeded and then were surprised that certain data was not
in the database after the upgrade. Instead `pg_upgrade` had failed
halfway through with an actionable error. It took us pretty long to
realise this.
This commit adds checking of exit codes to a lot more subprocess
executions. That should make debugging in the future much easier.
BuildStripeMetadata() calls HeapTupleHeaderGetXmin(), which must only
be called on a proper heap tuple with MVCC information. Make sure the
caller passes the heap tuple, and not a datum tuple.
Fixes#5318.
Considering all code-paths that we might interact with a columnar table,
add `CheckCitusVersion` calls to tableAM callbacks:
- initializing table scan (`columnar_beginscan` & `columnar_index_fetch_begin`)
- setting a new filenode for a relation (storage initializiation or a table rewrite)
- truncating the storage
- inserting tuple (single and multi)
Also add `CheckCitusVersion` call to:
- drop hook (`ColumnarTableDropHook`)
- `alter_columnar_table_set` & `alter_columnar_table_reset` UDFs
* Columnar: separate plain and exec quals.
Make a clear separation between plain quals, which contain constants
or extern params; and exec quals, which contain exec params and can't
be evaluated until a rescan.
Fixes#5258.
* more vanilla tests
Co-authored-by: Onur Tirtir <onurcantirtir@gmail.com>
When performing a partition-wise join, the planner will adjust paths
parameterized by the parent rel to instead parameterize by the child
rel directly. When this reparameterization happens, we also need to
adjust the join quals to reference the child rather than the parent.
Fixes#5257.
Not flush pending writes if given tid belongs to a "flushed" or
"aborted" stripe write, or to an "in-progress" stripe write of
another backend.
That way, we would reduce the cases where we flush single-tuple
stripes during index scan.
To do that, we follow below steps for index look-up's:
- Do not flush any pending writes and do stripe metadata look-up for
given tid.
If tuple with tid is found, then no need to do another look-up
since we already found the tuple without needing to flush pending
writes.
- If tuple is not found without flushing pending writes, then we have two
scenarios:
- If given tid belongs to a pending write of my backend, then do stripe
metadata look-up for given tid. But this time first **flush any pending
writes**.
- Otherwise, just return false from `index_fetch_tuple` since flushing
pending writes wouldn't help.
With 5825c44d5f, we made the changes to
skip aborted writes when scanning a columnar table.
However, looks like we forgot to handle such cases for the very first
call made to columnar_getnextslot. That means, that commit only
considered the intermediate stripe read operations.
However, functions called by columnar_getnextslot to find first stripe
to read (ColumnarBeginRead & ColumnarRescan) were not caring about
those aborted writes.
To fix that, we teach AdvanceStripeRead to find the very first stripe
to read, and then start using it where were blindly calling
FindNextStripeByRowNumber.
Recently there are some warnings during the compilation of Citus.
Part of the warnings come due to the `columnar_tableam.h` header not being properly guarded with defines and ifndef's.
This PR fixes these warnings.
Previously, even when `EXPLAIN` output tells that we will do
index-only scan, it was never the case since columnar tables
don't have the visibility fork that postgres is looking for.
For this reason, visibility check done in
`IndexOnlyNext->VM_ALL_VISIBLE`
code-path was always returning false and postgres was reading
the tuple from the columnar relation itself.
Previously, for regular table scans, we were setting `RelOptInfo->partial_pathlist`
to `NIL` via `set_rel_pathlist_hook` to discard scan `Path`s that need to use any
parallel workers, this was working nicely.
However, when building indexes, this hook doesn't get called so we were not
able to prevent spawning parallel workers when building an index. For this
reason, 9b4dc2f804 added basic
implementation for `columnar_parallelscan_*` callbacks but also made some
changes to skip using those workers when building the index.
However, now that we are doing stripe reservation in two stages, we call
`heap_inplace_update` at some point to complete stripe reservation.
However, postgres throws an error if we call `heap_inplace_update` during
a parallel operation, even if we don't actually make use of those workers.
For this reason, with this pr, we make sure to not generate scan `Path`s that
need to use any parallel workers by using `get_relation_info_hook`.
This is indeed useful to prevent spawning parallel workers during index builds.
If it is certain that we will not use any `parallel_worker`s for a columnar table,
then stripe entries inserted by aborted transactions become visible to
`SnapshotAny` and that causes `REINDEX` to fail by throwing a duplicate key
error.
To fix that:
* consider three states for a stripe write operation:
"flushed", "aborted", or "in-progress",
* make sure to have a clear separation between them, and
* act according to those three states when reading from a columnar table
Since PG14 we can now use binary encoding for arrays and composite types
that contain user defined types. This was fixed in this commit in
Postgres: 670c0a1d47
This change starts using that knowledge, by not necessarily falling back
to text encoding anymore for those types.
While doing this and testing a bit more I found various cases where
binary encoding would fail that our checks didn't cover. This fixes
those cases and adds tests for those. It also fixes EXPLAIN ANALYZE
never using binary encoding, which was a leftover of workaround that
was not necessary anymore.
Finally, it changes the default for both `citus.enable_binary_protocol`
and `citus.binary_worker_copy_format` to `true` for PG14 and up. In our
cloud offering `binary_worker_copy_format` already was true by default.
`enable_binary_protocol` had some bug with MX and user defined types,
this bug was fixed by the above mentioned fixes.
- get_missing_time_partition_ranges: Gets the ranges of missing partitions for the given table, interval and range unless any existing partition conflicts with calculated missing ranges.
- create_time_partitions: Creates partitions by getting range values from get_missing_time_partition_ranges.
- drop_old_time_partitions: Drops partitions of the table older than given threshold.
* Rename RecostColumnarPaths to CostColumnarPaths
* Rename RecostColumnarIndexPath to CostColumnarIndexPath
* Reorder args of CostColumnarScan to align with other two costing functions
* Not adjust index scan start-up cost
* Rename ColumnarIndexScanAddTotalCost to ColumnarIndexScanAdditionalCost
* Reflect that index scan will at least read one stripe in totalCost calculation
* Organize declarations in columnar_customscan.c
In PG 14, procedures can have OUT parameters. In Citus' procedure
delegation framework, we need to adjust the function expression
to get the outargs parameters.
Releven PG change:
e56bce5d43
Simply call Postgres' function to report the progress on
each row recieved.
Note that we currently do not support "COPY dist/ref TO .." progress
report nicely. Citus has some specialized logic to support
"COPY dist/ref TO .." such that it either converts the underlying
command into "COPY (SELECT * FROM dist/ref ) ..." or sends COPY
command to shards directly. In the former case, "tuples_processed"
is only updated when the executor returns all the tuples, so the
progress is not accurate. In the latter case, Citus can actually
implement the progress report. But, for the sake of consistency,
we prefer to not implement at all.
Added to PG 14 with https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=8a4f618e7ae3cb11b0b37d0f06f05c8ff905833f
It seems like there is a problem with Postgres14 with SELECT DISTINCT
COUNT. The issue is reported to Postgres and an alternative output is
added. We can remove the alternative output when the issue is fixed on
PG. If this is not an issue on PG(which is unlikely) we should consider
some other solution.
In order to avoid adding an alternative output, a function to check if a
given explan plan has a single task added. This doesn't change what the
changed tests intend to do.
Postgres changed stats expression types as of PG14. Hence we needed to
write the AppendColumnNames method. Also they removed the error on PG
side so we remove it as well.
Relevant commits on pg14:
a4d75c86bf15220df22de0a92c819ecef9db3849
388e75ad33489b77cfb9a8590a91e9287d8fb960
When queryId is not 0 and verbose is true, the query identifier is
emitted to the explain output. This is breaking Postgres outputs.
We disable de query identifier calculation in the tests.
Commit on PG that introduced the query identifier in the explain output:
4f0b0966c866ae9f0e15d7cc73ccf7ce4e1af84b
These changes were removed in commit: Introduces ExecSimpleRelationInsert_compat and modifyStateResultRelInfo macros
We shouldn't have removed them but instead kept them for before PG14
There was a small part in multi_partitioning that would need an
alternative output for pg14. Instead of adding an alternative for the
whole file, we created a new file, called partition_wise_join.sql and
added the alternative output for that.
When we check the exact version of the seg extension, it becomes a
problem when its version changes, such as from 1.3 to 1.4. So now we
modified the changes to check for that the version is the same in all
the cluster.
make_simple_restrictinfo and pull_varnos functions now have a new parameter
These new macros give us the ability to use this new parameter for PG14 and they don't give the parameter for previous versions
Relevant PG commit:
55dc86eca70b1dc18a79c141b3567efed910329d
Postgres tightened up its checks for invalid GUC names hence we started
to get an alternative output for one of our tests. We add an alternative
output since the file is relatively small.
Commit on PG:
3db826bd55cd1df0dd8c3d811f8e5b936d7ba1e4
Relevant PG commit:
9e38c2bb5093ceb0c04d6315ccd8975bd17add66
fix array_cat_agg for pg upgrades
array_cat_agg now needs to take anycompatiblearray instead of anyarray
because array_cat changed its type from anyarray to anycompatiblearray
with pg14.
To handle upgrades correctly, we drop the aggregate in
citus_pg_prepare_upgrade. To be able to drop it, we first remove the
dependency from pg_depend.
Then we create the right aggregate in citus_finish_pg_upgrade and we
also add the dependency back to pg_depend.
Postgres doesn't accept NULL for queryStrings in explain plans anymore.
Internally, there are some places in Postgres where they modified the
NULLS to ""(the empty string). So we do the same on citus side.
Commit on Postgres:
1111b2668d89bfcb6f502789158b1233ab4217a6
Postgres expects to set the HASH_STRINGS explicitly in case of the
default behaivor for string hash function.
Postgres Commit
b3817f5f774663d55931dd4fab9c5a94a15ae7ab
index_insert function now has a new parameter, indexUnchanged
This new macro give us the ability to use these new parameter for PG14 and they don't give the parameters for previous versions
Existing parameter is set to false
Relevant PG commit:
9dc718bdf2b1a574481a45624d42b674332e2903
es_result_relation_info is removed from Estate. In this commit we make some changes to handle that.
resultRelationInfo filed is added to ModifyState to support the removed field.
Relevant PG commits:
1375422c7826a2bf387be29895e961614f69de4b
a04daa97a4339c38e304cd6164d37da540d665a8
GetOldestXmin function is removed so we use GetOldestNonRemovableTransactionId functions instead
GetOldestNonRemovableTransactionId_compat picks the appropriate one
Relevant PG commit:
dc7420c2c9274a283779ec19718d2d16323640c0
get_partition_parent and RelationGetPartitionDesc functions now have new parameters to also include detached partitions
Thess new macros give us the ability to use these new parameter for PG14 and they don't give the parameters for previous versions
Existing parameters are set to not accept detached partitions
Relevant PG commit:
71f4c8c6f74ba021e55d35b1128d22fb8c6e1629
In two commits vacuumFlags in PGXACT is moved and then renamed to status flags
This macro uses the appropriate version of the flag
Relevant PG commits:
5788e258bb26495fab65ff3aa486268d1c50b123
cd9c1b3e197a9b53b840dcc87eb41b04d601a5f9
SetTuplestoreDestReceiverParams function now has two new parameters
This new macro give us the ability to use this new parameter for PG14 and it doesn't give the parameter for previous versions
Existing parameters are set to NULL to keep previous behavior
Relevant PG commit:
2f48ede080f42b97b594fb14102c82ca1001b80c
Some Copy related functions copied from Postgres had support for both old and new protocols
Postgres removed support for old version so we remove it too
Relevant PG commit:
3174d69fb96a66173224e60ec7053b988d5ed4d9
New macros: standard_ProcessUtility_compat, ProcessUtility_compat, ColumnarProcessUtility_compat, PrevProcessUtilityHook_compat
The functions now have a new bool parameter: readOnlyTree
These new macros give us the ability to use this new parameter for PG14 and it doesn't give the parameter for previous versions
In multi_ProcessUtility and ColumnarProcessUtility, before doing anything else, we check if readOnlyTree parameter is true and create a copy of pstmt
Existing readOnlyTree parameters are set to false since we already handle the read only case at multi_ProcessUtility and ColumnarProcessUtility
Relevant PG commit:
7c337b6b527b7052e6a751f966d5734c56f668b5
This function was copied from Postgres but it is not static at PG14
So we keep the definition only for previous versions
Relevant PG commit:
c532d15dddff14b01fe9ef1d465013cb8ef186df
BeginCopyFrom function now has a new whereClause parameter.
In the function this parameter is assigned to the whereClause field of the CopyFromState returned
Currently in Postgres there is only one place where this argument isn't NULL, and in previous PG version the whereClause argument of copy state is set right after the function call
Since we don't have such example all current whereClause parameters are set to NULL
Relevant PG commit:
c532d15dddff14b01fe9ef1d465013cb8ef186df
CopyState struct is divided into parts and one of them is CopyFromState
This macro uses the appropriate one for PG versions
Relevant PG commit:
c532d15dddff14b01fe9ef1d465013cb8ef186df
In ReindexStmt concurrent field is moved to options and then options are converted to params list.
This macro uses previous fields for previous versions and the new params list with a new function named IsReindexWithParam for PG14
Relevant PG commits:
844c05abc3f1c1703bf17cf44ab66351ed9711d2
b5913f6120792465f4394b93c15c2e2ac0c08376
VacOptTernaryValue enum is renamed to VacOptValue.
In the enum there were three values, VACOPT_TERNARY_DEFAULT, VACOPT_TERNARY_DISABLED, and VACOPT_TERNARY_ENABLED
Now there are four values VACOPTVALUE_UNSPECIFIED, VACOPTVALUE_AUTO, VACOPTVALUE_DISABLED, and VACOPTVALUE_ENABLED
New macros are VacOptValue_compat, VACOPTVALUE_UNSPECIFIED_COMPAT, VACOPTVALUE_DISABLED_COMPAT, and VACOPTVALUE_ENABLED_COMPAT
The VACOPTVALUE_UNSPECIFIED_COMPAT matches VACOPT_TERNARY_DEFAULT and VACOPTVALUE_UNSPECIFIED. And there are no macro for VACOPTVALUE_AUTO.
Relevant PG commit:
3499df0dee8c4ea51d264a674df5b5e31991319a
New macros: FuncnameGetCandidates_compat and expand_function_arguments_compat
The functions (the ones without _compat) now have a new bool include_out_arguments parameter
These new macros give us the ability to use this new parameter for PG14 and it doesn't give the parameter for previous versions
Existing include_out_arguments parameters are set to 'false' to keep current behavior
Relevant PG commit:
e56bce5d43789cce95d099554ae9593ada92b3b7
stats function now have a new bool print_to_stderr parameter
This new macro gives us the ability to use this new parameter for PG14 and it doesn't give the parameter for previous versions
Existing print_to_stderr parameter is set to true to keep current behavior
Relevant PG commit:
43620e328617c1f41a2a54c8cee01723064e3ffa
getObjectTypeDescription and getObjectIdentity functions now have a new bool missing_ok parameter
These new macros give us the ability to use this new parameter for PG14 and they don't give the parameter for previous versions
Currently all missing_ok parameters are set to false to keep current behavior
Relevant PG commit:
2a10fdc4307a667883f7a3369cb93a721ade9680
The STATUS_WAITING define is removed and an enum with PROC_WAIT_STATUS_WAITING is added instead
This macro uses appropriate one
Relevant PG commit:
a513f1dfbf2c29a51b0f7cbd5913ce2d2ee452c5
AlterTableStmt's relkind field is changed into objtype
New AlterTableStmtObjType macro uses the appropriate one
Relevant PG commit:
cc35d8933a211d9965eb1c1d2749a903d5735db2
Allow ColumnarScans to push down join quals by generating
parameterized paths. This significantly expands the utility of chunk
group filtering, making a ColumnarScan behave similar to an index when
on the inner of a nested loop join.
Also, evaluate all parameters on beginscan/rescan, which also works
for external parameters.
Fixes#4488.