When working on changelog, Marco suggested in
https://github.com/citusdata/citus/pull/6856#pullrequestreview-1386601215
that we should bump columnar version to 11.3 as well.
This PR aims to contain all the necessary changes to allow upgrades to
and downgrades from 11.3.0 for columnar. Note that updating citus
extension version does not affect columnar as the two extension versions
are not really coupled.
The same changes will also be applied to the release branch in
https://github.com/citusdata/citus/pull/6897
We are handling colocation groups with shard group count less than the
worker node count, using a method different than the usual rebalancer.
See #6739
While making the decision of using this method or not, we should've
ignored the nodes that are marked `shouldhaveshards = false`. This PR
excludes those nodes when making the decision.
Adds a test such that:
coordinator: []
worker 1: [1_1, 1_2]
worker 2: [2_1, 2_2]
(rebalance)
coordinator: []
worker 1: [1_1, 2_1]
worker 2: [1_2, 2_2]
If we take the coordinator into account, the rebalancer considers the
first state as balanced and does nothing (because shard_count <
worker_count)
But with this pr, we ignore the coordinator because it's
shouldhaveshards = false
So the rebalancer distributes each colocation group to both workers
Also, fixes an unrelated flaky test in the same file
We need to break sequence dependency for a table while creating the
table during non-transactional metadata sync to ensure idempotency of
the creation of the table.
**Problem:**
When we send `SELECT
pg_catalog.worker_drop_sequence_dependency(logicalrelid::regclass::text)
FROM pg_dist_partition` to workers during the non-transactional sync,
table might not be in `pg_dist_partition` at worker, and sequence
dependency is not broken at the worker.
**Solution:**
We break sequence dependency via `SELECT
pg_catalog.worker_drop_sequence_dependency(logicalrelid::regclass::text)`
for each table while creating it at the workers. It is safe to send
since the udf is a no-op when there is no sequence dependency.
DESCRIPTION: Fixes a bug related to sequence idempotency at
non-transactional sync.
Fixes https://github.com/citusdata/citus/issues/6888.
In #6814 we started using the Python test runner for upgrade tests in
run_test.py, instead of the Perl based one. This had a problem though,
not all tests in minimal_schedule can be run with the Python runner.
This adds a separate minimal schedule for the pg_upgrade tests which
doesn't include the tests that break with the Python runner.
This PR also fixes various other issues that came up while testing
the upgrade tests.
`PlaceHolderVar` is not relevant to be processed inside a restriction
clause. Otherwise, `pull_var_clause_default` would throw error. PG would
create the restriction to physical `Var` that `PlaceHolderVar` points to
anyway, so it is safe to skip this restriction.
DESCRIPTION: Fixes a bug related to WHERE clause list which contains
placeholder.
Fixes https://github.com/citusdata/citus/issues/6758
DESCRIPTION: Changes the regression test setups adding the coordinator
to metadata by default.
When creating a Citus cluster, coordinator can be added in metadata
explicitly by running `citus_set_coordinator_host ` function. Adding the
coordinator to metadata allows to create citus managed local tables.
Other Citus functionality is expected to be unaffected.
This change adds the coordinator to metadata by default when creating
test clusters in regression tests.
There are 3 ways to run commands in a sql file (or a schedule which is a
sequence of sql files) with Citus regression tests. Below is how this PR
adds the coordinator to metadata for each.
1. `make <schedule_name>`
Changed the sql files (sql/multi_cluster_management.sql and
sql/minimal_cluster_management.sql) which sets up the test clusters such
that they call `citus_set_coordinator_host`. This ensures any following
tests will have the coordinator in metadata by default.
2. `citus_tests/run_test.py <sql_file_name>`
Changed the python code that sets up the cluster to always call `
citus_set_coordinator_host`.
For the upgrade tests, a version check is included to make sure
`citus_set_coordinator_host` function is available for a given version.
3. ` make check-arbitrary-configs `
Changed the python code that sets up the cluster to always call
`citus_set_coordinator_host `.
#6864 will be used to track the remaining work which is to change the
tests where coordinator is added/removed as a node.
This PR updates the tenant stats implementation to set partitionKeyValue
and colocationId in ExecuteLocalTaskListExtended, in addition to
LocallyExecuteTaskPlan. This ensures that tenant stats can be properly
gathered regardless of the code path taken. The changes were initially
made while testing stored procedure calls for tenant stats.
Fixes the bug that causes updating the citus_stat_tenants periods
incorrectly.
`TimestampDifferenceExceeds` expects the difference in milliseconds but
it was microseconds, this is fixed.
`tenantStats->lastQueryTime` was updated during monitoring too, now it's
updated only when there are tenant queries.
DESCRIPTION: Adds control for background task executors involving a node
### Background and motivation
Nonblocking concurrent task execution via background workers was
introduced in [#6459](https://github.com/citusdata/citus/pull/6459), and
concurrent shard moves in the background rebalancer were introduced in
[#6756](https://github.com/citusdata/citus/pull/6756) - with a hard
dependency that limits to 1 shard move per node. As we know, a shard
move consists of a shard moving from a source node to a target node. The
hard dependency was used because the background task runner didn't have
an option to limit the parallel shard moves per node.
With the motivation of controlling the number of concurrent shard
moves that involve a particular node, either as source or target, this
PR introduces a general new GUC
citus.max_background_task_executors_per_node to be used in the
background task runner infrastructure. So, why do we even want to
control and limit the concurrency? Well, it's all about resource
availability: because the moves involve the same nodes, extra
parallelism won’t make the rebalance complete faster if some resource is
already maxed out (usually cpu or disk). Or, if the cluster is being
used in a production setting, the moves might compete for resources with
production queries much more than if they had been executed
sequentially.
### How does it work?
A new column named nodes_involved is added to the catalog table that
keeps track of the scheduled background tasks,
pg_dist_background_task. It is of type integer[] - to store a list
of node ids. It is NULL by default - the column will be filled by the
rebalancer, but we may not care about the nodes involved in other uses
of the background task runner.
Table "pg_catalog.pg_dist_background_task"
Column | Type
============================================
job_id | bigint
task_id | bigint
owner | regrole
pid | integer
status | citus_task_status
command | text
retry_count | integer
not_before | timestamp with time zone
message | text
+nodes_involved | integer[]
A hashtable named ParallelTasksPerNode keeps track of the number of
parallel running background tasks per node. An entry in the hashtable is
as follows:
ParallelTasksPerNodeEntry
{
node_id // The node is used as the hash table key
counter // Number of concurrent background tasks that involve node node_id
// The counter limit is citus.max_background_task_executors_per_node
}
When the background task runner assigns a runnable task to a new
executor, it increments the counter for each of the nodes involved with
that runnable task. The limit of each counter is
citus.max_background_task_executors_per_node. If the limit is reached
for any of the nodes involved, this runnable task is skipped. And then,
later, when the running task finishes, the background task runner
decrements the counter for each of the nodes involved with the done
task. The following functions take care of these increment-decrement
steps:
IncrementParallelTaskCountForNodesInvolved(task)
DecrementParallelTaskCountForNodesInvolved(task)
citus.max_background_task_executors_per_node can be changed in the
fly. In the background rebalancer, we simply give {source_node,
target_node} as the nodesInvolved input to the
ScheduleBackgroundTask function. The rest is taken care of by the
general background task runner infrastructure explained above. Check
background_task_queue_monitor.sql and
background_rebalance_parallel.sql tests for detailed examples.
#### Note
This PR also adds a hard node dependency if a node is first being used
as a source for a move, and then later as a target. The reason this
should be a hard dependency is that the first move might make space for
the second move. So, we could run out of disk space (or at least
overload the node) if we move the second shard to it before the first
one is moved away.
Fixes https://github.com/citusdata/citus/issues/6716
DESCRIPTION: PR description that will go into the change log, up to 78
characters
---------
Co-authored-by: Hanefi Onaldi <Hanefi.Onaldi@microsoft.com>
Fixes flakiness in multi_metadata_sync test
https://app.circleci.com/pipelines/github/citusdata/citus/31863/workflows/ea937480-a4cc-4646-815c-bb2634361d98/jobs/1074457
```diff
SELECT
logicalrelid, repmodel
FROM
pg_dist_partition
WHERE
logicalrelid = 'mx_test_schema_1.mx_table_1'::regclass
OR logicalrelid = 'mx_test_schema_2.mx_table_2'::regclass;
logicalrelid | repmodel
-----------------------------+----------
- mx_test_schema_1.mx_table_1 | s
mx_test_schema_2.mx_table_2 | s
+ mx_test_schema_1.mx_table_1 | s
(2 rows)
```
This is a simple issue of missing `ORDER BY` clauses. I went ahead and
added some other missing ones in the same file as well. Also, I replaced
existing `ORDER BY logicalrelid` with `ORDER BY logicalrelid::text`, in
order to compare names, not OIDs.
DESCRIPTION: Adds views that monitor statistics on tenant usages
This PR adds `citus_stats_tenants` view that monitors the tenants on the
cluster.
`citus_stats_tenants` shows the node id, colocation id, tenant
attribute, read count in this period and last period, and query count in
this period and last period of the tenant.
Tenant attribute currently is the tenant's distribution column value,
later when schema based sharding is introduced, this meaning might
change.
A period is a time bucket the queries are counted by. Read and query
counts for this period can increase until the current period ends. After
that those counts are moved to last period's counts, which cannot
change. The period length can be set using 'citus.stats_tenants_period'.
`SELECT` queries are counted as _read_ queries, `INSERT`, `UPDATE` and
`DELETE` queries are counted as _write_ queries. So in the view read
counts are `SELECT` counts and query counts are `SELECT`, `INSERT`,
`UPDATE` and `DELETE` count.
The data is stored in shared memory, in a struct named
`MultiTenantMonitor`.
`citus_stats_tenants` shows the data from local tenants.
`citus_stats_tenants` show up to `citus.stats_tenant_limit` number of
tenants.
The tenants are scored based on the number of queries they run and the
recency of those queries. Every query ran increases the score of tenant
by `ONE_QUERY_SCORE`, and after every period ends the scores are halved.
Halving is done lazily.
To retain information a longer the monitor keeps up to 3 times
`citus.stats_tenant_limit` tenants. When the tenant count hits `3 *
citus.stats_tenant_limit`, last `citus.stats_tenant_limit` tenants are
removed. To see all stored tenants you can use
`citus_stats_tenants(return_all_tenants := true)`
- [x] Create collector view that gets data from all nodes. #6761
- [x] Add monitoring log #6762
- [x] Create enable/disable GUC #6769
- [x] Parse the annotation string correctly #6796
- [x] Add local queries and prepared statements #6797
- [x] Rename to citus_stat_statements #6821
- [x] Run pgbench
- [x] Fix role permissions #6812
---------
Co-authored-by: Gokhan Gulbiz <ggulbiz@gmail.com>
Co-authored-by: Jelte Fennema <github-tech@jeltef.nl>
In CI we would sometimes get this failure:
```diff
-- The original shard is marked for deferred drop with policy_type = 2.
-- The previous shard should be dropped at the beginning of the second split call
SELECT * from pg_dist_cleanup;
record_id | operation_id | object_type | object_name | node_group_id | policy_type
-----------+--------------+-------------+--------------------------------------------------------------------------+---------------+-------------
+ 60 | 778 | 3 | citus_shard_split_slot_18_21216_778 | 16 | 0
512 | 778 | 1 | citus_split_shard_by_split_points_deferred_schema.table_to_split_8981001 | 16 | 2
-(1 row)
+(2 rows)
```
Replication slots sometimes cannot be deleted right away. Which is hard
to resolve, but luckily we can filter these cleanup records out easily
by filtering by policy_type.
While debugging this issue I learnt that we did not use
`GetNextCleanupRecordId` in all places where we created cleanup
records. This caused test failures when running tests multiple times,
when they set `citus.next_cleanup_record_id`. I tried fixing that by
calling GetNextCleanupRecordId in all places but that caused many
other tests to fail due to deadlocks. So, instead this adresses
that issue by using `ALTER SEQUENCE ... RESTART` instead of
`citus.next_cleanup_record_id`. In a follow up PR we should
probably get rid of `citus.next_cleanup_record_id`, since it's
only used in one other file.
DESCRIPTION: Fix an issue that caused some queries with custom
aggregates to fail
While playing around with https://github.com/pgvector/pgvector I noticed
that the AVG query was broken. That's because we treat it as any other
AVG by breaking it down in SUM and COUNT, but there are no SUM/COUNT
functions in this case, but there is a perfectly usable combinefunc.
This PR changes our aggregate logic to prefer custom aggregates with a
combinefunc even if they have a common name.
Co-authored-by: Marco Slot <marco.slot@gmail.com>
Add new metadata sync methods which uses MemorySyncContext api so that during the sync we can
- free memory to prevent OOM,
- use either transactional or nontransactional modes according to the GUC .
This pull request proposes a change to the logic used for propagating
identity columns to worker nodes in citus. Instead of creating a
dependent sequence for each identity column and changing its default
value to `nextval(seq)/worker_nextval(seq)`, this update will pass the
identity columns as-is to the worker nodes.
Please note that there are a few limitations to this change.
1. Only bigint identity columns will be allowed in distributed tables to
ensure compatibility with the DDL from any node functionality. Our
current distributed sequence implementation only allows insert
statements from all nodes for bigint sequences.
2. `alter_distributed_table` and `undistribute_table` operations will
not be allowed for tables with identity columns. This is because we do
not have a proper way of keeping sequence states consistent across the
cluster.
DESCRIPTION: Prevents using identity columns on data types other than
`bigint` on distributed tables
DESCRIPTION: Prevents using `alter_distributed_table` and
`undistribute_table` UDFs when a table has identity columns
DESCRIPTION: Fixes a bug that prevents enforcing identity column
restrictions on worker nodes
Depends on #6740Fixes#6694
DESCRIPTION: This PR removes the task dependencies between shard moves
for which the shards belong to different colocation groups. This change
results in scheduling multiple tasks in the RUNNABLE state. Therefore it
is possible that the background task monitor can run them concurrently.
Previously, all the shard moves planned in a rebalance operation took
dependency on each other sequentially.
For instance, given the following table and shards
colocation group 1 colocation group 2
table1 table2 table3 table4 table 5
shard11 shard21 shard31 shard41 shard51
shard12 shard22 shard32 shard42 shard52
if the rebalancer planner returned the below set of moves
` {move(shard11), move(shard12), move(shard41), move(shard42)}`
background rebalancer scheduled them such that they depend on each other
sequentially.
```
{move(reftables) if there is any, none}
|
move( shard11)
|
move(shard12)
| {move(shard41)<--- move(shard12)} This is an artificial dependency
move(shard41)
|
move(shard42)
```
This results in artificial dependencies between otherwise independent
moves.
Considering that the shards in different colocation groups can be moved
concurrently, this PR changes the dependency relationship between the
moves as follows:
```
{move(reftables) if there is any, none} {move(reftables) if there is any, none}
| |
move(shard11) move(shard41)
| |
move(shard12) move(shard42)
```
---------
Co-authored-by: Jelte Fennema <jelte.fennema@microsoft.com>
Description:
Implementing CDC changes using Logical Replication to avoid
re-publishing events multiple times by setting up replication origin
session, which will add "DoNotReplicateId" to every WAL entry.
- shard splits
- shard moves
- create distributed table
- undistribute table
- alter distributed tables (for some cases)
- reference table operations
The citus decoder which will be decoding WAL events for CDC clients,
ignores any WAL entry with replication origin that is not zero.
It also maps the shard names to distributed table names.
Today we allow planning the queries that reference non-colocated tables
if the shards that query targets are placed on the same node. However,
this may not be the case, e.g., after rebalancing shards because it's
not guaranteed to have those shards on the same node anymore.
This commit adds citus.enable_non_colocated_router_query_pushdown GUC
that can be used to disallow planning such queries via router planner,
when it's set to false. Note that the default value for this GUC will be
"true" for 11.3, but we will alter it to "false" on 12.0 to not
introduce
a breaking change in a minor release.
Closes#692.
Even more, allowing such queries to go through router planner also
causes
generating an incorrect plan for the DML queries that reference
distributed
tables that are sharded based on different replication factor settings.
For
this reason, #6779 can be closed after altering the default value for
this
GUC to "false", hence not now.
DESCRIPTION: Adds `citus.enable_non_colocated_router_query_pushdown` GUC
to ensure generating a consistent distributed plan for the queries that
reference non-colocated distributed tables (when set to "false", the
default is "true").
Soon I will be doing some changes related to #692 in router planner
and those changes require updating ~5/6 tests related to router
planning. And to make those test files runnable by run_test.py
multiple times, we need to make some other tests (that they're
run in parallel / they badly depend on) ready for run_test.py too.
This would be useful for testing #6773. This is because, given that
#6773
only adds support for router / fast-path queries, theoretically almost
all
the tests that we have in that test file should work for null-shard-key
tables too (and they indeed do).
I deliberately did not replace multi_router_planner_fast_path.sql with
the one that I'm adding into arbitrary configs because we might still
want to see when we're able to go through fast-path planning for the
usual distributed tables (the ones that have a shard key).
DESCRIPTION: Check before logicalrep for rebalancer, error if needed
Check if we can use logical replication or not, in case of shard
transfer mode = auto, before executing the shard moves. If we can't,
error out. Before this PR, we used to error out in the middle of shard
moves:
```sql
set citus.shard_count = 4; -- just to get the error sooner
select citus_remove_node('localhost',9702);
create table t1 (a int primary key);
select create_distributed_table('t1','a');
create table t2 (a bigint);
select create_distributed_table('t2','a');
select citus_add_node('localhost',9702);
select rebalance_table_shards();
NOTICE: Moving shard 102008 from localhost:9701 to localhost:9702 ...
NOTICE: Moving shard 102009 from localhost:9701 to localhost:9702 ...
NOTICE: Moving shard 102012 from localhost:9701 to localhost:9702 ...
ERROR: cannot use logical replication to transfer shards of the relation t2 since it doesn't have a REPLICA IDENTITY or PRIMARY KEY
```
Now we check and error out in the beginning, without moving the shards.
fixes: #6727
Fixes#6672
2) Move all MERGE related routines to a new file merge_planner.c
3) Make ConjunctionContainsColumnFilter() static again, and rearrange the code in MergeQuerySupported()
4) Restore the original format in the comments section.
5) Add big serial test. Implement latest set of comments
This implements the phase - II of MERGE sql support
Support routable query where all the tables in the merge-sql are distributed, co-located, and both the source and
target relations are joined on the distribution column with a constant qual. This should be a Citus single-task
query. Below is an example.
SELECT create_distributed_table('t1', 'id');
SELECT create_distributed_table('s1', 'id', colocate_with => ‘t1’);
MERGE INTO t1
USING s1 ON t1.id = s1.id AND t1.id = 100
WHEN MATCHED THEN
UPDATE SET val = s1.val + 10
WHEN MATCHED THEN
DELETE
WHEN NOT MATCHED THEN
INSERT (id, val, src) VALUES (s1.id, s1.val, s1.src)
Basically, MERGE checks to see if
There are a minimum of two distributed tables (source and a target).
All the distributed tables are indeed colocated.
MERGE relations are joined on the distribution column
MERGE .. USING .. ON target.dist_key = source.dist_key
The query should touch only a single shard i.e. JOIN AND with a constant qual
MERGE .. USING .. ON target.dist_key = source.dist_key AND target.dist_key = <>
If any of the conditions are not met, it raises an exception.
(cherry picked from commit 44c387b978)
This implements MERGE phase3
Support pushdown query where all the tables in the merge-sql are Citus-distributed, co-located, and both
the source and target relations are joined on the distribution column. This will generate multiple tasks
which execute independently after pushdown.
SELECT create_distributed_table('t1', 'id');
SELECT create_distributed_table('s1', 'id', colocate_with => ‘t1’);
MERGE INTO t1
USING s1
ON t1.id = s1.id
WHEN MATCHED THEN
UPDATE SET val = s1.val + 10
WHEN MATCHED THEN
DELETE
WHEN NOT MATCHED THEN
INSERT (id, val, src) VALUES (s1.id, s1.val, s1.src)
*The only exception for both the phases II and III is, UPDATEs and INSERTs must be done on the same shard-group
as the joined key; for example, below scenarios are NOT supported as the key-value to be inserted/updated is not
guaranteed to be on the same node as the id distribution-column.
MERGE INTO target t
USING source s ON (t.customer_id = s.customer_id)
WHEN NOT MATCHED THEN - -
INSERT(customer_id, …) VALUES (<non-local-constant-key-value>, ……);
OR this scenario where we update the distribution column itself
MERGE INTO target t
USING source s On (t.customer_id = s.customer_id)
WHEN MATCHED THEN
UPDATE SET customer_id = 100;
(cherry picked from commit fa7b8949a8)
In the past, having columnar tables in the cluster was causing pg
upgrades to fail when attempting to access columnar metadata. This is
because, pg_dump doesn't see objects that we use for columnar-am related
booking as the dependencies of the tables using columnar-am.
To fix that; in #5456, we inserted some "normal dependency" edges (from
those objects to columnar-am) into pg_depend.
This helped us ensuring the existency of a class of metadata objects
--such as columnar.storageid_seq-- and helped fixing #5437.
However, the normal-dependency edges that we added for indexes on
columnar metadata tables --such columnar.stripe_pkey-- didn't help at
all because they were indeed causing dependency loops (#5510) and
pg_dump was not able to take those dependency edges into the account.
For this reason, this commit deletes those dependency edges so that
pg_dump stops complaining about them. Note that it's not critical to
delete those edges from pg_depend since they're not breaking pg upgrades
but were triggering some warning messages. And given that backporting
a sql change into older versions is hard a lot, we skip backporting
this.
This commit hides port numbers in upgrade_columnar_after because the
port numbers assigned to nodes in upgrade schedule differ from the ones
that flaky test detector assigns.
DESCRIPTION: Fixes a bug in shard copy operations.
For copying shards in both shard move and shard split operations, Citus
uses the COPY statement.
A COPY all statement in the following form
` COPY target_shard FROM STDIN;`
throws an error when there is a GENERATED column in the shard table.
In order to fix this issue, we need to exclude the GENERATED columns in
the COPY and the matching SELECT statements. Hence this fix converts the
COPY and SELECT all statements to the following form:
```
COPY target_shard (col1, col2, ..., coln) FROM STDIN;
SELECT (col1, col2, ..., coln) FROM source_shard;
```
where (col1, col2, ..., coln) does not include a GENERATED column.
GENERATED column values are created in the target_shard as the values
are inserted.
Fixes#6705.
---------
Co-authored-by: Teja Mupparti <temuppar@microsoft.com>
Co-authored-by: aykut-bozkurt <51649454+aykut-bozkurt@users.noreply.github.com>
Co-authored-by: Jelte Fennema <jelte.fennema@microsoft.com>
Co-authored-by: Gürkan İndibay <gindibay@microsoft.com>
DESCRIPTION: Adds logic to distribute unbalanced shards
If the number of shard placements (for a colocation group) is less than
the number of workers, it means that some of the workers will remain
empty. With this PR, we consider these shard groups as a colocation
group, in order to make them be distributed evenly as much as possible
across the cluster.
Example:
```sql
create table t1 (a int primary key);
create table t2 (a int primary key);
create table t3 (a int primary key);
set citus.shard_count =1;
select create_distributed_table('t1','a');
select create_distributed_table('t2','a',colocate_with=>'t1');
select create_distributed_table('t3','a',colocate_with=>'t2');
create table tb1 (a bigint);
create table tb2 (a bigint);
select create_distributed_table('tb1','a');
select create_distributed_table('tb2','a',colocate_with=>'tb1');
select citus_add_node('localhost',9702);
select rebalance_table_shards();
```
Here we have two colocation groups, each with one shard group. Both
shard groups are placed on the first worker node. When we add a new
worker node and try to rebalance table shards, the rebalance planner
considers it well balanced and does nothing. With this PR, the
rebalancer tries to distribute these shard groups evenly across the
cluster as much as possible. For this example, with this PR, the
rebalancer moves one of the shard groups to the second worker node.
fixes: #6715
DESCRIPTION: Correctly report shard size in citus_shards view
When looking at citus_shards, people are interested in the actual size
that all the data related to the shard takes up on disk.
`pg_total_relation_size` is the function to use for that purpose. The
previously used `pg_relation_size` does not include indexes or TOAST.
Especially the missing toast can have enormous impact on the size of the
shown data.
First of all, this commit sets next_shard_id for
single_node_truncate.sql because shard ids in the test output were
changing whenever we modify a prior test file.
Then the flaky test detector started complaining about
single_node_truncate.sql. We fix that by specifying the correct
test dependency for it in run_test.py.
O Simple fix is to add ORDER BY to have definitive results.
O Add search_path explicitly after reconnecting, this avoids creating objects in public schema
which prevents us from repetitive running of tests.
O multi_mx_modification is not designed to run repetitive, so isolate it.
The failure_create_distributed_table_non_empty test would sometimes fail
like this:
```diff
-- in the first test, cancel the first connection we sent from the coordinator
SELECT citus.mitmproxy('conn.cancel(' || pg_backend_pid() || ')');
- mitmproxy
----------------------------------------------------------------------
-
-(1 row)
-
+ERROR: canceling statement due to user request
+CONTEXT: COPY mitmproxy_result, line 0
+SQL statement "COPY mitmproxy_result FROM '/home/circleci/project/src/test/regress/tmp_check/mitmproxy.fifo'"
+PL/pgSQL function citus.mitmproxy(text) line 11 at EXECUTE
SELECT create_distributed_table('test_table', 'id');
```
Source:
https://app.circleci.com/pipelines/github/citusdata/citus/30474/workflows/be1c9f9d-22c9-465c-964a-dcdd1cb8c99c/jobs/985441
Because the cancel command had no filter it would actually sometimes
cancel the mitmproxy cancel command itself. This PR addresses that by
simply removing this test.
This is basically the exact same issue as #6217, only in a different
place in the file. It's fixed here by removing the test since there's
already many different similar tests.
DESCRIPTION: Fix background rebalance when reference table has no PK
For the background rebalance we would always fail if a reference table
that was not replicated to all nodes would not have a PK (or replica
identity). Even when we used force_logical or block_writes as the shard
transfer mode. This fixes that and adds some regression tests.
Fixes#6680
We should disallow dropping table_name option if foreign table is in
metadata. Otherwise, we get table not found error which contains
shardid.
DESCRIPTION: Fixes an unexpected foreign table error by disallowing to drop the table_name option.
Fixes#6663
Fixes#6570.
In the past, having columnar tables in the cluster was causing pg
upgrades to fail when attempting to access columnar metadata. This is
because, pg_dump doesn't see objects that we use for columnar-am related
booking as the dependencies of the tables using columnar-am.
To fix that; in #5456, we inserted some "normal dependency" edges (from
those objects to columnar-am) into pg_depend.
This helped us ensuring the existency of a class of metadata objects
--such as columnar.storageid_seq-- and helped fixing #5437.
However, the normal-dependency edges that we added for indexes on
columnar metadata tables --such columnar.stripe_pkey-- didn't help at
all because they were indeed causing dependency loops (#5510) and
pg_dump was not able to take those dependency edges into the account.
For this reason, instead of inserting such dependency edges from indexes
to columnar-am, we allow columnar metadata accessors to fall-back to
sequential scan during pg upgrades.
Sometimes isolation_non_blocking_shard_split would fail like this:
```diff
step s2-show-pg_dist_cleanup:
SELECT object_name, object_type, policy_type FROM pg_dist_cleanup;
object_name |object_type|policy_type
------------------------------+-----------+-----------
+citus_shard_split_slot_2_10_39| 3| 0
public.to_split_table_1500001 | 1| 2
-(1 row)
+(2 rows)
```
Source:
https://app.circleci.com/pipelines/github/citusdata/citus/30237/workflows/edcf34b7-d7d3-4d10-8293-b6f59b00cdf2/jobs/970960
The reason is that replication slots have now become part of
pg_dist_cleanup too, and sometimes they cannot be cleaned up right away.
This is harmless as they will be cleaned up eventually. So this simply
filters out the replication slots for those tests.
Recursive planner should handle all the tree from bottom to top at
single pass. i.e. It should have already recursively planned all
required parts in its first pass. Otherwise, this means we have bug at
recursive planner, which needs to be handled. We add a check here and
return error.
DESCRIPTION: Fixes wrong results by throwing error in case recursive
planner multipass the query.
We found 3 different cases which causes recursive planner passes the
query multiple times.
1. Sublink in WHERE clause is planned at second pass after we
recursively planned a distributed table at the first pass. Fixed by PR
#6657.
2. Local-distributed joins are recursively planned at both the first and
the second pass. Issue #6659.
3. Some parts of the query is considered to be noncolocated at the
second pass as we do not generate attribute equivalances between
nondistributed and distributed tables. Issue #6653
DESCRIPTION: Fix foreign key validation skip at the end of shard move
In eadc88a we started completely skipping foreign key constraint
validation at the end of a non blocking shard move, instead of only for
foreign keys to reference tables. However, it turns out that this didn't
work at all because of a hard to notice bug: By resetting the
SkipConstraintValidation flag at the end of our utility hook, we
actually make the SET command that sets it a no-op.
This fixes that bug by removing the code that resets it. This is fine
because #6543 removed the only place where we set the flag in C code. So
the resetting of the flag has no purpose anymore. This PR also adds a
regression test, because it turned out we didn't have any otherwise we
would have caught that the feature was completely broken.
It also moves the constraint validation skipping to the utility hook.
The reason is that #6550 showed us that this is the better place to skip
it, because it will also skip the planning phase and not just the
execution.
We should do the sublink conversations at the end of the recursive
planning because earlier steps might have transformed the query into a
shape that needs recursively planning the sublinks.
DESCRIPTION: Fixes early sublink check at recursive planner.
Related to PR https://github.com/citusdata/citus/pull/6650
This change allows creating a constraint without a name using an index.
The index name will be used as the constraint name the same way postgres
handles it.
Fixes issue #6644
This commit also cleans up some leftovers from nameless constraint checks.
With this commit, we now fully support adding all nameless constraints
directly to a table.
Co-authored-by: naisila <nicypp@gmail.com>
Adds NOT VALID option to deparser. When we need to deparse:
"ALTER TABLE ADD FOREIGN KEY ... NOT VALID"
"ALTER TABLE ADD CHECK ... NOT VALID"
NOT VALID option should be propagated to workers.
Fixes issue #6646
This commit also uses AppendColumnNameList function
instead of repeated code blocks in two appropriate places
in the "ALTER TABLE" deparser.
If an update query on a reference table has a returns clause with a
subquery that accesses some other local table, we end-up with an crash.
This commit prevents the crash, but does not prevent other error
messages from happening due to Citus not being able to pushdown the
results of that subquery in a valid SQL command.
Related: #6634
DESCRIPTION: Fix regression in allowed foreign keys on distributed
tables
In commit eadc88a we changed how we skip foreign key validation. The
goal was to skip it in more cases. However, one change had the
unintended regression of introducing failures when trying to create
certain foreign keys. This reverts that part of the change.
The way of skipping validation of foreign keys that was introduced in
eadc88a was skipping validation during execution. The reason that
this caused this regression was because some foreign key validation
queries already fail during planning. In those cases it never gets to
the execution step where it would later be skipped.
Fixes#6543
DESCRIPTION: Enable adding FOREIGN KEY constraints on Citus tables
without a name
This PR enables adding a foreign key to a distributed/reference/Citus
local table without specifying the name of the constraint, e.g. `ALTER
TABLE items ADD FOREIGN KEY (user_id) REFERENCES users (id);`
This implements the phase - II of MERGE sql support
Support routable query where all the tables in the merge-sql are distributed, co-located, and both the source and
target relations are joined on the distribution column with a constant qual. This should be a Citus single-task
query. Below is an example.
SELECT create_distributed_table('t1', 'id');
SELECT create_distributed_table('s1', 'id', colocate_with => ‘t1’);
MERGE INTO t1
USING s1 ON t1.id = s1.id AND t1.id = 100
WHEN MATCHED THEN
UPDATE SET val = s1.val + 10
WHEN MATCHED THEN
DELETE
WHEN NOT MATCHED THEN
INSERT (id, val, src) VALUES (s1.id, s1.val, s1.src)
Basically, MERGE checks to see if
There are a minimum of two distributed tables (source and a target).
All the distributed tables are indeed colocated.
MERGE relations are joined on the distribution column
MERGE .. USING .. ON target.dist_key = source.dist_key
The query should touch only a single shard i.e. JOIN AND with a constant qual
MERGE .. USING .. ON target.dist_key = source.dist_key AND target.dist_key = <>
If any of the conditions are not met, it raises an exception.
citus_job_list() lists all background jobs by simply showing the records
in pg_dist_background_job.
citus_job_status(job_id bigint, raw boolean default false) shows the
status of a single background job by appending a jsonb details column to
the associated row from pg_dist_background_job. If the raw argument is
set, machine readable sizes are used instead of human readable
alternatives.
citus_rebalance_status(raw boolean default false) shows the status of
the last rebalance operation. If the raw argument is set, machine
readable sizes are used instead of human readable alternatives.
DESCRIPTION: Enable adding CHECK constraints on distributed tables
without the client having to provide a constraint name.
This PR enables the following command syntax for adding check
constraints to distributed tables.
ALTER TABLE ... ADD CHECK ...
by creating a default constraint name and transforming the command into
the below syntax before sending it to workers.
ALTER TABLE ... ADD CONSTRAINT \<conname> CHECK ...
Table Constraints UNIQUE, PRIMARY KEY and EXCLUDE may have option
DEFERRABLE in their command syntax. This PR handles the option when
deparsing the relevant constraint statements.
NOT DEFERRABLE
and
INITIALLY IMMEDIATE (if DEFERRABLE}
are the default values for the option so we only append the non-default
values to the alter table statement.
DESCRIPTION: Adds support for creating table constraints UNIQUE and
EXCLUDE via ALTER TABLE command without client having to specify a name.
ALTER TABLE ... ADD CONSTRAINT <conname> UNIQUE ...
ALTER TABLE ... ADD CONSTRAINT <conname> EXCLUDE ...
commands require the client to provide an explicit constraint name.
However, in postgres it is possible for clients not to provide a name
and let the postgres generate it using the following commands
ALTER TABLE ... ADD UNIQUE ...
ALTER TABLE ... ADD EXCLUDE ...
This PR enables the same functionality for citus tables.
DESCRIPTION: Drop `SHARD_STATE_TO_DELETE` and use the cleanup records
instead
Drops the shard state that is used to mark shards as orphaned. Now we
insert cleanup records into `pg_dist_cleanup` so "orphaned" shards will
be dropped either by maintenance daemon or internal cleanup calls. With
this PR, we make the "cleanup orphaned shards" functions to be no-op, as
they would not be needed anymore.
This PR includes some naming changes about placement functions. We don't
need functions that filter orphaned shards, as there will be no orphaned
shards anymore.
We will also be introducing a small script with this PR, for users with
orphaned shards. We'll basically delete the orphaned shard entries from
`pg_dist_placement` and insert cleanup records into `pg_dist_cleanup`
for each one of them, during Citus upgrade.
We also have a lot of flakiness fixes in this PR.
Co-authored-by: Jelte Fennema <github-tech@jeltef.nl>
Sometimes our `isolation_insert_vs_vacuum` test would fail like this.
```diff
step s2-vacuum-analyze:
VACUUM ANALYZE test_insert_vacuum;
-
+ <waiting ...>
step s1-commit:
COMMIT;
+step s2-vacuum-analyze: <... completed>
```
The reason seems to be that VACUUM ANALYZE tries to take some locks that
conflict with the other transaction, but these locks somehow get
released or VACUUM ANALYZE stops waiting for them. This is somewhat
expected since VACUUM has some special locking logic.
To solve the flakyness we now trigger VACUUM ANALYZE to always report as
blocking and after that we wait explicitly wait for it to complete. This
is done
like is suggested by the flaky test tips from postgres:
c68a183990/src/test/isolation/README (L152)
I've confirmed that this fixes the issue suing our flaky-test-debugging
CI workflow.
DESCRIPTION: Defers cleanup after a failure in shard move or split
We don't need to do a cleanup in case of failure on a shard transfer or
split anymore. Because,
* Maintenance daemon will clean them up anyway.
* We trigger a cleanup at the beginning of shard transfers/splits.
* The cleanup on failure logic also can fail sometimes and instead of
the original error, we throw the error that is raised by the cleanup
procedure, and it causes confusion.
DESCRIPTION: Cleanup the shard on the target node in case of a
failed/aborted shard move
Inserts a cleanup record for the moved shard placement on the target
node. If the move operation succeeds, the record will be deleted. If
not, it will remain there to be cleaned up later.
fixes: #6580
We have several version checks in our Citus upgrade tests. However, as
we drop support for PG versions, we need to update the Citus versions
used in our CI images. Therefore we must compare Citus versions in our
tests instead of using equality checks so that the queries are ran in
all the associated Citus versions.
For example, we have many conditionals where we early exit if the Citus
version is not equal to 9.0. However, as of today we never use version
9.0 and thus we always early exit in those tests.
All the tables (target, source or any CTE present) in the SQL statement are local i.e. a merge-sql with a combination of Citus local and
Non-Citus tables (regular Postgres tables) should work and give the same result as Postgres MERGE on regular tables. Catch and throw an
exception (not-yet-supported) for all other scenarios during Citus-planning phase.
DESCRIPTION: Support ALTER TABLE .. ADD PRIMARY KEY ... command
Before processing
> **ALTER TABLE ... ADD PRIMARY KEY ...**
command
1. Create a primary key name to use as the constraint name.
2. Change the **ALTER TABLE ... ADD PRIMARY KEY ...** command to into
**ALTER TABLE ... ADD CONSTRAINT \<constraint name> PRIMARY KEY ...**
form.
This is the only form we can specify a name for a primary key. If we run
ALTER TABLE .. ADD PRIMARY KEY, postgres
would create a constraint name internally in its own scheme. But the
problem is that we need to create constraint names
for shards in our own scheme which is \<constraint name>_\<shardid>.
Hence we need to create a name and send it to workers so that the
workers can append the shardid.
4. Run the changed command on the coordinator to make sure we are using
the same constraint name across the board.
5. Send the changed command to workers such that it is executed for the
main table as well as for the shards.
Fixes#6515.
Before this commit, we created an additional WaitEventSet for
checking whether the remote socket is closed per connection -
only once at the start of the execution.
However, for certain workloads, such as pgbench select-only
workloads, the creation/deletion of the additional WaitEventSet
adds ~7% CPU overhead, which is also reflected on the benchmark
results.
With this commit, we use the same WaitEventSet for the purposes
of checking the remote socket at the start of the execution.
We use "rebuildWaitEventSet" flag so that the executor can re-use
the existing WaitEventSet.
As a result, we see the following improvements on PG 15:
main : 120051 tps, 0.532 ms latency avg.
avoid_wes_rebuild: 127119 tps, 0.503 ms latency avg.
And, on PG 14, as expected, there is no difference
main : 129191 tps, 0.495 ms latency avg.
avoid_wes_rebuild: 129480 tps, 0.494 ms latency avg.
But, note that PG 15 is slightly (~1.5%) slower than PG 14.
That is probably the overhead of checking the remote socket.
We already have citus_job_wait to wait until the job reaches the desired
state. That PR adds waiting on task state to allow more granular
waiting. It can be used for Citus operations. Moreover, it is also
useful for testing purposes. (wait until a task reaches specified state)
Related to #6459.
Fixes task executor SIGTERM handling.
Problem:
When task executors are sent SIGTERM, their default handler
`bgworker_die`, which is set at worker startup, logs FATAL error. But
they do not release locks there before logging the error, which
sometimes causes hanging of the monitor. e.g. Monitor waits for the lock
forever at pg_stat flush after calling proc_exit.
Solution:
Because executors have connection to backend, they should handle SIGTERM
similar to normal backends. Normal backends uses `die` handler, in which
they set ProcDiePending flag and the next CHECK_FOR_INTERRUPTS call
handles it gracefully by releasing any lock before termination.
Adding a testing function `wait_for_resource_cleanup` which waits until
all records in `pg_dist_cleanup` are cleaned up. The motivation is to
prevent flakiness in our tests, since the `NOTICE: cleaned up X orphaned
resources` message is not consistent in many cases. This PR replaces
`citus_cleanup_orphaned_resources` calls with
`wait_for_resource_cleanup` calls.
DESCRIPTION: Create replication artifacts with unique names
We're creating replication objects with generic names. This disallows us
to enable parallel shard moves, as two operations might use the same
objects. With this PR, we'll create below objects with operation
specific names, by appending OparationId to the names.
* Subscriptions
* Publications
* Replication Slots
* Users created for subscriptions
1) Regular users fail to use clock UDF with permission issue.
2) Clock functions were declared as STABLE, whereas by definition they are VOLATILE. By design, any clock/time
functions will return different results for each call even within a single SQL statement.
Note: UDF citus_get_transaction_clock() is a misnomer as it internally calls the clock tick which always returns
different results for every invocation in the same transaction.
Adds signal handlers for graceful termination, cancellation of
task executors and detecting config updates. Related to PR #6459.
#### How to handle termination signal?
Monitor need to gracefully terminate all running task executors before
terminating. Hence, we have sigterm handler for the monitor.
#### How to handle cancellation signal?
Monitor need to gracefully cancel all running task executors before
terminating. Hence, we have sigint handler for the monitor.
#### How to detect configuration changes?
Monitor has SIGHUP handler to reflect configuration changes while
executing tasks.
We are having some flakiness in our test schedule because of the objects
leftover from shard moves/splits. With this commit we prevent logging
cleanup object counts.
fixes: #6534
DESCRIPTION: Extend cleanup process for replication artifacts
This PR adds new cleanup record types for:
* Subscriptions
* Replication slots
* Publications
* Users created for subscriptions
We add records for these object types, to `pg_dist_cleanup` during
creation phase. Once the operation is done, in case of success or
failure, we iterate those records and drop the objects. With this PR we
will not be dropping any of these objects during the operation. In
short, we will always be deferring the drop.
One thing that's worth mentioning is that we sort cleanup records before
processing (dropping) them, because of dependency relations among those
objects, e.g a subscription might depend on a publication. Therefore, we
always drop subscriptions before publications.
We have some renames in this PR:
* `TryDropOrphanedShards` -> `TryDropOrphanedResources`
* `DropOrphanedShardsForCleanup` -> `DropOrphanedResourcesForCleanup`
* `run_try_drop_marked_shards` -> `run_try_drop_marked_resources`
as these functions now process replication artifacts as well.
This PR drops function `DropAllLogicalReplicationLeftovers` and its all
usages, since now we rely on the deferring drop mechanism.
Improvement on our background task monitoring API (PR #6296) to support
concurrent and nonblocking task execution.
Mainly we have a queue monitor background process which forks task
executors for `Runnable` tasks and then monitors their status by
fetching messages from shared memory queue in nonblocking way.
**Problem**: Currently, we error out if we detect recurring tuples in
one side without checking the other side of the join.
**Solution**: When one side of the full join consists recurring tuples
and the other side consists nonrecurring tuples, we should not pushdown
to prevent duplicate results. Otherwise, safe to pushdown.
This PR changes
```citus.propagate_session_settings_for_loopback_connection``` default
value to off not to expose this feature publicly at this point. See
#6488 for details.
When debugging issues it's quite useful to see the originating gpid in
the application_name of a query on a worker. This already happens for
most queries, but not for queries created by the rebalancer or by
run_command_on_worker. This adds a gpid to those two application_names
too.
Note, that if the GPID of the new application_names is different than
the current GPID of the backend the backend will continue to keep
the old gpid as its actual GPID. This PR is just meant to make sure
that the application_name is as useful as it can be for users to
look at. Updating of gpids will be done in a follow-up PR, and
adding gpids to all internal connections will make this easier.
Fixes a bug that causes crash when using auto_explain extension with
ALTER TABLE...ADD FOREIGN KEY... queries.
Those queries trigger a SELECT query on the citus tables as part of the
foreign key constraint validation check. At the explain hook, workers
try to explain this SELECT query as a distributed query causing memory
corruption in the connection data structures. Hence, we will not explain
ALTER TABLE...ADD FOREIGN KEY... and the triggered queries on the
workers.
Fixes#6424.
DESCRIPTION: Makes sure to disallow triggers that depend on extensions
We were already doing so for `ALTER trigger DEPENDS ON EXTENSION`
commands. However, we also need to disallow creating Citus tables
having such triggers already, so this PR fixes that.
DESCRIPTION: Allow citus_update_node() to work with nodes from different clusters
citus_update_node(), citus_nodename_for_nodeid(), and citus_nodeport_for_nodeid() functions only checked for nodes in their own clusters and hence last two returned NULLs and the first one showed an error is the nodeId was from a different cluster.
Fixes https://github.com/citusdata/citus/issues/6433
increasing logical clock. Clock guarantees to never go back in value after restarts,
and makes best attempt to keep the value close to unix epoch time in milliseconds.
Also, introduces a new GUC "citus.enable_cluster_clock", when true, every
distributed transaction is stamped with logical causal clock and persisted
in a catalog pg_dist_commit_transaction.
DESCRIPTION: Drops GUC defer_drop_after_shard_split
DESCRIPTION: Drops GUC defer_drop_after_shard_move
Drop GUCs and related parts from the code.
Delete tests that specifically added for the GUCs.
Keep tests that can be used without the GUCs.
Update test output changes.
The motivation for this PR is to have an "always deferring" mechanism.
These two GUCs provide an option to not deferring dropping objects
during a shard move/split, and dropping them immediately. With this PR,
we will be always deferring dropping orphaned shards and other types of
objects.
We will have a separate PR to extend the deferred cleanup operation, so
that we would create records for deferred drop, for Subscriptions,
Publications, Replication Slots etc. This will make us be able to keep
track of created objects that needs to be dropped, during a shard
move/split. We will have objects created specifically for the current
operation; and those objects will be dropped at the end.
We have an issue (a draft roadmap) for enabling parallel shard moves.
For details please see: https://github.com/citusdata/citus/issues/6437
Sometimes in CI our failure_split_cleanup test would fail like this:
```diff
CALL pg_catalog.citus_cleanup_orphaned_resources();
-NOTICE: cleaned up 79 orphaned resources
+NOTICE: cleaned up 82 orphaned resources
SELECT operation_id, object_type, object_name, node_group_id, policy_type
```
Source:
https://app.circleci.com/pipelines/github/citusdata/citus/28107/workflows/4ec712c9-98b5-4e90-9806-e02a37d71679/jobs/846107
The reason was that previous tests in the schedule would also create
some orphaned resources. Sometimes some of those would already be
cleaned up by the maintenance daemon, resulting in a different number of
cleaned up resources than expected. This cleans up any previously
created resources at the start of the test without logging how many
exactly were cleaned up. As a bonus this now also allows running this
test using check-failure-base.
DESCRIPTION: Don't leak search_path to workers on DDL
For DDL we have to set the `search_path` on workers to the same as on
the coordinator for some DDL to work. Previously this search_path would
leak outside of the transaction that was used for the DDL. This fixes
that by using `SET LOCAL` instead of `SET`. The only place where we
still use plain `SET` is for DDL commands that are not allowed within
transactions, such as `CREATE INDEX CONCURRENLTY`.
This fixes this flaky test:
```diff
CONTEXT: SQL statement "SELECT change_id FROM distributed_triggers.data_changes
WHERE shard_key_value = NEW.shard_key_value AND object_id = NEW.object_id
ORDER BY change_id DESC LIMIT 1"
-PL/pgSQL function record_change() line XX at SQL statement
+PL/pgSQL function distributed_triggers.record_change() line 17 at SQL statement
while executing command on localhost:57638
DELETE FROM data_ref_table where shard_key_value = 'hello';
```
Source:
https://app.circleci.com/pipelines/github/citusdata/citus/27849/workflows/75ae5f1a-100b-4b7a-b991-7de069f39ee1/jobs/831429
I had tried to fix this flaky test in #5894 and then I tried
implementing a better fix in #5896, where @marcocitus suggested this
better fix. This change reverts the fix from #5894 and implements the
fix suggested by Marco.
Our multi_mx_alter_distributed_table test actually depended on the old
buggy search_path leaking behavior. After fixing the bug that test would
fail like this:
```diff
CALL proc_0(1.0);
DEBUG: pushing down the procedure
-NOTICE: Res: 3
-DETAIL: from localhost:xxxxx
+ERROR: relation "test_proc_colocation_0" does not exist
+CONTEXT: PL/pgSQL function mx_alter_distributed_table.proc_0(double precision) line 5 at SQL statement
+while executing command on localhost:57637
RESET client_min_messages;
```
I fixed this test by fully qualifying the table names used in the
procedure. I think it's quite unlikely that actual users depend
on this behavior though. Since it would require first doing
DDL before calling a procedure in a session where the
search_path was changed after connecting.
DESCRIPTION: Adds failure test for shard move
DESCRIPTION: Remove function `WaitForAllSubscriptionsToBecomeReady` and
related tests
Adding some failure tests for shard moves.
Dropping the not-needed-anymore function
`WaitForAllSubscriptionsToBecomeReady`, as the subscriptions now start
as ready from the beginning because we don't use logical replication
table sync workers anymore.
fixes: #6260
In CI shard_rebalancer sometimes fails with this error:
```diff
SET citus.node_connection_timeout to 60;
BEGIN;
SET LOCAL citus.shard_replication_factor TO 2;
SET citus.log_remote_commands TO ON;
SET SESSION citus.max_adaptive_executor_pool_size TO 5;
SELECT replicate_table_shards('dist_table_test_2', max_shard_copies := 4, shard_transfer_mode:='block_writes');
+WARNING: could not establish connection after 60 ms
```
Source
https://app.circleci.com/pipelines/github/citusdata/citus/28128/workflows/38eeacc4-4191-4366-87ed-9a628414965a/jobs/847458?invite=true#step-107-21
This PR avoids this issue by increasing
```citus.node_connection_timeout``` to 35s.
To be able to test non-blocking shard moves we take an advisory lock, so
we can pause the shard move at an interesting moment. Originally this
was during the logical replication catch up phase. But when I added
tests for the rebalancer progress I moved this lock before the initial
data copy. This allowed testing of the rebalance progress, but
inadvertently made our non-blocking tests not actually test if we held
unintended locks during logical replication catch up.
This fixes that by creating two types of advisory locks, one before the
copy and one after. This causes the tests to actually test their
intended scenario again.
Furthermore it starts using one of these locks for blocking shard moves
too. Which allowed me to reduce the complexity of the rebalance progress
test suite quite a bit. It also allowed enabling some flaky tests again,
because this stopped them from being flaky. And finally it allowed
testing of rebalance progress for blocking shard copy operations as
well.
In passing it fixes a flaky test during parallel blocking shard moves by
ordering the output.
DESCRIPTION: Adds status column to get_rebalance_progress()
Introduces a new column named `status` for the function
`get_rebalance_progress()`. For each ongoing shard move, this column
will reveal information about that shard move operation's current
status.
For now, candidate status messages could be one of the below.
* Not Started
* Setting Up
* Copying Data
* Catching Up
* Creating Constraints
* Final Catchup
* Creating Foreign Keys
* Completing
* Completed
Deparser function set_relation_column_names() knows that it needs to
re-evaluate column names based on relation's tuple descriptor when
the rte belongs to a relation (RTE_RELATION).
However before this commit, it didn't know about the fact that citus
might wrap such an rte with an rte that points to
citus_extradata_container() placeholder.
And because of this, it was simply taking the column aliases
(e.g., "bar" in "foo AS bar") into the account and this might result in
an incorrectly deparsed query as in below case:
* Say, if we had view based on following query:
```sql
SELECT a FROM table;
```
* And if we rename column "a" to "b", the view query normally becomes:
```sql
SELECT b AS a FROM table;
```
* So before this commit, deparsing a query based on that view was
resulting in such a query due to deparsing based on the column aliases,
which is not correct:
```sql
SELECT a FROM table;
```
Fixes#5932.
DESCRIPTION: Fixes a bug that might cause failing to query the views
based on tables that have renamed columns
PostgreSQL 15 exposes WL_SOCKET_CLOSED in WaitEventSet API, which is
useful for detecting closed remote sockets. In this patch, we use this
new event and try to detect closed remote sockets in the executor.
When a closed socket is detected, the executor now has the ability to
retry the connection establishment. Note that, the executor can retry
connection establishments only for the connection that has not been
used. Basically, this patch is mostly useful for preventing the executor
to fail if a cached connection is closed because of the worker node
restart (or worker failover).
In other words, the executor cannot retry connection establishment if we
are in a distributed transaction AND any command has been sent over the
connection. That requires more sophisticated retry mechanisms. For now,
fixing the above use case is enough.
Fixes#5538
Earlier discussions: #5908, #6259 and #6283
### Summary of the current approach regards to earlier trials
As noted, we explored some alternatives before getting into this.
https://github.com/citusdata/citus/pull/6283 is simple, but lacks an
important property. We should be checking for `WL_SOCKET_CLOSED`
_before_ sending anything over the wire. Otherwise, it becomes very
tricky to understand which connection is actually safe to retry. For
example, in the current patch, we can safely check
`transaction->transactionState == REMOTE_TRANS_NOT_STARTED` before
restarting a connection.
#6259 does what we intent here (e.g., check for sending any command).
However, as @marcocitus noted, it is very tricky to handle
`WaitEventSets` in multiple places. And, the executor is designed such
that it reacts to the events. So, adding anything `pre-executor` seemed
too ugly.
In the end, I converged into this patch. This patch relies on the
simplicity of #6283 and also does a very limited handling of
`WaitEventSets`, just for our purpose. Just before we add any connection
to the execution, we check if the remote session has already closed.
With that, we do a brief interaction of multiple wait event processing,
but with different purposes. The new wait event processing we added does
not even consider cancellations. We let that handled by the main event
processing loop.
Co-authored-by: Marco Slot <marco.slot@gmail.com>
If an operation requires having coordinator in pg_dist_node and if that
is not the case, then we automatically add the coordinator into
pg_dist_node if user didn't add any worker nodes yet.
However, if user have already added some worker nodes before, we throw
an error. With this commit, we improve the error thrown in that case.
Closes#6423 based on the discussion made there.
DESCRIPTION: Fix bug in global PID assignment for rebalancer
sub-connections
In CI our isolation_shard_rebalancer_progress test would sometimes fail
like this:
```diff
+isolationtester: canceling step s1-rebalance-c1-block-writes after 60 seconds
step s1-rebalance-c1-block-writes:
SELECT rebalance_table_shards('colocated1', shard_transfer_mode:='block_writes');
- <waiting ...>
+
+ERROR: canceling statement due to user request
step s7-get-progress:
```
Source:
https://app.circleci.com/pipelines/github/citusdata/citus/27855/workflows/2a7e335a-f3e8-46ed-b6bd-6920d42f7214/jobs/831710
It turned out this was an actual bug in the way our assigning of global
PIDs interacts with the way we connect to ourselves as the shard
rebalancer. The first command the shard rebalancer sends is a SET
ommand to change the application_name to `citus_rebalancer`. If
`StartupCitusBackend` is called after this command is processed, then it
overwrites the global PID that was extracted from the previous
application_name. This makes sure that we don't do that, and continue to
use the original global PID. While it might seem that we only call
`StartupCitusBackend` once for each query backend, this isn't actually
the case. Whenever pg_dist_partition gets ANALYZEd by autovacuum
we indirectly call `StartupCitusBackend` again, because we invalidate
the cache then.
In passing this fixes two other things as well:
1. It sets `distributedCommandOriginator` correctly in
`AssignGlobalPID`, by using IsExternalClientBackend(). This doesn't
matter much anymore, since AssignGlobalPID effectively becomes a
no-op in this PR for any non-external client backends.
2. It passes the application_name to InitializeBackendData in
StartupCitusBackend, instead of INVALID_CITUS_INTERNAL_BACKEND_GPID
(which effectively got casted to NULL). In practice this doesn't
change the behaviour of the call, since the call is a no-op for every
backend except the maintenance daemon. And the behaviour of the call
is the same for NULL as for the application_name of the maintenance
daemon.
We decrease verbosity level here to avoid the flaky output
https://app.circleci.com/pipelines/github/citusdata/citus/27936/workflows/dc63128a-1570-41a0-8722-08f3e3cfe301/jobs/836153
```diff
select alter_table_set_access_method('ref','heap');
NOTICE: creating a new table for alter_table_set_access_method.ref
NOTICE: moving the data of alter_table_set_access_method.ref
NOTICE: dropping the old alter_table_set_access_method.ref
NOTICE: drop cascades to 2 other objects
-DETAIL: drop cascades to materialized view m_ref
-drop cascades to view v_ref
+DETAIL: drop cascades to view v_ref
+drop cascades to materialized view m_ref
CONTEXT: SQL statement "DROP TABLE alter_table_set_access_method.ref CASCADE"
NOTICE: renaming the new table to alter_table_set_access_method.ref
alter_table_set_access_method
-------------------------------
(1 row)
```