In PG16, REINDEX DATABASE/SYSTEM name is optional.
We already don't propagate these commands automatically.
Testing here with run_command_on_workers.
Relevant PG commit:
https://github.com/postgres/postgres/commit/2cbc3c1
In PG16, REINDEX DATABASE/SYSTEM name is optional.
We already don't propagate these commands automatically.
Testing here with run_command_on_workers.
Relevant PG commit:
https://github.com/postgres/postgres/commit/2cbc3c1
In PG16, REINDEX DATABASE/SYSTEM name is optional.
We already don't propagate these commands automatically.
Testing here with run_command_on_workers.
Relevant PG commit:
https://github.com/postgres/postgres/commit/2cbc3c1
In PG16, REINDEX DATABASE/SYSTEM name is optional.
We already don't propagate these commands automatically.
Testing here with run_command_on_workers.
Relevant PG commit:
https://github.com/postgres/postgres/commit/2cbc3c1
Postgres got minor updates on Nov9, this starts using the images with
the latest version for our tests, namely 14.10, 15.5 and 16.1.
These minor updates were compatible with Citus.
Sister PR: https://github.com/citusdata/the-process/pull/152
DESCRIPTION: Adds support from issuing role management commands from worker nodes
It's unlikely to get into a distributed deadlock with role commands, we
don't care much about them at the moment.
There were several attempts to reduce the chances of a deadlock but we
didn't any of them merged into main branch yet, see:
#7325#7016#7009
When I run this test in my local, the size of the table after the DELETE
command is around 58785792. Hence, I assume that the diffs suggest that
the Vacuum had no effect. The current solution is to run the VACUUM
command three times instead of once.
Example diff:
https://github.com/citusdata/citus/actions/runs/6722231142/attempts/1#summary-18269870674
```diff
insert into local_vacuum_table select i from generate_series(1,1000000) i;
delete from local_vacuum_table;
VACUUM local_vacuum_table;
SELECT CASE WHEN s BETWEEN 20000000 AND 25000000 THEN 22500000 ELSE s END
FROM pg_total_relation_size('local_vacuum_table') s ;
s
----------
- 22500000
+ 58785792
(1 row)
```
See more diff examples in the PR description
https://github.com/citusdata/citus/pull/7334
https://github.com/citusdata/citus/actions/runs/6745019678/attempts/1#summary-18336188930
```diff
insert into target_table SELECT a*2 FROM source_table RETURNING a;
-NOTICE: executing the command locally: SELECT bytes FROM fetch_intermediate_results(ARRAY['repartitioned_results_xxxxx_from_4213582_to_0','repartitioned_results_xxxxx_from_4213584_to_0']::text[],'localhost',57638) bytes
+NOTICE: executing the command locally: SELECT bytes FROM fetch_intermediate_results(ARRAY['repartitioned_results_3940758121873413_from_4213584_to_0','repartitioned_results_3940758121873413_from_4213582_to_0']::text[],'localhost',57638) bytes
```
The elements in the array passed to `fetch_intermediate_results` are the
same, but in the opposite order than expected.
To fix this flakiness, we can omit the `"SELECT bytes FROM
fetch_intermediate_results..."` line. From the following logs, it is
understandable that the intermediate results have been fetched.
Fix the flaky test that results in following diff by waiting until the
backend that we want to terminate really terminates, until 5secs.
```diff
--- /__w/citus/citus/src/test/regress/expected/isolation_get_all_active_transactions.out.modified 2023-11-01 16:30:57.648749795 +0000
+++ /__w/citus/citus/src/test/regress/results/isolation_get_all_active_transactions.out.modified 2023-11-01 16:30:57.656749877 +0000
@@ -114,13 +114,13 @@
--------------------
t
(1 row)
step s3-show-activity:
SET ROLE postgres;
select count(*) from get_all_active_transactions() where process_id IN (SELECT * FROM selected_pid);
count
-----
- 0
+ 1
(1 row)
```
Sometimes multi_alter_table_statements would fail in CI like this:
```diff
-- Verify that DROP NOT NULL works
ALTER TABLE lineitem_alter ALTER COLUMN int_column2 DROP NOT NULL;
SELECT "Column", "Type", "Modifiers" FROM table_desc WHERE relid='lineitem_alter'::regclass;
- Column | Type | Modifiers
----------------------------------------------------------------------
- l_orderkey | bigint | not null
- l_partkey | integer | not null
- l_suppkey | integer | not null
- l_linenumber | integer | not null
- l_quantity | numeric(15,2) | not null
- l_extendedprice | numeric(15,2) | not null
- l_discount | numeric(15,2) | not null
- l_tax | numeric(15,2) | not null
- l_returnflag | character(1) | not null
- l_linestatus | character(1) | not null
- l_shipdate | date | not null
- l_commitdate | date | not null
- l_receiptdate | date | not null
- l_shipinstruct | character(25) | not null
- l_shipmode | character(10) | not null
- l_comment | character varying(44) | not null
- float_column | double precision | default 1
- date_column | date |
- int_column1 | integer |
- int_column2 | integer |
- null_column | integer |
-(21 rows)
-
+ERROR: schema "alter_table_add_column" does not exist
-- COPY should succeed now
SELECT master_create_empty_shard('lineitem_alter') as shardid \gset
```
Reading from table_desc apparantly has an issue that if the schema gets
deleted from one of the items, while it is being read that we get such
an error.
This change fixes that by not running multi_alter_table_statements in parallel
with alter_table_add_column anymore.
This is another instance of the same issue as in #7294
Sometimes in CI we run into this failure:
```diff
SELECT resultId, nodeport, rowcount, targetShardId, targetShardIndex
FROM partition_task_list_results('test', $$ SELECT * FROM source_table $$, 'target_table')
NATURAL JOIN pg_dist_node;
-WARNING: connection to the remote node localhost:xxxxx failed with the following error: connection not open
+ERROR: connection to the remote node localhost:9060 failed with the following error: connection not open
SELECT * FROM distributed_result_info ORDER BY resultId;
- resultid | nodeport | rowcount | targetshardid | targetshardindex
----------------------------------------------------------------------
- test_from_100800_to_0 | 9060 | 22 | 100805 | 0
- test_from_100801_to_0 | 57637 | 2 | 100805 | 0
- test_from_100801_to_1 | 57637 | 15 | 100806 | 1
- test_from_100802_to_1 | 57637 | 10 | 100806 | 1
- test_from_100802_to_2 | 57637 | 5 | 100807 | 2
- test_from_100803_to_2 | 57637 | 18 | 100807 | 2
- test_from_100803_to_3 | 57637 | 4 | 100808 | 3
- test_from_100804_to_3 | 9060 | 24 | 100808 | 3
-(8 rows)
-
+ERROR: current transaction is aborted, commands ignored until end of transaction block
-- fetch from worker 2 should fail
SAVEPOINT s1;
+ERROR: current transaction is aborted, commands ignored until end of transaction block
SELECT fetch_intermediate_results('{test_from_100802_to_1,test_from_100802_to_2}'::text[], 'localhost', :worker_2_port) > 0 AS fetched;
-ERROR: could not open file "base/pgsql_job_cache/xx_x_xxx/test_from_100802_to_1.data": No such file or directory
-CONTEXT: while executing command on localhost:xxxxx
+ERROR: current transaction is aborted, commands ignored until end of transaction block
ROLLBACK TO SAVEPOINT s1;
+ERROR: savepoint "s1" does not exist
-- fetch from worker 1 should succeed
SELECT fetch_intermediate_results('{test_from_100802_to_1,test_from_100802_to_2}'::text[], 'localhost', :worker_1_port) > 0 AS fetched;
- fetched
----------------------------------------------------------------------
- t
-(1 row)
-
+ERROR: current transaction is aborted, commands ignored until end of transaction block
-- make sure the results read are same as the previous transaction block
SELECT count(*), sum(x) FROM
read_intermediate_results('{test_from_100802_to_1,test_from_100802_to_2}'::text[],'binary') AS res (x int);
- count | sum
----------------------------------------------------------------------
- 15 | 863
-(1 row)
-
+ERROR: current transaction is aborted, commands ignored until end of transaction block
ROLLBACk;
```
As outlined in the #7306 I created, the reason for this is related to
only having a single connection open to the node. Finding and fixing the
full cause is not trivial, so instead this PR starts working around
this bug by forcing maximum parallelism. Preferably we'd want
this workaround not to be necessary, but that requires
spending time to fix this. For now having a less flaky CI is
good enough.
Sometimes in CI insert_select_connection_leak would fail like this:
```diff
END;
SELECT worker_connection_count(:worker_1_port) - :pre_xact_worker_1_connections AS leaked_worker_1_connections,
worker_connection_count(:worker_2_port) - :pre_xact_worker_2_connections AS leaked_worker_2_connections;
leaked_worker_1_connections | leaked_worker_2_connections
-----------------------------+-----------------------------
- 0 | 0
+ -1 | 0
(1 row)
-- ROLLBACK
BEGIN;
INSERT INTO target_table SELECT * FROM source_table;
INSERT INTO target_table SELECT * FROM source_table;
ROLLBACK;
SELECT worker_connection_count(:worker_1_port) - :pre_xact_worker_1_connections AS leaked_worker_1_connections,
worker_connection_count(:worker_2_port) - :pre_xact_worker_2_connections AS leaked_worker_2_connections;
leaked_worker_1_connections | leaked_worker_2_connections
-----------------------------+-----------------------------
- 0 | 0
+ -1 | 0
(1 row)
\set VERBOSITY TERSE
-- Error on constraint failure
BEGIN;
INSERT INTO target_table SELECT * FROM source_table;
SELECT worker_connection_count(:worker_1_port) AS worker_1_connections,
worker_connection_count(:worker_2_port) AS worker_2_connections \gset
SAVEPOINT s1;
INSERT INTO target_table SELECT a, CASE WHEN a < 50 THEN b ELSE null END FROM source_table;
@@ -89,15 +89,15 @@
leaked_worker_1_connections | leaked_worker_2_connections
-----------------------------+-----------------------------
0 | 0
(1 row)
END;
SELECT worker_connection_count(:worker_1_port) - :pre_xact_worker_1_connections AS leaked_worker_1_connections,
worker_connection_count(:worker_2_port) - :pre_xact_worker_2_connections AS leaked_worker_2_connections;
leaked_worker_1_connections | leaked_worker_2_connections
-----------------------------+-----------------------------
- 0 | 0
+ -1 | 0
(1 row)
```
Source:
https://github.com/citusdata/citus/actions/runs/6718401194/attempts/1#summary-18258258387
A negative amount of leaked connectios is obviously not possible. For
some reason there was a connection open when we checked the initial
amount of connections that was closed afterwards. This could be the
from the maintenance daemon or maybe from the previous test that had not
fully closed its connections just yet.
The change in this PR doesnt't actually fix the cause of the negative
connection, but it simply considers it good as well, by changing the
result to zero for negative values.
With this fix we might sometimes miss a leak, because the negative
number can cancel out the leak and still result in a 0. But since the
negative number only occurs sometimes, we'll still find the leak often
enough.
When executing a prepared CALL, which is not pure SQL but available with
some drivers like npgsql and jpgdbc, Citus entered a code path where a
plan is not defined, while trying to increase its cost. Thus SIG11 when
plan is a NULL pointer.
Fix by only increasing plan cost when plan is not null.
However, it is a bit suspicious to get here with a NULL plan and maybe a
better change will be to not call
ShardPlacementForFunctionColocatedWithDistTable() with a NULL plan at
all (in call.c:134)
bug hit with for example:
```
CallableStatement proc = con.prepareCall("{CALL p(?)}");
proc.registerOutParameter(1, java.sql.Types.BIGINT);
proc.setInt(1, -100);
proc.execute();
```
where `p(bigint)` is a distributed "function" and the param the
distribution key (also in a distributed table), see #7242 for details
Fixes#7242
Sometimes in CI our logical_replication test fails like this:
```diff
+++ /__w/citus/citus/src/test/regress/results/logical_replication.out.modified 2023-11-01 14:15:08.562758546 +0000
@@ -40,21 +40,21 @@
SELECT count(*) from pg_publication;
count
-------
0
(1 row)
SELECT count(*) from pg_replication_slots;
count
-------
- 0
+ 1
(1 row)
SELECT count(*) FROM dist;
count
-------
```
It's hard to understand what is going on here, just based on the wrong
number. So this PR changes the test to show the name of the
subscription, publication and replication slot to make finding the cause
easier.
In passing this also fixes another flaky test in the same file that our
flaky test detection picked up. This is done by waiting for resource
cleanup after the shard move.
This is causing 404 failures due to a race condition:
https://github.com/actions/toolkit/issues/1235
It also makes the tests take unnecessarily long.
This was tested by changing a test file and seeing that the flaky test
detection was still working.
Fixes the flaky test that results in following diff:
```diff
--- /__w/citus/citus/src/test/regress/expected/multi_mx_node_metadata.out.modified 2023-11-01 14:22:12.890476575 +0000
+++ /__w/citus/citus/src/test/regress/results/multi_mx_node_metadata.out.modified 2023-11-01 14:22:12.914476657 +0000
@@ -840,24 +840,26 @@
(1 row)
\c :datname - - :master_port
SELECT datname FROM pg_stat_activity WHERE application_name LIKE 'Citus Met%';
datname
------------
db_to_drop
(1 row)
DROP DATABASE db_to_drop;
+ERROR: database "db_to_drop" is being accessed by other users
SELECT datname FROM pg_stat_activity WHERE application_name LIKE 'Citus Met%';
datname
------------
-(0 rows)
+ db_to_drop
+(1 row)
-- cleanup
DROP SEQUENCE sequence CASCADE;
NOTICE: drop cascades to default value for column a of table reference_table
```
Sometimes isolation_metadata_sync_deadlock fails in CI like this:
```diff
diff -dU10 -w /__w/citus/citus/src/test/regress/expected/isolation_metadata_sync_deadlock.out /__w/citus/citus/src/test/regress/results/isolation_metadata_sync_deadlock.out
--- /__w/citus/citus/src/test/regress/expected/isolation_metadata_sync_deadlock.out.modified 2023-11-01 16:03:15.090199229 +0000
+++ /__w/citus/citus/src/test/regress/results/isolation_metadata_sync_deadlock.out.modified 2023-11-01 16:03:15.098199312 +0000
@@ -110,10 +110,14 @@
t
(1 row)
step s2-stop-connection:
SELECT stop_session_level_connection_to_node();
stop_session_level_connection_to_node
-------------------------------------
(1 row)
+
+teardown failed: ERROR: localhost:57638 is a metadata node, but is out of sync
+HINT: If the node is up, wait until metadata gets synced to it and try again.
+CONTEXT: SQL statement "SELECT master_remove_distributed_table_metadata_from_workers(v_obj.objid, v_obj.schema_name, v_obj.object_name)"
```
Source:
https://github.com/citusdata/citus/actions/runs/6721938040/attempts/1#summary-18268946448
To fix this we now wait for the metadata to be fully synced to all
nodes at the start of the teardown steps.
Sometimes in CI citus_non_blocking_split_shard_cleanup failed like this:
```diff
--- /__w/citus/citus/src/test/regress/expected/citus_non_blocking_split_shard_cleanup.out.modified 2023-11-01 15:07:14.280551207 +0000
+++ /__w/citus/citus/src/test/regress/results/citus_non_blocking_split_shard_cleanup.out.modified 2023-11-01 15:07:14.292551358 +0000
@@ -106,21 +106,22 @@
-----------------------------------
(1 row)
\c - - - :worker_2_port
SET search_path TO "citus_split_test_schema";
-- Replication slots should be cleaned up
SELECT slot_name FROM pg_replication_slots;
slot_name
---------------------------------
-(0 rows)
+ citus_shard_split_slot_19_10_17
+(1 row)
-- Publications should be cleanedup
SELECT count(*) FROM pg_publication;
count
```
It's expected that the replication slot is sometimes not cleaned up if
we don't wait until resource cleanup completes. This PR starts doing
that here.
Normally, tests which are written non-dependent to other tests can use
minimal-tests and should use as well. However, in our test settings
base-schedule is being used which may cause unnecessary dependencies and
so unrelated errors that developers don't see in their local environment
With this change, default setting will be minimal, so that tests will be
free of unnecessary dependencies.
Sometimes failure_split_cleanup failed in CI like this:
```diff
ERROR: server closed the connection unexpectedly
CONTEXT: while executing command on localhost:9060
SELECT operation_id, object_type, object_name, node_group_id, policy_type
FROM pg_dist_cleanup where operation_id = 777 ORDER BY object_name;
operation_id | object_type | object_name | node_group_id | policy_type
--------------+-------------+-----------------------------------------------------------+---------------+-------------
777 | 1 | citus_failure_split_cleanup_schema.table_to_split_8981000 | 1 | 0
- 777 | 1 | citus_failure_split_cleanup_schema.table_to_split_8981002 | 1 | 1
777 | 1 | citus_failure_split_cleanup_schema.table_to_split_8981002 | 2 | 0
+ 777 | 1 | citus_failure_split_cleanup_schema.table_to_split_8981002 | 1 | 1
777 | 1 | citus_failure_split_cleanup_schema.table_to_split_8981003 | 2 | 1
777 | 4 | citus_shard_split_publication_1_10_777 | 2 | 0
(5 rows)
-- we need to allow connection so that we can connect to proxy
```
Source:
https://github.com/citusdata/citus/actions/runs/6717642291/attempts/1#summary-18256014949
It's the common problem where we're missing a column in the ORDER BY
clause. This fixes that by adding an node_group_id to the query in
question.
Sometimes in CI isolation_master_update_node fails like this:
```diff
------------------
(1 row)
step s2-abort: ABORT;
step s1-abort: ABORT;
FATAL: terminating connection due to administrator command
FATAL: terminating connection due to administrator command
SSL connection has been closed unexpectedly
+server closed the connection unexpectedly
master_remove_node
------------------
```
This just seesm like a random error line. The only way to reasonably fix
this is by adding an extra output file. So that's what this PR does.
We want the nice looking green checkmark on our main branch too.
This PR includes running on pushes to release branches too, but that
won't come into effect until we have release branches with this
workflow file.
One of our most flaky and most anoying tests is
multi_cluster_management. It usually fails like this:
```diff
SELECT citus_disable_node('localhost', :worker_2_port);
citus_disable_node
--------------------
(1 row)
SELECT public.wait_until_metadata_sync(60000);
+WARNING: waiting for metadata sync timed out
wait_until_metadata_sync
--------------------------
(1 row)
```
This tries to address that by hardening wait_until_metadata_sync. I
believe the reason for this warning is that there is a race condition in
wait_until_metadata_sync. It's possible for the pre-check to fail, then
have the maintenance daemon send a notification. And only then have the
backend start to listen. I tried to fix it in two ways:
1. First run LISTEN, and only then read do the pre-check.
2. If we time out, check again just to make sure that we did not miss
the notification somehow. And don't show a warning if all metadata is
synced after the timeout.
It's hard to know for sure that this fixes it because the test is not
repeatable and I could not reproduce it locally. Let's just hope for the
best.
---------
Co-authored-by: Onur Tirtir <onurcantirtir@gmail.com>
Sometimes multi_reference_table failed in CI like this:
```diff
\c - - - :master_port
DROP INDEX reference_schema.reference_index_2;
\c - - - :worker_1_port
SELECT "Column", "Type", "Modifiers" FROM table_desc WHERE relid='reference_schema.reference_table_ddl_1250019'::regclass;
- Column | Type | Modifiers
----------------------------------------------------------------------
- value_2 | double precision | default 25.0
- value_3 | text | not null
- value_4 | timestamp without time zone |
- value_5 | double precision |
-(4 rows)
-
+ERROR: schema "citus_local_table_queries" does not exist
\di reference_schema.reference_index_2*
List of relations
Schema | Name | Type | Owner | Table
```
Source:
https://github.com/citusdata/citus/actions/runs/6707535961/attempts/2#summary-18226879513
Reading from table_desc apparantly has an issue that if the schema gets
deleted from one of the items, while it is being read that we get such
an error.
This change fixes that by not running multi_reference_table in parallel
with citus_local_tables_queries anymore.
I just enhanced the existing code to check if the relation is an index
belonging to a distributed table.
If so the shardId is appended to relation (index) name and the *_size
function are executed as before.
There is a change in an extern function:
`extern StringInfo GenerateSizeQueryOnMultiplePlacements(...)`
It's possible to create a new function and deprecate this one later if
compatibility is an issue.
Fixes https://github.com/citusdata/citus/issues/6496.
DESCRIPTION: Allows using Citus size functions on distributed tables
indexes.
---------
Co-authored-by: Onur Tirtir <onurcantirtir@gmail.com>
Sometimes validate constraint would fail like this:
```diff
validatable_constraint_8000016 | t
(10 rows)
DROP TABLE constrained_table;
+ERROR: deadlock detected
+DETAIL: Process 16602 waits for ShareRowExclusiveLock on relation 56258 of database 16384; blocked by process 16601.
+Process 16601 waits for AccessShareLock on relation 56120 of database 16384; blocked by process 16602.
+HINT: See server log for query details.
DROP TABLE referenced_table CASCADE;
DROP TABLE referencing_table;
DROP SCHEMA validate_constraint CASCADE;
-NOTICE: drop cascades to 3 other objects
+NOTICE: drop cascades to 4 other objects
DETAIL: drop cascades to type constraint_validity
drop cascades to view constraint_validations_in_workers
drop cascades to view constraint_validations
+drop cascades to table constrained_table
SET search_path TO DEFAULT;
```
Source:
https://github.com/citusdata/citus/actions/runs/6708383699?pr=7291
This change fixes that by not running together with the
foreign_key_to_reference_table test anymore. In passing it also
simplifies dropping of the test its resources.