The devcontainer missed two tools used by code formatting, as done by
`ci/fix_style.sh`
The missing tools were both python tools, used for formatting our python
scripts.
- black
- isort
This change adds both tools. The way it does this is by keeping a
`requirements.txt` in `.devcontainer/` containing all python
dependencies we need to install. When installing both tools in a clean
environment we have exported all installed packages with `pip freeze`
into the `requirements.txt` assuming this is all related to the two
tools installed.
Since python installs the binaires in `~/.local/bin/` we also move some
scripts we manually install from `~/.bin/` to that same directory. At
first it seemed like vscode's devcontainers were not having that on the
path. However, when the container has that directory when it starts the
directory does get added to `$PATH` by `~/.profile`. This makes the
whole environment a bit more streamlined.
This change adds a script to programatically group all includes in a
specific order. The script was used as a one time invocation to group
and sort all includes throught our formatted code. The grouping is as
follows:
- System includes (eg. `#include<...>`)
- Postgres.h (eg. `#include "postgres.h"`)
- Toplevel imports from postgres, not contained in a directory (eg.
`#include "miscadmin.h"`)
- General postgres includes (eg . `#include "nodes/..."`)
- Toplevel citus includes, not contained in a directory (eg. `#include
"citus_verion.h"`)
- Columnar includes (eg. `#include "columnar/..."`)
- Distributed includes (eg. `#include "distributed/..."`)
Because it is quite hard to understand the difference between toplevel
citus includes and toplevel postgres includes it hardcodes the list of
toplevel citus includes. In the same manner it assumes anything not
prefixed with `columnar/` or `distributed/` as a postgres include.
The sorting/grouping is enforced by CI. Since we do so with our own
script there are not changes required in our uncrustify configuration.
DESCRIPTION: Adds support for propagating `CREATE`/`DROP` database
In this PR, create and drop database support is added.
For CREATE DATABASE:
* "oid" option is not supported
* specifying "strategy" to be different than "wal_log" is not supported
* specifying "template" to be different than "template1" is not
supported
The last two are because those are not saved in `pg_database` and when
activating a node, we cannot assume what parameters were provided when
creating the database.
And "oid" is not supported because whether user specified an arbitrary
oid when creating the database is not saved in pg_database and we want
to avoid from oid collisions that might arise from attempting to use an
auto-assigned oid on workers.
Finally, in case of node activation, GRANTs for the database are also
propagated.
---------
Co-authored-by: Jelte Fennema-Nio <github-tech@jeltef.nl>
Co-authored-by: Jelte Fennema-Nio <jelte.fennema@microsoft.com>
Co-authored-by: Onur Tirtir <onurcantirtir@gmail.com>
https://app.circleci.com/pipelines/github/citusdata/citus/34550/workflows/5b802f66-2666-4623-a209-6d7799f7ee5f/jobs/1229153
```diff
VACUUM (FREEZE, PROCESS_TOAST true) local_vacuum_table;
SELECT relfrozenxid::text::integer > :frozenxid AS frozen_performed FROM pg_class
WHERE oid=:reltoastrelid::regclass;
frozen_performed
------------------
- t
+ f
(1 row)
```
Process toast option in vacuum was introduced in PG14. The failing test
was supposed to be a part of `multi_utilities.sql`, but it was included
in `pg14.sql` to avoid alternative output for PG13. See
ba62c0a148 (diff-ed03478f693155e2fe092e9ad356bf884dc097f554e8d75eff562d52bbcf7a75L255-L272)
for reference.
However, now that we don't support PG13 anymore, we can move this test
to `multi_utilities.sql`. Moving the test, plus inserting data before
running vacuum freeze such that the freeze is more meaningful and not
flaky, fixes the flakiness problem of the test.
With the recent changes in packaging images, linux package installations
to execute validate_output is unnecessary now.
In this PR, I removed them to make the pipeline more effective.
- [x] Remove the test warning before merge
When preparing changelog for 12.1.1 release, I accidentally swapped
the PR numbers for the two commits. This commit fixes the changelog
to point to the correct PRs.
We propagate `SECURITY LABEL [for provider] ON ROLE rolename IS
labelname` to the worker nodes.
We also make sure to run the relevant `SecLabelStmt` commands on a
newly added node by looking at roles found in `pg_shseclabel`.
See official docs for explanation on how this command works:
https://www.postgresql.org/docs/current/sql-security-label.html
This command stores the role label in the `pg_shseclabel` catalog table.
This commit also fixes the regex string in
`check_gucs_are_alphabetically_sorted.sh` script such that it escapes
the dot. Previously it was looking for all strings starting with "citus"
instead of "citus." as it should.
To test this feature, I currently make use of a special GUC to control
label provider registration in PG_init when creating the Citus extension.
While investigating replication slots leftovers
in PR https://github.com/citusdata/citus/pull/7338,
I ran into the following refactoring/cleanup
that can be done in our test suite:
- Add separate test to remove non default nodes
- Remove coordinator removal from `add_coordinator` test
Use `remove_coordinator_from_metadata` test where needed
- Don't print nodeids in `multi_multiuser_auth` and
`multi_poolinfo_usage`
tests
- Use `startswith` when checking for isolation or failure tests
- Add some dependencies accordingly in `run_test.py` for running flaky
test schedules
Postgres got minor updates on Nov9, this starts using the images with
the latest version for our tests, namely 14.10, 15.5 and 16.1.
These minor updates were compatible with Citus.
Sister PR: https://github.com/citusdata/the-process/pull/152
DESCRIPTION: Adds support from issuing role management commands from worker nodes
It's unlikely to get into a distributed deadlock with role commands, we
don't care much about them at the moment.
There were several attempts to reduce the chances of a deadlock but we
didn't any of them merged into main branch yet, see:
#7325#7016#7009
When I run this test in my local, the size of the table after the DELETE
command is around 58785792. Hence, I assume that the diffs suggest that
the Vacuum had no effect. The current solution is to run the VACUUM
command three times instead of once.
Example diff:
https://github.com/citusdata/citus/actions/runs/6722231142/attempts/1#summary-18269870674
```diff
insert into local_vacuum_table select i from generate_series(1,1000000) i;
delete from local_vacuum_table;
VACUUM local_vacuum_table;
SELECT CASE WHEN s BETWEEN 20000000 AND 25000000 THEN 22500000 ELSE s END
FROM pg_total_relation_size('local_vacuum_table') s ;
s
----------
- 22500000
+ 58785792
(1 row)
```
See more diff examples in the PR description
https://github.com/citusdata/citus/pull/7334
https://github.com/citusdata/citus/actions/runs/6745019678/attempts/1#summary-18336188930
```diff
insert into target_table SELECT a*2 FROM source_table RETURNING a;
-NOTICE: executing the command locally: SELECT bytes FROM fetch_intermediate_results(ARRAY['repartitioned_results_xxxxx_from_4213582_to_0','repartitioned_results_xxxxx_from_4213584_to_0']::text[],'localhost',57638) bytes
+NOTICE: executing the command locally: SELECT bytes FROM fetch_intermediate_results(ARRAY['repartitioned_results_3940758121873413_from_4213584_to_0','repartitioned_results_3940758121873413_from_4213582_to_0']::text[],'localhost',57638) bytes
```
The elements in the array passed to `fetch_intermediate_results` are the
same, but in the opposite order than expected.
To fix this flakiness, we can omit the `"SELECT bytes FROM
fetch_intermediate_results..."` line. From the following logs, it is
understandable that the intermediate results have been fetched.
Fix the flaky test that results in following diff by waiting until the
backend that we want to terminate really terminates, until 5secs.
```diff
--- /__w/citus/citus/src/test/regress/expected/isolation_get_all_active_transactions.out.modified 2023-11-01 16:30:57.648749795 +0000
+++ /__w/citus/citus/src/test/regress/results/isolation_get_all_active_transactions.out.modified 2023-11-01 16:30:57.656749877 +0000
@@ -114,13 +114,13 @@
--------------------
t
(1 row)
step s3-show-activity:
SET ROLE postgres;
select count(*) from get_all_active_transactions() where process_id IN (SELECT * FROM selected_pid);
count
-----
- 0
+ 1
(1 row)
```
Sometimes multi_alter_table_statements would fail in CI like this:
```diff
-- Verify that DROP NOT NULL works
ALTER TABLE lineitem_alter ALTER COLUMN int_column2 DROP NOT NULL;
SELECT "Column", "Type", "Modifiers" FROM table_desc WHERE relid='lineitem_alter'::regclass;
- Column | Type | Modifiers
----------------------------------------------------------------------
- l_orderkey | bigint | not null
- l_partkey | integer | not null
- l_suppkey | integer | not null
- l_linenumber | integer | not null
- l_quantity | numeric(15,2) | not null
- l_extendedprice | numeric(15,2) | not null
- l_discount | numeric(15,2) | not null
- l_tax | numeric(15,2) | not null
- l_returnflag | character(1) | not null
- l_linestatus | character(1) | not null
- l_shipdate | date | not null
- l_commitdate | date | not null
- l_receiptdate | date | not null
- l_shipinstruct | character(25) | not null
- l_shipmode | character(10) | not null
- l_comment | character varying(44) | not null
- float_column | double precision | default 1
- date_column | date |
- int_column1 | integer |
- int_column2 | integer |
- null_column | integer |
-(21 rows)
-
+ERROR: schema "alter_table_add_column" does not exist
-- COPY should succeed now
SELECT master_create_empty_shard('lineitem_alter') as shardid \gset
```
Reading from table_desc apparantly has an issue that if the schema gets
deleted from one of the items, while it is being read that we get such
an error.
This change fixes that by not running multi_alter_table_statements in parallel
with alter_table_add_column anymore.
This is another instance of the same issue as in #7294
Sometimes in CI we run into this failure:
```diff
SELECT resultId, nodeport, rowcount, targetShardId, targetShardIndex
FROM partition_task_list_results('test', $$ SELECT * FROM source_table $$, 'target_table')
NATURAL JOIN pg_dist_node;
-WARNING: connection to the remote node localhost:xxxxx failed with the following error: connection not open
+ERROR: connection to the remote node localhost:9060 failed with the following error: connection not open
SELECT * FROM distributed_result_info ORDER BY resultId;
- resultid | nodeport | rowcount | targetshardid | targetshardindex
----------------------------------------------------------------------
- test_from_100800_to_0 | 9060 | 22 | 100805 | 0
- test_from_100801_to_0 | 57637 | 2 | 100805 | 0
- test_from_100801_to_1 | 57637 | 15 | 100806 | 1
- test_from_100802_to_1 | 57637 | 10 | 100806 | 1
- test_from_100802_to_2 | 57637 | 5 | 100807 | 2
- test_from_100803_to_2 | 57637 | 18 | 100807 | 2
- test_from_100803_to_3 | 57637 | 4 | 100808 | 3
- test_from_100804_to_3 | 9060 | 24 | 100808 | 3
-(8 rows)
-
+ERROR: current transaction is aborted, commands ignored until end of transaction block
-- fetch from worker 2 should fail
SAVEPOINT s1;
+ERROR: current transaction is aborted, commands ignored until end of transaction block
SELECT fetch_intermediate_results('{test_from_100802_to_1,test_from_100802_to_2}'::text[], 'localhost', :worker_2_port) > 0 AS fetched;
-ERROR: could not open file "base/pgsql_job_cache/xx_x_xxx/test_from_100802_to_1.data": No such file or directory
-CONTEXT: while executing command on localhost:xxxxx
+ERROR: current transaction is aborted, commands ignored until end of transaction block
ROLLBACK TO SAVEPOINT s1;
+ERROR: savepoint "s1" does not exist
-- fetch from worker 1 should succeed
SELECT fetch_intermediate_results('{test_from_100802_to_1,test_from_100802_to_2}'::text[], 'localhost', :worker_1_port) > 0 AS fetched;
- fetched
----------------------------------------------------------------------
- t
-(1 row)
-
+ERROR: current transaction is aborted, commands ignored until end of transaction block
-- make sure the results read are same as the previous transaction block
SELECT count(*), sum(x) FROM
read_intermediate_results('{test_from_100802_to_1,test_from_100802_to_2}'::text[],'binary') AS res (x int);
- count | sum
----------------------------------------------------------------------
- 15 | 863
-(1 row)
-
+ERROR: current transaction is aborted, commands ignored until end of transaction block
ROLLBACk;
```
As outlined in the #7306 I created, the reason for this is related to
only having a single connection open to the node. Finding and fixing the
full cause is not trivial, so instead this PR starts working around
this bug by forcing maximum parallelism. Preferably we'd want
this workaround not to be necessary, but that requires
spending time to fix this. For now having a less flaky CI is
good enough.
Sometimes in CI insert_select_connection_leak would fail like this:
```diff
END;
SELECT worker_connection_count(:worker_1_port) - :pre_xact_worker_1_connections AS leaked_worker_1_connections,
worker_connection_count(:worker_2_port) - :pre_xact_worker_2_connections AS leaked_worker_2_connections;
leaked_worker_1_connections | leaked_worker_2_connections
-----------------------------+-----------------------------
- 0 | 0
+ -1 | 0
(1 row)
-- ROLLBACK
BEGIN;
INSERT INTO target_table SELECT * FROM source_table;
INSERT INTO target_table SELECT * FROM source_table;
ROLLBACK;
SELECT worker_connection_count(:worker_1_port) - :pre_xact_worker_1_connections AS leaked_worker_1_connections,
worker_connection_count(:worker_2_port) - :pre_xact_worker_2_connections AS leaked_worker_2_connections;
leaked_worker_1_connections | leaked_worker_2_connections
-----------------------------+-----------------------------
- 0 | 0
+ -1 | 0
(1 row)
\set VERBOSITY TERSE
-- Error on constraint failure
BEGIN;
INSERT INTO target_table SELECT * FROM source_table;
SELECT worker_connection_count(:worker_1_port) AS worker_1_connections,
worker_connection_count(:worker_2_port) AS worker_2_connections \gset
SAVEPOINT s1;
INSERT INTO target_table SELECT a, CASE WHEN a < 50 THEN b ELSE null END FROM source_table;
@@ -89,15 +89,15 @@
leaked_worker_1_connections | leaked_worker_2_connections
-----------------------------+-----------------------------
0 | 0
(1 row)
END;
SELECT worker_connection_count(:worker_1_port) - :pre_xact_worker_1_connections AS leaked_worker_1_connections,
worker_connection_count(:worker_2_port) - :pre_xact_worker_2_connections AS leaked_worker_2_connections;
leaked_worker_1_connections | leaked_worker_2_connections
-----------------------------+-----------------------------
- 0 | 0
+ -1 | 0
(1 row)
```
Source:
https://github.com/citusdata/citus/actions/runs/6718401194/attempts/1#summary-18258258387
A negative amount of leaked connectios is obviously not possible. For
some reason there was a connection open when we checked the initial
amount of connections that was closed afterwards. This could be the
from the maintenance daemon or maybe from the previous test that had not
fully closed its connections just yet.
The change in this PR doesnt't actually fix the cause of the negative
connection, but it simply considers it good as well, by changing the
result to zero for negative values.
With this fix we might sometimes miss a leak, because the negative
number can cancel out the leak and still result in a 0. But since the
negative number only occurs sometimes, we'll still find the leak often
enough.
When executing a prepared CALL, which is not pure SQL but available with
some drivers like npgsql and jpgdbc, Citus entered a code path where a
plan is not defined, while trying to increase its cost. Thus SIG11 when
plan is a NULL pointer.
Fix by only increasing plan cost when plan is not null.
However, it is a bit suspicious to get here with a NULL plan and maybe a
better change will be to not call
ShardPlacementForFunctionColocatedWithDistTable() with a NULL plan at
all (in call.c:134)
bug hit with for example:
```
CallableStatement proc = con.prepareCall("{CALL p(?)}");
proc.registerOutParameter(1, java.sql.Types.BIGINT);
proc.setInt(1, -100);
proc.execute();
```
where `p(bigint)` is a distributed "function" and the param the
distribution key (also in a distributed table), see #7242 for details
Fixes#7242
Sometimes in CI our logical_replication test fails like this:
```diff
+++ /__w/citus/citus/src/test/regress/results/logical_replication.out.modified 2023-11-01 14:15:08.562758546 +0000
@@ -40,21 +40,21 @@
SELECT count(*) from pg_publication;
count
-------
0
(1 row)
SELECT count(*) from pg_replication_slots;
count
-------
- 0
+ 1
(1 row)
SELECT count(*) FROM dist;
count
-------
```
It's hard to understand what is going on here, just based on the wrong
number. So this PR changes the test to show the name of the
subscription, publication and replication slot to make finding the cause
easier.
In passing this also fixes another flaky test in the same file that our
flaky test detection picked up. This is done by waiting for resource
cleanup after the shard move.
This is causing 404 failures due to a race condition:
https://github.com/actions/toolkit/issues/1235
It also makes the tests take unnecessarily long.
This was tested by changing a test file and seeing that the flaky test
detection was still working.
Fixes the flaky test that results in following diff:
```diff
--- /__w/citus/citus/src/test/regress/expected/multi_mx_node_metadata.out.modified 2023-11-01 14:22:12.890476575 +0000
+++ /__w/citus/citus/src/test/regress/results/multi_mx_node_metadata.out.modified 2023-11-01 14:22:12.914476657 +0000
@@ -840,24 +840,26 @@
(1 row)
\c :datname - - :master_port
SELECT datname FROM pg_stat_activity WHERE application_name LIKE 'Citus Met%';
datname
------------
db_to_drop
(1 row)
DROP DATABASE db_to_drop;
+ERROR: database "db_to_drop" is being accessed by other users
SELECT datname FROM pg_stat_activity WHERE application_name LIKE 'Citus Met%';
datname
------------
-(0 rows)
+ db_to_drop
+(1 row)
-- cleanup
DROP SEQUENCE sequence CASCADE;
NOTICE: drop cascades to default value for column a of table reference_table
```
Sometimes isolation_metadata_sync_deadlock fails in CI like this:
```diff
diff -dU10 -w /__w/citus/citus/src/test/regress/expected/isolation_metadata_sync_deadlock.out /__w/citus/citus/src/test/regress/results/isolation_metadata_sync_deadlock.out
--- /__w/citus/citus/src/test/regress/expected/isolation_metadata_sync_deadlock.out.modified 2023-11-01 16:03:15.090199229 +0000
+++ /__w/citus/citus/src/test/regress/results/isolation_metadata_sync_deadlock.out.modified 2023-11-01 16:03:15.098199312 +0000
@@ -110,10 +110,14 @@
t
(1 row)
step s2-stop-connection:
SELECT stop_session_level_connection_to_node();
stop_session_level_connection_to_node
-------------------------------------
(1 row)
+
+teardown failed: ERROR: localhost:57638 is a metadata node, but is out of sync
+HINT: If the node is up, wait until metadata gets synced to it and try again.
+CONTEXT: SQL statement "SELECT master_remove_distributed_table_metadata_from_workers(v_obj.objid, v_obj.schema_name, v_obj.object_name)"
```
Source:
https://github.com/citusdata/citus/actions/runs/6721938040/attempts/1#summary-18268946448
To fix this we now wait for the metadata to be fully synced to all
nodes at the start of the teardown steps.
Sometimes in CI citus_non_blocking_split_shard_cleanup failed like this:
```diff
--- /__w/citus/citus/src/test/regress/expected/citus_non_blocking_split_shard_cleanup.out.modified 2023-11-01 15:07:14.280551207 +0000
+++ /__w/citus/citus/src/test/regress/results/citus_non_blocking_split_shard_cleanup.out.modified 2023-11-01 15:07:14.292551358 +0000
@@ -106,21 +106,22 @@
-----------------------------------
(1 row)
\c - - - :worker_2_port
SET search_path TO "citus_split_test_schema";
-- Replication slots should be cleaned up
SELECT slot_name FROM pg_replication_slots;
slot_name
---------------------------------
-(0 rows)
+ citus_shard_split_slot_19_10_17
+(1 row)
-- Publications should be cleanedup
SELECT count(*) FROM pg_publication;
count
```
It's expected that the replication slot is sometimes not cleaned up if
we don't wait until resource cleanup completes. This PR starts doing
that here.
Normally, tests which are written non-dependent to other tests can use
minimal-tests and should use as well. However, in our test settings
base-schedule is being used which may cause unnecessary dependencies and
so unrelated errors that developers don't see in their local environment
With this change, default setting will be minimal, so that tests will be
free of unnecessary dependencies.
Sometimes failure_split_cleanup failed in CI like this:
```diff
ERROR: server closed the connection unexpectedly
CONTEXT: while executing command on localhost:9060
SELECT operation_id, object_type, object_name, node_group_id, policy_type
FROM pg_dist_cleanup where operation_id = 777 ORDER BY object_name;
operation_id | object_type | object_name | node_group_id | policy_type
--------------+-------------+-----------------------------------------------------------+---------------+-------------
777 | 1 | citus_failure_split_cleanup_schema.table_to_split_8981000 | 1 | 0
- 777 | 1 | citus_failure_split_cleanup_schema.table_to_split_8981002 | 1 | 1
777 | 1 | citus_failure_split_cleanup_schema.table_to_split_8981002 | 2 | 0
+ 777 | 1 | citus_failure_split_cleanup_schema.table_to_split_8981002 | 1 | 1
777 | 1 | citus_failure_split_cleanup_schema.table_to_split_8981003 | 2 | 1
777 | 4 | citus_shard_split_publication_1_10_777 | 2 | 0
(5 rows)
-- we need to allow connection so that we can connect to proxy
```
Source:
https://github.com/citusdata/citus/actions/runs/6717642291/attempts/1#summary-18256014949
It's the common problem where we're missing a column in the ORDER BY
clause. This fixes that by adding an node_group_id to the query in
question.
Sometimes in CI isolation_master_update_node fails like this:
```diff
------------------
(1 row)
step s2-abort: ABORT;
step s1-abort: ABORT;
FATAL: terminating connection due to administrator command
FATAL: terminating connection due to administrator command
SSL connection has been closed unexpectedly
+server closed the connection unexpectedly
master_remove_node
------------------
```
This just seesm like a random error line. The only way to reasonably fix
this is by adding an extra output file. So that's what this PR does.