Commit Graph

4677 Commits (c98341e4ed3967219dd5ae0e264dc41a7f442f1c)

Author SHA1 Message Date
Onur Tirtir 2d8be01853 Disable 2PC recovery while executing ALTER EXTENSION cmd during Citus upgrade tests
(cherry picked from commit b6b73e2f4c)
2025-02-04 16:53:32 +03:00
Onur Tirtir b6e3f39583 Fix flaky citus upgrade test
(cherry picked from commit 4cad81d643)
2025-02-04 16:49:12 +03:00
Naisila Puka 0a6adf4ccc
EXPLAIN generic_plan NOT supported in Citus (#7825)
We thought we provided support for this in

b8c493f2c4

However the use of parameters in SQL is not supported in Citus. Since
generic plan queries use parameters, we can't support for now.

Relevant PG16 commit https://github.com/postgres/postgres/commit/3c05284

Fixes #7813 with proper error message
2025-01-02 01:00:40 +03:00
Teja Mupparti ab7c13beb5 For scenarios, such as, Bug 3697586: Server crashes when assigning distributed transaction: Raise an ERROR instead of a crash 2024-12-26 10:45:59 -08:00
Onur Tirtir 73411915a4
Avoid re-assigning the global pid for client backends and bg workers when the application_name changes (#7791)
DESCRIPTION: Fixes a crash that happens because of unsafe catalog access
when re-assigning the global pid after application_name changes.

When application_name changes, we don't actually need to
try re-assigning the global pid for external client backends because
application_name doesn't affect the global pid for such backends. Plus,
trying to re-assign the global pid for external client backends would
unnecessarily cause performing a catalog access when the cached local
node id is invalidated. However, accessing to the catalog tables is
dangerous in certain situations like when we're not in a transaction
block. And for the other types of backends, i.e., the Citus internal
backends, we need to re-assign the global pid when the application_name
changes because for such backends we simply extract the global pid
inherited from the originating backend from the application_name -that's
specified by originating backend when openning that connection- and this
doesn't require catalog access.
2024-12-23 14:01:53 +00:00
Pavel Seleznev fe6d198ab2
Remove warnings on some builds (#7680)
Co-authored-by: Pavel Seleznev <PNSeleznev@sberbank.ru>
2024-12-03 17:10:36 +03:00
Colm McHugh c52f36019f [Bug Fix] [SEGFAULT] Querying distributed tables with window partition may cause segfault #7705
In function MasterAggregateMutator(), when the original Node is a Var node use makeVar() instead
of copyObject() when constructing the Var node for the target list of the combine query.
The varnullingrels field of the original Var node is ignored because it is not relevant for the
combine query; copying this cause the problem in issue 7705, where a coordinator query had
a Var with a reference to a non-existent join relation.
2024-11-06 19:26:29 +00:00
Erik Karsten f6959715dc
fix: typo runnnig -> running (#7686)
Very small PR, no changes to behaviour. Just a typo fix :-)

Under
`src/backend/distributed/sql/udfs/citus_finalize_upgrade_to_citus11/`
the sql has a typo "runnnig", which will be displayed to the user if the
`citus_check_cluster_node_health()` fails when calling
`citus_finish_citus_upgrade();`

Co-authored-by: eaydingol <60466783+eaydingol@users.noreply.github.com>
2024-09-17 09:28:46 +03:00
Parag Jain 5bad6c6a1d
[Bug Fix] : writing incorrect data to target Merge repartition Command (#7659)
We were writing incorrect data to target collection in some cases of merge command. In case of repartition when source query is RELATION. We were referring to incorrect attribute number that was resulting into
this incorrect behavior.

Example :

![image](https://github.com/user-attachments/assets/a101cb36-7976-459c-befb-96a55a5b3dc1)

![image](https://github.com/user-attachments/assets/e5c83b7b-5b8e-4d79-a927-95684dc9ba49)

I have added fixed tests as part of this PR , Thanks.
2024-09-12 21:16:39 -07:00
Mehmet YILMAZ 4775715691
Fix race condition in citus_set_coordinator_host when adding multiple coordinator nodes concurrently (#7682)
When multiple sessions concurrently attempt to add the same coordinator
node using `citus_set_coordinator_host`, there is a potential race
condition. Both sessions may pass the initial metadata check
(`isCoordinatorInMetadata`), but only one will succeed in adding the
node. The other session will fail with an assertion error
(`Assert(!nodeAlreadyExists)`), causing the server to crash. Even though
the `AddNodeMetadata` function takes an exclusive lock, it appears that
the lock is not preventing the race condition before the initial
metadata check.

- **Issue**: The current logic allows concurrent sessions to pass the
check for existing coordinators, leading to an attempt to insert
duplicate nodes, which triggers the assertion failure.

- **Impact**: This race condition leads to crashes during operations
that involve concurrent coordinator additions, as seen in
https://github.com/citusdata/citus/issues/7646.

**Test Plan:**

- Isolation Test Limitation: An isolation test was added to simulate
concurrent additions of the same coordinator node, but due to the
behavior of PostgreSQL locking mechanisms, the test does not trigger the
edge case. The lock applied within the function serializes the
operations, preventing the race condition from occurring in the
isolation test environment.
While the edge case is difficult to reproduce in an isolation test, the
fix addresses the core issue by ensuring concurrency control through
proper locking.

- Existing Tests: All existing tests related to node metadata and
coordinator management have been run to ensure that no regressions were
introduced.

**After the Fix:**

- Concurrent attempts to add the same coordinator node will be
serialized. One session will succeed in adding the node, while the
others will skip the operation without crashing the server.

Co-authored-by: Mehmet YILMAZ <mehmet.yilmaz@microsoft.com>
2024-09-09 17:09:56 +03:00
eaydingol 9e1852eac7
Check if the limit is null (#7665)
DESCRIPTION: Add a check to see if the given limit is null. 

Fixes a bug by checking if the limit given in the query is null when the
actual limit is computed with respect to the given offset.
Prior to this change, null is interpreted as 0 during the limit
calculation when both limit and offset are given.

Fixes #7663
2024-07-31 14:53:38 +03:00
Parag Jain 3c467e6e02
Support MERGE command for single_shard_distributed Target (#7643)
This PR has following changes :
1. Enable MERGE command for single_shard_distributed targets.
2024-07-16 08:08:44 -07:00
Nils Dijk e776a7ebbb
CI: move to github container registry (#7652)
We move the CI images to the github container registry.

Given we mostly (if not solely) run these containers on github actions
infra it makes sense to have them hosted closer to where they are
needed.

Image changes: https://github.com/citusdata/the-process/pull/157
2024-07-12 11:26:38 +03:00
Jelte Fennema-Nio 58fef24142
Update Citus Technical Documentation about the rebalancer (#7638)
The sections about the rebalancer algorithm and the backround tasks were
empty.

---------

Co-authored-by: Marco Slot <marco.slot@gmail.com>
Co-authored-by: Steven Sheehy <17552371+steven-sheehy@users.noreply.github.com>
2024-06-27 16:07:38 +02:00
Jelte Fennema-Nio aaaf637a6b
Redo #7620: Fix merge command when insert value does not have source distributed column (#7627)
Related to issue #7619, #7620
Merge command fails when source query is single sharded and source and
target are co-located and insert is not using distribution key of
source.

Example
```
CREATE TABLE source (id integer);
CREATE TABLE target (id integer );

-- let's distribute both table on id field
SELECT create_distributed_table('source', 'id');
SELECT create_distributed_table('target', 'id');

MERGE INTO target t
  USING ( SELECT 1 AS somekey
          FROM source
        WHERE source.id = 1) s
  ON t.id = s.somekey
  WHEN NOT MATCHED
  THEN INSERT (id)
    VALUES (s.somekey)

ERROR:  MERGE INSERT must use the source table distribution column value
HINT:  MERGE INSERT must use the source table distribution column value
```

Author's Opinion: If join is not between source and target distributed
column, we should not force user to use source distributed column while
inserting value of target distributed column.

Fix: If user is not using distributed key of source for insertion let's
not push down query to workers and don't force user to use source
distributed column if it is not part of join.

This reverts commit fa4fc0b372.

Co-authored-by: paragjain <paragjain@microsoft.com>
2024-06-17 14:07:25 +00:00
Jelte Fennema-Nio fa4fc0b372
Revert rebase merge of #7620 (#7626)
Because we want to track PR numbers and to make backporting easy we
(pretty much always) use squash-merges when merging to master. We
accidentally used a rebase merge for PR #7620. This reverts those
changes so we can redo the merge using squash merge.

This reverts all commits from eedb607c to 9e71750fc.
2024-06-17 15:46:00 +02:00
paragjain 9e71750fcd fixing flakyness in test 2024-06-15 14:55:36 -07:00
paragjain e62ae64d00 some more 2024-06-15 14:55:36 -07:00
paragjain 76f68f47c4 removing flakyness from test 2024-06-15 14:55:36 -07:00
Jelte Fennema-Nio d5231c34ab Revert "Try to fix failure"
This reverts commit 89f7217660.
2024-06-15 14:55:36 -07:00
Jelte Fennema-Nio f883cfdd77 Try to fix failure 2024-06-15 14:55:36 -07:00
paragjain 7c8a366ba2 some more 2024-06-15 14:55:36 -07:00
paragjain 06e9c29950 some more 2024-06-15 14:55:36 -07:00
paragjain 493140287a fix some indent 2024-06-15 14:55:36 -07:00
paragjain ec25b433d4 adding update and delete tests 2024-06-15 14:55:36 -07:00
paragjain eedb607cd5 merge command fix 2024-06-15 14:55:36 -07:00
Jelte Fennema-Nio 8c9de08b76
Fix CI issues after Github Actions networking changes (#7624)
For some reason using localhost in our hba file doesn't have the
intended effect anymore in our Github Actions runners. Probably because
of some networking change (IPv6 maybe) or some change in the
`/etc/hosts` file.

Replacing localhost with the equivalent loopback IPv4 and IPv6 addresses
resolved this issue.
2024-06-14 16:20:23 +02:00
Gürkan İndibay 0ab42e7a80
Adds null check for node in HasRangeTableRef (#7609)
DESCRIPTION: Adds null check for node in HasRangeTableRef to prevent
errors
2024-05-28 11:03:38 +03:00
Evgeny Nechayev fcc72d8a23
Use macro wrapper to access PGPROC data, which allow to improve compa… (#7607)
DESCRIPTION: Use macro wrapper to access PGPROC data, to improve compatibility with PostgreSQL forks.
2024-05-28 00:39:13 +00:00
Jelte Fennema-Nio a0151aa31d
Greatly speed up "\d tablename" on servers with many tables (#7577)
DESCRIPTION: Fix performance issue when using "\d tablename" on a server
with many tables

We introduce a filter to every query on pg_class to automatically remove
shards. This is useful to make sure \d and PgAdmin are not cluttered
with shards. However, the way we were introducing this filter was using
`securityQuals` which can have negative impact on query performance.

On clusters with 100k+ tables this could cause a simple "\d tablename"
command to take multiple seconds, because a skipped optimization by
Postgres causes a full table scan. This changes the code to introduce
this filter in the regular `quals` list instead of in `securityQuals`.
Which causes Postgres to use the intended optimization again.

For reference, this was initially reported as a Postgres issue by me:

https://www.postgresql.org/message-id/flat/4189982.1712785863%40sss.pgh.pa.us#b87421293b362d581ea8677e3bfea920
2024-04-16 17:26:12 +02:00
Xing Guo ada3ba2507
Add missing volatile qualifier. (#7570)
Variables being modified in the PG_TRY block and read in the PG_CATCH
block should be qualified with volatile.

The variable waitEventSet is modified in the PG_TRY block (line 1085)
and read in the PG_CATCH block (line 1095).

The variable relation is modified in the PG_TRY block (line 500) and
read in the PG_CATCH block (line 515).

Besides, the variable objectAddress doesn't need the volatile qualifier.

Ref: C99 7.13.2.1[^1],

> All accessible objects have values, and all other components of the
abstract machine have state, as of the time the longjmp function was
called, except that the values of objects of automatic storage duration
that are local to the function containing the invocation of the
corresponding setjmp macro that do not have volatile-qualified type and
have been changed between the setjmp invocation and longjmp call are
indeterminate.

[^1]: https://www.open-std.org/jtc1/sc22/wg14/www/docs/n1256.pdf

DESCRIPTION: Correctly mark some variables as volatile

---------

Co-authored-by: Hong Yi <zouzou0208@gmail.com>
2024-04-16 15:29:14 +02:00
Karina 41e2af8ff5
Use expecteddir option in _run_pg_regress() (#7582)
Fix check-arbitrary-configs tests failure with current REL_16_STABLE.
This is the same problem as described in #7573. I missed pg_regress call
in _run_pg_regress() in that PR.

Co-authored-by: Karina Litskevich <litskevichkarina@gmail.com>
2024-04-16 08:44:47 +00:00
Jelte Fennema-Nio a263ac6f5f
Speed up GetForeignKeyOids (#7578)
DESCRIPTION: Fix performance issue in GetForeignKeyOids on systems with
many constraints

GetForeignKeyOids was showing up in CPU profiles when distributing
schemas on systems with 100k+ constraints. The reason was that this
function was doing a sequence scan of pg_constraint to get the foreign
keys that referenced the requested table.

This fixes that by finding the constraints referencing the table through
pg_depend instead of pg_constraint. We're doing this indirection,
because pg_constraint doesn't have an index that we can use, but
pg_depend does.
2024-04-16 08:16:40 +00:00
Jelte Fennema-Nio 110b4192b2
Fix PG upgrades when invalid rebalance strategies exist (#7580)
DESCRIPTION: Fix PG upgrades when invalid rebalance strategies exist

Without this change an upgrade of a cluster with an invalid rebalance
strategy would fail with an error like this:
```
cache lookup failed for shard_cost_function with oid 6077337
CONTEXT:  SQL statement "SELECT citus_validate_rebalance_strategy_functions(
        NEW.shard_cost_function,
        NEW.node_capacity_function,
        NEW.shard_allowed_on_node_function)"
PL/pgSQL function citus_internal.pg_dist_rebalance_strategy_trigger_func() line 5 at PERFORM
SQL statement "INSERT INTO pg_catalog.pg_dist_rebalance_strategy SELECT
        name,
        default_strategy,
        shard_cost_function::regprocedure::regproc,
        node_capacity_function::regprocedure::regproc,
        shard_allowed_on_node_function::regprocedure::regproc,
        default_threshold,
        minimum_threshold,
        improvement_threshold
    FROM public.pg_dist_rebalance_strategy"
PL/pgSQL function citus_finish_pg_upgrade() line 115 at SQL statement
```

This fixes that by disabling the trigger and simply re-inserting the
invalid rebalance strategy without checking. We could also silently
remove it, but this seems nicer.
2024-04-15 14:26:33 +00:00
Jelte Fennema-Nio 16604a6601
Use an index to get FDWs that depend on extensions (#7574)
DESCRIPTION: Fix performance issue when distributing a table that
depends on an extension

When the database contains many objects this function would show up in
profiles because it was doing a sequence scan on pg_depend. And with
many objects pg_depend can get very large.

This starts using an index scan to only look for rows containing FDWs,
of which there are expected to be very few (often even zero).
2024-04-15 12:42:56 +00:00
Jelte Fennema-Nio cdf51da458
Speed up SequenceUsedInDistributedTable (#7579)
DESCRIPTION: Fix performance issue when creating distributed tables if
many already exist

This builds on the work to speed up EnsureSequenceTypeSupported, and now
does something similar for SequenceUsedInDistributedTable.
SequenceUsedInDistributedTable had a similar O(number of citus tables)
operation. This fixes that and speeds up creation of distributed tables
significantly when many distributed tables already exist.

Fixes #7022
2024-04-15 12:01:55 +00:00
Jelte Fennema-Nio 381f31756e
Speed up EnsureSequenceTypeSupported (#7575)
DESCRIPTION: Fix performance issue when creating distributed tables and many already exist

EnsureSequenceTypeSupported was doing an O(number of distributed tables)
operation. This can become very slow with lots of Citus tables, which
now happens much more frequently in practice due to schema based sharding.

Partially addresses #7022
2024-04-15 10:28:11 +00:00
Onur Tirtir 3586aab17a
Allow providing "host" parameter via citus.node_conninfo (#7541)
And when that is the case, directly use it as "host" parameter for the
connections between nodes and use the "hostname" provided in
pg_dist_node / pg_dist_poolinfo as "hostaddr" to avoid host name lookup.

This is to avoid allowing dns resolution (and / or setting up DNS names
for each host in the cluster). This already works currently when using
IPs in the hostname. The only use of setting host is that you can then
use sslmode=verify-full and it will validate that the hostname matches
the certificate provided by the node you're connecting too.

It would be more flexible to make this a per-node setting, but that
requires SQL changes. And we'd like to backport this change, and
backporting such a sql change would be quite hard while backporting this
change would be very easy. And in many setups, a different hostname for
TLS validation is actually not needed. The reason for that is
query-from-any node: With query-from-any-node all nodes usually have a
certificate that is valid for the same "cluster hostname", either using
a wildcard cert or a Subject Alternative Name (SAN). Because if you load
balance across nodes you don't know which node you're connecting to, but
you still want TLS validation to do it's job. So with this change you
can use this same "cluster hostname" for TLS validation within the
cluster. Obviously this means you don't validate that you're connecting
to a particular node, just that you're connecting to one of the nodes in
the cluster, but that should be fine from a security perspective (in
most cases).

Note to self: This change requires updating

https://docs.citusdata.com/en/latest/develop/api_guc.html#citus-node-conninfo-text.

DESCRIPTION: Allows overwriting host name for all inter-node connections
by supporting "host" parameter in citus.node_conninfo
2024-04-15 09:51:11 +00:00
Karina 41d99249d9
Use expecteddir option when running vanilla tests (#7573)
In PostgreSQL 16 a new option expecteddir was introduced to pg_regress.
Together with fix in
[196eeb6b](https://github.com/postgres/postgres/commit/196eeb6b) it
causes check-vanilla failure if expecteddir is not specified.

Co-authored-by: Karina Litskevich <litskevichkarina@gmail.com>
2024-04-10 16:08:54 +00:00
Onur Tirtir 3929a5b2a6
Fix incorrect "VALID UNTIL" assumption made for roles in node activation (#7534)
Fixes https://github.com/citusdata/citus/issues/7533.

DESCRIPTION: Fixes incorrect `VALID UNTIL` setting assumption made for
roles when syncing them to new nodes
2024-03-20 11:38:33 +00:00
Emel Şimşek fdd658acec
Fix crash caused by some form of ALTER TABLE ADD COLUMN statements. (#7522)
DESCRIPTION: Fixes a crash caused by some form of ALTER TABLE ADD COLUMN
statements. When adding multiple columns, if one of the ADD COLUMN
statements contains a FOREIGN constraint ommitting the referenced
columns in the statement, a SEGFAULT occurs.

For instance, the following statement results in a crash:

```
  ALTER TABLE lt ADD COLUMN new_col1 bool,
                          ADD COLUMN new_col2 int references rt;

```                      


Fixes #7520.
2024-03-20 11:06:05 +03:00
Onur Tirtir 0acb5f6e86
Fix assertion failure in maintenance daemon during Citus upgrades (#7537)
Fixes https://github.com/citusdata/citus/issues/7536.

Note to reviewer:

Before this commit, the following results in an assertion failure when
executed locally and this won't be the case anymore:
```console
make -C src/test/regress/ check-citus-upgrade-local citus-old-version=v10.2.0
```

Note that this doesn't happen on CI as we don't enable assertions there.

---------

Co-authored-by: Jelte Fennema-Nio <jelte.fennema@microsoft.com>
2024-03-20 00:10:12 +00:00
Onur Tirtir d129064280
Refactor the code that supports node-wide object mgmt commands from non-main dbs (#7544)
RunPreprocessNonMainDBCommand and RunPostprocessNonMainDBCommand are
the entrypoints for this module. These functions are called from
utility_hook.c to support some of the node-wide object management
commands from non-main databases.

To add support for a new command type, one needs to define a new
NonMainDbDistributeObjectOps object and add it to
GetNonMainDbDistributeObjectOps.
2024-03-19 14:26:17 +01:00
Hanefi Onaldi bf05bf51ec
Refactor one helper function (#7562)
The code looks simpler and easier to read now.
2024-03-18 12:06:49 +00:00
eaydingol 8afa2d0386
Change the order in which the locks are acquired (#7542)
This PR changes the order in which the locks are acquired (for the
target and reference tables), when a modify request is initiated from a
worker node that is not the "FirstWorkerNode".


To prevent concurrent writes, locks are acquired on the first worker
node for the replicated tables. When the update statement originates
from the first worker node, it acquires the lock on the reference
table(s) first, followed by the target table(s). However, if the update
statement is initiated in another worker node, the lock requests are
sent to the first worker in a different order. This PR unifies the
modification order on the first worker node. With the third commit,
independent of the node that received the request, the locks are
acquired for the modified table and then the reference tables on the
first node.

The first commit shows a sample output for the test prior to the fix. 

Fixes #7477

---------

Co-authored-by: Jelte Fennema-Nio <jelte.fennema@microsoft.com>
2024-03-10 10:20:08 +03:00
copetol 12f56438fc
Fix segfault when using certain DO block in function (#7554)
When using a CASE WHEN expression in the body
of the function that is used in the DO block, a segmentation
fault occured. This fixes that.

Fixes #7381

---------

Co-authored-by: Konstantin Morozov <vzbdryn@yahoo.com>
2024-03-08 14:21:42 +01:00
Karina f0043b64a1
Fix server crash when trying to execute activate_node_snapshot() on a single-node cluster (#7552)
This fixes #7551 reported by Egor Chindyaskin

Function activate_node_snapshot() is not meant to be called on a cluster
without worker nodes. This commit adds ERROR report for such case to
prevent server crash.
2024-03-07 11:08:19 +01:00
eaydingol edcdbe67b1
Fix: store the previous shard cost for order verification (#7550)
Store the previous shard cost so that the invariant checking performs as
expected.
2024-03-06 14:46:49 +03:00
sminux d59c93bc50
fix bad copy-paste rightComparisonLimit (#7547)
DESCRIPTION: change for #7543
2024-03-05 08:49:35 +01:00
Gürkan İndibay 51009d0191
Add support for alter/drop role propagation from non-main databases (#7461)
DESCRIPTION: Adds support for distributed `ALTER/DROP ROLE` commands
from the databases where Citus is not installed

---------

Co-authored-by: Onur Tirtir <onurcantirtir@gmail.com>
2024-02-28 08:58:28 +00:00
Onur Tirtir f4242685e3
Add failure handling for CREATE DATABASE commands (#7483)
In preprocess phase, we save the original database name, replace
dbname field of CreatedbStmt with a temporary name (to let Postgres
to create the database with the temporary name locally) and then
we insert a cleanup record for the temporary database name on all
nodes **(\*\*)**.

And in postprocess phase, we first rename the temporary database
back to its original name for local node and then return a list of
distributed DDL jobs i) to create the database with the temporary
name and then ii) to rename it back to its original name on other
nodes. That way, if CREATE DATABASE fails on any of the nodes, the
temporary database will be cleaned up by the cleanup records that
we inserted in preprocess phase and in case of a failure, we won't
leak any databases called as the name that user intended to use for
the database.

Solves the problem documented in
https://github.com/citusdata/citus/issues/7369
for CREATE DATABASE commands.

**(\*\*):** To ensure that we insert cleanup records on all nodes,
with this PR we also start requiring having the coordinator in the
metadata because otherwise we would skip inserting a cleanup record
for the coordinator.
2024-02-23 17:02:32 +00:00
Onur Tirtir 9ddee5d02a
Test that we check unsupported options for CREATE DATABASE from non-main dbs (#7532)
When adding CREATE/DROP DATABASE propagation in #7240, luckily
we've added EnsureSupportedCreateDatabaseCommand() check into
deparser too just to be on the safe side. That way, today CREATE
DATABASE commands from non-main dbs don't silently allow unsupported
options.

I wasn't aware of this when merging #7439 and hence wanted to add
a test so that we don't mistakenly remove that check from deparser
in future.
2024-02-23 10:37:11 +00:00
eaydingol 3509b7df5a
Add support for SECURITY LABEL on ROLE propagation from non-main databases (#7525)
DESCRIPTION: Adds support for distributed "SECURITY LABEL on ROLE"
commands from the databases where Citus is not installed.
2024-02-23 09:54:19 +03:00
Gürkan İndibay 211415dd4b
Removes granted by statement to fix flaky test errors (#7526)
Fix for the #7519
In metadata sync phase, grant statements for roles are being fetched and
propagated from catalog tables.
However, in some cases grant .. with admin option clauses executes after
the granted by statements which causes #7519 error.
We will fix this issue with the grantor propagation task in the project
2024-02-21 18:37:25 +03:00
Karina 683e10ab69
Fix error in master_disable_node/citus_disable_node (#7492)
This fixes #7454: master_disable_node() has only two arguments, but
calls citus_disable_node() that tries to read three arguments

Co-authored-by: Karina Litskevich <litskevichkarina@gmail.com>
2024-02-21 11:35:27 +00:00
Halil Ozan Akgül 852bcc5483
Add support for create / drop database propagation from non-main databases (#7439)
DESCRIPTION: Adds support for distributed `CREATE/DROP DATABASE `
commands from the databases where Citus is not installed

---------

Co-authored-by: Onur Tirtir <onurcantirtir@gmail.com>
2024-02-21 10:44:01 +00:00
Gürkan İndibay b3ef1b7e39
Add support for grant on database propagation from non-main databases (#7443)
DESCRIPTION: Adds support for distributed `GRANT .. ON DATABASE TO USER`
commands from the databases where Citus is not installed

---------

Co-authored-by: Onur Tirtir <onurcantirtir@gmail.com>
2024-02-21 13:14:58 +03:00
Onur Tirtir 56e014e64e
Clarify resource-cleaner apis (#7518)
Rename InsertCleanupRecordInCurrentTransaction ->
InsertCleanupOnSuccessRecordInCurrentTransaction and hardcode policy
type as CLEANUP_DEFERRED_ON_SUCCESS.

Rename InsertCleanupRecordInSubtransaction ->
InsertCleanupRecordOutsideTransaction.
2024-02-20 08:57:08 +00:00
Gürkan İndibay 2cbfdbfa46
Adds Grant Role support from non-main db (#7404)
DESCRIPTION: Adds support for distributed role-membership management
commands from the databases where Citus is not installed (`GRANT <role>
TO <role>`)

This PR also refactors the code-path that allows executing some of the
node-wide commands so that we use send deparsed query string to other
nodes instead of the `queryString` passed into utility hook.

---------

Co-authored-by: Onur Tirtir <onurcantirtir@gmail.com>
2024-02-19 17:53:27 +03:00
Gürkan İndibay 9a0cdbf5af
Fixes granted by cascade/restrict statements for revoke (#7517)
DESCRIPTION: Fixes incorrect propagating of `GRANTED BY` and
`CASCADE/RESTRICT` clauses for `REVOKE` statements

There are two issues fixed in this PR
1. granted by statement will appear for revoke statements as well
2. revoke/cascade statement will appear after granted by

Since granted by statements does not appear in statements, this bug
hasn't been visible until now. However, after activating the granted by
statement for revoke, order problem arised and this issue was fixed
order problem for cascade/revoke as well
In summary, this PR provides usage of granted by statements properly now
with the correct order of statements.
We can verify the both errors, fixed with just single statement
REVOKE dist_role_3 from non_dist_role_3 granted by test_admin_role
cascade;
2024-02-19 15:44:21 +03:00
Onur Tirtir 74b55d0546
Enforce using werkzeug 2.3.7 for failure tests and update Postgres versions to latest minors (#7491)
Let's use version 2.3.7 to fix the following error as we do in docker
images created in https://github.com/citusdata/the-process/ repo.
```
ImportError: cannot import name 'url_quote' from 'werkzeug.urls' (/home/onurctirtir/.local/share/virtualenvs/regress-ffZKpSmO/lib/python3.9/site-packages/werkzeug/urls.py)
```

And changing werkzeug version required rebuilding Pipfile.lock file in
src/test/regress. Before updating this Pipfile.lock file, we want to
make sure that versions specified there don't break any tests. And to
ensure that this is the case,
https://github.com/citusdata/the-process/pull/155 synchronizes
requirements.txt file based on new Pipfile.lock and hence this PR
updates test image suffix accordingly.

Also, while updating https://github.com/citusdata/the-process/pull/155,
I also had to update Postgres versions to latest minors to make image
builds passing again and updating Postgres versions in images requires
updating Postgres versions in this repo too. While doing that, we also
update Postgres version used in devcontainer too.
2024-02-16 14:38:32 +00:00
eaydingol 15a3adebe8
Support SECURITY LABEL ON ROLE from any node (#7508)
DESCRIPTION: Propagates SECURITY LABEL ON ROLE statement from any node
2024-02-15 20:34:15 +03:00
Gürkan İndibay 59da0633bb
Fixes invalid grantor field parsing in grant role propagation (#7451)
DESCRIPTION: Resolves an issue that disrupts distributed GRANT
statements with the grantor option

In this issue 3 issues are being solved:
1.Correcting the erroneous appending of multiple granted by in the
deparser.
2Adding support for grantor (granted by) in grant role propagation.
3. Implementing grantor (granted by) support during the metadata sync
grant role propagation phase.

Limitations: Currently, the grantor must be created prior to the
metadata sync phase. During metadata sync, both the creation of the
grantor and the grants given by that role cannot be performed, as the
grantor role is not detected during the dependency resolution phase.

---------

Co-authored-by: Onur Tirtir <onurcantirtir@gmail.com>
2024-02-15 08:27:29 +00:00
Onur Tirtir 689c6897a4
Refactor CREATE / DROP database functions for better readability (#7486) 2024-02-08 01:55:50 +03:00
eaydingol f01c5f2593
Move remaining citus_internal functions (#7478)
Moves the following functions to the Citus internal schema: 

citus_internal_local_blocked_processes
citus_internal_global_blocked_processes
citus_internal_mark_node_not_synced
citus_internal_unregister_tenant_schema_globally
citus_internal_update_none_dist_table_metadata
citus_internal_update_placement_metadata
citus_internal_update_relation_colocation
citus_internal_start_replication_origin_tracking
citus_internal_stop_replication_origin_tracking
citus_internal_is_replication_origin_tracking_active


#7405

---------

Co-authored-by: Jelte Fennema-Nio <jelte.fennema@microsoft.com>
2024-02-07 16:58:17 +03:00
Filip Sedlák 6869b3ad10
Fail early when shard can't be safely moved to a new node (#7467)
DESCRIPTION: citus_move_shard_placement now fails early when shard
cannot be safely moved

The implementation is quite simplistic -
`citus_move_shard_placement(...)` will fail with an error if there's any
new node in the cluster that doesn't have reference tables yet.

It could have been finer-grained, i.e. erroring only when trying to move
a shard to an unitialized node. Looking at the related functions -
`replicate_reference_tables()` or `citus_rebalance_start()`, I think
it's acceptable behaviour. These other functions also treat "any"
unitialized node as a temporary anomaly.

Fixes #7426

---------

Co-authored-by: Jelte Fennema-Nio <jelte.fennema@microsoft.com>
2024-02-07 12:04:52 +00:00
Karina 9ff8436f14
Create directories and files with pg_file_create_mode and pg_dir_create_mode permissions (#7479)
Since Postgres commit da9b580d files and directories are supposed to
be created with pg_file_create_mode and pg_dir_create_mode permissions
when default permissions are expected.

This fixes a failure of one of the postgres tests:
If we create file add.conf containing
```
shared_preload_libraries='citus'
```
and run postgres tests
```
TEMP_CONFIG=/path/to/add.conf make installcheck -C src/bin/pg_ctl/
```
then 001_start_stop.pl fails with
```
.../data/base/pgsql_job_cache mode must be 0750
```
in the log.

In passing this also stops creating directories that we haven't used
since Citus 7.4

This change explicitely doesn't change permissions of certificates/keys
that we create.

---------

Co-authored-by: Karina Litskevich <litskevichkarina@gmail.com>
2024-02-07 12:48:31 +01:00
eaydingol 594cb6f274
Move more citus internal functions (#7473)
Moves the following functions:

 citus_internal_delete_colocation_metadata 
 citus_internal_delete_partition_metadata 
 citus_internal_delete_placement_metadata 
 citus_internal_delete_shard_metadata 
 citus_internal_delete_tenant_schema
2024-01-31 23:00:04 +03:00
eaydingol d05174093b
Move citus internal functions (#7470)
Move more functions to citus_internal schema, the list:

citus_internal_add_placement_metadata
citus_internal_add_shard_metadata
citus_internal_add_tenant_schema
citus_internal_adjust_local_clock_to_remote
citus_internal_database_command

#7405
2024-01-31 11:45:19 +00:00
Onur Tirtir 3ce731d497
Make multi_metadata_sync runnable via run_test.py (#7472) 2024-01-31 09:50:16 +00:00
Onur Tirtir 6f43d5c02f
Enhance technical README for DDL propagation (#7471) 2024-01-31 10:30:14 +01:00
Onur Tirtir 5aedec4242
Improve error message for recursive CTEs (#7407)
Fixes #2870
2024-01-30 15:12:48 +00:00
eaydingol f6ea619e27
Move citus internal functions (#7466)
Move the following functions from pg_catalog to citus_internal:

citus_internal_add_object_metadata
citus_internal_add_partition_metadata


#7405
2024-01-30 12:27:10 +03:00
eaydingol 5d673874f7
Move citus internal functions (#7456)
Move citus_internal_acquire_citus_advisory_object_class_lock and
citus_internal_add_colocation_metadata functions from pg_catalog to
citus_internal.

#7405
2024-01-26 11:46:05 +03:00
eaydingol 542212c3d8
Make citus_internal schema public (#7450)
DESCRIPTION: Makes citus_internal schema public



#7405
2024-01-24 17:11:10 +03:00
Onur Tirtir 3de5601bcc
Replace LOCAL_HOST_NAME with LocalHostName (#7449)
The only usages of LOCAL_HOST_NAME were in functions that are only used
during regression tests and in places where it was used incorrectly.
2024-01-24 13:50:39 +00:00
Onur Tirtir 1d096df7f4
Not use hardcoded LOCAL_HOST_NAME but citus.local_hostname to distinguish loopback connections (#7436)
Fixes a bug that breaks queries from non-maindbs when
citus.local_hostname is set to a value different than "localhost".

This is a very old bug doesn't cause a problem as long as Citus catalog
is available to FindWorkerNode(). And the catalog is always available
unless we're in non-main database, which might be the case on main but
not on older releases, hence not adding a `DESCRIPTION`. For this
reason, I don't see a reason to backport this.

Maybe we should totally refrain using LOCAL_HOST_NAME in all code-paths,
but not doing that in this PR as the other paths don't seem to be
breaking something that is user-facing.

```c
char *
GetAuthinfo(char *hostname, int32 port, char *user)
{
	char *authinfo = NULL;
	bool isLoopback = (strncmp(LOCAL_HOST_NAME, hostname, MAX_NODE_LENGTH) == 0 &&
					   PostPortNumber == port);

	if (IsTransactionState())
	{
		int64 nodeId = WILDCARD_NODE_ID;

		/* -1 is a special value for loopback connections (task tracker) */
		if (isLoopback)
		{
			nodeId = LOCALHOST_NODE_ID;
		}
		else
		{
			WorkerNode *worker = FindWorkerNode(hostname, port);
			if (worker != NULL)
			{
				nodeId = worker->nodeId;
			}
		}

		authinfo = GetAuthinfoViaCatalog(user, nodeId);
	}

	return (authinfo != NULL) ? authinfo : "";
}
```
2024-01-24 12:58:55 +00:00
Filip Sedlák 8b48d6ab02
Log username in the failed connection message (#7432)
This patch includes the username in the reported error message.
This makes debugging easier when certain commands open connections
as other users than the user that is executing the command.

```
monitora_snapshot=# SELECT citus_move_shard_placement(102030, 'monitora.db-dev-worker-a', 6005, 'monitora.db-dev-worker-a', 6017);
ERROR:  connection to the remote node monitora_user@monitora.db-dev-worker-a:6017 failed with the following error: fe_sendauth: no password supplied
Time: 40,198 ms
```
2024-01-24 11:24:23 +00:00
Halil Ozan Akgül 1cb2e1e4e8
Fixes create user queries from Citus non-main databases with other users (#7442)
This PR makes the connections to other nodes for
`mark_object_distributed` use the same user as
`execute_command_on_remote_nodes_as_user` so they'll use the same
connection.
2024-01-24 12:57:54 +03:00
Gürkan İndibay 863713e9b7
Refactors ExtendedTaskList methods (#7372)
ExecuteTaskListIntoTupleDestWithParam and ExecuteTaskListIntoTupleDest
are nearly the same. I parameterized and a made a reusable structure
here

---------

Co-authored-by: Onur Tirtir <onurcantirtir@gmail.com>
2024-01-24 06:00:19 +00:00
Teja Mupparti 11d7c27352 Fix assertions in other PG versions too, the original fix is in PR-7379 2024-01-23 15:10:06 -08:00
Jelte Fennema-Nio 9683bef2ec
Replace more spurious strdups with pstrdups (#7441)
DESCRIPTION: Remove a few small memory leaks

In #7440 one instance of a strdup was removed. But there were a few
more. This removes the ones that are left over, or adds a comment why
strdup is on purpose.
2024-01-23 13:28:26 +01:00
Marco Slot 72fbea20c4
Replace spurious strdup with pstrdup (#7440)
Not sure why we never found this using valgrind, but using strdup will
cause memory leaks because the pointer is not tracked in a memory
context.
2024-01-23 11:55:03 +01:00
eaydingol ee11492a0e
Generate qualified relation name (#7427)
This change refactors the code by using generate_qualified_relation_name
from id instead of using a sequence of functions to generate the
relation name.


Fixes #6602
2024-01-22 17:32:49 +03:00
zhjwpku 4b295cc857
Simplify CitusNewNode (#7434)
postgres refactored newNode() in PG 17, the main point for doing this is
the original tricks is no longer neccessary for modern compilers[1].

This does the same for Citus.

This should have no backward compatibility issues since it just replaces
palloc0fast with palloc0.

This is good for forward compatibility since palloc0fast no longer
exists in PG 17.

[1]
https://www.postgresql.org/message-id/b51f1fa7-7e6a-4ecc-936d-90a8a1659e7c@iki.fi
2024-01-22 14:55:14 +01:00
Gürkan İndibay 188614512f
Adds comment on database and role propagation (#7388)
DESCRIPTION: Adds comment on database and role propagation.
Example commands are as below

comment on database <db_name> is '<comment_text>'
comment on database <db_name> is NULL
comment on role <role_name> is '<comment_text>'
comment on role <role_name> is NULL

---------

Co-authored-by: Jelte Fennema-Nio <jelte.fennema@microsoft.com>
2024-01-18 20:58:44 +03:00
Jelte Fennema-Nio 5ec056a172
Add pytest test example about connecting to a worker (#7386)
I noticed while reviewing #7203 that there as no example of executing
sql on a worker for the pytest README. Since this is a pretty common
thing that people want to do, this PR adds that.
2024-01-18 15:05:24 +03:00
Jelte Fennema-Nio fcfedff8d1
Support running isolation_update_node in flaky test detection (#7425)
I noticed in #7423 that `isolation_update_node` could not be run using
flaky test detection. This fixes that.
2024-01-17 15:36:26 +00:00
Valery 6cf6cf37fd
Adds information to explain output when using citus.explain_distributed_queries=false (#7412)
Fixes https://github.com/citusdata/citus/issues/6490
2024-01-17 15:04:42 +00:00
zhjwpku 51e607878b
remove a duplicate forward declaration and polish some comments (#7371)
remove a duplicate forward declaration and polish some comments

Signed-off-by: Zhao Junwang <zhjwpku@gmail.com>
2024-01-17 14:30:23 +00:00
Karina 21464adfec
Make isolation_update_node test system independent (#7423)
Test isolation_update_node fails on some systems with the following error:
```
-s2: WARNING:  connection to the remote node non-existent:57637 failed with the following error: could not translate host name "non-existent" to address: Name or service not known
+s2: WARNING:  connection to the remote node non-existent:57637 failed with the following error: could not translate host name "non-existent" to address: Temporary failure in name resolution
```

This slightly modifies an already existing [normalization
rule](739c6d26df/src/test/regress/bin/normalize.sed (L217-L218))
to fix it.

Co-authored-by: Karina Litskevich <litskevichkarina@gmail.com>
2024-01-17 13:39:07 +00:00
Onur Tirtir 04b374fc01
Fix upgrade tests (#7413)
Adding upgrade_basic_before_non_mixed.sql file because while
upgrade_basic_after_non_mixed exist, its before variation didn't exist
as we don't have any "before" steps. However, run_test.py assumes that
all "after" files do have a "before" variation as well. So this PR adds
an empty upgrade_basic_before_non_mixed.sql file.

Also, given that we don't have such a version called as 12.1devel
anymore, change it to 12.1.1.

And finally, let CI skip testing flakyness for upgrade tests both
because it's quite hard to get flaky-test-detection job working for
upgrade tests and also because in the end it is not much useful to test
upgrade tests against flakyness.
2024-01-16 12:37:18 +00:00
Halil Ozan Akgül 739c6d26df
Fix inserting to pg_dist_object for queries from other nodes (#7402)
Running a query from a Citus non-main database that inserts to
pg_dist_object requires a new connection to the main database itself.
This PR adds that connection to the main database.

---------

Co-authored-by: Jelte Fennema-Nio <github-tech@jeltef.nl>
2024-01-11 16:05:14 +03:00
Teja Mupparti 00068e07c5 Fix the incorrect column count after ALTER TABLE, this fixes the bug #7378 (please read the analysis in the bug for more information) 2024-01-10 12:49:44 -08:00
LightDB Enterprise Postgres 9a91136a3d
Fix timeout when underlying socket is changed in a MultiConnection (#7377)
When there are multiple localhost entries in /etc/hosts like following
/etc/hosts:
```
127.0.0.1   localhost localhost.localdomain localhost4 localhost4.localdomain4
::1         localhost localhost.localdomain localhost6 localhost6.localdomain6
127.0.0.1   localhost
```

multi_cluster_management check will failed:
```

@@ -857,20 +857,21 @@
 ERROR:  group 14 already has a primary node
 -- check that you can add secondaries and unavailable nodes to a group
 SELECT groupid AS worker_2_group FROM pg_dist_node WHERE nodeport = :worker_2_port \gset
 SELECT 1 FROM master_add_node('localhost', 9998, groupid => :worker_1_group, noderole => 'secondary');
  ?column?
 ----------
         1
 (1 row)

 SELECT 1 FROM master_add_node('localhost', 9997, groupid => :worker_1_group, noderole => 'unavailable');
+WARNING:  could not establish connection after 5000 ms
  ?column?
 ----------
         1
 (1 row)
```

This actually isn't just a problem in test environments, but could occur
as well during actual usage when a hostname in pg_dist_node
resolves to multiple IPs and one of those IPs is unreachable.
Postgres will then automatically continue with the next IP, but
Citus should listen for events on the new socket. Not on the
old one.

Co-authored-by: chuhx43211 <chuhx43211@hundsun.com>
2024-01-10 10:49:53 +00:00
zhjwpku 8e979f7ac6
[performance improvement] remove duplicate LoadShardList call (#7380)
LoadShardList is called twice, which is not neccessary, and there is no
need to sort the shard placement list since we only want to know the list
length.
2024-01-10 11:15:19 +01:00
Onur Tirtir 1d55debb98
Support CREATE / DROP database commands from any node (#7359)
DESCRIPTION: Adds support for issuing `CREATE`/`DROP` DATABASE commands
from worker nodes

With this commit, we allow issuing CREATE / DROP DATABASE commands from
worker nodes too.
As in #7278, this is not allowed when the coordinator is not added to
metadata because we don't ever sync metadata changes to coordinator
when adding coordinator to the metadata via
`SELECT citus_set_coordinator_host('<hostname>')`, or equivalently, via
`SELECT citus_add_node(<coordinator_node_name>, <coordinator_node_port>, 0)`.

We serialize database management commands by acquiring a Citus specific
advisory lock on the first primary worker node if there are any workers in the
cluster. As opposed to what we've done in https://github.com/citusdata/citus/pull/7278
for role management commands, we try to avoid from running into distributed deadlocks
as much as possible. This is because, while distributed deadlocks that can happen around
role management commands can be detected by Citus, this is not the case for database
management commands because most of them cannot be run inside in a transaction block.
In that case, Citus cannot even detect the distributed deadlock because the command is not
part of a distributed transaction at all, then the command execution might not return the
control back to the user for an indefinite amount of time.
2024-01-08 16:47:49 +00:00
Karina 20dc58cf5d
Fix getting heap tuple size (#7387)
This fixes #7230. 

First of all, using HeapTupleHeaderGetDatumLength(heapTuple) is
definetly wrong, it gives a number that's 4 times less than the correct
tuple size (heapTuple.t_len). See

https://github.com/postgres/postgres/blob/REL_16_0/src/include/access/htup_details.h#L455-L456

https://github.com/postgres/postgres/blob/REL_16_0/src/include/varatt.h#L279

https://github.com/postgres/postgres/blob/REL_16_0/src/include/varatt.h#L225-L226

When I fixed it, the limit_intermediate_size test failed, so I tried to
understand what's going on there. In original commit fd546cf these
queries were supposed to fail. Then in b3af63c three of the queries that
were supposed to fail suddenly worked and tests were changed to pass
without understanding why the output had changed or how to keep test
testing what it had to test. Even comments saying that these queries
should fail were left untouched. Commit message gives no clue about why
exactly test has changed:

> It seems that when we use adaptive executor instead of task tracker,
we
> exceed the intermediate result size less in the test. Therefore
updated
> the tests accordingly.

Then 3fda2c3 also blindly raised the limit for one of the queries to
keep it working:


3fda2c3254 (diff-a9b7b617f9dfd345318cb8987d5897143ca1b723c87b81049bbadd94dcc86570R19)

When in fe3caf3 that HeapTupleHeaderGetDatumLength(heapTuple) call was
finally added, one of those test queries became failing again.

The other two of them now also failing after the fix. I don't understand
how exactly the calculation of "intermediate result size" that is
limited by citus.max_intermediate_result_size had changed through
b3af63c and fe3caf3, but these numbers are now closer to what
they originally were when this limitation was added in
fd546cf. So these queries should fail, like in the original
version of the limit_intermediate_size test.

Co-authored-by: Karina Litskevich <litskevichkarina@gmail.com>
2024-01-08 17:09:30 +01:00
Onur Tirtir 968ac74cde
Fix foreign_key_to_reference_shard_rebalance test (#7400)
foreign_key_to_reference_shard_rebalance failed because partition of
2024 year does not exist, fixed by add default partition.

Replaces https://github.com/citusdata/citus/pull/7396 by adding a rule
that allows properly testing foreign_key_to_reference_shard_rebalance
via run_test.py.

Closes #7396

Co-authored-by: chuhx <148182736+cstarc1@users.noreply.github.com>
2024-01-04 13:16:45 +01:00
Onur Tirtir d940cfa992
Do nothing if the database is not distributed (#7392)
Fixes the remaining cases reported in
https://github.com/citusdata/citus/issues/7370.
2024-01-03 17:03:06 +03:00
Gürkan İndibay c3579eef06
Adds REASSIGN OWNED BY propagation (#7319)
DESCRIPTION: Adds REASSIGN OWNED BY propagation

This pull request introduces the propagation of the "Reassign owned by"
statement. It accommodates both local and distributed roles for both the
old and new assignments. However, when the old role is a local role, it
undergoes filtering and is not propagated. On the other hand, if the new
role is a local role, the process involves first creating the role on
worker nodes before propagating the "Reassign owned" statement.
2023-12-28 15:15:58 +03:00
Gürkan İndibay 181b8ab6d5
Adds additional alter database propagation support (#7253)
DESCRIPTION: Adds database connection limit, rename and set tablespace
propagation
In this PR, below statement propagations are added

alter database <database_name> with allow_connections = <boolean_value>;
alter database <database_name> rename to <database_name2>;
alter database <database_name> set TABLESPACE <table_space_name>

---------

Co-authored-by: Jelte Fennema-Nio <github-tech@jeltef.nl>
Co-authored-by: Jelte Fennema-Nio <jelte.fennema@microsoft.com>
Co-authored-by: Onur Tirtir <onurcantirtir@gmail.com>
2023-12-26 14:55:04 +03:00
Halil Ozan Akgül b877d606c7
Adds 2PC distributed commands from other databases (#7203)
DESCRIPTION: Adds support for 2PC from non-Citus main databases

This PR only adds support for `CREATE USER` queries, other queries need
to be added. But it should be simple because this PR creates the
underlying structure.

Citus main database is the database where the Citus extension is
created. A non-main database is all the other databases that are in the
same node with a Citus main database.

When a `CREATE USER` query is run on a non-main database we:

1. Run `start_management_transaction` on the main database. This
function saves the outer transaction's xid (the non-main database
query's transaction id) and marks the current query as main db command.
2. Run `execute_command_on_remote_nodes_as_user("CREATE USER
<username>", <username to run the command>)` on the main database. This
function creates the users in the rest of the cluster by running the
query on the other nodes. The user on the current node is created by the
query on the outer, non-main db, query to make sure consequent commands
in the same transaction can see this user.
3. Run `mark_object_distributed` on the main database. This function
adds the user to `pg_dist_object` in all of the nodes, including the
current one.

This PR also implements transaction recovery for the queries from
non-main databases.
2023-12-22 19:19:41 +03:00
Jodi-Ann Francis 6801a1ed1e
PG16 update GRANT... ADMIN | INHERIT | SET, and REVOKE
Allowing GRANT ADMIN to now also be INHERIT or SET in support of psql16

GRANT role_name [, ...] TO role_specification [, ...] [ WITH { ADMIN |
INHERIT | SET } { OPTION | TRUE | FALSE } ] [ GRANTED BY
role_specification ]

Fixes: #7148 
Related: #7138

See review changes from https://github.com/citusdata/citus/pull/7164
2023-12-13 15:57:02 -05:00
Naisila Puka dbdde111c1
Add missing order by clause in failure_split_cleanup test (#7363)
https://github.com/citusdata/citus/actions/runs/6903353045/attempts/1#summary-18781959638
```diff
         ARRAY['-100000'],
         ARRAY[:worker_1_node, :worker_2_node],
         'force_logical');
 ERROR:  server closed the connection unexpectedly
 CONTEXT:  while executing command on localhost:9060
     SELECT operation_id, object_type, object_name, node_group_id, policy_type
     FROM pg_dist_cleanup where operation_id = 777 ORDER BY object_name;
  operation_id | object_type |                        object_name                        | node_group_id | policy_type 
 --------------+-------------+-----------------------------------------------------------+---------------+-------------
           777 |           1 | citus_failure_split_cleanup_schema.table_to_split_8981000 |             1 |           0
-          777 |           1 | citus_failure_split_cleanup_schema.table_to_split_8981002 |             1 |           1
           777 |           1 | citus_failure_split_cleanup_schema.table_to_split_8981002 |             2 |           0
+          777 |           1 | citus_failure_split_cleanup_schema.table_to_split_8981002 |             1 |           1
           777 |           1 | citus_failure_split_cleanup_schema.table_to_split_8981003 |             2 |           1
           777 |           4 | citus_shard_split_publication_1_10_777                    |             2 |           0
 (5 rows)
```

Similar attempt to fix in

c9f2fc892d
There were some more missing ORDER BY stuff, so I added them
2023-11-24 18:26:06 +03:00
Naisila Puka c019acc01b
Run wal2json cdc test for pg16 as well (#7361)
pg16 wal2json package is now available, adding the tests back. Basically
reverting
f253bb3210

Sister PR https://github.com/citusdata/the-process/pull/153
2023-11-24 14:40:23 +03:00
Nils Dijk 0620c8f9a6
Sort includes (#7326)
This change adds a script to programatically group all includes in a
specific order. The script was used as a one time invocation to group
and sort all includes throught our formatted code. The grouping is as
follows:

 - System includes (eg. `#include<...>`)
 - Postgres.h (eg. `#include "postgres.h"`)
- Toplevel imports from postgres, not contained in a directory (eg.
`#include "miscadmin.h"`)
 - General postgres includes (eg . `#include "nodes/..."`)
- Toplevel citus includes, not contained in a directory (eg. `#include
"citus_verion.h"`)
 - Columnar includes (eg. `#include "columnar/..."`)
 - Distributed includes (eg. `#include "distributed/..."`)

Because it is quite hard to understand the difference between toplevel
citus includes and toplevel postgres includes it hardcodes the list of
toplevel citus includes. In the same manner it assumes anything not
prefixed with `columnar/` or `distributed/` as a postgres include.

The sorting/grouping is enforced by CI. Since we do so with our own
script there are not changes required in our uncrustify configuration.
2023-11-23 18:19:54 +01:00
Gürkan İndibay 3b556cb5ed
Adds create / drop database propagation support (#7240)
DESCRIPTION: Adds support for propagating `CREATE`/`DROP` database

In this PR, create and drop database support is added.

For CREATE DATABASE:
* "oid" option is not supported
* specifying "strategy" to be different than "wal_log" is not supported
* specifying "template" to be different than "template1" is not
supported

The last two are because those are not saved in `pg_database` and when
activating a node, we cannot assume what parameters were provided when
creating the database.

And "oid" is not supported because whether user specified an arbitrary
oid when creating the database is not saved in pg_database and we want
to avoid from oid collisions that might arise from attempting to use an
auto-assigned oid on workers.

Finally, in case of node activation, GRANTs for the database are also
propagated.

---------

Co-authored-by: Jelte Fennema-Nio <github-tech@jeltef.nl>
Co-authored-by: Jelte Fennema-Nio <jelte.fennema@microsoft.com>
Co-authored-by: Onur Tirtir <onurcantirtir@gmail.com>
2023-11-21 16:43:51 +03:00
Naisila Puka cedcc220bf
Fixes flaky VACUUM (freeze, process toast true) result (#7348)
https://app.circleci.com/pipelines/github/citusdata/citus/34550/workflows/5b802f66-2666-4623-a209-6d7799f7ee5f/jobs/1229153
```diff
VACUUM (FREEZE, PROCESS_TOAST true) local_vacuum_table;
 SELECT relfrozenxid::text::integer > :frozenxid AS frozen_performed FROM pg_class
 WHERE oid=:reltoastrelid::regclass;
  frozen_performed 
 ------------------
- t
+ f
 (1 row)
```
Process toast option in vacuum was introduced in PG14. The failing test
was supposed to be a part of `multi_utilities.sql`, but it was included
in `pg14.sql` to avoid alternative output for PG13. See
ba62c0a148 (diff-ed03478f693155e2fe092e9ad356bf884dc097f554e8d75eff562d52bbcf7a75L255-L272)
for reference.
However, now that we don't support PG13 anymore, we can move this test
to `multi_utilities.sql`. Moving the test, plus inserting data before
running vacuum freeze such that the freeze is more meaningful and not
flaky, fixes the flakiness problem of the test.
2023-11-17 18:58:06 +03:00
Naisila Puka c88bf5ff1c
Cleanup leftover replication slots in publication test (#7354) 2023-11-17 15:11:38 +03:00
Japin Li e14e8667cc
Fix redundant variable declaration (#7353)
The `$workerCount` declare twice in
src/test/regress/pg_regress_multi.pl.
2023-11-17 13:01:23 +03:00
Naisila Puka 0d1f18862b
Propagates SECURITY LABEL ON ROLE stmt (#7304)
We propagate `SECURITY LABEL [for provider] ON ROLE rolename IS
labelname` to the worker nodes.
We also make sure to run the relevant `SecLabelStmt` commands on a
newly added node by looking at roles found in `pg_shseclabel`.

See official docs for explanation on how this command works:
https://www.postgresql.org/docs/current/sql-security-label.html
This command stores the role label in the `pg_shseclabel` catalog table.

This commit also fixes the regex string in
`check_gucs_are_alphabetically_sorted.sh` script such that it escapes
the dot. Previously it was looking for all strings starting with "citus"
instead of "citus." as it should.

To test this feature, I currently make use of a special GUC to control
label provider registration in PG_init when creating the Citus extension.
2023-11-16 13:12:30 +03:00
Naisila Puka c6fbb72c02
Fix flaky multi_prepare_plsql (#7346)
Simple need of an `ORDER BY` clause

Ran into this twice this week already!

https://github.com/citusdata/citus/actions/runs/6849701315/attempts/1#summary-18622563506

https://github.com/citusdata/citus/actions/runs/6875051160/attempts/1#summary-18698009952

```diff
 SELECT nspname, typname FROM pg_type JOIN pg_namespace ON pg_namespace.oid = pg_type.typnamespace WHERE typname = 'prepare_ddl_type_backup';
    nspname   |         typname         
 -------------+-------------------------
- public      | prepare_ddl_type_backup
  otherschema | prepare_ddl_type_backup
+ public      | prepare_ddl_type_backup
 (2 rows)
```
2023-11-15 13:28:43 +03:00
Naisila Puka a960799dfb
Clean up leftover replication slots in tests (#7338)
This commit fixes the flakiness in `logical_replication` and
`citus_non_blocking_split_shard_cleanup` tests. The flakiness
was related to leftover replication slots.
Below is a flaky example for each test:

logical_replication https://github.com/citusdata/citus/actions/runs/6721324131/attempts/1#summary-18267030604
citus_non_blocking_split_shard_cleanup https://github.com/citusdata/citus/actions/runs/6721324131/attempts/1#summary-18267006967

```diff
 -- Replication slots should be cleaned up
 SELECT slot_name FROM pg_replication_slots;
             slot_name            
 ---------------------------------
-(0 rows)
+ citus_shard_split_slot_19_10_17
+(1 row)
```

The tests by themselves are not flaky: 32 flaky test
schedules each with 20 runs run successfully.
https://github.com/citusdata/citus/actions/runs/6822020127?pr=7338

The conclusion is that:
1. `multi_tenant_isolation_nonblocking` is the problematic test running
before `logical_replication` in the `enterprise_schedule`, so I added a
cleanup at the end of `multi_tenant_isolation_nonblocking`.
https://github.com/citusdata/citus/actions/runs/6824334614/attempts/1#summary-18560127461
2. `citus_split_shard_by_split_points_negative` is the problematic test
running before `citus_non_blocking_split_shards_cleanup` in the split
schedule. Also added cleanup line.

For details on the investigation of leftover replication slots,
please check the PR https://github.com/citusdata/citus/pull/7338
2023-11-14 18:50:54 +03:00
Naisila Puka cdef2d5224
Random tests refactoring (#7342)
While investigating replication slots leftovers
in PR https://github.com/citusdata/citus/pull/7338,
I ran into the following refactoring/cleanup
that can be done in our test suite:

- Add separate test to remove non default nodes
- Remove coordinator removal from `add_coordinator` test
  Use `remove_coordinator_from_metadata` test where needed
- Don't print nodeids in `multi_multiuser_auth` and
`multi_poolinfo_usage`
  tests
- Use `startswith` when checking for isolation or failure tests
- Add some dependencies accordingly in `run_test.py` for running flaky
test schedules
2023-11-14 12:49:15 +03:00
Onur Tirtir 240313e286
Support role commands from any node (#7278)
DESCRIPTION: Adds support from issuing role management commands from worker nodes

It's unlikely to get into a distributed deadlock with role commands, we
don't care much about them at the moment.
There were several attempts to reduce the chances of a deadlock but we
didn't any of them merged into main branch yet, see:
#7325
#7016
#7009
2023-11-10 09:58:51 +00:00
Naisila Puka 57ff762c82
Fix VACUUM flakiness in multi_utilities (#7334)
When I run this test in my local, the size of the table after the DELETE
command is around 58785792. Hence, I assume that the diffs suggest that
the Vacuum had no effect. The current solution is to run the VACUUM
command three times instead of once.

Example diff:
https://github.com/citusdata/citus/actions/runs/6722231142/attempts/1#summary-18269870674
```diff
insert into local_vacuum_table select i from generate_series(1,1000000) i;
 delete from local_vacuum_table;
 VACUUM local_vacuum_table;
 SELECT CASE WHEN s BETWEEN 20000000 AND 25000000 THEN 22500000 ELSE s END
 FROM pg_total_relation_size('local_vacuum_table') s ;
     s     
 ----------
- 22500000
+ 58785792
 (1 row)
```
See more diff examples in the PR description
https://github.com/citusdata/citus/pull/7334
2023-11-09 21:00:24 +03:00
dependabot[bot] d4663212f4 Bump werkzeug from 2.3.7 to 3.0.1 in /src/test/regress
Bumps [werkzeug](https://github.com/pallets/werkzeug) from 2.3.7 to 3.0.1.
- [Release notes](https://github.com/pallets/werkzeug/releases)
- [Changelog](https://github.com/pallets/werkzeug/blob/main/CHANGES.rst)
- [Commits](https://github.com/pallets/werkzeug/compare/2.3.7...3.0.1)

---
updated-dependencies:
- dependency-name: werkzeug
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>
2023-11-09 17:14:14 +01:00
Nils Dijk 0dac63afc0
move pg_version_constants.h to toplevel include (#7335)
In preparation of sorting and grouping all includes we wanted to move
this file to the toplevel includes for good grouping/sorting.
2023-11-09 15:09:39 +00:00
Naisila Puka 0dc41ee5a0
Fix flaky multi_mx_insert_select_repartition test (#7331)
https://github.com/citusdata/citus/actions/runs/6745019678/attempts/1#summary-18336188930
```diff
     insert into target_table SELECT a*2 FROM source_table RETURNING a;
-NOTICE:  executing the command locally: SELECT bytes FROM fetch_intermediate_results(ARRAY['repartitioned_results_xxxxx_from_4213582_to_0','repartitioned_results_xxxxx_from_4213584_to_0']::text[],'localhost',57638) bytes
+NOTICE:  executing the command locally: SELECT bytes FROM fetch_intermediate_results(ARRAY['repartitioned_results_3940758121873413_from_4213584_to_0','repartitioned_results_3940758121873413_from_4213582_to_0']::text[],'localhost',57638) bytes
```

The elements in the array passed to `fetch_intermediate_results` are the
same, but in the opposite order than expected.

To fix this flakiness, we can omit the `"SELECT bytes FROM
fetch_intermediate_results..."` line. From the following logs, it is
understandable that the intermediate results have been fetched.
2023-11-08 15:15:33 +03:00
Onur Tirtir 444e6cb7d6
Remove useless variables (#7327)
To fix warnings observed when using different compiler versions.
2023-11-07 16:39:08 +03:00
cvbhjkl e535f53ce5
Fix typo in local_executor.c (#7324)
Fix a typo 'remaning' -> 'remaining' in local_executor.c
2023-11-03 12:14:11 +00:00
Onur Tirtir 21646ca1e9
Fix flaky isolation_get_all_active_transactions.spec test (#7323)
Fix the flaky test that results in following diff by waiting until the
backend that we want to terminate really terminates, until 5secs.

```diff
--- /__w/citus/citus/src/test/regress/expected/isolation_get_all_active_transactions.out.modified	2023-11-01 16:30:57.648749795 +0000
+++ /__w/citus/citus/src/test/regress/results/isolation_get_all_active_transactions.out.modified	2023-11-01 16:30:57.656749877 +0000
@@ -114,13 +114,13 @@
 --------------------
 t                   
 (1 row)
 
 step s3-show-activity: 
  SET ROLE postgres;
  select count(*) from get_all_active_transactions() where process_id IN (SELECT * FROM selected_pid);
 
 count
 -----
-    0
+    1
 (1 row)
```
2023-11-03 09:00:32 +01:00
Onur Tirtir 5e2439a117
Make some more tests re-runable (#7322)
* multi_mx_create_table
* multi_mx_function_table_reference
* multi_mx_add_coordinator
* create_role_propagation
* metadata_sync_helpers
* text_search

https://github.com/citusdata/citus/pull/7278 requires this.
2023-11-02 18:32:56 +03:00
Jelte Fennema-Nio 85b997a0fb
Fix flaky multi_alter_table_statements (#7321)
Sometimes multi_alter_table_statements would fail in CI like this:

```diff
 -- Verify that DROP NOT NULL works
 ALTER TABLE lineitem_alter ALTER COLUMN int_column2 DROP NOT NULL;
 SELECT "Column", "Type", "Modifiers" FROM table_desc WHERE relid='lineitem_alter'::regclass;
-     Column      |         Type          | Modifiers
----------------------------------------------------------------------
- l_orderkey      | bigint                | not null
- l_partkey       | integer               | not null
- l_suppkey       | integer               | not null
- l_linenumber    | integer               | not null
- l_quantity      | numeric(15,2)         | not null
- l_extendedprice | numeric(15,2)         | not null
- l_discount      | numeric(15,2)         | not null
- l_tax           | numeric(15,2)         | not null
- l_returnflag    | character(1)          | not null
- l_linestatus    | character(1)          | not null
- l_shipdate      | date                  | not null
- l_commitdate    | date                  | not null
- l_receiptdate   | date                  | not null
- l_shipinstruct  | character(25)         | not null
- l_shipmode      | character(10)         | not null
- l_comment       | character varying(44) | not null
- float_column    | double precision      | default 1
- date_column     | date                  |
- int_column1     | integer               |
- int_column2     | integer               |
- null_column     | integer               |
-(21 rows)
-
+ERROR:  schema "alter_table_add_column" does not exist
 -- COPY should succeed now
 SELECT master_create_empty_shard('lineitem_alter') as shardid \gset
 ```

Reading from table_desc apparantly has an issue that if the schema gets
deleted from one of the items, while it is being read that we get such
an error.

This change fixes that by not running multi_alter_table_statements in parallel
with alter_table_add_column anymore.

This is another instance of the same issue as in #7294
2023-11-02 16:42:45 +03:00
Jelte Fennema-Nio f171ec98fc
Fix flaky failure_distributed_results (#7307)
Sometimes in CI we run into this failure:

```diff
   SELECT resultId, nodeport, rowcount, targetShardId, targetShardIndex
   FROM partition_task_list_results('test', $$ SELECT * FROM source_table $$, 'target_table')
           NATURAL JOIN pg_dist_node;
-WARNING:  connection to the remote node localhost:xxxxx failed with the following error: connection not open
+ERROR:  connection to the remote node localhost:9060 failed with the following error: connection not open
 SELECT * FROM distributed_result_info ORDER BY resultId;
-       resultid        | nodeport | rowcount | targetshardid | targetshardindex
----------------------------------------------------------------------
- test_from_100800_to_0 |     9060 |       22 |        100805 |                0
- test_from_100801_to_0 |    57637 |        2 |        100805 |                0
- test_from_100801_to_1 |    57637 |       15 |        100806 |                1
- test_from_100802_to_1 |    57637 |       10 |        100806 |                1
- test_from_100802_to_2 |    57637 |        5 |        100807 |                2
- test_from_100803_to_2 |    57637 |       18 |        100807 |                2
- test_from_100803_to_3 |    57637 |        4 |        100808 |                3
- test_from_100804_to_3 |     9060 |       24 |        100808 |                3
-(8 rows)
-
+ERROR:  current transaction is aborted, commands ignored until end of transaction block
 -- fetch from worker 2 should fail
 SAVEPOINT s1;
+ERROR:  current transaction is aborted, commands ignored until end of transaction block
 SELECT fetch_intermediate_results('{test_from_100802_to_1,test_from_100802_to_2}'::text[], 'localhost', :worker_2_port) > 0 AS fetched;
-ERROR:  could not open file "base/pgsql_job_cache/xx_x_xxx/test_from_100802_to_1.data": No such file or directory
-CONTEXT:  while executing command on localhost:xxxxx
+ERROR:  current transaction is aborted, commands ignored until end of transaction block
 ROLLBACK TO SAVEPOINT s1;
+ERROR:  savepoint "s1" does not exist
 -- fetch from worker 1 should succeed
 SELECT fetch_intermediate_results('{test_from_100802_to_1,test_from_100802_to_2}'::text[], 'localhost', :worker_1_port) > 0 AS fetched;
- fetched
----------------------------------------------------------------------
- t
-(1 row)
-
+ERROR:  current transaction is aborted, commands ignored until end of transaction block
 -- make sure the results read are same as the previous transaction block
 SELECT count(*), sum(x) FROM
   read_intermediate_results('{test_from_100802_to_1,test_from_100802_to_2}'::text[],'binary') AS res (x int);
- count | sum
----------------------------------------------------------------------
-    15 | 863
-(1 row)
-
+ERROR:  current transaction is aborted, commands ignored until end of transaction block
 ROLLBACk;
```

As outlined in the #7306 I created, the reason for this is related to
only having a single connection open to the node. Finding and fixing the
full cause is not trivial, so instead this PR starts working around
this bug by forcing maximum parallelism. Preferably we'd want
this workaround not to be necessary, but that requires
spending time to fix this. For now having a less flaky CI is
good enough.
2023-11-02 12:31:56 +00:00
Jelte Fennema-Nio b47c8b3fb0
Fix flaky insert_select_connection_leak (#7302)
Sometimes in CI insert_select_connection_leak would fail like this:

```diff
 END;
 SELECT worker_connection_count(:worker_1_port) - :pre_xact_worker_1_connections AS leaked_worker_1_connections,
        worker_connection_count(:worker_2_port) - :pre_xact_worker_2_connections AS leaked_worker_2_connections;
  leaked_worker_1_connections | leaked_worker_2_connections
 -----------------------------+-----------------------------
-                           0 |                           0
+                          -1 |                           0
 (1 row)

 -- ROLLBACK
 BEGIN;
 INSERT INTO target_table SELECT * FROM source_table;
 INSERT INTO target_table SELECT * FROM source_table;
 ROLLBACK;
 SELECT worker_connection_count(:worker_1_port) - :pre_xact_worker_1_connections AS leaked_worker_1_connections,
        worker_connection_count(:worker_2_port) - :pre_xact_worker_2_connections AS leaked_worker_2_connections;
  leaked_worker_1_connections | leaked_worker_2_connections
 -----------------------------+-----------------------------
-                           0 |                           0
+                          -1 |                           0
 (1 row)

 \set VERBOSITY TERSE
 -- Error on constraint failure
 BEGIN;
 INSERT INTO target_table SELECT * FROM source_table;
 SELECT worker_connection_count(:worker_1_port) AS worker_1_connections,
        worker_connection_count(:worker_2_port) AS worker_2_connections \gset
 SAVEPOINT s1;
 INSERT INTO target_table SELECT a, CASE WHEN a < 50 THEN b ELSE null END  FROM source_table;
@@ -89,15 +89,15 @@
  leaked_worker_1_connections | leaked_worker_2_connections
 -----------------------------+-----------------------------
                            0 |                           0
 (1 row)

 END;
 SELECT worker_connection_count(:worker_1_port) - :pre_xact_worker_1_connections AS leaked_worker_1_connections,
        worker_connection_count(:worker_2_port) - :pre_xact_worker_2_connections AS leaked_worker_2_connections;
  leaked_worker_1_connections | leaked_worker_2_connections
 -----------------------------+-----------------------------
-                           0 |                           0
+                          -1 |                           0
 (1 row)
```

Source:
https://github.com/citusdata/citus/actions/runs/6718401194/attempts/1#summary-18258258387

A negative amount of leaked connectios is obviously not possible. For
some reason there was a connection open when we checked the initial
amount of connections that was closed afterwards. This could be the
from the maintenance daemon or maybe from the previous test that had not
fully closed its connections just yet.

The change in this PR doesnt't actually fix the cause of the negative
connection, but it simply considers it good as well, by changing the
result to zero for negative values.

With this fix we might sometimes miss a leak, because the negative
number can cancel out the leak and still result in a 0. But since the
negative number only occurs sometimes, we'll still find the leak often
enough.
2023-11-02 13:15:43 +01:00
Cédric Villemain 0678a2fd89
Fix #7242, CALL(@0) crash backend (#7288)
When executing a prepared CALL, which is not pure SQL but available with
some drivers like npgsql and jpgdbc, Citus entered a code path where a
plan is not defined, while trying to increase its cost. Thus SIG11 when
plan is a NULL pointer.

Fix by only increasing plan cost when plan is not null.

However, it is a bit suspicious to get here with a NULL plan and maybe a
better change will be to not call
ShardPlacementForFunctionColocatedWithDistTable() with a NULL plan at
all (in call.c:134)

bug hit with for example:
```
CallableStatement proc = con.prepareCall("{CALL p(?)}");
proc.registerOutParameter(1, java.sql.Types.BIGINT);
proc.setInt(1, -100);
proc.execute();
```

where `p(bigint)` is a distributed "function" and the param the
distribution key (also in a distributed table), see #7242 for details

Fixes #7242
2023-11-02 13:15:24 +01:00
Jelte Fennema-Nio 5a48a1602e
Debug flaky logical_replication test (#7309)
Sometimes in CI our logical_replication test fails like this:

```diff
+++ /__w/citus/citus/src/test/regress/results/logical_replication.out.modified	2023-11-01 14:15:08.562758546 +0000
@@ -40,21 +40,21 @@

 SELECT count(*) from pg_publication;
  count
 -------
      0
 (1 row)

 SELECT count(*) from pg_replication_slots;
  count
 -------
-     0
+     1
 (1 row)

 SELECT count(*) FROM dist;
  count
 -------
```

It's hard to understand what is going on here, just based on the wrong
number. So this PR changes the test to show the name of the
subscription, publication and replication slot to make finding the cause
easier.

In passing this also fixes another flaky test in the same file that our
flaky test detection picked up. This is done by waiting for resource
cleanup after the shard move.
2023-11-02 13:15:02 +01:00
Onur Tirtir 9867c5b949
Fix flaky multi_mx_node_metadata.sql test (#7317)
Fixes the flaky test that results in following diff:
```diff
--- /__w/citus/citus/src/test/regress/expected/multi_mx_node_metadata.out.modified	2023-11-01 14:22:12.890476575 +0000
+++ /__w/citus/citus/src/test/regress/results/multi_mx_node_metadata.out.modified	2023-11-01 14:22:12.914476657 +0000
@@ -840,24 +840,26 @@
 (1 row)
 
 \c :datname - - :master_port
 SELECT datname FROM pg_stat_activity WHERE application_name LIKE 'Citus Met%';
   datname   
 ------------
  db_to_drop
 (1 row)
 
 DROP DATABASE db_to_drop;
+ERROR:  database "db_to_drop" is being accessed by other users
 SELECT datname FROM pg_stat_activity WHERE application_name LIKE 'Citus Met%';
   datname   
 ------------
-(0 rows)
+ db_to_drop
+(1 row)
 
 -- cleanup
 DROP SEQUENCE sequence CASCADE;
 NOTICE:  drop cascades to default value for column a of table reference_table
```
2023-11-02 11:02:34 +00:00
Gürkan İndibay 184c8fc1ee
Enriches statement propagation document (#7267)
Co-authored-by: Onur Tirtir <onurcantirtir@gmail.com>
Co-authored-by: Hanefi Onaldi <Hanefi.Onaldi@microsoft.com>
Co-authored-by: Jelte Fennema-Nio <jelte.fennema@microsoft.com>
2023-11-02 09:59:34 +00:00
Jelte Fennema-Nio a6e86884f6
Fix flaky isolation_metadata_sync_deadlock (#7312)
Sometimes isolation_metadata_sync_deadlock fails in CI like this:

```diff
diff -dU10 -w /__w/citus/citus/src/test/regress/expected/isolation_metadata_sync_deadlock.out /__w/citus/citus/src/test/regress/results/isolation_metadata_sync_deadlock.out
--- /__w/citus/citus/src/test/regress/expected/isolation_metadata_sync_deadlock.out.modified	2023-11-01 16:03:15.090199229 +0000
+++ /__w/citus/citus/src/test/regress/results/isolation_metadata_sync_deadlock.out.modified	2023-11-01 16:03:15.098199312 +0000
@@ -110,10 +110,14 @@
 t
 (1 row)

 step s2-stop-connection:
  SELECT stop_session_level_connection_to_node();

 stop_session_level_connection_to_node
 -------------------------------------

 (1 row)
+
+teardown failed: ERROR:  localhost:57638 is a metadata node, but is out of sync
+HINT:  If the node is up, wait until metadata gets synced to it and try again.
+CONTEXT:  SQL statement "SELECT master_remove_distributed_table_metadata_from_workers(v_obj.objid, v_obj.schema_name, v_obj.object_name)"
```

Source:
https://github.com/citusdata/citus/actions/runs/6721938040/attempts/1#summary-18268946448

To fix this we now wait for the metadata to be fully synced to all
nodes at the start of the teardown steps.
2023-11-02 10:39:05 +01:00
Onur Tirtir 2cf4c04023
Fix flaky global_cancel.sql test (#7316) 2023-11-01 23:59:41 +01:00
Jelte Fennema-Nio e3c93c303d
Fix flaky citus_non_blocking_split_shard_cleanup (#7311)
Sometimes in CI citus_non_blocking_split_shard_cleanup failed like this:

```diff
--- /__w/citus/citus/src/test/regress/expected/citus_non_blocking_split_shard_cleanup.out.modified	2023-11-01 15:07:14.280551207 +0000
+++ /__w/citus/citus/src/test/regress/results/citus_non_blocking_split_shard_cleanup.out.modified	2023-11-01 15:07:14.292551358 +0000
@@ -106,21 +106,22 @@
 -----------------------------------

 (1 row)

 \c - - - :worker_2_port
 SET search_path TO "citus_split_test_schema";
 -- Replication slots should be cleaned up
 SELECT slot_name FROM pg_replication_slots;
             slot_name
 ---------------------------------
-(0 rows)
+ citus_shard_split_slot_19_10_17
+(1 row)

 -- Publications should be cleanedup
 SELECT count(*) FROM pg_publication;
  count
```

It's expected that the replication slot is sometimes not cleaned up if
we don't wait until resource cleanup completes. This PR starts doing
that here.
2023-11-01 16:21:12 +00:00
Jelte Fennema-Nio c9f2fc892d
Fix flaky failure_split_cleanup (#7299)
Sometimes failure_split_cleanup failed in CI like this:

```diff
 ERROR:  server closed the connection unexpectedly
 CONTEXT:  while executing command on localhost:9060
     SELECT operation_id, object_type, object_name, node_group_id, policy_type
     FROM pg_dist_cleanup where operation_id = 777 ORDER BY object_name;
  operation_id | object_type |                        object_name                        | node_group_id | policy_type
 --------------+-------------+-----------------------------------------------------------+---------------+-------------
           777 |           1 | citus_failure_split_cleanup_schema.table_to_split_8981000 |             1 |           0
-          777 |           1 | citus_failure_split_cleanup_schema.table_to_split_8981002 |             1 |           1
           777 |           1 | citus_failure_split_cleanup_schema.table_to_split_8981002 |             2 |           0
+          777 |           1 | citus_failure_split_cleanup_schema.table_to_split_8981002 |             1 |           1
           777 |           1 | citus_failure_split_cleanup_schema.table_to_split_8981003 |             2 |           1
           777 |           4 | citus_shard_split_publication_1_10_777                    |             2 |           0
 (5 rows)

     -- we need to allow connection so that we can connect to proxy
```

Source:
https://github.com/citusdata/citus/actions/runs/6717642291/attempts/1#summary-18256014949

It's the common problem where we're missing a column in the ORDER BY
clause. This fixes that by adding an node_group_id to the query in
question.
2023-11-01 14:08:51 +00:00
Jelte Fennema-Nio c83c556702
Fix flaky isolation_master_update_node (#7303)
Sometimes in CI isolation_master_update_node fails like this:

```diff
 ------------------

 (1 row)

 step s2-abort: ABORT;
 step s1-abort: ABORT;
 FATAL:  terminating connection due to administrator command
 FATAL:  terminating connection due to administrator command
 SSL connection has been closed unexpectedly
+server closed the connection unexpectedly

 master_remove_node
 ------------------

```

This just seesm like a random error line. The only way to reasonably fix
this is by adding an extra output file. So that's what this PR does.
2023-11-01 16:44:45 +03:00
Jelte Fennema-Nio 0d83ab57de
Fix flaky multi_cluster_management (#7295)
One of our most flaky and most anoying tests is
multi_cluster_management. It usually fails like this:
```diff
 SELECT citus_disable_node('localhost', :worker_2_port);
  citus_disable_node
 --------------------

 (1 row)

 SELECT public.wait_until_metadata_sync(60000);
+WARNING:  waiting for metadata sync timed out
  wait_until_metadata_sync
 --------------------------

 (1 row)

```

This tries to address that by hardening wait_until_metadata_sync. I
believe the reason for this warning is that there is a race condition in
wait_until_metadata_sync. It's possible for the pre-check to fail, then
have the maintenance daemon send a notification. And only then have the
backend start to listen. I tried to fix it in two ways:
1. First run LISTEN, and only then read do the pre-check.
2. If we time out, check again just to make sure that we did not miss
   the notification somehow. And don't show a warning if all metadata is
   synced after the timeout.

It's hard to know for sure that this fixes it because the test is not
repeatable and I could not reproduce it locally. Let's just hope for the
best.

---------

Co-authored-by: Onur Tirtir <onurcantirtir@gmail.com>
2023-11-01 10:46:01 +00:00
Jelte Fennema-Nio 20ae42e7fa
Fix flaky multi_reference_table test (#7294)
Sometimes multi_reference_table failed in CI like this:

```diff
 \c - - - :master_port
 DROP INDEX reference_schema.reference_index_2;
 \c - - - :worker_1_port
 SELECT "Column", "Type", "Modifiers" FROM table_desc WHERE relid='reference_schema.reference_table_ddl_1250019'::regclass;
- Column  |            Type             |  Modifiers
----------------------------------------------------------------------
- value_2 | double precision            | default 25.0
- value_3 | text                        | not null
- value_4 | timestamp without time zone |
- value_5 | double precision            |
-(4 rows)
-
+ERROR:  schema "citus_local_table_queries" does not exist
 \di reference_schema.reference_index_2*
           List of relations
  Schema | Name | Type | Owner | Table
```

Source:
https://github.com/citusdata/citus/actions/runs/6707535961/attempts/2#summary-18226879513

Reading from table_desc apparantly has an issue that if the schema gets
deleted from one of the items, while it is being read that we get such
an error.

This change fixes that by not running multi_reference_table in parallel
with citus_local_tables_queries anymore.
2023-11-01 10:12:06 +00:00
Cédric Villemain 37415ef8f5
Allow citus_*_size on index related to a distributed table (#7271)
I just enhanced the existing code to check if the relation is an index
belonging to a distributed table.
If so the shardId is appended to relation (index) name and the *_size
function are executed as before.

There is a change in an extern function:
  `extern StringInfo GenerateSizeQueryOnMultiplePlacements(...)`
It's possible to create a new function and deprecate this one later if
compatibility is an issue.

Fixes https://github.com/citusdata/citus/issues/6496.

DESCRIPTION: Allows using Citus size functions on distributed tables
indexes.

---------

Co-authored-by: Onur Tirtir <onurcantirtir@gmail.com>
2023-11-01 09:05:51 +00:00
Jelte Fennema-Nio a76a832553
Fix flaky validate_constraint test (#7293)
Sometimes validate constraint would fail like this:

```diff
  validatable_constraint_8000016 | t
 (10 rows)

 DROP TABLE constrained_table;
+ERROR:  deadlock detected
+DETAIL:  Process 16602 waits for ShareRowExclusiveLock on relation 56258 of database 16384; blocked by process 16601.
+Process 16601 waits for AccessShareLock on relation 56120 of database 16384; blocked by process 16602.
+HINT:  See server log for query details.
 DROP TABLE referenced_table CASCADE;
 DROP TABLE referencing_table;
 DROP SCHEMA validate_constraint CASCADE;
-NOTICE:  drop cascades to 3 other objects
+NOTICE:  drop cascades to 4 other objects
 DETAIL:  drop cascades to type constraint_validity
 drop cascades to view constraint_validations_in_workers
 drop cascades to view constraint_validations
+drop cascades to table constrained_table
 SET search_path TO DEFAULT;

```

Source:
https://github.com/citusdata/citus/actions/runs/6708383699?pr=7291

This change fixes that by not running together with the
foreign_key_to_reference_table test anymore. In passing it also
simplifies dropping of the test its resources.
2023-11-01 09:41:28 +01:00
Emel Şimşek ee8f4bb7e8
Start Maintenance Daemon for Main DB at the server start. (#7254)
DESCRIPTION: This change starts a maintenance deamon at the time of
server start if there is a designated main database.

This is the code flow:

1. User designates a main database:
   `ALTER SYSTEM SET citus.main_db =  "myadmindb";`

2. When postmaster starts, in _PG_Init, citus calls 
    `InitializeMaintenanceDaemonForMainDb`
  
This function registers a background worker to run
`CitusMaintenanceDaemonMain `with `databaseOid = 0 `

3. `CitusMaintenanceDaemonMain ` takes some special actions when
databaseOid is 0:
     - Gets the citus.main_db  value.
     - Connects to the  citus.main_db
     - Now the `MyDatabaseId `is available, creates a hash entry for it.
     - Then follows the same control flow as for a regular db,
2023-10-30 09:44:13 +03:00
Benjamin O f9218d9780
Support replacing IPv6 Loopback in `normalize.sed` (#7269)
I had a test failure issue due to my machine using the IPv6 loopback
address. This change to the `normalize.sed` solves that issue.
2023-10-27 16:42:55 +02:00
Naisila Puka 10198b18e8
Technical readme small fixes (#7261) 2023-10-23 13:43:43 +03:00
Naisila Puka 1fe16fa746
Remove unnecessary pre-fastpath code (#7262)
This code was here because we first implemented
`fast path planner` via
[#2606](https://github.com/citusdata/citus/pull/2606)
and then later `deferred pruning`
[#3369](https://github.com/citusdata/citus/pull/3369)
So, for some years, this code was useful.
2023-10-23 13:01:48 +03:00
zhjwpku 2d1444188c
Fix wrong comments around HasDistributionKey() (#7223)
HasDistributionKey & HasDistributionKeyCacheEntry returns true when the
corresponding table has a distribution key, the comments state the
opposite,
which should be fixed.

Signed-off-by: Zhao Junwang <zhjwpku@gmail.com>
Co-authored-by: Onur Tirtir <onurcantirtir@gmail.com>
2023-10-18 10:53:00 +02:00
Onur Tirtir db13afaa7b
Fix flaky columnar_create.sql test (#7266) 2023-10-17 16:58:17 +03:00
Gürkan İndibay 71a4633dad
Fixes typo and renames multi_process_utility (#7259) 2023-10-17 16:39:37 +03:00
Jelte Fennema-Nio 788e09a39a
Add a test for citus_shards where table names have spaces (#7224)
There was a bug reported for previous versions of Citus where
shard\_size was returning NULL for tables with spaces in them. It works
fine on the main branch though, but I'm still adding a test for this to
the main branch because it seems a good test to have.
2023-10-16 11:38:24 +02:00
Emel Şimşek e9035f6d32
Send keepalive messages in split decoder periodically to avoid wal receiver timeouts during large shard splits. (#7229)
DESCRIPTION: Send keepalive messages during the logical replication
phase of large shard splits to avoid timeouts.

During the logical replication part of the shard split process, split
decoder filters out the wal records produced by the initial copy. If the
number of wal records is big, then split decoder ends up processing for
a long time before sending out any wal records through pgoutput. Hence
the wal receiver may time out and restarts repeatedly causing our split
driver code catch up logic to fail.

Notes: 

1. If the wal_receiver_timeout is set to a very small number e.g. 600ms,
it may time out before receiving the keepalives. My tests show that this
code works best when the` wal_receiver_timeout `is set to 1minute, which
is the default value.

2. Once a logical replication worker time outs, a new one gets launched.
The new logical replication worker sets the pg_stat_subscription columns
to initial values. E.g. the latest_end_lsn is set to 0. Our driver logic
in `WaitForGroupedLogicalRepTargetsToCatchUp` can not handle LSN value
to go back. This is the main reason for it to get stuck in the infinite
loop.
2023-10-09 22:33:08 +03:00
Nils Dijk 6d8725efb0
Fix leaking of memory and memory contexts in Foreign Constraint Graphs (#7236)
DESCRIPTION: Fix leaking of memory and memory contexts in Foreign
Constraint Graphs

Previously, every time we (re)created the Foreign Constraint
Relationship Graph, we created a new Memory Context while loosing a
reference to the previous context. This old context could still have
left over memory in there causing a memory leak.

With this patch we statically have one memory context that we lazily
initialize the first time we create our foreign constraint relationship
graph. On every subsequent creation, beside destroying our previous
hashmap we also reset our memory context to remove any left over
references.
2023-10-09 13:05:51 +02:00