Commit Graph

5843 Commits (c1f58fac6c151fb96e1281bb968545f1dfd6550c)

Author SHA1 Message Date
Jelte Fennema c1f58fac6c Don't run any isolation tests in parallel (#6212)
By running isolation tests in parallel we're just asking for flaky
tasks. The first test might temporarily block one of the commands in the
second test, which we then detect as waiting like this:
```diff
 step s2-vacuum-analyze:
     VACUUM ANALYZE test_insert_vacuum;
-
+ <waiting ...>
 step s1-commit:
     COMMIT;

+step s2-vacuum-analyze: <... completed>
```

Debugging flaky tests is also much harder when they are run in parallel.
This PR starts running all our isolation tests sequentially.

The reason for opening this PR was me seeing this failing test:
https://app.circleci.com/pipelines/github/citusdata/citus/26194/workflows/ff57e2cf-8ac4-40fe-bc0c-74a7f8fecb53/jobs/740454

As well as having fixed a similar issue recently in #6122

(cherry picked from commit 85305b2773)
2022-09-07 13:27:49 +02:00
Jelte Fennema 3d67ae6497 Fix flakyness in failure_insert_select_repartition (#6202)
This fixes our most commonly randomly failing failure test. The failing
diff is as follows:

```diff
SELECT citus.mitmproxy('conn.onQuery(query="fetch_intermediate_results").kill()');
  mitmproxy
 -----------

 (1 row)

 INSERT INTO target_table SELECT * FROM source_table;
-ERROR:  connection to the remote node localhost:xxxxx failed with the following error: connection not open
+ERROR:  could not open file "base/pgsql_job_cache/10_0_40/repartitioned_results_20770193413_from_4213590_to_1.data": No such file or directory
+CONTEXT:  while executing command on localhost:9060
+while executing command on localhost:57637
 SELECT * FROM target_table ORDER BY a;
```

As far as I can tell this is the cause of a race condition: After killing
fetch_intermediate_results on worker 9060, the previously created data
file gets cleaned up. The fetch_intermediate_results call that's sent
to worker 57637 will be cancelled and rolled back soon because of the
failure on the other connection. But if that fetch_intermediate_results
call is able to connect to 9060 before it is cancelled, it won't find
the file it's looking for there anymore. So while it's not the error we
expect, it does indicate that we succeeded.

To avoid this issue instead of killing the fetch_intermediate_results
call directly, we kill the COPY command that it uses to do the fetch.
This results in stable output as can be seen here, where 227 runs of
failure_insert_select_repartition succeeded:
https://app.circleci.com/pipelines/github/citusdata/citus/26168/workflows/9c64a3b6-f46c-4725-9fb4-8f6a2d00a023/jobs/739389

To be clear this changes the test to affects the opposite
fetch_intermediate_results call. This kills the fetch_intermediate_results
call of worker 57637, instead of killing the fetch_intermediate_results call
on worker 9060.

Example of failing test: https://app.circleci.com/pipelines/github/citusdata/citus/26147/workflows/780e95ea-264a-4c9f-ad2e-cf11449a795e/jobs/738467

(cherry picked from commit 8ce12eb51f)
2022-09-07 13:27:49 +02:00
Önder Kalacı 173b7ed5ad Properly add / remove coordinator for isolation tests (#6181)
We used to rely on a seperate session to add the coordinator.
However, that might prevent the existing sessions to get
assigned proper gpids, which causes flaky tests.

(cherry picked from commit 961fcff5db)
2022-09-07 13:27:49 +02:00
Jelte Fennema eed4d6452d Fix flakyness in columnar_first_row_number test (#6192)
When running columnar_first_row_number in parallel with the
columnar_query test sometimes it would fail. This bug is tracked
in #6191. For now to make CI less flaky we simply don't run these tests
in parallel.

Example of failed test: https://app.circleci.com/pipelines/github/citusdata/citus/26106/workflows/75d00ea9-23f8-4bff-a927-bced19e1f81b/jobs/736713

Fixes #6184

(cherry picked from commit 0a045afd3a)
2022-09-07 13:27:49 +02:00
Jelte Fennema aa879108b7 Remove the flaky rollback_to_savepoint test (#6190)
This removes a flaky test that I introduced in #3868 after I fixed the
issue described in #3622. This test is sometimes fails randomly in CI.
The way it fails indicates that there might be some bug: A connection
breaks after rolling back to a savepoint.

I tried reproducing this issue locally, but I wasn't able to. I don't
understand what causes the failure.

Things that I tried were:

1. Running the test with:
   ```sql
   SET citus.force_max_query_parallelization = true;
   ```
2. Running the test with:
   ```sql
   SET citus.max_adaptive_executor_pool_size = 1;
   ```
3. Running the test in parallel with the same tests that it is run in
   parallel with in multi_schedule.

None of these allowed me to reproduce the issue locally.

So I think it's time to give on fixing this test and simply remove the
test. The regression that this test protects against seems very unlikely
to reappear, since in #3868 I also added a big comment about the need
for the newly added `UnclaimConnection` call. So, I think the need for
the test is quite small, and removing it will make our CI less flaky.

In case the cause of the bug ever gets found, I tracked the bug in #6189

Example of a failing CI run:
https://app.circleci.com/pipelines/github/citusdata/citus/26098/workflows/f84741d9-13b1-4ae7-9155-c21ed3466951/jobs/736424

For reference the unexpected diff is this (so both warnings and an error):
```diff
 INSERT INTO t SELECT i FROM generate_series(1, 100) i;
+WARNING:  connection to the remote node localhost:57638 failed with the following error:
+WARNING:
+CONTEXT:  while executing command on localhost:57638
+ERROR:  connection to the remote node localhost:57638 failed with the following error:
 ROLLBACK;
```

This test is also mentioned as the most failing regression test in #5975

(cherry picked from commit d16b458e2a)
2022-09-07 13:27:49 +02:00
Jelte Fennema ee887ef648 Fix flakyness in create index concurrently isolation tests (#6158)
This creates consistent test output for isolation tests that involve
`CREATE INDEX CONCURRENTLY`. `CREATE INDEX CONCURRENTLY` is sometimes
temporarily detected as blocking, even though it will complete without any other
queries needing to be run. This change makes sure that we wait until that happens
without running any other queries in the meantime. This way we always get consistent
output. The way we do that is addressed by using an empty step in the same
session as the `CREATE INDEX CONCURRENLTY` command. Doing so forces
the isolation tester to wait until the command is finished and not continue with
steps from other sessions. This is [the recommended approach by Postgres][1].

There's two separate cases which are addressed in slightly different ways:
1. If `CREATE INDEX CONCURRENTLY` is actually blocked on another session: Add an
    empty step right after the commit of blocking session.
    e.g. `"s2-ddl-create-index-concurrently" "s1-commit" "s2-empty"`
2. If it's not actually blocked on another session: Add [an asterisk marker][2] to make
    it look like it's blocked (because sometimes this happens randomly) and right
    after that we add an empty step to trigger waiting.
    e.g. `"s2-ddl-create-index-concurrently"(*) "s2-empty" "s1-commit"`

In passing this also enables isolation tests that were disabled due to a
bug that has already been fixed for a while.

Fixes #5993
Related to #5910 and #2966

[1]: 5f0adec253/src/test/isolation/README (L197-L204)
[2]: 5f0adec253/src/test/isolation/README (L174-L179)

Co-authored-by: Hanefi Onaldi <Hanefi.Onaldi@microsoft.com>
(cherry picked from commit fd07cc9baf)
2022-09-07 13:27:49 +02:00
Jelte Fennema a9a114145a Fix flakyness in isolation_data_migration.spec (#6122)
The tests isolation_concurrent_dml and isolation_data_migration tests
were being run in parallel, but they were interfering with each others
output. Sometimes queries from isolation_concurrent_dml were blocking
create_distributed_table in isolation_data_migration:

1. https://app.circleci.com/pipelines/github/citusdata/citus/25562/workflows/f9d0a6ff-bb7a-4b71-9fcf-1a3e46d54425/jobs/713270
2. https://app.circleci.com/pipelines/github/citusdata/citus/25562/workflows/1e22454c-1623-48a7-97fb-c6803c7959c7/jobs/713223
3. https://app.circleci.com/pipelines/github/citusdata/citus/25562/workflows/618c419e-eefb-4582-9482-322dbb9ac96d/jobs/713110

This fixes it changing the schedule to not run these tests in parallel.

(cherry picked from commit dff71abc32)
2022-09-07 13:27:49 +02:00
Jelte Fennema 86614e3555 Fix flakyness in isolation_replicate_reference_tables_to_coordinator.spec (#6123)
When the deadlock detector kills s2-update-dist-table both sessions
finish at the same time. The order in which they are displayed can be
swapped. To counteract this we start using the ["marker" feature][1] of
the isolationtester framework to create consistent output.

In passing this also sets the next_shard_id to the expected value by
this test so it can be run using `make check-isolation-base`.

Failed CI test: https://app.circleci.com/pipelines/github/citusdata/citus/25562/workflows/dfe6f88a-c306-4d91-b771-d5d1deb1798d/jobs/713417

[1]: ec62ce55a8/src/test/isolation/README (L152)

(cherry picked from commit 8bbc1a45e1)
2022-09-07 13:27:49 +02:00
Hanefi Onaldi a29b689fc9 Replace isolation tester func only once on enterprise tests (#6064)
This is a continuation of a refactor (with commit sha
2b7cf0c097) that aimed to use Citus helper
UDFs by default in iso tests.

PostgreSQL isolation test infrastructure uses some UDFs to detect
whether concurrent sessions block each other. Citus implements
alternatives to that UDF so that we are able to detect and report
distributed transactions that get blocked on the worker nodes as well.

We needed to explicitly replace PG helper functions with Citus
implementations in each isolation file. Now we replace them by default.

(cherry picked from commit ae58ca5783)
2022-09-07 13:27:49 +02:00
Hanefi Onaldi b90523628f Replace iso tester func only once (#5964)
Use Citus helper UDFs by default in iso tests

PostgreSQL isolation test infrastructure uses some UDFs to detect
whether concurrent sessions block each other. Citus implements
alternatives to that UDF so that we are able to detect and report
distributed transactions that get blocked on the worker nodes as well.

We needed to explicitly replace PG helper functions with Citus
implementations in each isolation file. Now we replace them by default.

(cherry picked from commit 2b7cf0c097)
2022-09-07 13:27:49 +02:00
Jelte Fennema 75cf7a748d
Define symbols required for downgrade from 11.1 (#6301)
Since #6300/e29db74 changed the C symbol that our bigint overrides of
pg_cancel_backend and pg_terminate_backend called. We needed to do
something to continue to make these functions work after downgrading.

Recreating the old definition with a downgrade scripts is not really
possible, since people are expected to run the downgrade steps when
using the new .so file, which does not contain the old symbols.

So, the easiest way to solve it was also defining the new symbols in our
old Citus versions. Luckily our overrides haven't existed for long, so
these symbol definitions only needed to be backported to 11.0.
2022-09-07 12:18:39 +02:00
Marco Slot 5f57d77899 Allow citus_internal application_name with additional suffix (#6282)
Co-authored-by: Marco Slot <marco.slot@gmail.com>
2022-09-05 21:41:06 +02:00
Marco Slot 0a11da1291 Add an allow_unsafe_constraints flag for constraints without distribution column (#6237)
Co-authored-by: Marco Slot <marco.slot@gmail.com>
2022-08-25 16:13:07 +02:00
Gokhan Gulbiz 07143e7d12 Use the same colocation group for child and parent rels when altering a distributed table (#6225)
* Alter_distributed_table colocateWith:none bug fix for partitioned tables.

* Regression tests added for alter_distributed_table colocateWith:none for partitioned tables

* Update query comparision to be more accurate

(cherry picked from commit 69d2fcf5c0)
2022-08-25 11:47:06 +03:00
Marco Slot 006c8eacc0 Verify that we can replicate reference tables using rebalancer 2022-08-23 23:26:49 +02:00
Marco Slot 9dc6273b88 Set application_name to citus_rebalancer when copying reference tables 2022-08-23 23:26:49 +02:00
Onur Tirtir dbfdaca0f0 Add changelog entries for 11.0.6
(cherry picked from commit b6b8f198d9)
2022-08-19 11:09:46 +03:00
Onur Tirtir a5d6e841df Bump citus version to 11.0.6 2022-08-19 10:58:38 +03:00
Jelte Fennema 73e993e908
Fix flakyness in isolation_reference_table (#6193)
The newly introduced isolation_reference_table test had some flakyness,
because the assumption on how the arbitrary reference table gets chosen
was incorrect. This introduces a VACUUM FULL at the start of the test to
ensure the assumption actually holds.

Example of failed test: https://app.circleci.com/pipelines/github/citusdata/citus/26108/workflows/0a5cd526-006b-423e-8b67-7411b9c6be36/jobs/736802
2022-08-18 14:47:59 +02:00
Nils Dijk 08dee6fe08
Fix reference table lock contention (#6173)
DESCRIPTION: Fix reference table lock contention

Dropping and creating reference tables unintentionally blocked on each other due to the use of an ExclusiveLock for both the Drop and conditionally copying existing reference tables to (new) nodes.

The patch does the following:
 - Lower lock lever for dropping (reference) tables to `ShareLock` so they don't self conflict
 - Treat reference tables and distributed tables equally and acquire the colocation lock when dropping any table that is in a colocation group
 - Perform the precondition check for copying reference tables twice, first time with a lower lock that doesn't conflict with anything. Could have been a NoLock, however, in preparation for dropping a colocation group, it is an `AccessShareLock`

During normal operation the first check will always pass and we don't have to escalate that lock. Making it that we won't be blocked on adding and remove reference tables. Only after a node addition the first `create_reference_table` will still need to acquire an `ExclusiveLock` on the colocation group to perform the copy.
2022-08-18 13:22:31 +02:00
Onder Kalaci 87787dd146 Support Sequences owned by columns before distributing tables
There are 3 different ways that a sequence can be interacting
with tables. (1) and (2) are already supported. This commit adds
support for (3).

     (1) column DEFAULT nextval('seq'):

	The dependency is roughly like below,
	and ExpandCitusSupportedTypes() is responsible
	for finding the depending sequences.

        schema <--- table <--- column <---- default value
         ^                                     |
         |------------------ sequence <--------|

    (2) serial columns: Bigserial/small serial etc:

	The dependency is roughly like below,
	and ExpandCitusSupportedTypes() is responsible
	for finding the depending sequences.

        schema <--- table <--- column <---- default value
                                 ^             |
				 |             |
          		     sequence <--------|

   (3) Sequence OWNED BY table.column: Added support for
       this type of resolution in this commit.

       The dependency is almost like the following, and
       ExpandCitusSupportedTypes() is NOT responsible for finding
       the dependency.

        schema <--- table <--- column
                                 ^
				 |
          		     sequence

(cherry picked from commit 9ec8e627c1)
2022-08-18 11:22:25 +02:00
Marco Slot 56939f0d14
Fix relation access tracking for local only transactions on release-11.0 (#6182)
Co-authored-by: Onder Kalaci <onderkalaci@gmail.com>
2022-08-18 10:13:41 +02:00
Ahmet Gedemenli 7df8588107
Fix upgrade paths for 11.0 (#6171)
* Fix upgrade paths for 11.0
2022-08-17 21:34:23 +03:00
aykut-bozkurt e0b4455e45 sysid should be parsed as int. (#6150)
(cherry picked from commit 898801504e)
2022-08-11 11:03:41 +03:00
Onur Tirtir f8e3e8c444 Add CHANGELOG entries for 11.0.5 (#6108)
(cherry picked from commit 0a04b115aa)
2022-08-01 13:40:43 +03:00
Onur Tirtir a18f6c4e40 Bump citus version to 11.0.5 2022-08-01 10:56:31 +03:00
Jelte Fennema d6c885713e Work around flaky test related to search_path (#5894)
For some reason search_path is not always set correctly on the worker
when calling a distributed function, this shows up when calling
`insert_document` in our distributed_triggers test. The underlying
reason is currently unknown and warrants deeper investigation.

Currently this test is one of the main causes for random CI failures. So
this change sets the search_path of each function explicitly, to reduce
these failures. So other devs can be more efficient, while I continue
investigating the root cause of this issue.

Also changes explicit `SET citus.enable_unsafe_triggers = false` to
`RESET citus.enable_unsafe_triggers` in passing.

(cherry picked from commit 6d8c5931d6)
2022-08-01 10:17:26 +03:00
Ying Xu a8aa82a3ec Bugfix for IN clause to be considered during planner phase in Columnar (#6030)
Reported bug #5803 shows that we are currently not sending the IN clause to our planner for columnar. This PR fixes it by checking for ScalarArrayOpExpr in ExtractPushdownClause so that we do not skip it. Also added a test case for this new addition.
2022-07-29 17:56:24 +02:00
Ahmet Gedemenli 2f1719c149
Do not create truncate triggers on foreign tables (#6103) 2022-07-29 16:43:09 +03:00
Marco Slot 4eb0749369 Avoid catalog read via superuser() call in DecrementSharedConnectionCounter 2022-07-29 14:22:51 +02:00
Marco Slot 4439124b6d Fix issues with insert..select casts and column ordering 2022-07-28 13:54:04 +02:00
Jelte Fennema 1cf079581f Avoid possible information leakage about existing users (#6090)
(cherry picked from commit 0f50bef696)
2022-07-27 17:58:24 +02:00
Ahmet Gedemenli 4d01af5160 Error out for views with circular dependencies (#6051)
Adds error check for views with circular dependencies

(cherry picked from commit 2b2a529653)
2022-07-27 17:59:49 +03:00
Marco Slot e45b6ece0d Allow WITH HOLD cursors with parameters 2022-07-27 14:08:18 +02:00
Onder Kalaci 9af736c7a6 Concurrent shard move/copy and colocated table creation fix
It turns out that create_distributed_table
and citus_move/copy_shard_placement does not
work well concurrently.

To fix that, we need to acquire a lock, which
sounds like a good use of colocation lock.

However, the current usage of colocation lock is
limited to higher level UDFs like rebalance_table_shards
etc. Those usage of lock is still useful, but
we cannot acquire the same lock on citus_move_shard_placement
etc. because the coordinator connects to itself to acquire
the lock. Hence, the high level UDF blocks itself.

To fix that, we use one more colocation lock, with the placements
are the main objects to consider.

(cherry picked from commit 12fa3aaf6b)
2022-07-27 10:10:46 +02:00
Onder Kalaci a21a4e128c Optimize StringJoin() for when prefix-postfix is needed
Before this commit, we required multiple copies of the
same stringInfo if we needed to append/prepend data to
the stringInfo. Now, we optionally get prefix/postfix.

For large string operations, this can save up to %10
memory.

(cherry picked from commit 26fdcb68f0)
2022-07-27 10:02:32 +02:00
Onder Kalaci 2a684e426c Do not cache all the metadata during fix_all_partition_shard_index_names
(cherry picked from commit f076e81166)
2022-07-27 10:02:05 +02:00
Onder Kalaci 377375de2a Reduce memory consumption while adjust partition index names
Previously, CreateFixPartitionShardIndexNames() created all
the relevant query strings for all the shards, and executed
the large query string. And, in terms of the memory consumption,
this huge command (and its ExprContext generated while running
the command) is the main bottleneck/

With this change, we are reducing the total amount of memory
usage to almost 1/shard_count.

On my local machine, a distributed partitioned table with 120 partitions,
each 32 shards, the total memory consumption reduced from ~3GB
to ~0.1GB. And, the total execution time increased from ~28 seconds
to ~30 seconds. This seems like a good trade-off.

(cherry picked from commit b8008999dc)
2022-07-27 10:02:00 +02:00
Nitish Upreti fcdf4434c6
Fix blocking shard moves failure due to constraint failure.
DESCRIPTION:
Fix Bug #4949 where Blocking shard moves fails if there is a foreign key between partitioned distributed tables (from child to parent). This is because we try to create constraints before attaching child partitions to parent. This causes constraint failure as parent table will be empty. Fix is to reverse the order i.e. attach partitions before we create constraints.

TESTING:
Added a new test 'shard_move_constraints_blocking' inspired for existing 'shard_move_constraints' where we trigger shard move with 'block_writes' instead of 'force_logical' to add coverage for this scenario.
2022-07-24 21:21:25 -07:00
Hanefi Onaldi 5ca792aef9
Bump Citus version to 11.0.4 2022-07-13 18:06:04 +03:00
Hanefi Onaldi 096047bfbc
Add changelog entry for 11.0.4 2022-07-13 18:05:19 +03:00
Onder Kalaci c51095c462 Add more generic read-replica tests
(cherry picked from commit 6cd7319f12)
2022-07-13 15:16:04 +02:00
Onder Kalaci 857a770b86 Add regression tests for LOCK command citus.use_secondary_nodes=always mode
(cherry picked from commit 3c343d4563)
2022-07-13 15:15:52 +02:00
Onder Kalaci 06e55df141 Make sure citus_is_coordinator works on read replicas
(cherry picked from commit b2e9a5baf1)
2022-07-13 15:15:46 +02:00
Onder Kalaci 06d6ffbb6e LOCK COMMAND does not require primaries at the start
(cherry picked from commit 8ab696f7e2)
2022-07-13 15:15:40 +02:00
Hanefi Onaldi 2bb106508a
Bump Citus version to 11.0.3 2022-07-05 13:19:10 +03:00
Hanefi Onaldi 5f46f2e9f7
Add changelog entry for 11.0.3
(cherry picked from commit c33915c3e6)
2022-07-05 13:19:10 +03:00
Ahmet Gedemenli ac7511de7d Fix matviews for citus_add_local_table_to_metadata (#6023)
(cherry picked from commit c8e1e243b8)
2022-07-04 17:01:40 +03:00
Hanefi Onaldi 0eee7fd9b8
Fix downgrade scripts from 11.0-2 to 11.0-1
(cherry picked from commit f60809a6c1)

Conflicts:
	src/test/regress/expected/multi_extension.out
	src/test/regress/sql/multi_extension.sql
2022-06-29 22:52:07 +03:00
Önder Kalacı 03a4305e06
Fixes a bug that prevents upgrades when there are no worker nodes (#6037)
(cherry picked from commit bab4c0a8c3)
2022-06-29 14:36:24 +03:00