Commit Graph

6458 Commits (7b60cdd13b9c181ca2ad31306666e00957f1c296)

Author SHA1 Message Date
Ahmet Gedemenli cfc17385e9
Some minor improvements on flakiness detection (#6585)
* Skip some exceptional test files in the flaky workflow, like
multi_extension
* Run some tests without a schedule, like single_node_enterprise
* Use minimal schedule for the tests in split and operations schedules
2022-12-28 15:08:39 +03:00
Ahmet Gedemenli eba9abeee2
Fix leftover shard copy on the target node when tx with move is aborted (#6583)
DESCRIPTION: Cleanup the shard on the target node in case of a
failed/aborted shard move

Inserts a cleanup record for the moved shard placement on the target
node. If the move operation succeeds, the record will be deleted. If
not, it will remain there to be cleaned up later.

fixes: #6580
2022-12-27 22:42:46 +03:00
Naisila Puka e937935935
Clean up normalize file (#6578) 2022-12-26 12:08:27 +03:00
Naisila Puka bc49616426
Fix link of path to flaky tests docs (#6579) 2022-12-23 12:09:22 +03:00
Naisila Puka 9c78f693f2
Fix typo in fix_style.sh (#6577)
For reference, this is the print stack trace PR
https://github.com/citusdata/citus/pull/6539
2022-12-22 16:24:26 +03:00
Ahmet Gedemenli cf93eb46c6
Fix split schedule (#6571)
* Drop enterprise_split_schedule as it's not even called in our CI
pipeline. It's actually a subset of split_schedule, except for
`citus_split_shard_by_split_points_deferred_drop`. Added that one into
split_schedule and dropped the enterprise one.
* Delete `citus_non_blocking_shard_split_cleanup.out`, as there is no
sql file for it. It seems it's renamed to some other test and the sql
file is deleted, but we forgot to delete the output file.
* 6 test files are chained to each other with dependent objects. Unified
them into one test file so that the flaky check will not fail for them
anymore.
* Some cleanup lines to prevent the flakiness check from failing.
2022-12-22 13:20:06 +03:00
Ahmet Gedemenli 96bf648d1a Unify dependent test files into one 2022-12-22 13:06:26 +03:00
Ahmet Gedemenli acf3539a90 Fix split schedule 2022-12-22 13:06:26 +03:00
Ahmet Gedemenli 497a7589d0
Use failtester for flakyness detection (#6569)
We need failtester image to run failure tests in flakiness detection
workflow.
2022-12-21 18:56:51 +03:00
Hanefi Onaldi 303db172f8
Use Citus version comparison in upgrade tests, not equality (#6568)
We have several version checks in our Citus upgrade tests. However, as
we drop support for PG versions, we need to update the Citus versions
used in our CI images. Therefore we must compare Citus versions in our
tests instead of using equality checks so that the queries are ran in
all the associated Citus versions.

For example, we have many conditionals where we early exit if the Citus
version is not equal to 9.0. However, as of today we never use version
9.0 and thus we always early exit in those tests.
2022-12-21 14:01:57 +03:00
Hanefi Onaldi d6833c877b
Add changelog entries for 11.1.5 (#6567) 2022-12-20 14:56:26 +03:00
Teja Mupparti 9a9989fc15 Support MERGE Phase – I
All the tables (target, source or any CTE present) in the SQL statement are local i.e. a merge-sql with a combination of Citus local and
Non-Citus tables (regular Postgres tables) should work and give the same result as Postgres MERGE on regular tables. Catch and throw an
exception (not-yet-supported) for all other scenarios during Citus-planning phase.
2022-12-18 20:32:15 -08:00
Emel Şimşek 5268d0a6cb
Enable PRIMARY KEY generation via ALTER TABLE even if the constraint name is not provided (#6520)
DESCRIPTION: Support ALTER TABLE .. ADD PRIMARY KEY ... command

Before processing
	> **ALTER TABLE ... ADD PRIMARY KEY ...**
command

1. 	Create a primary key name to use as the constraint name.
2. Change the **ALTER TABLE ... ADD PRIMARY KEY ...** command to into
**ALTER TABLE ... ADD CONSTRAINT \<constraint name> PRIMARY KEY ...**
form.
This is the only form we can specify a name for a primary key. If we run
ALTER TABLE .. ADD PRIMARY KEY, postgres
would create a constraint name internally in its own scheme. But the
problem is that we need to create constraint names
for shards in our own scheme which is \<constraint name>_\<shardid>.
Hence we need to create a name and send it to workers so that the
workers can append the shardid.
4. Run the changed command on the coordinator to make sure we are using
the same constraint name across the board.
5. Send the changed command to workers such that it is executed for the
main table as well as for the shards.

Fixes #6515.
2022-12-16 20:34:00 +03:00
aykut-bozkurt 9c0073ba57
remove unused boundary type (#6563)
Removes unused job boundary tag `SUBQUERY_MAP_MERGE_JOB`.

Only usage is at `BuildMapMergeJob`, which is only called when the
boundary = `JOIN_MAP_MERGE_JOB`. Hence, it should be safe to remove.
2022-12-16 18:19:22 +03:00
Önder Kalacı f7e881a4c4
Do not create additional WaitEventSet for RemoteSocketClosed checks (#6505)
Fixes #6501


Before this commit, we created an additional WaitEventSet for
checking whether the remote socket is closed per connection -
only once at the start of the execution.

However, for certain workloads, such as pgbench select-only
workloads, the creation/deletion of the additional WaitEventSet
adds ~7% CPU overhead, which is also reflected on the benchmark
results.

With this commit, we use the same WaitEventSet for the purposes
of checking the remote socket at the start of the execution.

We use "rebuildWaitEventSet" flag so that the executor can re-use
the existing WaitEventSet.

As a result, we see the following improvements on PG 15:

main			 : 120051 tps, 0.532 ms latency avg.
avoid_wes_rebuild: 127119 tps, 0.503 ms latency avg.

And, on PG 14, as expected, there is no difference

main			 : 129191 tps, 0.495 ms latency avg.
avoid_wes_rebuild: 129480 tps, 0.494 ms latency avg.

But, note that PG 15 is slightly (~1.5%) slower than PG 14.
That is probably the overhead of checking the remote socket.
2022-12-14 22:53:38 +01:00
Onder Kalaci feb5534c65 Do not create additional WaitEventSet for RemoteSocketClosed checks
Before this commit, we created an additional WaitEventSet for
checking whether the remote socket is closed per connection -
only once at the start of the execution.

However, for certain workloads, such as pgbench select-only
workloads, the creation/deletion of the additional WaitEventSet
adds ~7% CPU overhead, which is also reflected on the benchmark
results.

With this commit, we use the same WaitEventSet for the purposes
of checking the remote socket at the start of the execution.

We use "rebuildWaitEventSet" flag so that the executor can re-use
the existing WaitEventSet.

As a result, we see the following improvements on PG 15:

main			 : 120051 tps, 0.532 ms latency avg.
avoid_wes_rebuild: 127119 tps, 0.503 ms latency avg.

And, on PG 14, as expected, there is no difference

main			 : 129191 tps, 0.495 ms latency avg.
avoid_wes_rebuild: 129480 tps, 0.494 ms latency avg.

But, note that PG 15 is slightly (~1.5%) slower than PG 14.
That is probably the overhead of checking the remote socket.
2022-12-14 22:42:55 +01:00
Onder Kalaci d52da55ac0 Move WaitEvent to DistributedExecution
Prep. for caching WaitEventsSet/WaitEvents
2022-12-14 21:59:19 +01:00
Nils Dijk b5b73d78c3
add prepare and finish pg upgrade functions to 11.2-1 (#6560)
Fixes a missed include in #6315.

While adding the cluster clock we have added some extra steps to
`citus_prepare_pg_upgrade` and `citus_finish_pg_upgrade`. These changes
were not added to the citus upgrade and downgrade scripts, this allowed
for a syntax error to slip in.

This PR adds the new versions of both UDF's to the upgrade script while
adding the old version to the downgrade script. This exposed the syntax
error which is also solved.
2022-12-14 12:34:22 +01:00
Gokhan Gulbiz 556161be32
Fix make recipe mapping in test runner (#6561) 2022-12-14 12:57:13 +03:00
aykut-bozkurt 8be4ce546e
fix vanilla test status on CI (#6555)
- Because of the make command used for vanilla tests, test status is
always shown as success on CI. As a fix, I added `&& false` at the end
of the copying diff file to make the command fail when check-vanilla
fails.
```make
check-vanilla: all
	$(pg_regress_multi_check) --vanillatest || (cp $(vanilla_diffs_file) $(citus_abs_srcdir)/regression.diffs && false)
```

- I also fixed some vanilla tests that fails due to recently added clock
related operators shown up at some queries.
2022-12-13 11:15:47 +03:00
Gürkan İndibay 3f091e3493
Give nicer error message when using alter_table_set_access_method on a view (#6553)
DESCRIPTION: Fixes alter_table_set_access_method error for views.

Fixes #6001
2022-12-12 23:56:22 +03:00
aykut-bozkurt 1ad1a0a336
add citus_task_wait udf to wait on desired task status (#6475)
We already have citus_job_wait to wait until the job reaches the desired
state. That PR adds waiting on task state to allow more granular
waiting. It can be used for Citus operations. Moreover, it is also
useful for testing purposes. (wait until a task reaches specified state)

Related to #6459.
2022-12-12 22:41:03 +03:00
aykut-bozkurt 80686907a3
print stack trace from core files instead of uploading them (#6539)
Prints stack trace from core files instead of uploading them.
2022-12-12 18:53:17 +03:00
Gokhan Gulbiz d307e342a2
Use test name parameter in flakiness detection (#6559)
This PR changes test-flakyness CI job to pass the test name instead of
the file path to `run_test.py` script.
2022-12-12 17:53:25 +03:00
aykut-bozkurt 3da6e3e743
bgworkers with backend connection should handle SIGTERM properly (#6552)
Fixes task executor SIGTERM handling.

Problem:
When task executors are sent SIGTERM, their default handler
`bgworker_die`, which is set at worker startup, logs FATAL error. But
they do not release locks there before logging the error, which
sometimes causes hanging of the monitor. e.g. Monitor waits for the lock
forever at pg_stat flush after calling proc_exit.

Solution:
Because executors have connection to backend, they should handle SIGTERM
similar to normal backends. Normal backends uses `die` handler, in which
they set ProcDiePending flag and the next CHECK_FOR_INTERRUPTS call
handles it gracefully by releasing any lock before termination.
2022-12-12 16:44:36 +03:00
dependabot[bot] f6b8990fc7
Bump certifi from 2022.9.14 to 2022.12.7 in /src/test/regress (#6554)
Bumps [certifi](https://github.com/certifi/python-certifi) from
2022.9.14 to 2022.12.7.

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2022-12-12 14:13:08 +01:00
Gokhan Gulbiz e2a73ad8a8
Flaky Test Detection CI Workflow (#6495)
This PR adds a new CI workflow named ```flaky-test``` to run flaky test
detection on newly introduced regression tests.

Co-authored-by: Jelte Fennema <github-tech@jeltef.nl>
2022-12-12 14:36:23 +03:00
Ahmet Gedemenli 190307e8d8
Wait for cleanup function (#6549)
Adding a testing function `wait_for_resource_cleanup` which waits until
all records in `pg_dist_cleanup` are cleaned up. The motivation is to
prevent flakiness in our tests, since the `NOTICE: cleaned up X orphaned
resources` message is not consistent in many cases. This PR replaces
`citus_cleanup_orphaned_resources` calls with
`wait_for_resource_cleanup` calls.
2022-12-08 13:19:25 +03:00
Teja Mupparti cbb33167f9 Fix the flaky test in clock.sql 2022-12-07 09:47:35 -08:00
Ahmet Gedemenli 989a3b54c9
Disable maintenance daemon for cleanup test (#6551)
Disabling the cleanup in maintenance daemon, to prevent a flaky test.
2022-12-07 20:28:55 +03:00
Onur Tirtir 1e415bb1c3
Support outer joins where the outer rel is a recurring one and the inner one is a non-recurring one (#6512)
DESCRIPTION: Adds support for outer joins having a recurring rel in the
outer side of the join (e.g., \<reference table\> LEFT JOIN
\<distributed table\>)

Closes #6219.
Closes #521

If the outer part of an outer join is a recurring rel (i.e., reference
table
or an intermediate_result injected into the query during the earlier
stages
of the recursive planning), Citus cannot run the join query if the other
side
of the join is not a recurring rel (i.e., distributed table).
See DeferredErrorIfUnsupportedRecurringTuplesJoin for the reasoning.

And to support such joins, now we start recursively planning distributed
side
of such joins so that non-recurring rel becomes an intermediate result
(and
hence a recurring rel) since Citus already knows how to compute an outer
join
between two recurring rels already. In the simplest scenario, this means
to
convert
  _"\<reference\> LEFT JOIN \<distributed\>"_ to
  _"\<reference\> LEFT JOIN  \<intermediate_result\>"_
by wrapping the distributed table into a subquery.

- [x] Add support for outer joins having a recurring rel in the outer
side and a "distributed table" (*) in the inner side of the join
- [x] Expand "distributed table" concept to "distributed rel" in first
item.
      This means that;
- [x] Currently RecursivelyPlanNonRecurringJoinNode doesn't know how to
wrap a sub join tree that constitutes a recurring rel, e.g., rhs clause
of the following join: `reference LEFT OUTER <distributed INNER JOIN
distributed>`; fix this.
- [x] Similar to previous item, currently
RecursivelyPlanNonRecurringJoinNode doesn't know how to handle
subqueries constituting a distributed rel, e.g., `SELECT * FROM ref LEFT
JOIN (SELECT * FROM dist_1) u1 ON (ref.a = u1.a);`; fix this.
- [x] Add lateral join checks for now-supported outer joins into
recursive planner
- [x] Fix regressions tests
- [x] Verified each test output file by first un-distributing Citus
tables involved in related queries and re-running the test file.
- [x] Some of the tests --that were not supposed to return any data
before but this PR adds support for-- were likely to get flaky, so added
some "ORDER BY"s to them.
- [x] Continue doing manual testing and start writing a test file for
the join clauses that this PR adds support for --not only rely on
existing tests

See https://github.com/citusdata/citus/issues/6546 for what we could do
further.
2022-12-07 18:44:00 +03:00
Onur Tirtir b177975371 Add new regression tests 2022-12-07 18:27:50 +03:00
Onur Tirtir 2803470b58 Add lateral join checks for outer joins and drop the useless ones for semi joins 2022-12-07 18:27:50 +03:00
Onur Tirtir e7e4881289 Phase - III: recursively plan non-recurring sub join trees too 2022-12-07 18:27:50 +03:00
Onur Tirtir f52381387e Phase - II: recursively plan non-recurring subqueries too 2022-12-07 18:27:50 +03:00
Onur Tirtir f339450a9d Phase - I: recursively plan non-recurring relations 2022-12-07 18:27:50 +03:00
Ahmet Gedemenli 3cc5d9842a
Remove IF EXISTS from cleanup on failure test for subscription object (#6547)
Nothing critical. Just improving a DROP SUBSCRIPTION test for a cleanup
after failure scenario.
2022-12-07 17:51:36 +03:00
Jelte Fennema 7499c3073d
Push coverage to codeclimate (#6538)
In addition to pushing coverage results to codecov, this now also pushes
them to codeclimate. This is meant so we can evaluate codeclimate.
2022-12-06 16:01:20 +01:00
Ahmet Gedemenli cb02d62369
Unique names for replication artifacts (#6529)
DESCRIPTION: Create replication artifacts with unique names

We're creating replication objects with generic names. This disallows us
to enable parallel shard moves, as two operations might use the same
objects. With this PR, we'll create below objects with operation
specific names, by appending OparationId to the names.

* Subscriptions
* Publications
* Replication Slots
* Users created for subscriptions
2022-12-06 15:48:16 +03:00
Teja Mupparti e14dc5d45d Address the issues/comments from the original PR# 6315
1) Regular users fail to use clock UDF with permission issue.
2) Clock functions were declared as STABLE, whereas by definition they are VOLATILE. By design, any clock/time
   functions will return different results for each call even within a single SQL statement.

Note: UDF citus_get_transaction_clock() is a misnomer as it internally calls the clock tick which always returns
      different results for every invocation in the same transaction.
2022-12-05 11:06:21 -08:00
aykut-bozkurt 65f256eec4
* add SIGTERM handler to gracefully terminate task executors, \ (#6473)
Adds signal handlers for graceful termination, cancellation of
task executors and detecting config updates. Related to PR #6459.

#### How to handle termination signal?
Monitor need to gracefully terminate all running task executors before
terminating. Hence, we have sigterm handler for the monitor.

#### How to handle cancellation signal?
Monitor need to gracefully cancel all running task executors before
terminating. Hence, we have sigint handler for the monitor.

#### How to detect configuration changes?
Monitor has SIGHUP handler to reflect configuration changes while
executing tasks.
2022-12-02 18:15:31 +03:00
aykut-bozkurt 6781ace3a1
find core files from correct path on CI (#6535)
Finds core files from correct path on CI. According to default core
pattern on CI, core is generated at the location relative to binary is
executed.

It can be safe to set core pattern before running binary but to change a
kernel param(in our case kernel.core_pattern), you need related
privilege in docker container. Or you have to change it at image build.
But, by default, on CI machines, kernel pattern contains a relative path
to binary + pid + process name, so we do not need to set it explicitly
for now. (Example core file name on CI machine:
`core.2559.!usr!lib!postgresql!14!bin!postgres`)
2022-12-02 18:04:29 +03:00
songjinzhou ad6450b793
fix the problem #5763 (#6519)
Co-authored-by: TsinghuaLucky912 <tsinghualucky912@foxmail.com>
Fixes https://github.com/citusdata/citus/issues/5763
2022-12-02 13:49:32 +01:00
Ahmet Gedemenli 3b24c47470
Fix flaky cleanup tests (#6530)
We are having some flakiness in our test schedule because of the objects
leftover from shard moves/splits. With this commit we prevent logging
cleanup object counts.

fixes: #6534
2022-12-02 12:39:36 +03:00
Hanefi Onaldi d4394b2e2d
Fix spacing in multiline strings (#6533)
When using multiline strings, we occasionally forget to add a single
space at the end of the first line. When this line is concatenated with
the next one, the resulting string has a missing space.
2022-12-01 23:42:47 +03:00
Fabrízio de Royes Mello 37f3dff1ca
Simplify columnar perf example (#6526)
Rewrite the plpython function to generate random words in SQL to
simplify the usage and run the example.
2022-12-01 20:05:40 +01:00
songjinzhou 29f0196fdf
Add support for SET ACCESS METHOD in altering a distributed table (#6525)
Co-authored-by: TsinghuaLucky912 <postgres@localhost.localdomain>
2022-12-01 17:45:32 +01:00
Gürkan İndibay c2193608c9
Add jobs to test builds on different distros (#6499)
With this PR, citus code will be tested in all packaging environments. 
Sometimes, there can be compile errors which blocks packaging and in
this case unplanned delays may occur.
By testing the code in packaging environments, I'm aiming to detect any
compilation errors before packaging.

Co-authored-by: Onur Tirtir <onurcantirtir@gmail.com>
Co-authored-by: Hanefi Onaldi <Hanefi.Onaldi@microsoft.com>
2022-12-01 19:11:41 +03:00
Hanefi Onaldi 1f29c16262
Fix misleading GUC description (#6532)
citus.skip_advisory_lock_permission_checks skips checks when it is set
to 'on', not 'off'
2022-12-01 15:43:02 +03:00
Ahmet Gedemenli 0e92244bfe
Cleanup for shard moves (#6472)
DESCRIPTION: Extend cleanup process for replication artifacts

This PR adds new cleanup record types for:
* Subscriptions
* Replication slots
* Publications
* Users created for subscriptions

We add records for these object types, to `pg_dist_cleanup` during
creation phase. Once the operation is done, in case of success or
failure, we iterate those records and drop the objects. With this PR we
will not be dropping any of these objects during the operation. In
short, we will always be deferring the drop.

One thing that's worth mentioning is that we sort cleanup records before
processing (dropping) them, because of dependency relations among those
objects, e.g a subscription might depend on a publication. Therefore, we
always drop subscriptions before publications.

We have some renames in this PR:
* `TryDropOrphanedShards` -> `TryDropOrphanedResources`
* `DropOrphanedShardsForCleanup` -> `DropOrphanedResourcesForCleanup`
* `run_try_drop_marked_shards` -> `run_try_drop_marked_resources`
as these functions now process replication artifacts as well.

This PR drops function `DropAllLogicalReplicationLeftovers` and its all
usages, since now we rely on the deferring drop mechanism.
2022-11-30 15:38:05 +03:00