citus

Commit Graph

Author	SHA1	Message	Date
Ahmet Gedemenli	26b170e1a8	Use %u instead of %i for naming subscriptions & roles (#6603 ) DESCRIPTION: Fix the modifier for subscription and role creation fixes: #6598 Reported by @ivyazmitinov	2023-01-06 14:38:32 +01:00
Ahmet Gedemenli	bc3383170e	Fix crash when trying to replicate a ref table that is actually dropped (#6595 ) DESCRIPTION: Fix crash when trying to replicate a ref table that is actually dropped see #6592 We should have a real solution for it.	2023-01-06 14:52:08 +03:00
Emel Şimşek	db7a70ef3e	Enable ALTER TABLE ... ADD UNIQUE and ADD EXCLUDE. (#6582 ) DESCRIPTION: Adds support for creating table constraints UNIQUE and EXCLUDE via ALTER TABLE command without client having to specify a name. ALTER TABLE ... ADD CONSTRAINT <conname> UNIQUE ... ALTER TABLE ... ADD CONSTRAINT <conname> EXCLUDE ... commands require the client to provide an explicit constraint name. However, in postgres it is possible for clients not to provide a name and let the postgres generate it using the following commands ALTER TABLE ... ADD UNIQUE ... ALTER TABLE ... ADD EXCLUDE ... This PR enables the same functionality for citus tables.	2023-01-05 18:12:32 +03:00
Emel Şimşek	135c519c62	Fix flakyness for multi_name_lengths test (#6599 ) Fix flakyness for multi_name_lengths test.	2023-01-05 17:27:16 +03:00
Önder Kalacı	eb75decbeb	Undo planner extended statistics override (#6492 )	2023-01-04 13:25:57 +01:00
Önder Kalacı	a1aa96b32c	Make the metadata syncing less resource invasive [Phase-1] (#6537 )	2023-01-04 11:36:45 +01:00
Ahmet Gedemenli	235047670d	Drop SHARD_STATE_TO_DELETE (#6494 ) DESCRIPTION: Drop `SHARD_STATE_TO_DELETE` and use the cleanup records instead Drops the shard state that is used to mark shards as orphaned. Now we insert cleanup records into `pg_dist_cleanup` so "orphaned" shards will be dropped either by maintenance daemon or internal cleanup calls. With this PR, we make the "cleanup orphaned shards" functions to be no-op, as they would not be needed anymore. This PR includes some naming changes about placement functions. We don't need functions that filter orphaned shards, as there will be no orphaned shards anymore. We will also be introducing a small script with this PR, for users with orphaned shards. We'll basically delete the orphaned shard entries from `pg_dist_placement` and insert cleanup records into `pg_dist_cleanup` for each one of them, during Citus upgrade. We also have a lot of flakiness fixes in this PR. Co-authored-by: Jelte Fennema <github-tech@jeltef.nl>	2023-01-03 14:38:16 +03:00
Jelte Fennema	f56904fe04	Fix flakyness in isolation_insert_vs_vacuum (#6589 ) Sometimes our `isolation_insert_vs_vacuum` test would fail like this. ```diff step s2-vacuum-analyze: VACUUM ANALYZE test_insert_vacuum; - + <waiting ...> step s1-commit: COMMIT; +step s2-vacuum-analyze: <... completed> ``` The reason seems to be that VACUUM ANALYZE tries to take some locks that conflict with the other transaction, but these locks somehow get released or VACUUM ANALYZE stops waiting for them. This is somewhat expected since VACUUM has some special locking logic. To solve the flakyness we now trigger VACUUM ANALYZE to always report as blocking and after that we wait explicitly wait for it to complete. This is done like is suggested by the flaky test tips from postgres: `c68a183990/src/test/isolation/README (L152)` I've confirmed that this fixes the issue suing our flaky-test-debugging CI workflow.	2023-01-02 16:51:58 +01:00
Ahmet Gedemenli	0c74e4cc0f	Fix some flaky tests (#6587 ) Fix for some simple flakiness'. All `DROP USER' and cleanup function calls.	2022-12-29 10:19:09 +03:00
Ahmet Gedemenli	f824c996b3	Remove duplicate split entry in run_test.py (#6586 ) Nothing important. Removing the duplicate `"split" in test_schedule` check	2022-12-28 18:18:24 +03:00
Ahmet Gedemenli	1b1e737e51	Drop cleanup on failure (#6584 ) DESCRIPTION: Defers cleanup after a failure in shard move or split We don't need to do a cleanup in case of failure on a shard transfer or split anymore. Because, * Maintenance daemon will clean them up anyway. * We trigger a cleanup at the beginning of shard transfers/splits. * The cleanup on failure logic also can fail sometimes and instead of the original error, we throw the error that is raised by the cleanup procedure, and it causes confusion.	2022-12-28 15:48:44 +03:00
Ahmet Gedemenli	cfc17385e9	Some minor improvements on flakiness detection (#6585 ) * Skip some exceptional test files in the flaky workflow, like multi_extension * Run some tests without a schedule, like single_node_enterprise * Use minimal schedule for the tests in split and operations schedules	2022-12-28 15:08:39 +03:00
Ahmet Gedemenli	eba9abeee2	Fix leftover shard copy on the target node when tx with move is aborted (#6583 ) DESCRIPTION: Cleanup the shard on the target node in case of a failed/aborted shard move Inserts a cleanup record for the moved shard placement on the target node. If the move operation succeeds, the record will be deleted. If not, it will remain there to be cleaned up later. fixes: #6580	2022-12-27 22:42:46 +03:00
Naisila Puka	e937935935	Clean up normalize file (#6578 )	2022-12-26 12:08:27 +03:00
Naisila Puka	bc49616426	Fix link of path to flaky tests docs (#6579 )	2022-12-23 12:09:22 +03:00
Ahmet Gedemenli	96bf648d1a	Unify dependent test files into one	2022-12-22 13:06:26 +03:00
Ahmet Gedemenli	acf3539a90	Fix split schedule	2022-12-22 13:06:26 +03:00
Hanefi Onaldi	303db172f8	Use Citus version comparison in upgrade tests, not equality (#6568 ) We have several version checks in our Citus upgrade tests. However, as we drop support for PG versions, we need to update the Citus versions used in our CI images. Therefore we must compare Citus versions in our tests instead of using equality checks so that the queries are ran in all the associated Citus versions. For example, we have many conditionals where we early exit if the Citus version is not equal to 9.0. However, as of today we never use version 9.0 and thus we always early exit in those tests.	2022-12-21 14:01:57 +03:00
Teja Mupparti	9a9989fc15	Support MERGE Phase – I All the tables (target, source or any CTE present) in the SQL statement are local i.e. a merge-sql with a combination of Citus local and Non-Citus tables (regular Postgres tables) should work and give the same result as Postgres MERGE on regular tables. Catch and throw an exception (not-yet-supported) for all other scenarios during Citus-planning phase.	2022-12-18 20:32:15 -08:00
Emel Şimşek	5268d0a6cb	Enable PRIMARY KEY generation via ALTER TABLE even if the constraint name is not provided (#6520 ) DESCRIPTION: Support ALTER TABLE .. ADD PRIMARY KEY ... command Before processing > ALTER TABLE ... ADD PRIMARY KEY ... command 1. Create a primary key name to use as the constraint name. 2. Change the ALTER TABLE ... ADD PRIMARY KEY ... command to into ALTER TABLE ... ADD CONSTRAINT \<constraint name> PRIMARY KEY ... form. This is the only form we can specify a name for a primary key. If we run ALTER TABLE .. ADD PRIMARY KEY, postgres would create a constraint name internally in its own scheme. But the problem is that we need to create constraint names for shards in our own scheme which is \<constraint name>_\<shardid>. Hence we need to create a name and send it to workers so that the workers can append the shardid. 4. Run the changed command on the coordinator to make sure we are using the same constraint name across the board. 5. Send the changed command to workers such that it is executed for the main table as well as for the shards. Fixes #6515.	2022-12-16 20:34:00 +03:00
aykut-bozkurt	9c0073ba57	remove unused boundary type (#6563 ) Removes unused job boundary tag `SUBQUERY_MAP_MERGE_JOB`. Only usage is at `BuildMapMergeJob`, which is only called when the boundary = `JOIN_MAP_MERGE_JOB`. Hence, it should be safe to remove.	2022-12-16 18:19:22 +03:00
Onder Kalaci	feb5534c65	Do not create additional WaitEventSet for RemoteSocketClosed checks Before this commit, we created an additional WaitEventSet for checking whether the remote socket is closed per connection - only once at the start of the execution. However, for certain workloads, such as pgbench select-only workloads, the creation/deletion of the additional WaitEventSet adds ~7% CPU overhead, which is also reflected on the benchmark results. With this commit, we use the same WaitEventSet for the purposes of checking the remote socket at the start of the execution. We use "rebuildWaitEventSet" flag so that the executor can re-use the existing WaitEventSet. As a result, we see the following improvements on PG 15: main : 120051 tps, 0.532 ms latency avg. avoid_wes_rebuild: 127119 tps, 0.503 ms latency avg. And, on PG 14, as expected, there is no difference main : 129191 tps, 0.495 ms latency avg. avoid_wes_rebuild: 129480 tps, 0.494 ms latency avg. But, note that PG 15 is slightly (~1.5%) slower than PG 14. That is probably the overhead of checking the remote socket.	2022-12-14 22:42:55 +01:00
Onder Kalaci	d52da55ac0	Move WaitEvent to DistributedExecution Prep. for caching WaitEventsSet/WaitEvents	2022-12-14 21:59:19 +01:00
Nils Dijk	b5b73d78c3	add prepare and finish pg upgrade functions to 11.2-1 (#6560 ) Fixes a missed include in #6315. While adding the cluster clock we have added some extra steps to `citus_prepare_pg_upgrade` and `citus_finish_pg_upgrade`. These changes were not added to the citus upgrade and downgrade scripts, this allowed for a syntax error to slip in. This PR adds the new versions of both UDF's to the upgrade script while adding the old version to the downgrade script. This exposed the syntax error which is also solved.	2022-12-14 12:34:22 +01:00
Gokhan Gulbiz	556161be32	Fix make recipe mapping in test runner (#6561 )	2022-12-14 12:57:13 +03:00
aykut-bozkurt	8be4ce546e	fix vanilla test status on CI (#6555 ) - Because of the make command used for vanilla tests, test status is always shown as success on CI. As a fix, I added `&& false` at the end of the copying diff file to make the command fail when check-vanilla fails. ```make check-vanilla: all $(pg_regress_multi_check) --vanillatest \|\| (cp $(vanilla_diffs_file) $(citus_abs_srcdir)/regression.diffs && false) ``` - I also fixed some vanilla tests that fails due to recently added clock related operators shown up at some queries.	2022-12-13 11:15:47 +03:00
Gürkan İndibay	3f091e3493	Give nicer error message when using alter_table_set_access_method on a view (#6553 ) DESCRIPTION: Fixes alter_table_set_access_method error for views. Fixes #6001	2022-12-12 23:56:22 +03:00
aykut-bozkurt	1ad1a0a336	add citus_task_wait udf to wait on desired task status (#6475 ) We already have citus_job_wait to wait until the job reaches the desired state. That PR adds waiting on task state to allow more granular waiting. It can be used for Citus operations. Moreover, it is also useful for testing purposes. (wait until a task reaches specified state) Related to #6459.	2022-12-12 22:41:03 +03:00
aykut-bozkurt	3da6e3e743	bgworkers with backend connection should handle SIGTERM properly (#6552 ) Fixes task executor SIGTERM handling. Problem: When task executors are sent SIGTERM, their default handler `bgworker_die`, which is set at worker startup, logs FATAL error. But they do not release locks there before logging the error, which sometimes causes hanging of the monitor. e.g. Monitor waits for the lock forever at pg_stat flush after calling proc_exit. Solution: Because executors have connection to backend, they should handle SIGTERM similar to normal backends. Normal backends uses `die` handler, in which they set ProcDiePending flag and the next CHECK_FOR_INTERRUPTS call handles it gracefully by releasing any lock before termination.	2022-12-12 16:44:36 +03:00
dependabot[bot]	f6b8990fc7	Bump certifi from 2022.9.14 to 2022.12.7 in /src/test/regress (#6554 ) Bumps [certifi](https://github.com/certifi/python-certifi) from 2022.9.14 to 2022.12.7. Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2022-12-12 14:13:08 +01:00
Gokhan Gulbiz	e2a73ad8a8	Flaky Test Detection CI Workflow (#6495 ) This PR adds a new CI workflow named ```flaky-test``` to run flaky test detection on newly introduced regression tests. Co-authored-by: Jelte Fennema <github-tech@jeltef.nl>	2022-12-12 14:36:23 +03:00
Ahmet Gedemenli	190307e8d8	Wait for cleanup function (#6549 ) Adding a testing function `wait_for_resource_cleanup` which waits until all records in `pg_dist_cleanup` are cleaned up. The motivation is to prevent flakiness in our tests, since the `NOTICE: cleaned up X orphaned resources` message is not consistent in many cases. This PR replaces `citus_cleanup_orphaned_resources` calls with `wait_for_resource_cleanup` calls.	2022-12-08 13:19:25 +03:00
Teja Mupparti	cbb33167f9	Fix the flaky test in clock.sql	2022-12-07 09:47:35 -08:00
Ahmet Gedemenli	989a3b54c9	Disable maintenance daemon for cleanup test (#6551 ) Disabling the cleanup in maintenance daemon, to prevent a flaky test.	2022-12-07 20:28:55 +03:00
Onur Tirtir	b177975371	Add new regression tests	2022-12-07 18:27:50 +03:00
Onur Tirtir	2803470b58	Add lateral join checks for outer joins and drop the useless ones for semi joins	2022-12-07 18:27:50 +03:00
Onur Tirtir	e7e4881289	Phase - III: recursively plan non-recurring sub join trees too	2022-12-07 18:27:50 +03:00
Onur Tirtir	f52381387e	Phase - II: recursively plan non-recurring subqueries too	2022-12-07 18:27:50 +03:00
Onur Tirtir	f339450a9d	Phase - I: recursively plan non-recurring relations	2022-12-07 18:27:50 +03:00
Ahmet Gedemenli	3cc5d9842a	Remove IF EXISTS from cleanup on failure test for subscription object (#6547 ) Nothing critical. Just improving a DROP SUBSCRIPTION test for a cleanup after failure scenario.	2022-12-07 17:51:36 +03:00
Ahmet Gedemenli	cb02d62369	Unique names for replication artifacts (#6529 ) DESCRIPTION: Create replication artifacts with unique names We're creating replication objects with generic names. This disallows us to enable parallel shard moves, as two operations might use the same objects. With this PR, we'll create below objects with operation specific names, by appending OparationId to the names. * Subscriptions * Publications * Replication Slots * Users created for subscriptions	2022-12-06 15:48:16 +03:00
Teja Mupparti	e14dc5d45d	Address the issues/comments from the original PR# 6315 1) Regular users fail to use clock UDF with permission issue. 2) Clock functions were declared as STABLE, whereas by definition they are VOLATILE. By design, any clock/time functions will return different results for each call even within a single SQL statement. Note: UDF citus_get_transaction_clock() is a misnomer as it internally calls the clock tick which always returns different results for every invocation in the same transaction.	2022-12-05 11:06:21 -08:00
aykut-bozkurt	65f256eec4	* add SIGTERM handler to gracefully terminate task executors, \ (#6473 ) Adds signal handlers for graceful termination, cancellation of task executors and detecting config updates. Related to PR #6459. #### How to handle termination signal? Monitor need to gracefully terminate all running task executors before terminating. Hence, we have sigterm handler for the monitor. #### How to handle cancellation signal? Monitor need to gracefully cancel all running task executors before terminating. Hence, we have sigint handler for the monitor. #### How to detect configuration changes? Monitor has SIGHUP handler to reflect configuration changes while executing tasks.	2022-12-02 18:15:31 +03:00
songjinzhou	ad6450b793	fix the problem #5763 (#6519 ) Co-authored-by: TsinghuaLucky912 <tsinghualucky912@foxmail.com> Fixes https://github.com/citusdata/citus/issues/5763	2022-12-02 13:49:32 +01:00
Ahmet Gedemenli	3b24c47470	Fix flaky cleanup tests (#6530 ) We are having some flakiness in our test schedule because of the objects leftover from shard moves/splits. With this commit we prevent logging cleanup object counts. fixes: #6534	2022-12-02 12:39:36 +03:00
Hanefi Onaldi	d4394b2e2d	Fix spacing in multiline strings (#6533 ) When using multiline strings, we occasionally forget to add a single space at the end of the first line. When this line is concatenated with the next one, the resulting string has a missing space.	2022-12-01 23:42:47 +03:00
Fabrízio de Royes Mello	37f3dff1ca	Simplify columnar perf example (#6526 ) Rewrite the plpython function to generate random words in SQL to simplify the usage and run the example.	2022-12-01 20:05:40 +01:00
songjinzhou	29f0196fdf	Add support for SET ACCESS METHOD in altering a distributed table (#6525 ) Co-authored-by: TsinghuaLucky912 <postgres@localhost.localdomain>	2022-12-01 17:45:32 +01:00
Gürkan İndibay	c2193608c9	Add jobs to test builds on different distros (#6499 ) With this PR, citus code will be tested in all packaging environments. Sometimes, there can be compile errors which blocks packaging and in this case unplanned delays may occur. By testing the code in packaging environments, I'm aiming to detect any compilation errors before packaging. Co-authored-by: Onur Tirtir <onurcantirtir@gmail.com> Co-authored-by: Hanefi Onaldi <Hanefi.Onaldi@microsoft.com>	2022-12-01 19:11:41 +03:00
Hanefi Onaldi	1f29c16262	Fix misleading GUC description (#6532 ) citus.skip_advisory_lock_permission_checks skips checks when it is set to 'on', not 'off'	2022-12-01 15:43:02 +03:00
Ahmet Gedemenli	0e92244bfe	Cleanup for shard moves (#6472 ) DESCRIPTION: Extend cleanup process for replication artifacts This PR adds new cleanup record types for: * Subscriptions * Replication slots * Publications * Users created for subscriptions We add records for these object types, to `pg_dist_cleanup` during creation phase. Once the operation is done, in case of success or failure, we iterate those records and drop the objects. With this PR we will not be dropping any of these objects during the operation. In short, we will always be deferring the drop. One thing that's worth mentioning is that we sort cleanup records before processing (dropping) them, because of dependency relations among those objects, e.g a subscription might depend on a publication. Therefore, we always drop subscriptions before publications. We have some renames in this PR: * `TryDropOrphanedShards` -> `TryDropOrphanedResources` * `DropOrphanedShardsForCleanup` -> `DropOrphanedResourcesForCleanup` * `run_try_drop_marked_shards` -> `run_try_drop_marked_resources` as these functions now process replication artifacts as well. This PR drops function `DropAllLogicalReplicationLeftovers` and its all usages, since now we rely on the deferring drop mechanism.	2022-11-30 15:38:05 +03:00
aykut-bozkurt	1f8675da43	nonblocking concurrent task execution via background workers (#6459 ) Improvement on our background task monitoring API (PR #6296) to support concurrent and nonblocking task execution. Mainly we have a queue monitor background process which forks task executors for `Runnable` tasks and then monitors their status by fetching messages from shared memory queue in nonblocking way.	2022-11-30 14:29:46 +03:00
aykut-bozkurt	83ef600f27	fix false full join pushdown error check (#6523 ) Problem: Currently, we error out if we detect recurring tuples in one side without checking the other side of the join. Solution: When one side of the full join consists recurring tuples and the other side consists nonrecurring tuples, we should not pushdown to prevent duplicate results. Otherwise, safe to pushdown.	2022-11-30 14:17:56 +03:00
Gokhan Gulbiz	bc118ee551	Change GUC propagation flag's default value to off (#6516 ) This PR changes ```citus.propagate_session_settings_for_loopback_connection``` default value to off not to expose this feature publicly at this point. See #6488 for details.	2022-11-29 13:25:53 +03:00
Jelte Fennema	e12d97def2	Fix flakyness in multi_metadata_access (#6524 ) Sometimes multi_metadata_access failed like this in CI: ```diff AND ext.extname = 'citus' AND nsp.nspname = 'pg_catalog' AND NOT has_table_privilege(pg_class.oid, 'select'); oid --------------------------- - pg_dist_authinfo pg_dist_clock_logical_seq + pg_dist_authinfo (2 rows) ``` Source: https://app.circleci.com/pipelines/github/citusdata/citus/28784/workflows/e462f118-eb64-4a3f-941a-e04115334f9b/jobs/883443 This fixes that by ordering the column.	2022-11-29 10:00:06 +01:00
Philip Dubé	cf69fc3652	Grammar: it's to its Includes an error message & one case of its to it's Also fix "to the to" typos	2022-11-28 20:43:44 +00:00
Jelte Fennema	68de2ce601	Include gpid in all internal application names (#6431 ) When debugging issues it's quite useful to see the originating gpid in the application_name of a query on a worker. This already happens for most queries, but not for queries created by the rebalancer or by run_command_on_worker. This adds a gpid to those two application_names too. Note, that if the GPID of the new application_names is different than the current GPID of the backend the backend will continue to keep the old gpid as its actual GPID. This PR is just meant to make sure that the application_name is as useful as it can be for users to look at. Updating of gpids will be done in a follow-up PR, and adding gpids to all internal connections will make this easier.	2022-11-25 11:16:33 +01:00
Teja Mupparti	edaf88e0ff	Fix the dangling pointer bug in get_merged_argument_list()	2022-11-22 09:41:10 -08:00
Onur Tirtir	80faf47ab5	Fix dangling pointer warning in AnyTableReplicated (#6504 ) DESCRIPTION: Fixes a potential dangling pointer issue Need to backport to 11.0 & 11.1 since we might want to release packages for debian/bookworm based on those branches in future.	2022-11-21 16:42:00 +03:00
Jelte Fennema	a477ffdf4b	Correctly fix OpenSSL 3.0 warnings (#6502 ) In #6038 I tried to fix OpenSSL 3.0 warnings with PG13, but I had made a mistake when doing that. This actually fixes these warnings.	2022-11-18 14:35:41 +01:00
Emel Şimşek	8e5ba45b74	Fixes a bug that causes crash when using auto_explain extension with ALTER TABLE...ADD FOREIGN KEY... queries. (#6470 ) Fixes a bug that causes crash when using auto_explain extension with ALTER TABLE...ADD FOREIGN KEY... queries. Those queries trigger a SELECT query on the citus tables as part of the foreign key constraint validation check. At the explain hook, workers try to explain this SELECT query as a distributed query causing memory corruption in the connection data structures. Hence, we will not explain ALTER TABLE...ADD FOREIGN KEY... and the triggered queries on the workers. Fixes #6424.	2022-11-15 17:53:39 +03:00
Teja Mupparti	7358b826ef	Remove the explicit-transaction requirement for the UDF citus_get_transaction_clock() as implicit transactions too use this UDF.	2022-11-10 10:54:36 -08:00
Marco Slot	77fbcfaf14	Propagate BEGIN properties to worker nodes (#6483 ) Co-authored-by: Marco Slot <marco.slot@gmail.com>	2022-11-10 18:08:43 +01:00
Onur Tirtir	e0363470bc	Add missing targets to check-full	2022-11-08 09:59:55 +03:00
Onur Tirtir	7b3e55f903	Add missing dependencies to test targets	2022-11-08 09:59:55 +03:00
Marco Slot	fcaabfdcf3	Remove remaining master_create_distributed_table usages (#6477 ) Co-authored-by: Marco Slot <marco.slot@gmail.com>	2022-11-04 16:30:06 +01:00
Marco Slot	666696c01c	Deprecate citus.replicate_reference_tables_on_activate, make it always off (#6474 ) Co-authored-by: Marco Slot <marco.slot@gmail.com>	2022-11-04 16:21:10 +01:00
Naisila Puka	b8c7a9844c	Add docs on handling alternative test outputs (#6469 ) I recently cleaned up our test suite from redundant test outputs: #6111 #6140 #6214 #6140 #6434 I had to check many files manually, as they didn't have any documentation on why the alternative test output existed in the first place. Adding a section in our test docs to remind developers to add alternative test outputs with enough information/keywords.	2022-11-03 10:55:50 +03:00
Onur Tirtir	1af28b3f27	Use CommitContext for subxact mgmt and reduce memory usage in CommitContext (#6099 ) (Hopefully) Fixes #5000. If memory allocation done for `SubXactContext state` in `PushSubXact()` fails, then `PopSubXact()` might segfault, for example, when grabbing the topmost `SubXactContext` from `activeSubXactContexts` if this is the first ever subxact within the current xact, with the following stack trace: ```c citus.so!list_nth_cell(const List list, int n) (\opt\pgenv\pgsql-14.3\include\server\nodes\pg_list.h:260) citus.so!PopSubXact(SubTransactionId subId) (\home\onurctirtir\citus\src\backend\distributed\transaction\transaction_management.c:761) citus.so!CoordinatedSubTransactionCallback(SubXactEvent event, SubTransactionId subId, SubTransactionId parentSubid, void * arg) (\home\onurctirtir\citus\src\backend\distributed\transaction\transaction_management.c:673) CallSubXactCallbacks(SubXactEvent event, SubTransactionId mySubid, SubTransactionId parentSubid) (\opt\pgenv\src\postgresql-14.3\src\backend\access\transam\xact.c:3644) AbortSubTransaction() (\opt\pgenv\src\postgresql-14.3\src\backend\access\transam\xact.c:5058) AbortCurrentTransaction() (\opt\pgenv\src\postgresql-14.3\src\backend\access\transam\xact.c:3366) PostgresMain(int argc, char ** argv, const char * dbname, const char * username) (\opt\pgenv\src\postgresql-14.3\src\backend\tcop\postgres.c:4250) BackendRun(Port * port) (\opt\pgenv\src\postgresql-14.3\src\backend\postmaster\postmaster.c:4530) BackendStartup(Port * port) (\opt\pgenv\src\postgresql-14.3\src\backend\postmaster\postmaster.c:4252) ServerLoop() (\opt\pgenv\src\postgresql-14.3\src\backend\postmaster\postmaster.c:1745) PostmasterMain(int argc, char argv) (\opt\pgenv\src\postgresql-14.3\src\backend\postmaster\postmaster.c:1417) main(int argc, char argv) (\opt\pgenv\src\postgresql-14.3\src\backend\main\main.c:209) ``` For this reason, to be more defensive against memory-allocation errors that could happen at `PushSubXact()`, now we use our pre-allocated memory context for the objects created in `PushSubXact()`. This commit also attempts reducing the memory allocations done under CommitContext to reduce the chances of consuming all the memory available to CommitContext. Note that it's problematic to encounter with such a memory-allocation error for other objects created in `PushSubXact()` as well, so above is an example scenario that might result in a segfault. DESCRIPTION: Fixes a bug that might cause segfaults when handling deeply nested subtransactions	2022-11-03 00:57:32 +03:00
Onur Tirtir	a5f7f001b0	Make sure to disallow triggers that depend on extensions (#6399 ) DESCRIPTION: Makes sure to disallow triggers that depend on extensions We were already doing so for `ALTER trigger DEPENDS ON EXTENSION` commands. However, we also need to disallow creating Citus tables having such triggers already, so this PR fixes that.	2022-11-02 16:27:31 +03:00
Alexander Kukushkin	deeacfee04	Improve a query that terminates compeling backends from citus_update_node() (#6468 ) DESCRIPTION: Improve a query that terminates compeling backends from citus_update_node() 1. Use pg_blocking_pids() function instead of self join on pg_locks. It exists since 9.6 and more accurate than pg_locks. 2. Prefix all function calls with pg_catalog schema to prevent privilege escalation by creating functions with similar names in a public schema. 3. Change logs and update comments to reflect the fact that the pg_terminate_backend() function only sends SIGTERM but not wating for the actual backend termination.	2022-11-02 12:32:00 +01:00
Alexander Kukushkin	402a30a2b7	Allow citus_update_node() to work with nodes from different clusters (#6466 ) DESCRIPTION: Allow citus_update_node() to work with nodes from different clusters citus_update_node(), citus_nodename_for_nodeid(), and citus_nodeport_for_nodeid() functions only checked for nodes in their own clusters and hence last two returned NULLs and the first one showed an error is the nodeId was from a different cluster. Fixes https://github.com/citusdata/citus/issues/6433	2022-11-02 10:07:01 +01:00
oohira	3f66f3d9dd	Add missing space to citus.shard_count description (#6464 ) DESCRIPTION: Add missing space to citus.shard_count description	2022-10-31 10:37:14 +01:00
Teja Mupparti	69f75af62d	Remove unused macros	2022-10-28 10:38:07 -07:00
Teja Mupparti	01103ce05d	This implements a new UDF citus_get_cluster_clock() that returns a monotonically increasing logical clock. Clock guarantees to never go back in value after restarts, and makes best attempt to keep the value close to unix epoch time in milliseconds. Also, introduces a new GUC "citus.enable_cluster_clock", when true, every distributed transaction is stamped with logical causal clock and persisted in a catalog pg_dist_commit_transaction.	2022-10-28 10:15:08 -07:00
Ahmet Gedemenli	c379ff8614	Drop defer drop gucs (#6447 ) DESCRIPTION: Drops GUC defer_drop_after_shard_split DESCRIPTION: Drops GUC defer_drop_after_shard_move Drop GUCs and related parts from the code. Delete tests that specifically added for the GUCs. Keep tests that can be used without the GUCs. Update test output changes. The motivation for this PR is to have an "always deferring" mechanism. These two GUCs provide an option to not deferring dropping objects during a shard move/split, and dropping them immediately. With this PR, we will be always deferring dropping orphaned shards and other types of objects. We will have a separate PR to extend the deferred cleanup operation, so that we would create records for deferred drop, for Subscriptions, Publications, Replication Slots etc. This will make us be able to keep track of created objects that needs to be dropped, during a shard move/split. We will have objects created specifically for the current operation; and those objects will be dropped at the end. We have an issue (a draft roadmap) for enabling parallel shard moves. For details please see: https://github.com/citusdata/citus/issues/6437	2022-10-25 16:48:34 +03:00
Hanefi Onaldi	915d1b3b38	Repartition tests for numeric types with neg scale (#6358 ) This PR adds some test cases where repartition join correctly prunes shards on two tables that have numeric columns with negative scale.	2022-10-24 20:59:05 +03:00
Jelte Fennema	20a4d742aa	Fix flakyness in failure_split_cleanup (#6450 ) Sometimes in CI our failure_split_cleanup test would fail like this: ```diff CALL pg_catalog.citus_cleanup_orphaned_resources(); -NOTICE: cleaned up 79 orphaned resources +NOTICE: cleaned up 82 orphaned resources SELECT operation_id, object_type, object_name, node_group_id, policy_type ``` Source: https://app.circleci.com/pipelines/github/citusdata/citus/28107/workflows/4ec712c9-98b5-4e90-9806-e02a37d71679/jobs/846107 The reason was that previous tests in the schedule would also create some orphaned resources. Sometimes some of those would already be cleaned up by the maintenance daemon, resulting in a different number of cleaned up resources than expected. This cleans up any previously created resources at the start of the test without logging how many exactly were cleaned up. As a bonus this now also allows running this test using check-failure-base.	2022-10-24 17:35:31 +02:00
Onur Tirtir	2d14dd85e9	Not hardcode "false" in UpdateAutoConvertedForConnectedRelations (#6452 ) This didn't cause any bugs since today we're always calling UpdateAutoConvertedForConnectedRelations with autoconverted=false, so we don't need to backport this to anywhere.	2022-10-21 18:14:20 +03:00
Onur Tirtir	dbe2749bbf	Drop unreachable code from query_pushdown_planning.c (#6451 ) Given that we cannot continue after a `RaiseDeferredErrorInternal(.., ERROR)` call.	2022-10-21 18:04:31 +03:00
Jelte Fennema	7f05ad033a	Add a section on PR descriptions to flaky test docs (#6446 ) Good PR descriptions for flaky tests are quite helpful when reviewing. Although obviously no PR description is the same, there's a few common pieces of information that are useful for all PRs that fix flaky tests.	2022-10-21 16:52:31 +02:00
aykut-bozkurt	162c8a5160	Drop worker_fetch_foreign_file/worker_repartition_cleanup only if they exist when upgrading Citus (#6441 ) We should not introduce breaking sql changes to upgrade files after they are released. We did that for worker_fetch_foreign_file in v9.0.0 and worker_repartition_cleanup in v9.2.0. Later when we try to drop those udfs, they were missing for some clients unexpectedly due to breaking change in an old upgrade script. For that case, the fix is to add DROP IF EXISTS for those 2 udfs in 11.0-4--11.1-1.	2022-10-21 14:32:42 +03:00
Emel Şimşek	02fd1e6c03	Fix the crash that happens when using auto_explain extension with recursive queries (#6406 ) This crash happens with recursively planned queries. For such queries, subplans are explained via the ExplainOnePlan function of postgresql. This function reconstructs the query description from the plan therefore it expects the ActiveSnaphot for the query be available. This fix makes sure that the snapshot is in the stack before calling ExplainOnePlan. Fixes #2920.	2022-10-19 18:04:45 +03:00
Jelte Fennema	737e2bb1bb	Don't leak search_path to workers on DDL (#6444 ) DESCRIPTION: Don't leak search_path to workers on DDL For DDL we have to set the `search_path` on workers to the same as on the coordinator for some DDL to work. Previously this search_path would leak outside of the transaction that was used for the DDL. This fixes that by using `SET LOCAL` instead of `SET`. The only place where we still use plain `SET` is for DDL commands that are not allowed within transactions, such as `CREATE INDEX CONCURRENLTY`. This fixes this flaky test: ```diff CONTEXT: SQL statement "SELECT change_id FROM distributed_triggers.data_changes WHERE shard_key_value = NEW.shard_key_value AND object_id = NEW.object_id ORDER BY change_id DESC LIMIT 1" -PL/pgSQL function record_change() line XX at SQL statement +PL/pgSQL function distributed_triggers.record_change() line 17 at SQL statement while executing command on localhost:57638 DELETE FROM data_ref_table where shard_key_value = 'hello'; ``` Source: https://app.circleci.com/pipelines/github/citusdata/citus/27849/workflows/75ae5f1a-100b-4b7a-b991-7de069f39ee1/jobs/831429 I had tried to fix this flaky test in #5894 and then I tried implementing a better fix in #5896, where @marcocitus suggested this better fix. This change reverts the fix from #5894 and implements the fix suggested by Marco. Our multi_mx_alter_distributed_table test actually depended on the old buggy search_path leaking behavior. After fixing the bug that test would fail like this: ```diff CALL proc_0(1.0); DEBUG: pushing down the procedure -NOTICE: Res: 3 -DETAIL: from localhost:xxxxx +ERROR: relation "test_proc_colocation_0" does not exist +CONTEXT: PL/pgSQL function mx_alter_distributed_table.proc_0(double precision) line 5 at SQL statement +while executing command on localhost:57637 RESET client_min_messages; ``` I fixed this test by fully qualifying the table names used in the procedure. I think it's quite unlikely that actual users depend on this behavior though. Since it would require first doing DDL before calling a procedure in a session where the search_path was changed after connecting.	2022-10-19 16:47:35 +02:00
Ahmet Gedemenli	cdbda9ea6b	Add failure test for shard move (#6325 ) DESCRIPTION: Adds failure test for shard move DESCRIPTION: Remove function `WaitForAllSubscriptionsToBecomeReady` and related tests Adding some failure tests for shard moves. Dropping the not-needed-anymore function `WaitForAllSubscriptionsToBecomeReady`, as the subscriptions now start as ready from the beginning because we don't use logical replication table sync workers anymore. fixes: #6260	2022-10-19 14:25:26 +02:00
Gokhan Gulbiz	56da3cf6aa	Increase node_connection_timeout to prevent flakiness in shard_rebalancer regression tests (#6445 ) In CI shard_rebalancer sometimes fails with this error: ```diff SET citus.node_connection_timeout to 60; BEGIN; SET LOCAL citus.shard_replication_factor TO 2; SET citus.log_remote_commands TO ON; SET SESSION citus.max_adaptive_executor_pool_size TO 5; SELECT replicate_table_shards('dist_table_test_2', max_shard_copies := 4, shard_transfer_mode:='block_writes'); +WARNING: could not establish connection after 60 ms ``` Source https://app.circleci.com/pipelines/github/citusdata/citus/28128/workflows/38eeacc4-4191-4366-87ed-9a628414965a/jobs/847458?invite=true#step-107-21 This PR avoids this issue by increasing ```citus.node_connection_timeout``` to 35s.	2022-10-19 13:03:14 +03:00
Onur Tirtir	5aec88d084	Not try locking relations referencing to views (#6430 ) Since there can't be such a foreign key already. This mainly fixes the error that Citus throws when trying to truncate a distributed view. Fixes #5990.	2022-10-19 11:24:22 +03:00
Jelte Fennema	f756db39c4	Add docs on how to fix flaky tests (#6438 ) I fixed a lot of flaky tests recently and I found some patterns in the type of issues and type of fixes. This adds a document that lists these types of issues and explains how to fix them.	2022-10-18 15:52:01 +02:00
Gokhan Gulbiz	e87eda6496	Introduce a new GUC to propagate local settings to new connections in rebalancer (#6396 ) DESCRIPTION: Introduce ```citus.propagate_session_settings_for_loopback_connection``` GUC to propagate local settings to new connections. Fixes: #5289	2022-10-18 12:50:30 +03:00
Jelte Fennema	60eb67b908	Increase shard move test coverage by improving advisory locks (#6429 ) To be able to test non-blocking shard moves we take an advisory lock, so we can pause the shard move at an interesting moment. Originally this was during the logical replication catch up phase. But when I added tests for the rebalancer progress I moved this lock before the initial data copy. This allowed testing of the rebalance progress, but inadvertently made our non-blocking tests not actually test if we held unintended locks during logical replication catch up. This fixes that by creating two types of advisory locks, one before the copy and one after. This causes the tests to actually test their intended scenario again. Furthermore it starts using one of these locks for blocking shard moves too. Which allowed me to reduce the complexity of the rebalance progress test suite quite a bit. It also allowed enabling some flaky tests again, because this stopped them from being flaky. And finally it allowed testing of rebalance progress for blocking shard copy operations as well. In passing it fixes a flaky test during parallel blocking shard moves by ordering the output.	2022-10-17 17:32:28 +02:00
Ahmet Gedemenli	96912d9ba1	Add status column to get_rebalance_progress() (#6403 ) DESCRIPTION: Adds status column to get_rebalance_progress() Introduces a new column named `status` for the function `get_rebalance_progress()`. For each ongoing shard move, this column will reveal information about that shard move operation's current status. For now, candidate status messages could be one of the below. * Not Started * Setting Up * Copying Data * Catching Up * Creating Constraints * Final Catchup * Creating Foreign Keys * Completing * Completed	2022-10-17 16:55:31 +03:00
Naisila Puka	8323f4f12c	Cleans up test outputs (#6434 )	2022-10-17 15:13:07 +03:00
Onur Tirtir	4152a391c2	Properly set col names for shard rels that citus_extradata_container points to (#6428 ) Deparser function set_relation_column_names() knows that it needs to re-evaluate column names based on relation's tuple descriptor when the rte belongs to a relation (RTE_RELATION). However before this commit, it didn't know about the fact that citus might wrap such an rte with an rte that points to citus_extradata_container() placeholder. And because of this, it was simply taking the column aliases (e.g., "bar" in "foo AS bar") into the account and this might result in an incorrectly deparsed query as in below case: * Say, if we had view based on following query: ```sql SELECT a FROM table; ``` * And if we rename column "a" to "b", the view query normally becomes: ```sql SELECT b AS a FROM table; ``` * So before this commit, deparsing a query based on that view was resulting in such a query due to deparsing based on the column aliases, which is not correct: ```sql SELECT a FROM table; ``` Fixes #5932. DESCRIPTION: Fixes a bug that might cause failing to query the views based on tables that have renamed columns	2022-10-14 17:31:25 +03:00
Önder Kalacı	8b624b5c9d	Detect remotely closed sockets and add a single connection retry in the executor (#6404 ) PostgreSQL 15 exposes WL_SOCKET_CLOSED in WaitEventSet API, which is useful for detecting closed remote sockets. In this patch, we use this new event and try to detect closed remote sockets in the executor. When a closed socket is detected, the executor now has the ability to retry the connection establishment. Note that, the executor can retry connection establishments only for the connection that has not been used. Basically, this patch is mostly useful for preventing the executor to fail if a cached connection is closed because of the worker node restart (or worker failover). In other words, the executor cannot retry connection establishment if we are in a distributed transaction AND any command has been sent over the connection. That requires more sophisticated retry mechanisms. For now, fixing the above use case is enough. Fixes #5538 Earlier discussions: #5908, #6259 and #6283 ### Summary of the current approach regards to earlier trials As noted, we explored some alternatives before getting into this. https://github.com/citusdata/citus/pull/6283 is simple, but lacks an important property. We should be checking for `WL_SOCKET_CLOSED` _before_ sending anything over the wire. Otherwise, it becomes very tricky to understand which connection is actually safe to retry. For example, in the current patch, we can safely check `transaction->transactionState == REMOTE_TRANS_NOT_STARTED` before restarting a connection. #6259 does what we intent here (e.g., check for sending any command). However, as @marcocitus noted, it is very tricky to handle `WaitEventSets` in multiple places. And, the executor is designed such that it reacts to the events. So, adding anything `pre-executor` seemed too ugly. In the end, I converged into this patch. This patch relies on the simplicity of #6283 and also does a very limited handling of `WaitEventSets`, just for our purpose. Just before we add any connection to the execution, we check if the remote session has already closed. With that, we do a brief interaction of multiple wait event processing, but with different purposes. The new wait event processing we added does not even consider cancellations. We let that handled by the main event processing loop. Co-authored-by: Marco Slot <marco.slot@gmail.com>	2022-10-14 15:08:49 +02:00
Jelte Fennema	0cee79a7ab	Actually enable improved blocked process detection (#6426 ) In #6405 I added better improved blocked process detection for isolation tests. But when cleaning up unnecessary code I cleaned up a bit too much. This actually includes the new function definition in our migrations.	2022-10-13 09:50:37 +02:00
Jelte Fennema	ecc37b9028	Fix flakyness in multi_partitioning (#6427 ) In CI multi_partitioning sometimes fails with this error: ```diff SELECT citus_remove_node('localhost', :master_port); - citus_remove_node ---------------------------------------------------------------------- - -(1 row) - +ERROR: tuple concurrently deleted -- d) invalid tables for helper UDFs ``` Source: https://app.circleci.com/pipelines/github/citusdata/citus/27993/workflows/685e5b20-c923-43e5-8a0d-b932ef4c4914/jobs/839466 This PR avoids this concurrency issue by not running the multi_partitioning test in parallel with other tests.	2022-10-13 10:33:37 +03:00
Onur Tirtir	20847515fa	Hint users to call "citus_set_coordinator_host" first (#6425 ) If an operation requires having coordinator in pg_dist_node and if that is not the case, then we automatically add the coordinator into pg_dist_node if user didn't add any worker nodes yet. However, if user have already added some worker nodes before, we throw an error. With this commit, we improve the error thrown in that case. Closes #6423 based on the discussion made there.	2022-10-12 18:18:51 +03:00
Jelte Fennema	6277ffd69e	Reduce isolation flakyness by improving blocked process detection (#6405 ) Sometimes our CI randomly fails on a test in a way similar to this: ```diff step s2-drop: DROP TABLE cancel_table; - + <waiting ...> +step s2-drop: <... completed> starting permutation: s1-timeout s1-begin s1-sleep10000 s1-rollback s1-reset s1-drop ``` Source: https://app.circleci.com/pipelines/github/citusdata/citus/26524/workflows/5415b84f-13a3-482f-bef9-648314c79a67/jobs/756377 I tried to fix that already in #6252 by disabling the maintenance daemon during isolation tests. But it seems that hasn't fixed all cases of these errors. This is another attempt at fixing these issues that seems to have better results. What it does is that it starts using the pInterestingPids parameter that citus_isolation_test_session_is_blocked receives. With this change we start filter out block-edges that are not caused by any of these pids. In passing this change also makes it possible to run `isolation_create_distributed_table_concurrently` with `check-isolation-base`	2022-10-12 16:35:09 +02:00
Hanefi Onaldi	ec3eebbaf6	Rename a function that collides with PG15 (#6422 ) PG15 introduced a function called ReplicationSlotName that causes conflicts with our function with the same name. I solved this issue by renaming our function to ReplicationSlotNameForNodeAndOwner Relevant PG commit: `c3b5992b91`	2022-10-12 13:24:04 +03:00
Jelte Fennema	cb34adf7ac	Don't reassign global PID when already assigned (#6412 ) DESCRIPTION: Fix bug in global PID assignment for rebalancer sub-connections In CI our isolation_shard_rebalancer_progress test would sometimes fail like this: ```diff +isolationtester: canceling step s1-rebalance-c1-block-writes after 60 seconds step s1-rebalance-c1-block-writes: SELECT rebalance_table_shards('colocated1', shard_transfer_mode:='block_writes'); - <waiting ...> + +ERROR: canceling statement due to user request step s7-get-progress: ``` Source: https://app.circleci.com/pipelines/github/citusdata/citus/27855/workflows/2a7e335a-f3e8-46ed-b6bd-6920d42f7214/jobs/831710 It turned out this was an actual bug in the way our assigning of global PIDs interacts with the way we connect to ourselves as the shard rebalancer. The first command the shard rebalancer sends is a SET ommand to change the application_name to `citus_rebalancer`. If `StartupCitusBackend` is called after this command is processed, then it overwrites the global PID that was extracted from the previous application_name. This makes sure that we don't do that, and continue to use the original global PID. While it might seem that we only call `StartupCitusBackend` once for each query backend, this isn't actually the case. Whenever pg_dist_partition gets ANALYZEd by autovacuum we indirectly call `StartupCitusBackend` again, because we invalidate the cache then. In passing this fixes two other things as well: 1. It sets `distributedCommandOriginator` correctly in `AssignGlobalPID`, by using IsExternalClientBackend(). This doesn't matter much anymore, since AssignGlobalPID effectively becomes a no-op in this PR for any non-external client backends. 2. It passes the application_name to InitializeBackendData in StartupCitusBackend, instead of INVALID_CITUS_INTERNAL_BACKEND_GPID (which effectively got casted to NULL). In practice this doesn't change the behaviour of the call, since the call is a no-op for every backend except the maintenance daemon. And the behaviour of the call is the same for NULL as for the application_name of the maintenance daemon.	2022-10-11 16:41:01 +02:00
Naisila Puka	b5d70d2e11	Fix flakyness in alter_table_set_access_method (#6421 ) We decrease verbosity level here to avoid the flaky output https://app.circleci.com/pipelines/github/citusdata/citus/27936/workflows/dc63128a-1570-41a0-8722-08f3e3cfe301/jobs/836153 ```diff select alter_table_set_access_method('ref','heap'); NOTICE: creating a new table for alter_table_set_access_method.ref NOTICE: moving the data of alter_table_set_access_method.ref NOTICE: dropping the old alter_table_set_access_method.ref NOTICE: drop cascades to 2 other objects -DETAIL: drop cascades to materialized view m_ref -drop cascades to view v_ref +DETAIL: drop cascades to view v_ref +drop cascades to materialized view m_ref CONTEXT: SQL statement "DROP TABLE alter_table_set_access_method.ref CASCADE" NOTICE: renaming the new table to alter_table_set_access_method.ref alter_table_set_access_method ------------------------------- (1 row) ```	2022-10-11 16:31:24 +03:00
Naisila Puka	89aa9a015f	Fixes empty password issue (#6417 )	2022-10-11 15:56:44 +03:00
Onur Tirtir	0b81f68def	Use memcpy instead of memcpy_s to avoid pointless limits in columnar (#6419 ) DESCRIPTION: Raises memory limits in columnar from 256MB to 1GB for reads and writes This doesn't completely fix #5918 but at least increases the buffer limits that might cause throwing an error when reading from or writing into into columnar storage. A way better approach to fix this is documented in #6420. Replacing memcpy_s with memcpy is quite safe in those places since we anyway make sure to allocate enough amount of memory before writing into related buffers.	2022-10-11 14:57:31 +03:00
aykut-bozkurt	442cdb2ea5	pg_regress needs the option dlpath for postgres tests to find regress.so (#6416 ) When you run vanilla tests in your local environment, some of the tests tries to find path for regress.so which is not in default lib path. That is why we need to specify regress.so path as dlpath option. Example failure: ``` LOAD :'regresslib'; +ERROR: could not access file "/home/aykutbozkurt/.pgenv/pgsql-15beta4/lib/regress.so": No such file or directory ``` It is actually in `~/.pgenv/src/postgresql-15beta4/src/test/regress/regress.so` which is found by `$regresslibdir`.	2022-10-11 14:43:06 +03:00
Hanefi Onaldi	cbe4298c5b	Remove references to optimization PG15 reverted PG15 introduced an optimization on GROUP BY keys that is now reverted on RC2. Relevant PG commit: Revert "Optimize order of GROUP BY keys". 443df6e2db932a7cd6d85ddfb67e11a43345130d	2022-10-10 21:54:08 +03:00
Onur Tirtir	517b72a9d5	Fix use-after-free in GetAlterTriggerStateCommand() (#6413 ) Fix use-after-free in GetAlterTriggerStateCommand() introduced in #6398.	2022-10-10 16:38:21 +03:00
Gokhan Gulbiz	1776bdf654	Limit citus_drain_node to drain the specified node only (#6361 ) DESCRIPTION: Fixes citus_drain_node to drain the specified worker only. Fixes #6267	2022-10-09 13:33:08 +03:00
Onur Tirtir	86e186f671	Retain trigger settings when re-creating the triggers (on shards) (#6398 ) Fixes https://github.com/citusdata/citus/issues/6394. DESCRIPTION: Fixes a bug that causes creating disabled-triggers on shards as enabled Since CREATE TRIGGER doesn't have syntax support to specify whether the trigger should be enabled/disabled, the underlying PG function (`pg_get_triggerdef()`) that we use to generate the command to create the trigger is not enough. For this reason, we append a second command to enable/disable trigger, right after creating it. We don't retain explicit extension dependencies set by using `ALTER trigger DEPENDS ON EXTENSION` commands too, but apparently right fix for that is to throw an error as in `PreprocessAlterTriggerDependsStmt()`; so, opened a separate PR to fix that #6399.	2022-10-06 14:51:07 +03:00
Naisila Puka	27e867afbc	Propagates column aliases (#6400 ) Propagates column aliases in the shard-level commands	2022-10-06 12:27:31 +03:00
Naisila Puka	b5cba3a3fe	Use original relation to retrieve column name because of syscache (#6387 ) During alter_distributed_table, we create a new table like the original table but with the altered options. To retrieve the name of the distribution column, we were using the attribute syscache of the new table, since we already created the new table as identical to the original table. However, the attribute syscaches of these two tables are not the same if the original table has dropped columns. The reason is that dropped columns are all still present in the cache. Hence, for example, the attnos would be different in the syscaches. So, let's use the attribute syscache of the original table.	2022-10-06 12:08:00 +03:00
Ying Xu	f21cbe68f8	[Columnar] Bugfix for Columnar: options ignored during ALTER TABLE rewrite (#6337 ) DESCRIPTION: Fixes a bug that prevents retaining columnar table options after a table-rewrite A fix for this issue: Columnar: options ignored during ALTER TABLE rewrite #5927 The OID for the temporary table created during ALTER TABLE was not the same as the original table's OID so the columnar options were not being applied during rewrite. The change is that I applied the original table's columnar options to the new table so that it has the correct options during write. I also added a test.	2022-10-05 11:42:09 -07:00
Ahmet Gedemenli	e36890ce55	Add source_lsn and target_lsn fields into get_rebalance_progress (#6364 ) DESCRIPTION: Adds source_lsn and target_lsn fields into get_rebalance_progress Adding two fields named `source_lsn` and `target_lsn` to the function `get_rebalance_progress`. Target lsn data is fetched in `GetShardStatistics`, by expanding the query sent to workers (joining with pg_subscription_rel and pg_stat_subscription). Then put into the hashmap, for each shard. Source lsn data is fetched in `BuildWorkerShardStatististicsHash`, in the loop that iterate each node, by sending a pg_current_wal_lsn query to each node. Then put into the hashmap, for each node.	2022-10-05 11:12:24 +03:00
Hanefi Onaldi	11a9a3771f	Ensure no dependencies to index before drop	2022-10-04 18:56:20 +03:00
Hanefi Onaldi	5ddd4754a2	Document failing downgrades from 10.2-4 to 10.2-2	2022-10-04 18:56:20 +03:00
Hanefi Onaldi	0efd6f7829	Fix tests for missing downgrades	2022-10-04 18:56:20 +03:00
Jelte Fennema	aea4964b39	Fix flakyness in isolation_shard_rebalancer_progress (#6397 ) On our CI our isolation_shard_rebalancer_progress would sometimes randomly fail like this: ```diff table_name\|shardid\|shard_size\|sourcename\|sourceport\|source_shard_size\|targetname\|targetport\|target_shard_size\|progress\|operation_type ----------+-------+----------+----------+----------+-----------------+----------+----------+-----------------+--------+-------------- -colocated1\|1500001\| 49152\|localhost \| 57637\| 49152\|localhost \| 57638\| 73728\| 1\|move -colocated2\|1500005\| 376832\|localhost \| 57637\| 376832\|localhost \| 57638\| 401408\| 1\|move +colocated1\|1500001\| 49152\|localhost \| 57637\| 49152\|localhost \| 57638\| 81920\| 1\|move +colocated2\|1500005\| 376832\|localhost \| 57637\| 376832\|localhost \| 57638\| 409600\| 1\|move (2 rows) ``` Source: https://app.circleci.com/pipelines/github/citusdata/citus/27688/workflows/8c5ca443-5f21-4f21-b74f-0ca7bde69648/jobs/823648/parallel-runs/1 The shard sizes would be slightly larger or smaller than expected. This fixes this by fixing the output to the nearest expected shard size. To do so I used a trick described in this stack overflow answer: https://stackoverflow.com/a/33147437/2570866 When investigating I ran into one more random failure: ```diff -step s1-shard-move-c1-block-writes: <... completed> +step s4-shard-move-sep-block-writes: <... completed> citus_move_shard_placement -------------------------- (1 row) -step s4-shard-move-sep-block-writes: <... completed> +step s1-shard-move-c1-block-writes: <... completed> citus_move_shard_placement -------------------------- ``` Source: https://app.circleci.com/pipelines/github/citusdata/citus/27707/workflows/c3ff4fc7-5068-4096-ab9f-803c941ddac0/jobs/824622/parallel-runs/29?filterBy=FAILED This random failure happens, because the two parallel moves can complete at the same time. So, it's non-deterministic which one finishes first. To make this deterministic I used the "marker" feature from the isolation tester. And finally I ran into a third random failure: ```diff table_name\|shardid\|shard_size\|sourcename\|sourceport\|source_shard_size\|targetname\|targetport\|target_shard_size\|progress\|operation_type ----------+-------+----------+----------+----------+-----------------+----------+----------+-----------------+--------+-------------- -colocated1\|1500001\| 50000\|localhost \| 57637\| 50000\|localhost \| 57638\| 50000\| 1\|move -colocated2\|1500005\| 400000\|localhost \| 57637\| 400000\|localhost \| 57638\| 400000\| 1\|move +colocated1\|1500001\| 50000\|localhost \| 57637\| 50000\|localhost \| 57638\| 8000\| 1\|move +colocated2\|1500005\| 400000\|localhost \| 57637\| 400000\|localhost \| 57638\| 8000\| 1\|move colocated1\|1500002\| 200000\|localhost \| 57637\| 200000\|localhost \| 57638\| 0\| 0\|move colocated2\|1500006\| 8000\|localhost \| 57637\| 8000\|localhost \| 57638\| 0\| 0\|move ``` Source: https://app.circleci.com/pipelines/github/citusdata/citus/27707/workflows/c3ff4fc7-5068-4096-ab9f-803c941ddac0/jobs/824622/parallel-runs/30?filterBy=FAILED This happened in two of the tests only. For now I commented these tests out. I have some ideas on how to fix these, but these ideas require more impactful changes than I would like in this PR. One of these tests had a copy paste error too, in passing I fixed that in the commented out line.	2022-10-04 17:05:42 +02:00
Hanefi Onaldi	24f247b5a1	Cleanup multi_utility_warnings test This test used to contain some utility commands that Citus did not support. However we added support for most of the commands, and this test got outdated. We used to error out on community when user attempted to use pooler options. Now that we open sourced all enterprise features, the test can now be removed.	2022-10-04 15:27:42 +03:00
Jelte Fennema	5c64227223	Hopefully reduce flaky tests by disabling the maintenance daemon (#6252 ) Sometimes our CI randomly fails on a test in a way similar to this: ```diff step s2-drop: DROP TABLE cancel_table; - + <waiting ...> +step s2-drop: <... completed> starting permutation: s1-timeout s1-begin s1-sleep10000 s1-rollback s1-reset s1-drop ``` Source: https://app.circleci.com/pipelines/github/citusdata/citus/26524/workflows/5415b84f-13a3-482f-bef9-648314c79a67/jobs/756377 Another example of a failure like this: ```diff stop_session_level_connection_to_node ------------------------------------- (1 row) step s3-display: SELECT * FROM ref_table ORDER BY id, value; SELECT * FROM dist_table ORDER BY id, value; - + <waiting ...> +step s3-display: <... completed> id\|value --+----- ``` Source: https://app.circleci.com/pipelines/github/citusdata/citus/26551/workflows/91dca4b2-bb1c-4cae-b2ef-ce3f9c689ce5/jobs/757781 A step that shouldn't be blocked is detected as "waiting..." temporarily and then gets unblocked automatically immediately after. I'm not certain of the reason for this, but one explanation is that the maintenance daemon is doing something that blocks the query. In the shown case my hunch is that it could be the deferred shard deletion. This PR disables all the features of the maintenance daemon during isolation testing to try and prevent process from randomly being detected as blocking. NOTE: I'm not certain that this will actually fix this issue. If the issue persists even after this change, at least we know that it's not the maintenance daemon that's blocking it.	2022-10-04 14:33:57 +03:00
Hanefi Onaldi	813542dfa1	Fix flaky isolation_citus_dist_activity test (#6395 ) For the sake of documentation, here is a failing diff: ```diff step s2-view-dist: SELECT query, citus_nodename_for_nodeid(citus_nodeid_for_gpid(global_pid)), citus_nodeport_for_nodeid(citus_nodeid_for_gpid(global_pid)), state, wait_event_type, wait_event, usename, datname FROM citus_dist_stat_activity WHERE query NOT ILIKE ALL(VALUES('%pg_prepared_xacts%'), ('%COMMIT%'), ('%BEGIN%'), ('%pg_catalog.pg_isolation_test_session_is_blocked%'), ('%citus_add_node%')) AND backend_type = 'client backend' ORDER BY query DESC; query \|citus_nodename_for_nodeid\|citus_nodeport_for_nodeid\|state \|wait_event_type\|wait_event\|usename \|datname ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------+-------------------------+-------------------+---------------+----------+--------+---------- ALTER TABLE test_table ADD COLUMN x INT; \|localhost \| 57636\|idle in transaction\|Client \|ClientRead\|postgres\|regression -(1 row) + + SELECT coalesce(to_jsonb(array_agg(csa_from_one_node.)), '[{}]'::JSONB) + FROM ( + SELECT global_pid, worker_query AS is_worker_query, pg_stat_activity. FROM + pg_stat_activity LEFT JOIN get_all_active_transactions() ON process_id = pid + ) AS csa_from_one_node; + \|localhost \| 57638\|active \| \| \|postgres\|regression +(2 rows) ``` This failure can be seen at [this CI run](https://app.circleci.com/pipelines/github/citusdata/citus/27653/workflows/d769701c-8f6e-4f97-a412-16f7b9b288a6/jobs/821416)	2022-10-04 13:09:09 +02:00
Hanefi Onaldi	8be8eb9d8c	Update hints on trigger rename of partitions There is a new commit in REL_15_STABLE that improves message styles. Relevant PG commit: 517484b5820e9e20057ff066b5df7d09cbb5f464	2022-09-30 16:37:56 +03:00
Ahmet Gedemenli	d0fa10a98c	Bump Citus to 11.2devel (#6385 )	2022-09-30 14:47:42 +03:00
Hanefi Onaldi	7e0edee4ec	Add tests for CREATE DATABASE with OID option (#6376 ) PG15 now allows users to specify oids when creating databases. This feature is a side effect of a bigger feature in pg_upgrade. Relevant PG Commit: pg_upgrade: Preserve database OIDs. aa01051418f10afbdfa781b8dc109615ca785ff9	2022-09-27 19:54:51 +02:00
Nils Dijk	9cad6a5324	Fix/python protobuf (#6378 ) Depends on https://github.com/citusdata/the-process/pull/92 Closes: #6371 Updates test dependencies to not rely on a known vulnerable dependency Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2022-09-27 14:46:27 +02:00
Naisila Puka	63e4d23722	Tests moving a shard with RLS owned by nonbypassrls & nonsuperuser (#6369 )	2022-09-27 14:53:23 +03:00
Naisila Puka	1b26d57288	Adds tests for suppressed constants in postgres_fdw queries (#6370 ) PG15 has suppressed some casts on constants when querying foreign tables. For example, we can use text to represent a type that's an enum on the remote side. A comparison on such a column will get shipped as "var = 'foo'::text". But there's no enum = text operator on the remote side. If we leave off the explicit cast, the comparison will work. Test we behave in the same way with a Citus foreign table Reminder: foreign tables cannot be distributed/reference, can only be Citus local Relevant PG commit: `f8abb0f5e1`	2022-09-27 13:40:48 +03:00
Hanefi Onaldi	30ac6f0fe9	Add tests for jsonpath changes on PG15 PostgreSQL 15 had some changes to jsonpath to conform with ECMA-262 referenced by SQL standard. This commit adds tests to make sure Citus also supports the same standards. Relevant pg commit: e26114c817b610424010cfbe91a743f591246ff1	2022-09-26 22:55:54 +03:00
Jelte Fennema	24e06af6d2	Reuse connections for Splits and Logical Replication (#6314 ) In Split, Logical replication logic and ShardCleaner we call `SendCommandListToWorkerOutsideTransaction` and `SendOptionalCommandListToWorkerOutsideTransaction` frequently. This opens new connection for each of those calls, even though we already have a perfectly good connection lying around. This PR adds two new APIs `SendCommandListToWorkerOutsideTransactionWithConnection` and `SendOptionalCommandListToWorkerOutsideTransactionWithConnection` that allow sending a list of queries in a transaction over an existing connection. We also update the callers (Split, ShardCleaner, Logical Replication) to use these new APIs instead. Co-authored-by: Nitish Upreti <niupre@microsoft.com> Co-authored-by: Onder Kalaci <onderkalaci@gmail.com>	2022-09-26 13:37:40 +02:00
Naisila Puka	dc9723fa45	Comment about column list for fk ON DELETE SET in PG15 (#6372 ) As a part of `a868cc049a`	2022-09-26 11:45:05 +03:00
Jelte Fennema	d9a9a3263b	Revert replica identity creation order for shard moves (#6367 ) In Citus 11.1.0 we changed the order of doing the initial data copy and the replica identity creation when doing a non blocking shard move. This was done to try and increase the speed with which shard moves could be done. But after doing more extensive performance testing this change turned out to have a negative impact on the speed of moves on the setups that I tested. Looking at the resource usage metrics of the VMs the reason for this seems to be that these shard moves were bottlenecked by disk bandwidth. While creating replica identities in bulk after the initial copy will reduce CPU usage a bit, it does require an additional sequence scan of the just written data. So when a VM is bottlenecked on disk, it makes sense to spend a little bit more CPU to avoid an additional scan. Since PKs are usually simple indexes that don't require lots of CPU to update, as opposed to e.g. GiST indexes. This reverts the order change to avoid a regression on shard move speed in these cases. For future releases we might consider re-evaluating our index creation order for other indexes too, and create "simple" indexes before the copy.	2022-09-23 14:55:25 +02:00
Onur Tirtir	a868cc049a	Not allow ON DELETE/UPDATE SET DEFAULT actions on columns that default to sequences (#6340 ) Given that we drop DEFAULT nextval('sequence') expressions from shard relation columns, allowing `ON DELETE/UPDATE SET DEFAULT` on such columns might cause inserting NULL values as a result of a delete/update operation. For this reason, we disallow ON DELETE/UPDATE SET DEFAULT actions on columns that default to sequences. DESCRIPTION: Disallows having ON DELETE/UPDATE SET DEFAULT actions on columns that default to sequences Fixes #6339.	2022-09-23 03:34:02 -07:00
Onur Tirtir	de24a3eda5	Not drop default col exprs from shard when adding local table to metadata (#6323 ) As we did for GENERATED STORED columns in #4613, we should not drop column default expressions that are not based on sequences from shard relation since such expressions need to exist e.g. for foreign key actions. For the column default expressions that are based on sequences we cannot do much, so we need to disallow having ON DELETE SET DEFAULT actions on such columns in a separate PR, see #6339. Fixes #6318. DESCRIPTION: Fixes a bug that might cause inserting incorrect DEFAULT values when applying foreign key actions	2022-09-23 03:05:08 -07:00
Naisila Puka	1ede0b9db3	Add tests to verify we support security invoker views (#6362 ) PG15 added support for security invoker views. Relevant PG conmit: `7faa5fc84b` These views check the permissions for the underlying tables of the view invoker user, not the view definer user. When the view has underlying distributed tables, the queries to the shards are sent by opening connections with the current user, which is the view invoker, no matter what the type of the view is. This means that, for distributed views, they were always behaving like security invoker views. Check the following issue for more details: https://github.com/citusdata/citus/issues/6161 So, Citus doesn't fully support security definer views. However Citus does fully support security invoker views. We add tests to make sure we cover different cases.	2022-09-23 10:55:46 +03:00
Ahmet Gedemenli	bae4b47c2f	Fix dropping replication slot (#6359 ) DESCRIPTION: Fixes dropping replication slots As detected by a flaky test, Citus sometimes fails to drop replication slots, possibly due to a race condition, at the end of a shard split. With this PR, we retry to drop them in case of an `OBJECT_IN_USE` error, consistently for 20 seconds. fixes: #6326	2022-09-21 16:29:56 +03:00
Onder Kalaci	03ac8b4f82	Add tests for PG15 new aggregate commands Both tests include pushdown and pull to coordinator type of aggregate execution. Relevant PG commits: Add min() and max() aggregates for xid8 400fc6b6487ddf16aa82c9d76e5cfbe64d94f660 Add range_agg with multirange inputs 7ae1619bc5b1794938c7387a766b8cae34e38d8a Co-authored-by: Onder Kalaci <onderkalaci@gmail.com>	2022-09-20 17:08:17 +03:00
Nitish Upreti	e9508b2603	Shard Split : Add / Update logging (#6336 ) DESCRIPTION: Improve logging during shard split and resource cleanup ### DESCRIPTION This PR makes logging improvements to Shard Split : 1. Update confusing logging to fix #6312 2. Added new `ereport(LOG` to make debugging easier as part of telemetry review.	2022-09-16 09:39:08 -07:00
Marco Slot	8544346a78	Allow create_distributed_table_concurrently on an empty node (#6353 ) Co-authored-by: Marco Slot <marco.slot@gmail.com>	2022-09-16 10:55:02 +02:00
Onder Kalaci	766f340ce0	Prevent failures on partitioned distributed tables with statistics objects on PG 15 Comment from the code is clear on this: /* * The statistics objects of the distributed table are not relevant * for the distributed planning, so we can override it. * * Normally, we should not need this. However, the combination of * Postgres commit 269b532aef55a579ae02a3e8e8df14101570dfd9 and * Citus function AdjustPartitioningForDistributedPlanning() * forces us to do this. The commit expects statistics objects * of partitions to have "inh" flag set properly. Whereas, the * function overrides "inh" flag. To avoid Postgres to throw error, * we override statlist such that Postgres does not try to process * any statistics objects during the standard_planner() on the * coordinator. In the end, we do not need the standard_planner() * on the coordinator to generate an optimized plan. We call * into standard_planner() for other purposes, such as generating the * relationRestrictionContext here. * * AdjustPartitioningForDistributedPlanning() is a hack that we use * to prevent Postgres' standard_planner() to expand all the partitions * for the distributed planning when a distributed partitioned table * is queried. It is required for both correctness and performance * reasons. Although we can eliminate the use of the function for * the correctness (e.g., make sure that rest of the planner can handle * partitions), it's performance implication is hard to avoid. Certain * planning logic of Citus (such as router or query pushdown) relies * heavily on the relationRestrictionList. If * AdjustPartitioningForDistributedPlanning() is removed, all the * partitions show up in the, causing high planning times for * such queries. */	2022-09-15 14:36:05 +03:00
aykut-bozkurt	739b91afa6	ensure we have more active nodes than replication factor. (#6341 ) DESCRIPTION: Fixes floating exception during create_distributed_table_concurrently. Fixes #6332. During create_distributed_table_concurrently, when there is no active primary node, it fails with floating exception. We added similar check with create_distributed_table. It will fail with proper message if current active node is less than replication factor.	2022-09-14 18:20:50 +03:00
Marco Slot	4ab415c43a	Fix escaping in sequence dependency queries (#6345 ) Co-authored-by: Marco Slot <marco.slot@gmail.com>	2022-09-14 17:43:24 +03:00
Sameer Awasekar	4851b4e8f2	Introduce code changes to fix Issue:6303 (#6328 ) The PR introduces code changes to fix Issue [6303](https://github.com/citusdata/citus/issues/6303) `create_distributed_table_concurrently` following drop column, creates a buggy situation in split decoder. * Consider the below scenario: * Session1 : Drop column followed by create_distributed_table_concurrently * Session2 : Concurrent insert workload The child shards created by `create_distributed_table_concurrently` will have less columns than the source shard because some column were dropped. The incoming tuple from session2 will have more columns as the writes happened on source shard. But now the tuple needs to be applied on child shard. So we need to format existing tuple according to child schema and skip dropped column values. The PR fixes this by reformatting the tuple according the target child schema. Test: 1) isolation_create_distributed_concurrently_after_drop_column - Repros the issue and tests on the same.	2022-09-14 19:56:32 +05:30
Marco Slot	7a92d873b6	Fix bugs in CheckIfRelationWithSameNameExists (#6343 ) Co-authored-by: Marco Slot <marco.slot@gmail.com>	2022-09-14 15:42:46 +02:00
Nils Dijk	da527951ca	Fix: rebalance stop non super user (#6334 ) No need for description, fixing issue introduced with new feature for 11.1 Fixes #6333 Due to Postgres' C api being o-indexed and postgres' attributes being 1-indexed, we were reading the wrong Datum as the Task owner when cancelling. Here we add a test to show the error and fix the off-by-one error.	2022-09-13 23:19:31 +02:00
Hanefi Onaldi	f34467dcb3	Remove missing declaration warning (#6330 ) When I built Citus on PG15beta4 locally, I get a warning message. ``` utils/background_jobs.c:902:5: warning: declaration does not declare anything [-Wmissing-declarations] __attribute__((fallthrough)); ^ 1 warning generated. ``` This is a hint to the compiler that we are deliberately falling through in a switch-case block.	2022-09-13 13:48:51 +03:00
Jelte Fennema	f13b140621	Show citus_copy_shard_placement progress in get_rebalance_progress (#6322 ) DESCRIPTION: Show citus_copy_shard_placement progress in get_rebalance_progress When rebalancing to a new node that does not have reference tables yet the rebalancer will first copy the reference tables to the nodes. Depending on the size of the reference tables, this might take a long time. However, there's no indication of what's happening at this stage of the rebalance. This PR improves this situation by also showing the progress of any citus_copy_shard_placement calls when calling get_rebalance_progress.	2022-09-13 08:59:52 +00:00
Naisila Puka	76ff4ab188	Adds support for unlogged distributed sequences (#6292 ) We can now do the following: - Distribute sequence with logged/unlogged option - ALTER TABLE my_sequence SET LOGGED/UNLOGGED - ALTER SEQUENCE my_sequence SET LOGGED/UNLOGGED Relevant PG commit `344d62fb9a`	2022-09-13 10:53:39 +03:00
Hanefi Onaldi	5cfcc63308	Add warning messages for cluster commands on partitioned tables (#6306 ) PG15 introduces `CLUSTER` commands for partitioned tables. Similar to a `CLUSTER` command with no supplied table names, these commands also can not be run inside transaction blocks and therefore can not be propagated in a distributed transaction block with ease. Therefore we raise warnings. Relevant PG commit: cfdd03f45e6afc632fbe70519250ec19167d6765	2022-09-13 00:05:58 +03:00
Hanefi Onaldi	164f2fa0a6	PG15: Add support for NULLS NOT DISTINCT (#6308 ) Relevant PG commit: 94aa7cc5f707712f592885995a28e018c7c80488	2022-09-12 23:47:37 +03:00
Marco Slot	b79111527e	Avoid blocking writes in create_distributed_table_concurrently (#6324 ) Co-authored-by: Marco Slot <marco.slot@gmail.com>	2022-09-12 12:09:37 -07:00
Nils Dijk	cda3686d86	Feature: run rebalancer in the background (#6215 ) DESCRIPTION: Add a rebalancer that uses background tasks for its execution Based on the baclground jobs and tasks introduced in #6296 we implement a new rebalancer on top of the primitives of background execution. This allows the user to initiate a rebalance and let Citus execute the long running steps in the background until completion. Users can invoke the new background rebalancer with `SELECT citus_rebalance_start();`. It will output information on its job id and how to track progress. Also it returns its job id for automation purposes. If you simply want to wait till the rebalance is done you can use `SELECT citus_rebalance_wait();` A running rebalance can be canelled/stopped with `SELECT citus_rebalance_stop();`.	2022-09-12 20:46:53 +03:00
Marco Slot	48f7d6c279	Show local managed tables in citus_tables and hide tables owned by extensions (#6321 ) Co-authored-by: Marco Slot <marco.slot@gmail.com>	2022-09-12 17:49:17 +03:00
Marco Slot	b036e44aa4	Fix bug preventing isolate_tenant_to_new_shard with text column (#6320 ) Co-authored-by: Marco Slot <marco.slot@gmail.com>	2022-09-12 16:29:57 +02:00
naisila	47bea76c6c	Revert "Support JSON_TABLE on PG 15 (#6241 )" This reverts commit `1f4fe35512`.	2022-09-12 15:20:17 +03:00
naisila	53ffbe440a	Revert SQL/JSON features in ruleutils_15.c Reverting the following commits: `977ddaae56` `4a5cf06def` `9ae19c181f` `30447117e5` `f9c43f4332` `21dba4ed08` `262932da3e` We have to manually make changes to this file. Follow the relevant PG commit in ruleutils.c & make the exact same changes in ruleutils_15.c Relevant PG commit: 96ef3237bf741c12390003e90a4d7115c0c854b7	2022-09-12 15:20:17 +03:00
Onder Kalaci	36f8c48560	Add tests for allowing SET NULL/DEFAULT for subseet of columns PG 15 added support for that (d6f96ed94e73052f99a2e545ed17a8b2fdc1fb8a). We also add support, but we already do not support ON DELETE SET NULL/DEFAULT for distribution column. So, in essence, we add support for reference tables and Citus local tables.	2022-09-12 13:56:09 +03:00
Marco Slot	2e943a64a0	Make shard moves more idempotent (#6313 ) Co-authored-by: Marco Slot <marco.slot@gmail.com>	2022-09-09 18:21:36 +02:00
Jelte Fennema	a2d86214b2	Share more replication code between moves and splits (#6310 ) The logical replication catchup part for shard splits and shard moves is very similar. This abstracts most of that similarity away into a single function. This also improves the logic for non blocking shard splits a bit, by using faster foreign key creation. It also parallelizes index creation which shard moves were already doing, but shard splits did not.	2022-09-09 16:45:38 +02:00
Marco Slot	ba2fe3e3c4	Remove do_repair option from citus_copy_shard_placement (#6299 ) Co-authored-by: Marco Slot <marco.slot@gmail.com>	2022-09-09 15:44:30 +02:00
Nils Dijk	00a94c7f13	Implement infrastructure to run sql jobs in the background (#6296 ) DESCRIPTION: Add infrastructure to run long running management operations in background This infrastructure introduces the primitives of jobs and tasks. A task consists of a sql statement and an owner. Tasks belong to a Job and can depend on other tasks from the same job. When there are either runnable or running tasks we would like to make sure a bacgrkound task queue monitor process is running. A Task could be in running state while there is actually no monitor present due to a database restart or failover. Once the monitor starts it will reset any running task to its runnable state. To make sure only one background task queue monitor is ever running at once it will acquire an advisory lock that self conflicts. Once a task is done it will find all tasks depending on this task. After checking that the task doesn't have unmet dependencies it will transition the task from blocked to runnable state for the task to be picked up on a subsequent task start. Currently only one task can be running at a time. This can be improved upon in later releases without changes to the higher level API. The initial goal for this background tasks is to allow a rebalance to run in the background. This will be implemented in a subsequent PR.	2022-09-09 16:11:19 +03:00
Jelte Fennema	76137e967f	Create all foreign keys quickly at the end of a shard move (#6148 ) Previously we would create foreign keys to reference table in an extra fast way at the end of a shard move. This uses that same logic to also do it for foreign keys between distributed tables. Fixes #6141	2022-09-09 09:58:33 +02:00
Nils Dijk	cc0eeea4c5	remove redundant call to TerminateBackgroundWorker (#6307 ) Remove redundant call to TerminateBackgroundWorker Discussion: https://github.com/citusdata/citus/pull/6296#discussion_r965926695	2022-09-09 07:37:02 +02:00
Ahmet Gedemenli	eadc88a800	Introduce GUC citus.skip_constraint_validation (#6281 ) Introduces a new GUC named citus.skip_constraint_validation, which basically skips constraint validation when set to on. For some several places that we hack to skip the foreign key validation phase, now we use this GUC.	2022-09-08 18:13:18 +03:00
Hanefi Onaldi	a557a196aa	Add tests for numeric with scale greater than precision	2022-09-07 13:12:04 +03:00
Hanefi Onaldi	4db113496f	Add tests for new COPY features in PG15	2022-09-07 13:12:04 +03:00
Hanefi Onaldi	3e4e42253f	Add tests for new regexp sql functions	2022-09-07 13:12:04 +03:00
Jelte Fennema	e29db74a19	Don't override postgres C symbols with our own (#6300 ) When introducing our overrides of pg_cancel_backend and pg_terminate_backend we accidentally did that in such a way that we cannot call the original pg_cancel_backend and pg_terminate_backend from C anymore. This happened because we defined the exact same symbols in our shared library as postgres does in its own binary. This fixes that by using a different names for the C function than for the SQL function. Making this work in all upgrade and downgrade scenarios is not trivial though, because we actually need to remove the C function definition. Postgres errors in two different times when the symbol that a C function wants to call is not defined in the library it expects it in: 1. When creating the SQL function definition 2. When calling the SQL function Item 1 causes an issue when creating our extension for the first time. We then go execute all the migrations that we have. So if the 11.0 migration contains a SQL function definition that still references the pg_cancel_backend symbol, that migration will fail. This issue is solved by actually changing the SQL definition in the old migration. This is not enough to fix all issues though. Item 2 causes an issue after an upgrade to 11.1, because it won't have the new definition of the SQL function. This is solved by recreating the SQL functions in the migration to 11.1. That way it gets the new definition. Then finally there's the case of downgrades. To continue to make our pg_cancel_backend SQL function work after downgrading, we will need to make a patch release for 11.0 that includes the new citus_cancel_backend symbol. This is done in a separate commit.	2022-09-07 11:27:05 +02:00
Nitish Upreti	d7404a9446	'Deferred Drop' and robust 'Shard Cleanup' for Splits. (#6258 ) DESCRIPTION: This PR adds support for 'Deferred Drop' and robust 'Shard Cleanup' for Splits. Common Infrastructure This PR introduces new common infrastructure so as any operation that wants robust cleanup of resources can register with the cleaner and have the resources cleaned appropriately based on a specified policy. 'Shard Split' is the first consumer using this new infrastructure. Note : We only support adding 'shards' as resources to be cleaned-up right now but the framework will be extended to support other resources in future. Deferred Drop for Split Deferred Drop Support ensures that shards undergoing split are not dropped inline as part of operation but dropped later when no active read queries are running on shard. This helps with : Avoids any potential deadlock scenarios that can cause long running Split operation to rollback. Avoids Split operation blocking writes and then getting blocked (due to running queries on the shard) when trying to drop shards. Deferred drop is the new default behavior going forward. Shard Cleaner Extension Shard Cleaner is a background task responsible for deferred drops in case of 'Move' operations. The cleaner has been extended to ensure robust cleanup of shards (dummy shards and split children) in case of a failure based on the new infrastructure mentioned above. The cleaner also handles deferred drop for 'Splits'. TESTING: New test ''citus_split_shard_by_split_points_deferred_drop' to test deferred drop support. New test 'failure_split_cleanup' to test shard cleanup with failures in different stages. Update 'isolation_blocking_shard_split and isolation_non_blocking_shard_split' for deferred drop. Added non-deferred drop version of existing tests : 'citus_split_shard_no_deferred_drop' and 'citus_non_blocking_splits_no_deferred_drop'	2022-09-06 12:11:20 -07:00
Gokhan Gulbiz	ac96370ddf	Use IsMultiStatementTransaction for SELECT .. FOR UPDATE queries (#6288 ) * Use IsMultiStatementTransaction instead of IsTransaction for row-locking operations. * Add regression test for SELECT..FOR UPDATE statement	2022-09-06 16:38:41 +02:00
Emel Şimşek	6f06ff78cc	Throw an error if there is a RangeTblEntry that is not assigned an RTE identity. (#6295 ) * Fix issue : 6109 Segfault or (assertion failure) is possible when using a SQL function * DESCRIPTION: Ensures disallowing the usage of SQL functions referencing to a distributed table and prevents a segfault. Using a SQL function may result in segmentation fault in some cases. This change fixes the issue by throwing an error message when a SQL function cannot be handled. Fixes #6109. * DESCRIPTION: Ensures disallowing the usage of SQL functions referencing to a distributed table and prevents a segfault. Using a SQL function may result in segmentation fault in some cases. This change fixes the issue by throwing an error message when a SQL function cannot be handled. Fixes #6109. Co-authored-by: Emel Simsek <emel.simsek@microsoft.com>	2022-09-06 15:46:41 +02:00
aykut-bozkurt	69726648ab	verify shards if exists for insert, delete, update (#6280 ) Co-authored-by: Marco Slot <marco.slot@gmail.com>	2022-09-06 15:29:14 +02:00
Hanefi Onaldi	85b19c851a	Disallow distributing by numeric with negative scale PG15 allows numeric scale to be negative or greater than precision. This causes issues and we may end up routing queries to a wrong shard due to differing hash results after rounding. Formerly, when specifying NUMERIC(precision, scale), the scale had to be in the range [0, precision], which was per SQL spec. PG15 extends the range of allowed scales to [-1000, 1000]. A negative scale implies rounding before the decimal point. For example, a column might be declared with a scale of -3 to round values to the nearest thousand. Note that the display scale remains non-negative, so in this case the display scale will be zero, and all digits before the decimal point will be displayed. Relevant PG commit: 085f931f52494e1f304e35571924efa6fcdc2b44	2022-09-06 12:40:56 +03:00
Naisila Puka	d7f41cacbe	Prohibit renaming child trigger on distributed partition pre PG15 (#6290 ) Pre PG15, renaming child triggers on partitions is allowed. When creating a trigger in a distributed parent partitioned table, the triggers on the shards of the partitions have the same name with the triggers on the corresponding parent shards of the parent table. Therefore, they don't have the same appended shard id as the shard id of the partition. Hence, when trying to rename a child trigger on a partition of a distributed table, we can't correctly find the triggers on the shards of the partition in order to rename them since we append a different shard id to the name of the trigger. Since we can't find the trigger we get a misleading error of inexistent trigger. In this commit we prohibit renaming child triggers on distributed partitions altogether.	2022-09-06 12:19:25 +03:00
Naisila Puka	fd9b3f4ae9	Add tests to make sure distributed clone trigger rename fails in PG15 (#6291 ) Relevant PG commit: 80ba4bb383538a2ee846fece6a7b8da9518b6866	2022-09-06 11:04:14 +03:00
Marco Slot	e6b1845931	Change split logic to avoid EnsureReferenceTablesExistOnAllNodesExtended (#6208 ) Co-authored-by: Marco Slot <marco.slot@gmail.com>	2022-09-05 22:02:18 +02:00
Önder Kalacı	bd13836648	Add citus.skip_advisory_lock_permission_checks (#6293 )	2022-09-05 17:47:41 +02:00
Jelte Fennema	1c5b8588fe	Address race condition in InitializeBackendData (#6285 ) Sometimes in CI our isolation_citus_dist_activity test fails randomly like this: ```diff step s2-view-dist: SELECT query, citus_nodename_for_nodeid(citus_nodeid_for_gpid(global_pid)), citus_nodeport_for_nodeid(citus_nodeid_for_gpid(global_pid)), state, wait_event_type, wait_event, usename, datname FROM citus_dist_stat_activity WHERE query NOT ILIKE ALL(VALUES('%pg_prepared_xacts%'), ('%COMMIT%'), ('%BEGIN%'), ('%pg_catalog.pg_isolation_test_session_is_blocked%'), ('%citus_add_node%')) AND backend_type = 'client backend' ORDER BY query DESC; query \|citus_nodename_for_nodeid\|citus_nodeport_for_nodeid\|state \|wait_event_type\|wait_event\|usename \|datname ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------+-------------------------+-------------------+---------------+----------+--------+---------- INSERT INTO test_table VALUES (100, 100); \|localhost \| 57636\|idle in transaction\|Client \|ClientRead\|postgres\|regression -(1 row) + + SELECT coalesce(to_jsonb(array_agg(csa_from_one_node.)), '[{}]'::JSONB) + FROM ( + SELECT global_pid, worker_query AS is_worker_query, pg_stat_activity. FROM + pg_stat_activity LEFT JOIN get_all_active_transactions() ON process_id = pid + ) AS csa_from_one_node; + \|localhost \| 57636\|active \| \| \|postgres\|regression +(2 rows) step s3-view-worker: ``` Source: https://app.circleci.com/pipelines/github/citusdata/citus/26692/workflows/3406e4b4-b686-4667-bec6-8253ee0809b1/jobs/765119 I intended to fix this with #6263, but the fix turned out to be insufficient. This PR tries to address the issue by setting distributedCommandOriginator correctly in more situations. However, even with this change it's still possible to reproduce the flaky test in CI. In any case this should fix at least some instances of this issue. In passing this changes the isolation_citus_dist_activity test to allow running it multiple times in a row.	2022-09-02 14:23:47 +02:00
Ahmet Gedemenli	7c8cc7fc61	Fix flakiness for view tests (#6284 )	2022-09-02 10:12:07 +03:00
Marco Slot	432f399a5d	Allow citus_internal application_name with additional suffix (#6282 ) Co-authored-by: Marco Slot <marco.slot@gmail.com>	2022-09-01 14:26:43 +02:00
Naisila Puka	9e2b96caa5	Add pg14->pg15 upgrade test for dist. triggers on part. tables (#6265 ) PRE PG15, Renaming the parent triggers on partitioned tables doesn't recurse to renaming the child triggers on the partitions as well. In PG15, Renaming triggers on partitioned tables recurses to renaming the triggers on the partitions as well. Add an upgrade test to make sure we are not breaking anything with distributed triggers on distributed partitioned tables. Relevant PG commit: 80ba4bb383538a2ee846fece6a7b8da9518b6866	2022-09-01 12:32:44 +03:00
Naisila Puka	317dda6af1	Use RelationGetPrimaryKeyIndex for citus catalog tables (#6262 ) pg_dist_node and pg_dist_colocation have a primary key index, not a replica identity index. Citus catalog tables are created in public schema, which has replica identity index by default as primary key index. Later the citus catalog tables are moved to pg_catalog schema. During pg_upgrade, all tables are recreated, and given that pg_dist_colocation is found in pg_catalog schema, it is recreated in that schema, and when it is recreated it doesn't have a replica identity index, because catalog tables have no replica identity. Further action: Do we even need to acquire this lock on the primary key index? Postgres doesn't acquire such locks on indexes before deleting catalog tuples. Also, catalog tuples don't have replica identities by definition.	2022-09-01 11:56:31 +03:00
Jelte Fennema	8bb082e77d	Fix reporting of progress on waiting and moved shards (#6274 ) In commit `31faa88a4e` I removed some features of the rebalance progress monitor. I did this because the plan was to remove the foreground shard rebalancer later in the PR that would add the background shard rebalancer. So, I didn't want to spend time fixing something that we would throw away anyway. As it turns out we're not removing the foreground shard rebalancer after all, so it made sens to fix the stuff that I broke. This PR does that. For the most part this commit reverts the changes in commit `31faa88a4e`. It's not a full revert though, because it keeps the improved tests and the changes to `citus_move_shard_placement`.	2022-08-31 14:55:47 +03:00
Naisila Puka	98dcbeb304	Specifies that our CustomScan providers support projections (#6244 ) Before, this was the default mode for CustomScan providers. Now, the default is to assume that they can't project. This causes performance penalties due to adding unnecessary Result nodes. Hence we use the newly added flag, CUSTOMPATH_SUPPORT_PROJECTION to get it back to how it was. In PG15 support branch we created explain functions to ignore the new Result nodes, so we undo that in this commit. Relevant PG commit: 955b3e0f9269639fb916cee3dea37aee50b82df0	2022-08-31 10:48:01 +03:00
Jelte Fennema	24e695ca27	Fix flakyness in multi_utilities (#6272 ) Sometimes in CI our multi_utilities test fails like this: ```diff VACUUM (INDEX_CLEANUP ON, PARALLEL 1) local_vacuum_table; SELECT CASE WHEN s BETWEEN 20000000 AND 25000000 THEN 22500000 ELSE s END size FROM pg_total_relation_size('local_vacuum_table') s ; size ---------- - 22500000 + 39518208 (1 row) ``` Source: https://app.circleci.com/pipelines/github/citusdata/citus/26641/workflows/5caea99c-9f58-4baa-839a-805aea714628/jobs/762870 Apparently VACUUM is not as reliable in cleaning up as we thought. This increases the range of allowed values. Important to note is that the range is still completely outside of the allowed range of the initial size. So we know for sure that some data was cleaned up.	2022-08-30 14:32:34 -07:00
Jelte Fennema	f22a47981a	Fix flakyness in adaptive_executor (#6275 ) Sometimes in CI our adaptive_executor test would fail randomly with the following error: ```diff SELECT sum(result::bigint) FROM run_command_on_workers($$ SELECT count(*) FROM pg_stat_activity WHERE pid <> pg_backend_pid() AND query LIKE '%8010090%' $$); sum ----- - 4 + 2 (1 row) END; ``` Source: https://app.circleci.com/pipelines/github/citusdata/citus/26665/workflows/40665680-0044-4852-8fe4-5fd628f9fb47/jobs/764371 This means that the low slow start interval did not have any effect on the number of connections being opened. I could see two possibilities for this to happen: 1. CI was slow and actually doing the start of the second connection. I tried to solve this by doubling the time a query to the worker takes. 2. The second option is that the shards were queried in the oposite order than we expect. This would mean that the first query to the worker completes quickly because there's no, sleep because it doesn't contain any rows. I tried to solve this option by adding a row to each shard. After trying to reproduce the random failure in CI it turned out that I needed both of these fixes to resolve the random failure.	2022-08-30 23:23:30 +02:00
Jelte Fennema	8354853dec	Fix flakyness in citus_split_shard_columnar_partitioned (#6273 ) On CI our citus_split_shard_columnar_partitioned test would sometimes randomly fail like this: ```diff 8970008 \| colocated_dist_table \| -2147483648 \| 2147483647 \| localhost \| 57637 8970009 \| colocated_partitioned_table \| -2147483648 \| 2147483647 \| localhost \| 57637 8970010 \| colocated_partitioned_table_2020_01_01 \| -2147483648 \| 2147483647 \| localhost \| 57637 - 8970011 \| reference_table \| \| \| localhost \| 57637 8970011 \| reference_table \| \| \| localhost \| 57638 + 8970011 \| reference_table \| \| \| localhost \| 57637 (13 rows) ``` Source: https://app.circleci.com/pipelines/github/citusdata/citus/26651/workflows/f695b4fb-ad81-46ff-b97e-0100e5d167ea/jobs/763517 This is a harmless diff due to a missing column in the order by list. This fixes that by adding the nodeport as a tiebreaker.	2022-08-30 19:54:50 +03:00
Marco Slot	6bb31c5d75	Add non-blocking variant of create_distributed_table (#6087 ) Added create_distributed_table_concurrently which is nonblocking variant of create_distributed_table. It bases on the split API which takes advantage of logical replication to support nonblocking split operations. Co-authored-by: Marco Slot <marco.slot@gmail.com> Co-authored-by: aykutbozkurt <aykut.bozkurt1995@gmail.com>	2022-08-30 15:35:40 +03:00
Jelte Fennema	d68654680b	Fix flakyness in isolation_citus_dist_activity (#6263 ) Sometimes in CI our isolation_citus_dist_activity test fails randomly like this: ```diff step s2-view-dist: SELECT query, citus_nodename_for_nodeid(citus_nodeid_for_gpid(global_pid)), citus_nodeport_for_nodeid(citus_nodeid_for_gpid(global_pid)), state, wait_event_type, wait_event, usename, datname FROM citus_dist_stat_activity WHERE query NOT ILIKE ALL(VALUES('%pg_prepared_xacts%'), ('%COMMIT%'), ('%BEGIN%'), ('%pg_catalog.pg_isolation_test_session_is_blocked%'), ('%citus_add_node%')) AND backend_type = 'client backend' ORDER BY query DESC; query \|citus_nodename_for_nodeid\|citus_nodeport_for_nodeid\|state \|wait_event_type\|wait_event\|usename \|datname ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------+-------------------------+-------------------+---------------+----------+--------+---------- INSERT INTO test_table VALUES (100, 100); \|localhost \| 57636\|idle in transaction\|Client \|ClientRead\|postgres\|regression -(1 row) + + SELECT coalesce(to_jsonb(array_agg(csa_from_one_node.)), '[{}]'::JSONB) + FROM ( + SELECT global_pid, worker_query AS is_worker_query, pg_stat_activity. FROM + pg_stat_activity LEFT JOIN get_all_active_transactions() ON process_id = pid + ) AS csa_from_one_node; + \|localhost \| 57636\|active \| \| \|postgres\|regression +(2 rows) step s3-view-worker: ``` Source: https://app.circleci.com/pipelines/github/citusdata/citus/26605/workflows/56d284d2-5bb3-4e64-a0ea-7b9b1626e7cd/jobs/760633 The reason for this is that citus_dist_stat_activity sometimes shows the query that it uses itself to get the data from pg_stat_activity. This is actually a bug, because it's a worker query and thus shouldn't show up there. To try and solve this bug, we remove two small opportunities for a race condition. These race conditions could happen when the backenddata was marked as active, but the distributedCommandOriginator was not set correctly yet/anymore. There was an opportunity for this to happen both during connection start and shutdown.	2022-08-30 12:57:37 +03:00
Önder Kalacı	33af407ac8	Add missing orderbys (#6271 )	2022-08-30 12:49:15 +03:00
Jelte Fennema	895a484b39	Hopefully fix flakyeness in drop_partitioned_table (#6270 ) Sometimes in CI our drop_partitioned_talbe test would fail with the following error: ```diff NOTICE: issuing SELECT worker_drop_distributed_table('drop_partitioned_table.child1') NOTICE: issuing SELECT worker_drop_distributed_table('drop_partitioned_table.child1') NOTICE: issuing DROP TABLE IF EXISTS drop_partitioned_table.child1_727001 CASCADE -NOTICE: issuing SELECT pg_catalog.citus_internal_delete_colocation_metadata(100047) -NOTICE: issuing SELECT pg_catalog.citus_internal_delete_colocation_metadata(100047) +NOTICE: issuing SELECT pg_catalog.citus_internal_delete_colocation_metadata(100046) +NOTICE: issuing SELECT pg_catalog.citus_internal_delete_colocation_metadata(100046) ROLLBACK; NOTICE: issuing ROLLBACK NOTICE: issuing ROLLBACK ``` Source: https://app.circleci.com/pipelines/github/citusdata/citus/26631/workflows/31536032-e1ba-493b-b12a-f40757f3a7d6/jobs/762170 For some reason the colocationid of the distributed partitioned table would be one less than we expected. Why this happens I'm not sure, but it seems fairly harmless that it does. In an attempt to work around this flakyness I now reset the colocation id sequence right before creating the table in question. This is good practice in general, because it allows us to run the test successfully using `check-minimal` and it also allows us to rerun it multiple times.	2022-08-30 12:21:16 +03:00
Jelte Fennema	5c95604154	Always copy normalized files after a regress run (#6254 ) Our python based tests didn't always copy the normalized files after the regress run. I had the problem where running the following command would result in non-normalized files in the expected directory after running our PG upgrade tests locally: ``` cp src/test/regress/{results,expected}/upgrade_list_citus_objects.out ``` This PR fixes that by always running `copy_modified` even if the tests fail. The same was already being done for our perl based tests at the end of the `pg_regress_multi.pl` file.	2022-08-30 07:15:29 +00:00
Naisila Puka	13fe89f018	Fixes flakyness in columnar_permissions test (#6266 ) `columnar_permissions.sql` test is flaky due to a missing `ORDER BY` clauses. Added the other `ORDER BY` clauses for consistency in the test. ```diff where relation in ('no_access'::regclass, 'columnar_permissions'::regclass); relation \| chunk_group_row_limit \| stripe_row_limit \| compression \| compression_level ----------------------+-----------------------+------------------+-------------+------------------- - no_access \| 10000 \| 150000 \| zstd \| 3 columnar_permissions \| 10000 \| 2222 \| none \| 3 + no_access \| 10000 \| 150000 \| zstd \| 3 (2 rows) ``` Source: https://app.circleci.com/pipelines/github/citusdata/citus/26610/workflows/79f03ef9-7674-4567-a087-02536c9ddf04/jobs/760942	2022-08-29 14:33:26 +02:00
Önder Kalacı	1df943e0d5	Use Posix locale in the tests (#6261 ) Commit `9653a0065e` has changed it to C.UTF-8 , which fails on MacOS	2022-08-29 12:52:03 +02:00
Ahmet Gedemenli	0855a9d1d4	Use SUM for calculating non partitioned table sizes (#6222 ) We currently do a `pg_relation_total_size('t1') + pg_relation_total_size('t2') + ..` on shard lists, especially when rebalancing the shards. This in some cases goes huge. With this PR, we basically use a SUM for all table sizes, instead of using thousands of pluses.	2022-08-26 18:02:14 +03:00
Sameer Awasekar	4df8eca77f	Add worker_split_shard_release_dsm udf to release dynamic shared memory (#6248 ) The code introduces worker_split_shard_release_dsm udf to release the dynamic shared memory segment allocated during non-blocking split workflow.	2022-08-26 18:27:32 +05:30
Jelte Fennema	77dd49fcf8	Fix flakyness in failure_online_move_shard_placement (#6250 ) Sometimes in CI failure_online_move_shard_placement fails with the following error: ```diff SELECT citus.mitmproxy('conn.onQuery(query="^ALTER SUBSCRIPTION .* ENABLE").cancel(' \|\| :pid \|\| ')'); mitmproxy ----------- (1 row) SELECT master_move_shard_placement(101, 'localhost', :worker_1_port, 'localhost', :worker_2_proxy_port); -ERROR: canceling statement due to user request +ERROR: tuple concurrently updated +CONTEXT: while executing command on localhost:9060 -- failure on polling subscription state ``` Source: https://app.circleci.com/pipelines/github/citusdata/citus/26441/workflows/dd6e3475-6121-47b3-aea3-4ac92be114f4/jobs/751476/steps This error is not completely harmless, because based on the logs it mean that our cleanup logic failed, which in turn means that replication slots are left around: ``` 2022-08-24 16:01:29.247 UTC [1219] ERROR: XX000: tuple concurrently updated 2022-08-24 16:01:29.247 UTC [1219] LOCATION: simple_heap_update, heapam.c:4179 2022-08-24 16:01:29.247 UTC [1219] STATEMENT: ALTER SUBSCRIPTION citus_shard_move_subscription_10 DISABLE ``` However, we have other mechanisms to clean up any leftovers in case of a failed cleanup. So it's not that big of a problem. The reason we run into this error is arguably because of a Postgres bug, so I created a patch for Postgres that fixes this. While we wait for this (or a similar) patch to be merged, this PR disables the flaky test. There's still a test that tests in case of a connection "kill" instead of a "cancel", so I don't think we lose very important coverage by disabling this test. When trying to reproduce this I only hit this issue in the cancel case, so I don't think there's a need to disable the kill case for now.	2022-08-26 12:49:45 +02:00
Jelte Fennema	2a0c0b3ba6	Fix flakyness in failure_connection_establishment (#6251 ) In CI sometimes failure_connection_establishment would fail with the following error: ```diff -- cancel all connections to this node SELECT citus.mitmproxy('conn.onAuthenticationOk().cancel(' \|\| pg_backend_pid() \|\| ')'); - mitmproxy ---------------------------------------------------------------------- - -(1 row) - +ERROR: canceling statement due to user request +CONTEXT: COPY mitmproxy_result, line 1: "" +SQL statement "COPY mitmproxy_result FROM '/home/circleci/project/src/test/regress/tmp_check/mitmproxy.fifo'" +PL/pgSQL function citus.mitmproxy(text) line 11 at EXECUTE SELECT * FROM citus_check_cluster_node_health(); ``` The reason for this is that the mitm command that was used is very broad and doesn't actually do what the comment says. What happens is that if any connection is made, the current backend is cancelled, which is not the always the same as the backend that made the connection. My assessment is that likely the maintenance daemon makes a connection to the node while we are executing the mitmproxy command. The mitmproxy command goes through, and then triggers a cancel of itself due to the connection made by the maintenance daemon. This PR simply removes this test, since it doesn't seem to test what it intended to test anyway. There's also still the "kill" version of this test, which does do the intended thing. So I don't think we lose important coverage by removing this test.	2022-08-26 10:01:36 +00:00
Jelte Fennema	18015ca501	Fix flakyness in multi_transaction_recovery (#6249 ) Sometimes in CI multi_transaction_recovery would fail with the following error: ```diff SET LOCAL citus.defer_drop_after_shard_move TO OFF; SELECT citus_move_shard_placement((SELECT * FROM selected_shard), 'localhost', :worker_1_port, 'localhost', :worker_2_port, shard_transfer_mode := 'block_writes'); - citus_move_shard_placement ---------------------------------------------------------------------- - -(1 row) - +ERROR: could not find placement matching "localhost:57637" +HINT: Confirm the placement still exists and try again. COMMIT; ``` Source: https://app.circleci.com/pipelines/github/citusdata/citus/26510/workflows/8269ea93-d9b4-4376-ae0e-8332a5c15fc6/jobs/755548 The reason for this was that when choosing `selected_shard` we didn't ensure that it was actually located on the node that we were moving it from. Instead we simply picked the first shard for the table that was returned by the query. To fix this issue this PR adds a filter to only choose shards that are located on the intended node.	2022-08-26 11:48:55 +02:00
Jelte Fennema	9749622399	Fix flakyness in isolation_distributed_deadlock_detection (#6240 ) Our isolation_distributed_deadlock_detection test would fail randomly in CI in three different ways. The first type of failure looked like this: ```diff check_distributed_deadlocks --------------------------- t (1 row) -step s1-update-5: <... completed> step s5-update-1: <... completed> ERROR: canceling the transaction since it was involved in a distributed deadlock +step s1-update-5: <... completed> step s1-commit: ``` Source: https://app.circleci.com/pipelines/github/citusdata/citus/26399/workflows/d213ee85-397a-467a-9ffb-39e4f44e6688/jobs/749533 This random change in output was harmless and happened because when the deadlock detector cancelled a query, two queries would continue: The one that was cancelled would throw an error (and thus complete), and the one that was unblocked would now complete. It was random which of the two the isolation tester would first detect as completed. To resolve this PR starts using the ["marker" feature][1], this allows us to make sure one of the steps won't be marked as completed until the other one completed first. The second random failure was very similar: ```diff check_distributed_deadlocks --------------------------- t (1 row) -step s2-update-2: <... completed> -step s3-update-3: <... completed> -ERROR: canceling the transaction since it was involved in a distributed deadlock step s6-commit: COMMIT; step s5-update-6: <... completed> +step s2-update-2: <... completed> +step s3-update-3: <... completed> +ERROR: canceling the transaction since it was involved in a distributed deadlock step s5-commit: ``` Again a harmless difference in test output. In this case it's possible that the deadlock detector would not detect the unblocked processes right away, and would thus continue with to the next step. This step was a commit on a session that was not blocked, and which thus could complete without issues. To solve this I changed the order of the commits at the end of the permutation, to always have the first session that would commit be the session that would be unblocked the last. This ensures that no commit will ever be executed before completing all the queries. The third issue was different and looked like this: ```diff step s4-update-5: <... completed> step s4-commit: COMMIT; +step s1-update-4: <... completed> +isolationtester: canceling step s3-update-4 after 5 seconds step s3-update-4: <... completed> +ERROR: canceling statement due to user request +step s2-update-2: <... completed> step s3-commit: COMMIT; -step s2-update-2: <... completed> -step s1-update-4: <... completed> step s1-commit: ``` Source: https://app.circleci.com/pipelines/github/citusdata/citus/26411/workflows/9089beec-4f0f-4027-b4ce-0e84889afc06/jobs/750143 The reason for this failure is not entirely clear to me, but I was able to remove the flakyness without impacting the goal of the test. What was happening was that both `s1` and `s3` were waiting for `s4` to commit and release it's lock on the row 4. For some reason it wasn't deterministic which of the two sessions would be granted the lock after it was released by row 4. The test expected `s3` to be granted the lock, but sometimes it would be granted to `s1` instead. Which would in turn cause `s3` to still be blocked. To solve this I simply removed `s1` completely from this test. It wasn't actually part of the cycle that the deadlock detector should detect and was an unrelated appendage: ```mermaid graph TD; s2-->s3; s3-->s4; s1-->s4; s4-->s5; s5-->s6; s6-->s5; ``` By removing `s1` completely there was no contention for the lock and `s3` could always acquire it. [1]: `a73d6c87f2/src/test/isolation/README (L163-L188)`	2022-08-26 12:03:40 +03:00
Jelte Fennema	b5cd1676f9	Fix flakyness in multi_utilities (#6245 ) In CI multi_utilities would sometimes fail randomly with this error: ```diff VACUUM (INDEX_CLEANUP ON, PARALLEL 1) local_vacuum_table; SELECT pg_size_pretty( pg_total_relation_size('local_vacuum_table') ); pg_size_pretty ---------------- - 21 MB + 22 MB (1 row) ``` Source: https://app.circleci.com/pipelines/github/citusdata/citus/26459/workflows/da47d9b6-f70b-49fe-806f-5ebf75bf0b11/jobs/752482 This is a harmless change in output where the relation size after vacuuming was slightly more than we expected. This changes the size checks for the local_vacuum_table to allow a wider range of values. It uses the same trick as #6216 to show the actual value when it's outside this valid range, which is useful if this test ever starts failing again.	2022-08-25 22:50:47 +02:00
Jelte Fennema	00485d45a6	Make multi_utilities not leak tables (#6246 ) When trying to fix #6245 I realized that multi_utilities was leaking some tables that it created during the test. This fixes that by creating all these tables in a schema that's dedicated for this test.	2022-08-25 19:33:03 +03:00
Jelte Fennema	1688bcda33	Fix errors in base_schedule (#6247 ) When running `make check-base` locally it would fail with two different errors. The first one was this: ```diff SELECT create_distributed_table('pg_class', 'relname'); -ERROR: cannot create a citus table from a catalog table +ERROR: deadlock detected +DETAIL: Process 28950 waits for ExclusiveLock on relation 16551 of database 16384; blocked by process 28951. +Process 28951 waits for RowExclusiveLock on relation 1259 of database 16384; blocked by process 28950. +HINT: See server log for query details. SELECT create_reference_table('pg_class'); ``` This happened because multi_behavioral_analytics_create_table and multi_create_table were being run in parallel. Running them separately resolved this issue. The second one was this: ```diff CREATE OR REPLACE FUNCTION wait_until_metadata_sync(timeout INTEGER DEFAULT 15000) RETURNS void LANGUAGE C STRICT AS 'citus'; +ERROR: duplicate key value violates unique constraint "pg_proc_proname_args_nsp_index" +DETAIL: Key (proname, proargtypes, pronamespace)=(wait_until_metadata_sync, 23, 2200) already exists. -- Add some helper functions for sending commands to mitmproxy ``` Which was because failure_test_helpers and multi_test_helpers were trying to create the same function at the exact same time. The easy fix here is to simply not create this function in the failure_test_helpers file. This is fine, because any test schedule that runs failure_test_helpers also runs multi_test_helpers.	2022-08-25 18:06:41 +02:00

... 2 3 4 5 6 ...

4242 Commits (efd41e8ea55cf613fc1a0255a034ce13ceb554d3)