citus

Commit Graph

Author	SHA1	Message	Date
Hanefi Onaldi	7c5b787b9c	Add changelog entries for 11.1.4	2022-10-24 12:46:41 +03:00
Onur Tirtir	2d14dd85e9	Not hardcode "false" in UpdateAutoConvertedForConnectedRelations (#6452 ) This didn't cause any bugs since today we're always calling UpdateAutoConvertedForConnectedRelations with autoconverted=false, so we don't need to backport this to anywhere.	2022-10-21 18:14:20 +03:00
Onur Tirtir	dbe2749bbf	Drop unreachable code from query_pushdown_planning.c (#6451 ) Given that we cannot continue after a `RaiseDeferredErrorInternal(.., ERROR)` call.	2022-10-21 18:04:31 +03:00
Jelte Fennema	7f05ad033a	Add a section on PR descriptions to flaky test docs (#6446 ) Good PR descriptions for flaky tests are quite helpful when reviewing. Although obviously no PR description is the same, there's a few common pieces of information that are useful for all PRs that fix flaky tests.	2022-10-21 16:52:31 +02:00
aykut-bozkurt	162c8a5160	Drop worker_fetch_foreign_file/worker_repartition_cleanup only if they exist when upgrading Citus (#6441 ) We should not introduce breaking sql changes to upgrade files after they are released. We did that for worker_fetch_foreign_file in v9.0.0 and worker_repartition_cleanup in v9.2.0. Later when we try to drop those udfs, they were missing for some clients unexpectedly due to breaking change in an old upgrade script. For that case, the fix is to add DROP IF EXISTS for those 2 udfs in 11.0-4--11.1-1.	2022-10-21 14:32:42 +03:00
Emel Şimşek	02fd1e6c03	Fix the crash that happens when using auto_explain extension with recursive queries (#6406 ) This crash happens with recursively planned queries. For such queries, subplans are explained via the ExplainOnePlan function of postgresql. This function reconstructs the query description from the plan therefore it expects the ActiveSnaphot for the query be available. This fix makes sure that the snapshot is in the stack before calling ExplainOnePlan. Fixes #2920.	2022-10-19 18:04:45 +03:00
Jelte Fennema	737e2bb1bb	Don't leak search_path to workers on DDL (#6444 ) DESCRIPTION: Don't leak search_path to workers on DDL For DDL we have to set the `search_path` on workers to the same as on the coordinator for some DDL to work. Previously this search_path would leak outside of the transaction that was used for the DDL. This fixes that by using `SET LOCAL` instead of `SET`. The only place where we still use plain `SET` is for DDL commands that are not allowed within transactions, such as `CREATE INDEX CONCURRENLTY`. This fixes this flaky test: ```diff CONTEXT: SQL statement "SELECT change_id FROM distributed_triggers.data_changes WHERE shard_key_value = NEW.shard_key_value AND object_id = NEW.object_id ORDER BY change_id DESC LIMIT 1" -PL/pgSQL function record_change() line XX at SQL statement +PL/pgSQL function distributed_triggers.record_change() line 17 at SQL statement while executing command on localhost:57638 DELETE FROM data_ref_table where shard_key_value = 'hello'; ``` Source: https://app.circleci.com/pipelines/github/citusdata/citus/27849/workflows/75ae5f1a-100b-4b7a-b991-7de069f39ee1/jobs/831429 I had tried to fix this flaky test in #5894 and then I tried implementing a better fix in #5896, where @marcocitus suggested this better fix. This change reverts the fix from #5894 and implements the fix suggested by Marco. Our multi_mx_alter_distributed_table test actually depended on the old buggy search_path leaking behavior. After fixing the bug that test would fail like this: ```diff CALL proc_0(1.0); DEBUG: pushing down the procedure -NOTICE: Res: 3 -DETAIL: from localhost:xxxxx +ERROR: relation "test_proc_colocation_0" does not exist +CONTEXT: PL/pgSQL function mx_alter_distributed_table.proc_0(double precision) line 5 at SQL statement +while executing command on localhost:57637 RESET client_min_messages; ``` I fixed this test by fully qualifying the table names used in the procedure. I think it's quite unlikely that actual users depend on this behavior though. Since it would require first doing DDL before calling a procedure in a session where the search_path was changed after connecting.	2022-10-19 16:47:35 +02:00
Ahmet Gedemenli	cdbda9ea6b	Add failure test for shard move (#6325 ) DESCRIPTION: Adds failure test for shard move DESCRIPTION: Remove function `WaitForAllSubscriptionsToBecomeReady` and related tests Adding some failure tests for shard moves. Dropping the not-needed-anymore function `WaitForAllSubscriptionsToBecomeReady`, as the subscriptions now start as ready from the beginning because we don't use logical replication table sync workers anymore. fixes: #6260	2022-10-19 14:25:26 +02:00
Gokhan Gulbiz	56da3cf6aa	Increase node_connection_timeout to prevent flakiness in shard_rebalancer regression tests (#6445 ) In CI shard_rebalancer sometimes fails with this error: ```diff SET citus.node_connection_timeout to 60; BEGIN; SET LOCAL citus.shard_replication_factor TO 2; SET citus.log_remote_commands TO ON; SET SESSION citus.max_adaptive_executor_pool_size TO 5; SELECT replicate_table_shards('dist_table_test_2', max_shard_copies := 4, shard_transfer_mode:='block_writes'); +WARNING: could not establish connection after 60 ms ``` Source https://app.circleci.com/pipelines/github/citusdata/citus/28128/workflows/38eeacc4-4191-4366-87ed-9a628414965a/jobs/847458?invite=true#step-107-21 This PR avoids this issue by increasing ```citus.node_connection_timeout``` to 35s.	2022-10-19 13:03:14 +03:00
Onur Tirtir	5aec88d084	Not try locking relations referencing to views (#6430 ) Since there can't be such a foreign key already. This mainly fixes the error that Citus throws when trying to truncate a distributed view. Fixes #5990.	2022-10-19 11:24:22 +03:00
Önder Kalacı	93e162def6	Bump PG version to 15 on the README (#6442 )	2022-10-18 13:22:28 -05:00
Jelte Fennema	f756db39c4	Add docs on how to fix flaky tests (#6438 ) I fixed a lot of flaky tests recently and I found some patterns in the type of issues and type of fixes. This adds a document that lists these types of issues and explains how to fix them.	2022-10-18 15:52:01 +02:00
Gokhan Gulbiz	e87eda6496	Introduce a new GUC to propagate local settings to new connections in rebalancer (#6396 ) DESCRIPTION: Introduce ```citus.propagate_session_settings_for_loopback_connection``` GUC to propagate local settings to new connections. Fixes: #5289	2022-10-18 12:50:30 +03:00
Jelte Fennema	60eb67b908	Increase shard move test coverage by improving advisory locks (#6429 ) To be able to test non-blocking shard moves we take an advisory lock, so we can pause the shard move at an interesting moment. Originally this was during the logical replication catch up phase. But when I added tests for the rebalancer progress I moved this lock before the initial data copy. This allowed testing of the rebalance progress, but inadvertently made our non-blocking tests not actually test if we held unintended locks during logical replication catch up. This fixes that by creating two types of advisory locks, one before the copy and one after. This causes the tests to actually test their intended scenario again. Furthermore it starts using one of these locks for blocking shard moves too. Which allowed me to reduce the complexity of the rebalance progress test suite quite a bit. It also allowed enabling some flaky tests again, because this stopped them from being flaky. And finally it allowed testing of rebalance progress for blocking shard copy operations as well. In passing it fixes a flaky test during parallel blocking shard moves by ordering the output.	2022-10-17 17:32:28 +02:00
Ahmet Gedemenli	96912d9ba1	Add status column to get_rebalance_progress() (#6403 ) DESCRIPTION: Adds status column to get_rebalance_progress() Introduces a new column named `status` for the function `get_rebalance_progress()`. For each ongoing shard move, this column will reveal information about that shard move operation's current status. For now, candidate status messages could be one of the below. * Not Started * Setting Up * Copying Data * Catching Up * Creating Constraints * Final Catchup * Creating Foreign Keys * Completing * Completed	2022-10-17 16:55:31 +03:00
Naisila Puka	8323f4f12c	Cleans up test outputs (#6434 )	2022-10-17 15:13:07 +03:00
Hanefi Onaldi	82ea76bc0c	Bump PG15 CI images to 15.0 (#6439 ) Related: citusdata/the-process#95	2022-10-15 13:14:17 +03:00
Önder Kalacı	037eeb3918	Use Azure Cosmos DB for PostgreSQL instead of Azure Database for PostgreSQL in the README (#6432 ) For more details: https://devblogs.microsoft.com/cosmosdb/distributed-postgresql-comes-to-azure-cosmos-db/ Co-authored-by: Claire Giordano <claire@citusdata.com>	2022-10-14 18:17:30 +02:00
Onur Tirtir	4152a391c2	Properly set col names for shard rels that citus_extradata_container points to (#6428 ) Deparser function set_relation_column_names() knows that it needs to re-evaluate column names based on relation's tuple descriptor when the rte belongs to a relation (RTE_RELATION). However before this commit, it didn't know about the fact that citus might wrap such an rte with an rte that points to citus_extradata_container() placeholder. And because of this, it was simply taking the column aliases (e.g., "bar" in "foo AS bar") into the account and this might result in an incorrectly deparsed query as in below case: * Say, if we had view based on following query: ```sql SELECT a FROM table; ``` * And if we rename column "a" to "b", the view query normally becomes: ```sql SELECT b AS a FROM table; ``` * So before this commit, deparsing a query based on that view was resulting in such a query due to deparsing based on the column aliases, which is not correct: ```sql SELECT a FROM table; ``` Fixes #5932. DESCRIPTION: Fixes a bug that might cause failing to query the views based on tables that have renamed columns	2022-10-14 17:31:25 +03:00
Önder Kalacı	8b624b5c9d	Detect remotely closed sockets and add a single connection retry in the executor (#6404 ) PostgreSQL 15 exposes WL_SOCKET_CLOSED in WaitEventSet API, which is useful for detecting closed remote sockets. In this patch, we use this new event and try to detect closed remote sockets in the executor. When a closed socket is detected, the executor now has the ability to retry the connection establishment. Note that, the executor can retry connection establishments only for the connection that has not been used. Basically, this patch is mostly useful for preventing the executor to fail if a cached connection is closed because of the worker node restart (or worker failover). In other words, the executor cannot retry connection establishment if we are in a distributed transaction AND any command has been sent over the connection. That requires more sophisticated retry mechanisms. For now, fixing the above use case is enough. Fixes #5538 Earlier discussions: #5908, #6259 and #6283 ### Summary of the current approach regards to earlier trials As noted, we explored some alternatives before getting into this. https://github.com/citusdata/citus/pull/6283 is simple, but lacks an important property. We should be checking for `WL_SOCKET_CLOSED` _before_ sending anything over the wire. Otherwise, it becomes very tricky to understand which connection is actually safe to retry. For example, in the current patch, we can safely check `transaction->transactionState == REMOTE_TRANS_NOT_STARTED` before restarting a connection. #6259 does what we intent here (e.g., check for sending any command). However, as @marcocitus noted, it is very tricky to handle `WaitEventSets` in multiple places. And, the executor is designed such that it reacts to the events. So, adding anything `pre-executor` seemed too ugly. In the end, I converged into this patch. This patch relies on the simplicity of #6283 and also does a very limited handling of `WaitEventSets`, just for our purpose. Just before we add any connection to the execution, we check if the remote session has already closed. With that, we do a brief interaction of multiple wait event processing, but with different purposes. The new wait event processing we added does not even consider cancellations. We let that handled by the main event processing loop. Co-authored-by: Marco Slot <marco.slot@gmail.com>	2022-10-14 15:08:49 +02:00
Hanefi Onaldi	4d037f03fe	Add changelog entries for 11.1.3 (#6435 )	2022-10-14 13:04:35 +03:00
Jelte Fennema	0cee79a7ab	Actually enable improved blocked process detection (#6426 ) In #6405 I added better improved blocked process detection for isolation tests. But when cleaning up unnecessary code I cleaned up a bit too much. This actually includes the new function definition in our migrations.	2022-10-13 09:50:37 +02:00
Jelte Fennema	ecc37b9028	Fix flakyness in multi_partitioning (#6427 ) In CI multi_partitioning sometimes fails with this error: ```diff SELECT citus_remove_node('localhost', :master_port); - citus_remove_node ---------------------------------------------------------------------- - -(1 row) - +ERROR: tuple concurrently deleted -- d) invalid tables for helper UDFs ``` Source: https://app.circleci.com/pipelines/github/citusdata/citus/27993/workflows/685e5b20-c923-43e5-8a0d-b932ef4c4914/jobs/839466 This PR avoids this concurrency issue by not running the multi_partitioning test in parallel with other tests.	2022-10-13 10:33:37 +03:00
Onur Tirtir	20847515fa	Hint users to call "citus_set_coordinator_host" first (#6425 ) If an operation requires having coordinator in pg_dist_node and if that is not the case, then we automatically add the coordinator into pg_dist_node if user didn't add any worker nodes yet. However, if user have already added some worker nodes before, we throw an error. With this commit, we improve the error thrown in that case. Closes #6423 based on the discussion made there.	2022-10-12 18:18:51 +03:00
Jelte Fennema	6277ffd69e	Reduce isolation flakyness by improving blocked process detection (#6405 ) Sometimes our CI randomly fails on a test in a way similar to this: ```diff step s2-drop: DROP TABLE cancel_table; - + <waiting ...> +step s2-drop: <... completed> starting permutation: s1-timeout s1-begin s1-sleep10000 s1-rollback s1-reset s1-drop ``` Source: https://app.circleci.com/pipelines/github/citusdata/citus/26524/workflows/5415b84f-13a3-482f-bef9-648314c79a67/jobs/756377 I tried to fix that already in #6252 by disabling the maintenance daemon during isolation tests. But it seems that hasn't fixed all cases of these errors. This is another attempt at fixing these issues that seems to have better results. What it does is that it starts using the pInterestingPids parameter that citus_isolation_test_session_is_blocked receives. With this change we start filter out block-edges that are not caused by any of these pids. In passing this change also makes it possible to run `isolation_create_distributed_table_concurrently` with `check-isolation-base`	2022-10-12 16:35:09 +02:00
Hanefi Onaldi	ec3eebbaf6	Rename a function that collides with PG15 (#6422 ) PG15 introduced a function called ReplicationSlotName that causes conflicts with our function with the same name. I solved this issue by renaming our function to ReplicationSlotNameForNodeAndOwner Relevant PG commit: `c3b5992b91`	2022-10-12 13:24:04 +03:00
Jelte Fennema	cb34adf7ac	Don't reassign global PID when already assigned (#6412 ) DESCRIPTION: Fix bug in global PID assignment for rebalancer sub-connections In CI our isolation_shard_rebalancer_progress test would sometimes fail like this: ```diff +isolationtester: canceling step s1-rebalance-c1-block-writes after 60 seconds step s1-rebalance-c1-block-writes: SELECT rebalance_table_shards('colocated1', shard_transfer_mode:='block_writes'); - <waiting ...> + +ERROR: canceling statement due to user request step s7-get-progress: ``` Source: https://app.circleci.com/pipelines/github/citusdata/citus/27855/workflows/2a7e335a-f3e8-46ed-b6bd-6920d42f7214/jobs/831710 It turned out this was an actual bug in the way our assigning of global PIDs interacts with the way we connect to ourselves as the shard rebalancer. The first command the shard rebalancer sends is a SET ommand to change the application_name to `citus_rebalancer`. If `StartupCitusBackend` is called after this command is processed, then it overwrites the global PID that was extracted from the previous application_name. This makes sure that we don't do that, and continue to use the original global PID. While it might seem that we only call `StartupCitusBackend` once for each query backend, this isn't actually the case. Whenever pg_dist_partition gets ANALYZEd by autovacuum we indirectly call `StartupCitusBackend` again, because we invalidate the cache then. In passing this fixes two other things as well: 1. It sets `distributedCommandOriginator` correctly in `AssignGlobalPID`, by using IsExternalClientBackend(). This doesn't matter much anymore, since AssignGlobalPID effectively becomes a no-op in this PR for any non-external client backends. 2. It passes the application_name to InitializeBackendData in StartupCitusBackend, instead of INVALID_CITUS_INTERNAL_BACKEND_GPID (which effectively got casted to NULL). In practice this doesn't change the behaviour of the call, since the call is a no-op for every backend except the maintenance daemon. And the behaviour of the call is the same for NULL as for the application_name of the maintenance daemon.	2022-10-11 16:41:01 +02:00
Naisila Puka	b5d70d2e11	Fix flakyness in alter_table_set_access_method (#6421 ) We decrease verbosity level here to avoid the flaky output https://app.circleci.com/pipelines/github/citusdata/citus/27936/workflows/dc63128a-1570-41a0-8722-08f3e3cfe301/jobs/836153 ```diff select alter_table_set_access_method('ref','heap'); NOTICE: creating a new table for alter_table_set_access_method.ref NOTICE: moving the data of alter_table_set_access_method.ref NOTICE: dropping the old alter_table_set_access_method.ref NOTICE: drop cascades to 2 other objects -DETAIL: drop cascades to materialized view m_ref -drop cascades to view v_ref +DETAIL: drop cascades to view v_ref +drop cascades to materialized view m_ref CONTEXT: SQL statement "DROP TABLE alter_table_set_access_method.ref CASCADE" NOTICE: renaming the new table to alter_table_set_access_method.ref alter_table_set_access_method ------------------------------- (1 row) ```	2022-10-11 16:31:24 +03:00
Naisila Puka	89aa9a015f	Fixes empty password issue (#6417 )	2022-10-11 15:56:44 +03:00
Onur Tirtir	0b81f68def	Use memcpy instead of memcpy_s to avoid pointless limits in columnar (#6419 ) DESCRIPTION: Raises memory limits in columnar from 256MB to 1GB for reads and writes This doesn't completely fix #5918 but at least increases the buffer limits that might cause throwing an error when reading from or writing into into columnar storage. A way better approach to fix this is documented in #6420. Replacing memcpy_s with memcpy is quite safe in those places since we anyway make sure to allocate enough amount of memory before writing into related buffers.	2022-10-11 14:57:31 +03:00
aykut-bozkurt	442cdb2ea5	pg_regress needs the option dlpath for postgres tests to find regress.so (#6416 ) When you run vanilla tests in your local environment, some of the tests tries to find path for regress.so which is not in default lib path. That is why we need to specify regress.so path as dlpath option. Example failure: ``` LOAD :'regresslib'; +ERROR: could not access file "/home/aykutbozkurt/.pgenv/pgsql-15beta4/lib/regress.so": No such file or directory ``` It is actually in `~/.pgenv/src/postgresql-15beta4/src/test/regress/regress.so` which is found by `$regresslibdir`.	2022-10-11 14:43:06 +03:00
Hanefi Onaldi	4f8d6f6558	Bump PG15 CI images to rc2 (#6407 ) When bumping to RC2, we needed to update one test. The following is the commit message for the change: Remove references to optimization PG15 reverted PG15 introduced an optimization on GROUP BY keys that is now reverted on RC2. Relevant PG Commit: Revert "Optimize order of GROUP BY keys". 443df6e2db932a7cd6d85ddfb67e11a43345130d Depends on: https://github.com/citusdata/the-process/pull/94	2022-10-11 14:30:59 +03:00
Hanefi Onaldi	cbe4298c5b	Remove references to optimization PG15 reverted PG15 introduced an optimization on GROUP BY keys that is now reverted on RC2. Relevant PG commit: Revert "Optimize order of GROUP BY keys". 443df6e2db932a7cd6d85ddfb67e11a43345130d	2022-10-10 21:54:08 +03:00
Hanefi Onaldi	30af70926f	Bump PG15 CI images to rc2	2022-10-10 21:54:08 +03:00
Onur Tirtir	517b72a9d5	Fix use-after-free in GetAlterTriggerStateCommand() (#6413 ) Fix use-after-free in GetAlterTriggerStateCommand() introduced in #6398.	2022-10-10 16:38:21 +03:00
Gokhan Gulbiz	1776bdf654	Limit citus_drain_node to drain the specified node only (#6361 ) DESCRIPTION: Fixes citus_drain_node to drain the specified worker only. Fixes #6267	2022-10-09 13:33:08 +03:00
Onur Tirtir	86e186f671	Retain trigger settings when re-creating the triggers (on shards) (#6398 ) Fixes https://github.com/citusdata/citus/issues/6394. DESCRIPTION: Fixes a bug that causes creating disabled-triggers on shards as enabled Since CREATE TRIGGER doesn't have syntax support to specify whether the trigger should be enabled/disabled, the underlying PG function (`pg_get_triggerdef()`) that we use to generate the command to create the trigger is not enough. For this reason, we append a second command to enable/disable trigger, right after creating it. We don't retain explicit extension dependencies set by using `ALTER trigger DEPENDS ON EXTENSION` commands too, but apparently right fix for that is to throw an error as in `PreprocessAlterTriggerDependsStmt()`; so, opened a separate PR to fix that #6399.	2022-10-06 14:51:07 +03:00
Naisila Puka	27e867afbc	Propagates column aliases (#6400 ) Propagates column aliases in the shard-level commands	2022-10-06 12:27:31 +03:00
Naisila Puka	b5cba3a3fe	Use original relation to retrieve column name because of syscache (#6387 ) During alter_distributed_table, we create a new table like the original table but with the altered options. To retrieve the name of the distribution column, we were using the attribute syscache of the new table, since we already created the new table as identical to the original table. However, the attribute syscaches of these two tables are not the same if the original table has dropped columns. The reason is that dropped columns are all still present in the cache. Hence, for example, the attnos would be different in the syscaches. So, let's use the attribute syscache of the original table.	2022-10-06 12:08:00 +03:00
Ying Xu	f21cbe68f8	[Columnar] Bugfix for Columnar: options ignored during ALTER TABLE rewrite (#6337 ) DESCRIPTION: Fixes a bug that prevents retaining columnar table options after a table-rewrite A fix for this issue: Columnar: options ignored during ALTER TABLE rewrite #5927 The OID for the temporary table created during ALTER TABLE was not the same as the original table's OID so the columnar options were not being applied during rewrite. The change is that I applied the original table's columnar options to the new table so that it has the correct options during write. I also added a test.	2022-10-05 11:42:09 -07:00
Ahmet Gedemenli	e36890ce55	Add source_lsn and target_lsn fields into get_rebalance_progress (#6364 ) DESCRIPTION: Adds source_lsn and target_lsn fields into get_rebalance_progress Adding two fields named `source_lsn` and `target_lsn` to the function `get_rebalance_progress`. Target lsn data is fetched in `GetShardStatistics`, by expanding the query sent to workers (joining with pg_subscription_rel and pg_stat_subscription). Then put into the hashmap, for each shard. Source lsn data is fetched in `BuildWorkerShardStatististicsHash`, in the loop that iterate each node, by sending a pg_current_wal_lsn query to each node. Then put into the hashmap, for each node.	2022-10-05 11:12:24 +03:00
Hanefi Onaldi	e0f8666131	Fix downgrades from 10.2-4 to 10.2-2 (#6383 ) DESCRIPTION: Fixes a bug in `ALTER EXTENSION citus UPDATE` We had a series of changes on columnar that made it impossible for a Citus user to downgrade from 10.2-4 to 10.2-2. Since we test downgrades to immediate previous versions, we did not capture this in our tests. Here are the series of changes. - `10.2-1` introduced a btree index named `columnar.stripe_first_row_number_idx` - `10.2-3` had a unique index with the same name. To accomplish that, we dropped the btree index, and create a unique index with the same name. - `10.2-4` introduced `columnar_ensure_am_depends_catalog()` that adds pg_depend entries so that Columnar access method depended on objects such as `stripe_first_row_number_idx` If a user upgrades to `>=10.2-4` we create a dependency record, and this prevents users from downgrading to an earlier version than `10.2-3` since the downgrade file `columnar--10.2-3--10.2-2.sql` wanted to drop the unique index and create a btree index instead. However this created an error because columnar am depended on that index. We do not usually like to update earlier migration versions, but there is no other solution that I could think of. ## Notes to reviewer: Consider reviewing the commits one by one. - Commit#1 aims to improve downgrade scripts overall. - Commit#2 documents the failure - Commit#3 fixes the problem by updating all the files that attempted to drop `stripe_first_row_number_idx` index. Related: #6041	2022-10-04 20:39:50 +03:00
Hanefi Onaldi	11a9a3771f	Ensure no dependencies to index before drop	2022-10-04 18:56:20 +03:00
Hanefi Onaldi	5ddd4754a2	Document failing downgrades from 10.2-4 to 10.2-2	2022-10-04 18:56:20 +03:00
Hanefi Onaldi	0efd6f7829	Fix tests for missing downgrades	2022-10-04 18:56:20 +03:00
Jelte Fennema	aea4964b39	Fix flakyness in isolation_shard_rebalancer_progress (#6397 ) On our CI our isolation_shard_rebalancer_progress would sometimes randomly fail like this: ```diff table_name\|shardid\|shard_size\|sourcename\|sourceport\|source_shard_size\|targetname\|targetport\|target_shard_size\|progress\|operation_type ----------+-------+----------+----------+----------+-----------------+----------+----------+-----------------+--------+-------------- -colocated1\|1500001\| 49152\|localhost \| 57637\| 49152\|localhost \| 57638\| 73728\| 1\|move -colocated2\|1500005\| 376832\|localhost \| 57637\| 376832\|localhost \| 57638\| 401408\| 1\|move +colocated1\|1500001\| 49152\|localhost \| 57637\| 49152\|localhost \| 57638\| 81920\| 1\|move +colocated2\|1500005\| 376832\|localhost \| 57637\| 376832\|localhost \| 57638\| 409600\| 1\|move (2 rows) ``` Source: https://app.circleci.com/pipelines/github/citusdata/citus/27688/workflows/8c5ca443-5f21-4f21-b74f-0ca7bde69648/jobs/823648/parallel-runs/1 The shard sizes would be slightly larger or smaller than expected. This fixes this by fixing the output to the nearest expected shard size. To do so I used a trick described in this stack overflow answer: https://stackoverflow.com/a/33147437/2570866 When investigating I ran into one more random failure: ```diff -step s1-shard-move-c1-block-writes: <... completed> +step s4-shard-move-sep-block-writes: <... completed> citus_move_shard_placement -------------------------- (1 row) -step s4-shard-move-sep-block-writes: <... completed> +step s1-shard-move-c1-block-writes: <... completed> citus_move_shard_placement -------------------------- ``` Source: https://app.circleci.com/pipelines/github/citusdata/citus/27707/workflows/c3ff4fc7-5068-4096-ab9f-803c941ddac0/jobs/824622/parallel-runs/29?filterBy=FAILED This random failure happens, because the two parallel moves can complete at the same time. So, it's non-deterministic which one finishes first. To make this deterministic I used the "marker" feature from the isolation tester. And finally I ran into a third random failure: ```diff table_name\|shardid\|shard_size\|sourcename\|sourceport\|source_shard_size\|targetname\|targetport\|target_shard_size\|progress\|operation_type ----------+-------+----------+----------+----------+-----------------+----------+----------+-----------------+--------+-------------- -colocated1\|1500001\| 50000\|localhost \| 57637\| 50000\|localhost \| 57638\| 50000\| 1\|move -colocated2\|1500005\| 400000\|localhost \| 57637\| 400000\|localhost \| 57638\| 400000\| 1\|move +colocated1\|1500001\| 50000\|localhost \| 57637\| 50000\|localhost \| 57638\| 8000\| 1\|move +colocated2\|1500005\| 400000\|localhost \| 57637\| 400000\|localhost \| 57638\| 8000\| 1\|move colocated1\|1500002\| 200000\|localhost \| 57637\| 200000\|localhost \| 57638\| 0\| 0\|move colocated2\|1500006\| 8000\|localhost \| 57637\| 8000\|localhost \| 57638\| 0\| 0\|move ``` Source: https://app.circleci.com/pipelines/github/citusdata/citus/27707/workflows/c3ff4fc7-5068-4096-ab9f-803c941ddac0/jobs/824622/parallel-runs/30?filterBy=FAILED This happened in two of the tests only. For now I commented these tests out. I have some ideas on how to fix these, but these ideas require more impactful changes than I would like in this PR. One of these tests had a copy paste error too, in passing I fixed that in the commented out line.	2022-10-04 17:05:42 +02:00
Hanefi Onaldi	24f247b5a1	Cleanup multi_utility_warnings test This test used to contain some utility commands that Citus did not support. However we added support for most of the commands, and this test got outdated. We used to error out on community when user attempted to use pooler options. Now that we open sourced all enterprise features, the test can now be removed.	2022-10-04 15:27:42 +03:00
Jelte Fennema	5c64227223	Hopefully reduce flaky tests by disabling the maintenance daemon (#6252 ) Sometimes our CI randomly fails on a test in a way similar to this: ```diff step s2-drop: DROP TABLE cancel_table; - + <waiting ...> +step s2-drop: <... completed> starting permutation: s1-timeout s1-begin s1-sleep10000 s1-rollback s1-reset s1-drop ``` Source: https://app.circleci.com/pipelines/github/citusdata/citus/26524/workflows/5415b84f-13a3-482f-bef9-648314c79a67/jobs/756377 Another example of a failure like this: ```diff stop_session_level_connection_to_node ------------------------------------- (1 row) step s3-display: SELECT * FROM ref_table ORDER BY id, value; SELECT * FROM dist_table ORDER BY id, value; - + <waiting ...> +step s3-display: <... completed> id\|value --+----- ``` Source: https://app.circleci.com/pipelines/github/citusdata/citus/26551/workflows/91dca4b2-bb1c-4cae-b2ef-ce3f9c689ce5/jobs/757781 A step that shouldn't be blocked is detected as "waiting..." temporarily and then gets unblocked automatically immediately after. I'm not certain of the reason for this, but one explanation is that the maintenance daemon is doing something that blocks the query. In the shown case my hunch is that it could be the deferred shard deletion. This PR disables all the features of the maintenance daemon during isolation testing to try and prevent process from randomly being detected as blocking. NOTE: I'm not certain that this will actually fix this issue. If the issue persists even after this change, at least we know that it's not the maintenance daemon that's blocking it.	2022-10-04 14:33:57 +03:00
Hanefi Onaldi	813542dfa1	Fix flaky isolation_citus_dist_activity test (#6395 ) For the sake of documentation, here is a failing diff: ```diff step s2-view-dist: SELECT query, citus_nodename_for_nodeid(citus_nodeid_for_gpid(global_pid)), citus_nodeport_for_nodeid(citus_nodeid_for_gpid(global_pid)), state, wait_event_type, wait_event, usename, datname FROM citus_dist_stat_activity WHERE query NOT ILIKE ALL(VALUES('%pg_prepared_xacts%'), ('%COMMIT%'), ('%BEGIN%'), ('%pg_catalog.pg_isolation_test_session_is_blocked%'), ('%citus_add_node%')) AND backend_type = 'client backend' ORDER BY query DESC; query \|citus_nodename_for_nodeid\|citus_nodeport_for_nodeid\|state \|wait_event_type\|wait_event\|usename \|datname ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------+-------------------------+-------------------+---------------+----------+--------+---------- ALTER TABLE test_table ADD COLUMN x INT; \|localhost \| 57636\|idle in transaction\|Client \|ClientRead\|postgres\|regression -(1 row) + + SELECT coalesce(to_jsonb(array_agg(csa_from_one_node.)), '[{}]'::JSONB) + FROM ( + SELECT global_pid, worker_query AS is_worker_query, pg_stat_activity. FROM + pg_stat_activity LEFT JOIN get_all_active_transactions() ON process_id = pid + ) AS csa_from_one_node; + \|localhost \| 57638\|active \| \| \|postgres\|regression +(2 rows) ``` This failure can be seen at [this CI run](https://app.circleci.com/pipelines/github/citusdata/citus/27653/workflows/d769701c-8f6e-4f97-a412-16f7b9b288a6/jobs/821416)	2022-10-04 13:09:09 +02:00
Hanefi Onaldi	580ab012bf	Note PG release candidate support in changelog (#6390 ) Co-authored-by: Joe Nelson <jonels@microsoft.com>	2022-09-30 22:25:24 +03:00

1 2 3 4 5 ...

6271 Commits (7499c3073d6841f5edee093b507eb9bfe6fd54c0) All Branches Search

6271 Commits (7499c3073d6841f5edee093b507eb9bfe6fd54c0)

All Branches