citus

Commit Graph

Author	SHA1	Message	Date
Naisila Puka	fd9b3f4ae9	Add tests to make sure distributed clone trigger rename fails in PG15 (#6291 ) Relevant PG commit: 80ba4bb383538a2ee846fece6a7b8da9518b6866	2022-09-06 11:04:14 +03:00
Marco Slot	e6b1845931	Change split logic to avoid EnsureReferenceTablesExistOnAllNodesExtended (#6208 ) Co-authored-by: Marco Slot <marco.slot@gmail.com>	2022-09-05 22:02:18 +02:00
Önder Kalacı	bd13836648	Add citus.skip_advisory_lock_permission_checks (#6293 )	2022-09-05 17:47:41 +02:00
Jelte Fennema	1c5b8588fe	Address race condition in InitializeBackendData (#6285 ) Sometimes in CI our isolation_citus_dist_activity test fails randomly like this: ```diff step s2-view-dist: SELECT query, citus_nodename_for_nodeid(citus_nodeid_for_gpid(global_pid)), citus_nodeport_for_nodeid(citus_nodeid_for_gpid(global_pid)), state, wait_event_type, wait_event, usename, datname FROM citus_dist_stat_activity WHERE query NOT ILIKE ALL(VALUES('%pg_prepared_xacts%'), ('%COMMIT%'), ('%BEGIN%'), ('%pg_catalog.pg_isolation_test_session_is_blocked%'), ('%citus_add_node%')) AND backend_type = 'client backend' ORDER BY query DESC; query \|citus_nodename_for_nodeid\|citus_nodeport_for_nodeid\|state \|wait_event_type\|wait_event\|usename \|datname ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------+-------------------------+-------------------+---------------+----------+--------+---------- INSERT INTO test_table VALUES (100, 100); \|localhost \| 57636\|idle in transaction\|Client \|ClientRead\|postgres\|regression -(1 row) + + SELECT coalesce(to_jsonb(array_agg(csa_from_one_node.)), '[{}]'::JSONB) + FROM ( + SELECT global_pid, worker_query AS is_worker_query, pg_stat_activity. FROM + pg_stat_activity LEFT JOIN get_all_active_transactions() ON process_id = pid + ) AS csa_from_one_node; + \|localhost \| 57636\|active \| \| \|postgres\|regression +(2 rows) step s3-view-worker: ``` Source: https://app.circleci.com/pipelines/github/citusdata/citus/26692/workflows/3406e4b4-b686-4667-bec6-8253ee0809b1/jobs/765119 I intended to fix this with #6263, but the fix turned out to be insufficient. This PR tries to address the issue by setting distributedCommandOriginator correctly in more situations. However, even with this change it's still possible to reproduce the flaky test in CI. In any case this should fix at least some instances of this issue. In passing this changes the isolation_citus_dist_activity test to allow running it multiple times in a row.	2022-09-02 14:23:47 +02:00
Ahmet Gedemenli	7c8cc7fc61	Fix flakiness for view tests (#6284 )	2022-09-02 10:12:07 +03:00
Marco Slot	432f399a5d	Allow citus_internal application_name with additional suffix (#6282 ) Co-authored-by: Marco Slot <marco.slot@gmail.com>	2022-09-01 14:26:43 +02:00
Naisila Puka	9e2b96caa5	Add pg14->pg15 upgrade test for dist. triggers on part. tables (#6265 ) PRE PG15, Renaming the parent triggers on partitioned tables doesn't recurse to renaming the child triggers on the partitions as well. In PG15, Renaming triggers on partitioned tables recurses to renaming the triggers on the partitions as well. Add an upgrade test to make sure we are not breaking anything with distributed triggers on distributed partitioned tables. Relevant PG commit: 80ba4bb383538a2ee846fece6a7b8da9518b6866	2022-09-01 12:32:44 +03:00
Naisila Puka	317dda6af1	Use RelationGetPrimaryKeyIndex for citus catalog tables (#6262 ) pg_dist_node and pg_dist_colocation have a primary key index, not a replica identity index. Citus catalog tables are created in public schema, which has replica identity index by default as primary key index. Later the citus catalog tables are moved to pg_catalog schema. During pg_upgrade, all tables are recreated, and given that pg_dist_colocation is found in pg_catalog schema, it is recreated in that schema, and when it is recreated it doesn't have a replica identity index, because catalog tables have no replica identity. Further action: Do we even need to acquire this lock on the primary key index? Postgres doesn't acquire such locks on indexes before deleting catalog tuples. Also, catalog tuples don't have replica identities by definition.	2022-09-01 11:56:31 +03:00
Jelte Fennema	8bb082e77d	Fix reporting of progress on waiting and moved shards (#6274 ) In commit `31faa88a4e` I removed some features of the rebalance progress monitor. I did this because the plan was to remove the foreground shard rebalancer later in the PR that would add the background shard rebalancer. So, I didn't want to spend time fixing something that we would throw away anyway. As it turns out we're not removing the foreground shard rebalancer after all, so it made sens to fix the stuff that I broke. This PR does that. For the most part this commit reverts the changes in commit `31faa88a4e`. It's not a full revert though, because it keeps the improved tests and the changes to `citus_move_shard_placement`.	2022-08-31 14:55:47 +03:00
Naisila Puka	98dcbeb304	Specifies that our CustomScan providers support projections (#6244 ) Before, this was the default mode for CustomScan providers. Now, the default is to assume that they can't project. This causes performance penalties due to adding unnecessary Result nodes. Hence we use the newly added flag, CUSTOMPATH_SUPPORT_PROJECTION to get it back to how it was. In PG15 support branch we created explain functions to ignore the new Result nodes, so we undo that in this commit. Relevant PG commit: 955b3e0f9269639fb916cee3dea37aee50b82df0	2022-08-31 10:48:01 +03:00
Jelte Fennema	24e695ca27	Fix flakyness in multi_utilities (#6272 ) Sometimes in CI our multi_utilities test fails like this: ```diff VACUUM (INDEX_CLEANUP ON, PARALLEL 1) local_vacuum_table; SELECT CASE WHEN s BETWEEN 20000000 AND 25000000 THEN 22500000 ELSE s END size FROM pg_total_relation_size('local_vacuum_table') s ; size ---------- - 22500000 + 39518208 (1 row) ``` Source: https://app.circleci.com/pipelines/github/citusdata/citus/26641/workflows/5caea99c-9f58-4baa-839a-805aea714628/jobs/762870 Apparently VACUUM is not as reliable in cleaning up as we thought. This increases the range of allowed values. Important to note is that the range is still completely outside of the allowed range of the initial size. So we know for sure that some data was cleaned up.	2022-08-30 14:32:34 -07:00
Jelte Fennema	f22a47981a	Fix flakyness in adaptive_executor (#6275 ) Sometimes in CI our adaptive_executor test would fail randomly with the following error: ```diff SELECT sum(result::bigint) FROM run_command_on_workers($$ SELECT count(*) FROM pg_stat_activity WHERE pid <> pg_backend_pid() AND query LIKE '%8010090%' $$); sum ----- - 4 + 2 (1 row) END; ``` Source: https://app.circleci.com/pipelines/github/citusdata/citus/26665/workflows/40665680-0044-4852-8fe4-5fd628f9fb47/jobs/764371 This means that the low slow start interval did not have any effect on the number of connections being opened. I could see two possibilities for this to happen: 1. CI was slow and actually doing the start of the second connection. I tried to solve this by doubling the time a query to the worker takes. 2. The second option is that the shards were queried in the oposite order than we expect. This would mean that the first query to the worker completes quickly because there's no, sleep because it doesn't contain any rows. I tried to solve this option by adding a row to each shard. After trying to reproduce the random failure in CI it turned out that I needed both of these fixes to resolve the random failure.	2022-08-30 23:23:30 +02:00
Jelte Fennema	8354853dec	Fix flakyness in citus_split_shard_columnar_partitioned (#6273 ) On CI our citus_split_shard_columnar_partitioned test would sometimes randomly fail like this: ```diff 8970008 \| colocated_dist_table \| -2147483648 \| 2147483647 \| localhost \| 57637 8970009 \| colocated_partitioned_table \| -2147483648 \| 2147483647 \| localhost \| 57637 8970010 \| colocated_partitioned_table_2020_01_01 \| -2147483648 \| 2147483647 \| localhost \| 57637 - 8970011 \| reference_table \| \| \| localhost \| 57637 8970011 \| reference_table \| \| \| localhost \| 57638 + 8970011 \| reference_table \| \| \| localhost \| 57637 (13 rows) ``` Source: https://app.circleci.com/pipelines/github/citusdata/citus/26651/workflows/f695b4fb-ad81-46ff-b97e-0100e5d167ea/jobs/763517 This is a harmless diff due to a missing column in the order by list. This fixes that by adding the nodeport as a tiebreaker.	2022-08-30 19:54:50 +03:00
Marco Slot	6bb31c5d75	Add non-blocking variant of create_distributed_table (#6087 ) Added create_distributed_table_concurrently which is nonblocking variant of create_distributed_table. It bases on the split API which takes advantage of logical replication to support nonblocking split operations. Co-authored-by: Marco Slot <marco.slot@gmail.com> Co-authored-by: aykutbozkurt <aykut.bozkurt1995@gmail.com>	2022-08-30 15:35:40 +03:00
Jelte Fennema	d68654680b	Fix flakyness in isolation_citus_dist_activity (#6263 ) Sometimes in CI our isolation_citus_dist_activity test fails randomly like this: ```diff step s2-view-dist: SELECT query, citus_nodename_for_nodeid(citus_nodeid_for_gpid(global_pid)), citus_nodeport_for_nodeid(citus_nodeid_for_gpid(global_pid)), state, wait_event_type, wait_event, usename, datname FROM citus_dist_stat_activity WHERE query NOT ILIKE ALL(VALUES('%pg_prepared_xacts%'), ('%COMMIT%'), ('%BEGIN%'), ('%pg_catalog.pg_isolation_test_session_is_blocked%'), ('%citus_add_node%')) AND backend_type = 'client backend' ORDER BY query DESC; query \|citus_nodename_for_nodeid\|citus_nodeport_for_nodeid\|state \|wait_event_type\|wait_event\|usename \|datname ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------+-------------------------+-------------------+---------------+----------+--------+---------- INSERT INTO test_table VALUES (100, 100); \|localhost \| 57636\|idle in transaction\|Client \|ClientRead\|postgres\|regression -(1 row) + + SELECT coalesce(to_jsonb(array_agg(csa_from_one_node.)), '[{}]'::JSONB) + FROM ( + SELECT global_pid, worker_query AS is_worker_query, pg_stat_activity. FROM + pg_stat_activity LEFT JOIN get_all_active_transactions() ON process_id = pid + ) AS csa_from_one_node; + \|localhost \| 57636\|active \| \| \|postgres\|regression +(2 rows) step s3-view-worker: ``` Source: https://app.circleci.com/pipelines/github/citusdata/citus/26605/workflows/56d284d2-5bb3-4e64-a0ea-7b9b1626e7cd/jobs/760633 The reason for this is that citus_dist_stat_activity sometimes shows the query that it uses itself to get the data from pg_stat_activity. This is actually a bug, because it's a worker query and thus shouldn't show up there. To try and solve this bug, we remove two small opportunities for a race condition. These race conditions could happen when the backenddata was marked as active, but the distributedCommandOriginator was not set correctly yet/anymore. There was an opportunity for this to happen both during connection start and shutdown.	2022-08-30 12:57:37 +03:00
Önder Kalacı	33af407ac8	Add missing orderbys (#6271 )	2022-08-30 12:49:15 +03:00
Jelte Fennema	895a484b39	Hopefully fix flakyeness in drop_partitioned_table (#6270 ) Sometimes in CI our drop_partitioned_talbe test would fail with the following error: ```diff NOTICE: issuing SELECT worker_drop_distributed_table('drop_partitioned_table.child1') NOTICE: issuing SELECT worker_drop_distributed_table('drop_partitioned_table.child1') NOTICE: issuing DROP TABLE IF EXISTS drop_partitioned_table.child1_727001 CASCADE -NOTICE: issuing SELECT pg_catalog.citus_internal_delete_colocation_metadata(100047) -NOTICE: issuing SELECT pg_catalog.citus_internal_delete_colocation_metadata(100047) +NOTICE: issuing SELECT pg_catalog.citus_internal_delete_colocation_metadata(100046) +NOTICE: issuing SELECT pg_catalog.citus_internal_delete_colocation_metadata(100046) ROLLBACK; NOTICE: issuing ROLLBACK NOTICE: issuing ROLLBACK ``` Source: https://app.circleci.com/pipelines/github/citusdata/citus/26631/workflows/31536032-e1ba-493b-b12a-f40757f3a7d6/jobs/762170 For some reason the colocationid of the distributed partitioned table would be one less than we expected. Why this happens I'm not sure, but it seems fairly harmless that it does. In an attempt to work around this flakyness I now reset the colocation id sequence right before creating the table in question. This is good practice in general, because it allows us to run the test successfully using `check-minimal` and it also allows us to rerun it multiple times.	2022-08-30 12:21:16 +03:00
Jelte Fennema	5c95604154	Always copy normalized files after a regress run (#6254 ) Our python based tests didn't always copy the normalized files after the regress run. I had the problem where running the following command would result in non-normalized files in the expected directory after running our PG upgrade tests locally: ``` cp src/test/regress/{results,expected}/upgrade_list_citus_objects.out ``` This PR fixes that by always running `copy_modified` even if the tests fail. The same was already being done for our perl based tests at the end of the `pg_regress_multi.pl` file.	2022-08-30 07:15:29 +00:00
Naisila Puka	13fe89f018	Fixes flakyness in columnar_permissions test (#6266 ) `columnar_permissions.sql` test is flaky due to a missing `ORDER BY` clauses. Added the other `ORDER BY` clauses for consistency in the test. ```diff where relation in ('no_access'::regclass, 'columnar_permissions'::regclass); relation \| chunk_group_row_limit \| stripe_row_limit \| compression \| compression_level ----------------------+-----------------------+------------------+-------------+------------------- - no_access \| 10000 \| 150000 \| zstd \| 3 columnar_permissions \| 10000 \| 2222 \| none \| 3 + no_access \| 10000 \| 150000 \| zstd \| 3 (2 rows) ``` Source: https://app.circleci.com/pipelines/github/citusdata/citus/26610/workflows/79f03ef9-7674-4567-a087-02536c9ddf04/jobs/760942	2022-08-29 14:33:26 +02:00
Önder Kalacı	1df943e0d5	Use Posix locale in the tests (#6261 ) Commit `9653a0065e` has changed it to C.UTF-8 , which fails on MacOS	2022-08-29 12:52:03 +02:00
Ahmet Gedemenli	0855a9d1d4	Use SUM for calculating non partitioned table sizes (#6222 ) We currently do a `pg_relation_total_size('t1') + pg_relation_total_size('t2') + ..` on shard lists, especially when rebalancing the shards. This in some cases goes huge. With this PR, we basically use a SUM for all table sizes, instead of using thousands of pluses.	2022-08-26 18:02:14 +03:00
Sameer Awasekar	4df8eca77f	Add worker_split_shard_release_dsm udf to release dynamic shared memory (#6248 ) The code introduces worker_split_shard_release_dsm udf to release the dynamic shared memory segment allocated during non-blocking split workflow.	2022-08-26 18:27:32 +05:30
Jelte Fennema	77dd49fcf8	Fix flakyness in failure_online_move_shard_placement (#6250 ) Sometimes in CI failure_online_move_shard_placement fails with the following error: ```diff SELECT citus.mitmproxy('conn.onQuery(query="^ALTER SUBSCRIPTION .* ENABLE").cancel(' \|\| :pid \|\| ')'); mitmproxy ----------- (1 row) SELECT master_move_shard_placement(101, 'localhost', :worker_1_port, 'localhost', :worker_2_proxy_port); -ERROR: canceling statement due to user request +ERROR: tuple concurrently updated +CONTEXT: while executing command on localhost:9060 -- failure on polling subscription state ``` Source: https://app.circleci.com/pipelines/github/citusdata/citus/26441/workflows/dd6e3475-6121-47b3-aea3-4ac92be114f4/jobs/751476/steps This error is not completely harmless, because based on the logs it mean that our cleanup logic failed, which in turn means that replication slots are left around: ``` 2022-08-24 16:01:29.247 UTC [1219] ERROR: XX000: tuple concurrently updated 2022-08-24 16:01:29.247 UTC [1219] LOCATION: simple_heap_update, heapam.c:4179 2022-08-24 16:01:29.247 UTC [1219] STATEMENT: ALTER SUBSCRIPTION citus_shard_move_subscription_10 DISABLE ``` However, we have other mechanisms to clean up any leftovers in case of a failed cleanup. So it's not that big of a problem. The reason we run into this error is arguably because of a Postgres bug, so I created a patch for Postgres that fixes this. While we wait for this (or a similar) patch to be merged, this PR disables the flaky test. There's still a test that tests in case of a connection "kill" instead of a "cancel", so I don't think we lose very important coverage by disabling this test. When trying to reproduce this I only hit this issue in the cancel case, so I don't think there's a need to disable the kill case for now.	2022-08-26 12:49:45 +02:00
Jelte Fennema	2a0c0b3ba6	Fix flakyness in failure_connection_establishment (#6251 ) In CI sometimes failure_connection_establishment would fail with the following error: ```diff -- cancel all connections to this node SELECT citus.mitmproxy('conn.onAuthenticationOk().cancel(' \|\| pg_backend_pid() \|\| ')'); - mitmproxy ---------------------------------------------------------------------- - -(1 row) - +ERROR: canceling statement due to user request +CONTEXT: COPY mitmproxy_result, line 1: "" +SQL statement "COPY mitmproxy_result FROM '/home/circleci/project/src/test/regress/tmp_check/mitmproxy.fifo'" +PL/pgSQL function citus.mitmproxy(text) line 11 at EXECUTE SELECT * FROM citus_check_cluster_node_health(); ``` The reason for this is that the mitm command that was used is very broad and doesn't actually do what the comment says. What happens is that if any connection is made, the current backend is cancelled, which is not the always the same as the backend that made the connection. My assessment is that likely the maintenance daemon makes a connection to the node while we are executing the mitmproxy command. The mitmproxy command goes through, and then triggers a cancel of itself due to the connection made by the maintenance daemon. This PR simply removes this test, since it doesn't seem to test what it intended to test anyway. There's also still the "kill" version of this test, which does do the intended thing. So I don't think we lose important coverage by removing this test.	2022-08-26 10:01:36 +00:00
Jelte Fennema	18015ca501	Fix flakyness in multi_transaction_recovery (#6249 ) Sometimes in CI multi_transaction_recovery would fail with the following error: ```diff SET LOCAL citus.defer_drop_after_shard_move TO OFF; SELECT citus_move_shard_placement((SELECT * FROM selected_shard), 'localhost', :worker_1_port, 'localhost', :worker_2_port, shard_transfer_mode := 'block_writes'); - citus_move_shard_placement ---------------------------------------------------------------------- - -(1 row) - +ERROR: could not find placement matching "localhost:57637" +HINT: Confirm the placement still exists and try again. COMMIT; ``` Source: https://app.circleci.com/pipelines/github/citusdata/citus/26510/workflows/8269ea93-d9b4-4376-ae0e-8332a5c15fc6/jobs/755548 The reason for this was that when choosing `selected_shard` we didn't ensure that it was actually located on the node that we were moving it from. Instead we simply picked the first shard for the table that was returned by the query. To fix this issue this PR adds a filter to only choose shards that are located on the intended node.	2022-08-26 11:48:55 +02:00
Jelte Fennema	9749622399	Fix flakyness in isolation_distributed_deadlock_detection (#6240 ) Our isolation_distributed_deadlock_detection test would fail randomly in CI in three different ways. The first type of failure looked like this: ```diff check_distributed_deadlocks --------------------------- t (1 row) -step s1-update-5: <... completed> step s5-update-1: <... completed> ERROR: canceling the transaction since it was involved in a distributed deadlock +step s1-update-5: <... completed> step s1-commit: ``` Source: https://app.circleci.com/pipelines/github/citusdata/citus/26399/workflows/d213ee85-397a-467a-9ffb-39e4f44e6688/jobs/749533 This random change in output was harmless and happened because when the deadlock detector cancelled a query, two queries would continue: The one that was cancelled would throw an error (and thus complete), and the one that was unblocked would now complete. It was random which of the two the isolation tester would first detect as completed. To resolve this PR starts using the ["marker" feature][1], this allows us to make sure one of the steps won't be marked as completed until the other one completed first. The second random failure was very similar: ```diff check_distributed_deadlocks --------------------------- t (1 row) -step s2-update-2: <... completed> -step s3-update-3: <... completed> -ERROR: canceling the transaction since it was involved in a distributed deadlock step s6-commit: COMMIT; step s5-update-6: <... completed> +step s2-update-2: <... completed> +step s3-update-3: <... completed> +ERROR: canceling the transaction since it was involved in a distributed deadlock step s5-commit: ``` Again a harmless difference in test output. In this case it's possible that the deadlock detector would not detect the unblocked processes right away, and would thus continue with to the next step. This step was a commit on a session that was not blocked, and which thus could complete without issues. To solve this I changed the order of the commits at the end of the permutation, to always have the first session that would commit be the session that would be unblocked the last. This ensures that no commit will ever be executed before completing all the queries. The third issue was different and looked like this: ```diff step s4-update-5: <... completed> step s4-commit: COMMIT; +step s1-update-4: <... completed> +isolationtester: canceling step s3-update-4 after 5 seconds step s3-update-4: <... completed> +ERROR: canceling statement due to user request +step s2-update-2: <... completed> step s3-commit: COMMIT; -step s2-update-2: <... completed> -step s1-update-4: <... completed> step s1-commit: ``` Source: https://app.circleci.com/pipelines/github/citusdata/citus/26411/workflows/9089beec-4f0f-4027-b4ce-0e84889afc06/jobs/750143 The reason for this failure is not entirely clear to me, but I was able to remove the flakyness without impacting the goal of the test. What was happening was that both `s1` and `s3` were waiting for `s4` to commit and release it's lock on the row 4. For some reason it wasn't deterministic which of the two sessions would be granted the lock after it was released by row 4. The test expected `s3` to be granted the lock, but sometimes it would be granted to `s1` instead. Which would in turn cause `s3` to still be blocked. To solve this I simply removed `s1` completely from this test. It wasn't actually part of the cycle that the deadlock detector should detect and was an unrelated appendage: ```mermaid graph TD; s2-->s3; s3-->s4; s1-->s4; s4-->s5; s5-->s6; s6-->s5; ``` By removing `s1` completely there was no contention for the lock and `s3` could always acquire it. [1]: `a73d6c87f2/src/test/isolation/README (L163-L188)`	2022-08-26 12:03:40 +03:00
Jelte Fennema	b5cd1676f9	Fix flakyness in multi_utilities (#6245 ) In CI multi_utilities would sometimes fail randomly with this error: ```diff VACUUM (INDEX_CLEANUP ON, PARALLEL 1) local_vacuum_table; SELECT pg_size_pretty( pg_total_relation_size('local_vacuum_table') ); pg_size_pretty ---------------- - 21 MB + 22 MB (1 row) ``` Source: https://app.circleci.com/pipelines/github/citusdata/citus/26459/workflows/da47d9b6-f70b-49fe-806f-5ebf75bf0b11/jobs/752482 This is a harmless change in output where the relation size after vacuuming was slightly more than we expected. This changes the size checks for the local_vacuum_table to allow a wider range of values. It uses the same trick as #6216 to show the actual value when it's outside this valid range, which is useful if this test ever starts failing again.	2022-08-25 22:50:47 +02:00
Jelte Fennema	00485d45a6	Make multi_utilities not leak tables (#6246 ) When trying to fix #6245 I realized that multi_utilities was leaking some tables that it created during the test. This fixes that by creating all these tables in a schema that's dedicated for this test.	2022-08-25 19:33:03 +03:00
Jelte Fennema	1688bcda33	Fix errors in base_schedule (#6247 ) When running `make check-base` locally it would fail with two different errors. The first one was this: ```diff SELECT create_distributed_table('pg_class', 'relname'); -ERROR: cannot create a citus table from a catalog table +ERROR: deadlock detected +DETAIL: Process 28950 waits for ExclusiveLock on relation 16551 of database 16384; blocked by process 28951. +Process 28951 waits for RowExclusiveLock on relation 1259 of database 16384; blocked by process 28950. +HINT: See server log for query details. SELECT create_reference_table('pg_class'); ``` This happened because multi_behavioral_analytics_create_table and multi_create_table were being run in parallel. Running them separately resolved this issue. The second one was this: ```diff CREATE OR REPLACE FUNCTION wait_until_metadata_sync(timeout INTEGER DEFAULT 15000) RETURNS void LANGUAGE C STRICT AS 'citus'; +ERROR: duplicate key value violates unique constraint "pg_proc_proname_args_nsp_index" +DETAIL: Key (proname, proargtypes, pronamespace)=(wait_until_metadata_sync, 23, 2200) already exists. -- Add some helper functions for sending commands to mitmproxy ``` Which was because failure_test_helpers and multi_test_helpers were trying to create the same function at the exact same time. The easy fix here is to simply not create this function in the failure_test_helpers file. This is fine, because any test schedule that runs failure_test_helpers also runs multi_test_helpers.	2022-08-25 18:06:41 +02:00
Jelte Fennema	ee5af1ab90	Use C.UTF-8 locale in tests (#6242 ) I upgraded my OS to Ubuntu 22.04 a while back and since then some tests order output slightly differently. I think it might be because of the glibc upgrade that changed ordering for things like underscores and spaces. Changing the locale to C.UTF-8 solves this issue.	2022-08-25 13:10:49 +02:00
Önder Kalacı	3ed6fea1cf	Prevent Merge command on distributed tables [PG 15] (#6238 )	2022-08-25 13:27:08 +03:00
Marco Slot	9bf3c3dd5c	Add an allow_unsafe_constraints flag for constraints without distribution column (#6237 ) Co-authored-by: Marco Slot <marco.slot@gmail.com>	2022-08-25 11:37:50 +03:00
Gokhan Gulbiz	69d2fcf5c0	Use the same colocation group for child and parent rels when altering a distributed table (#6225 ) * Alter_distributed_table colocateWith:none bug fix for partitioned tables. * Regression tests added for alter_distributed_table colocateWith:none for partitioned tables * Update query comparision to be more accurate	2022-08-25 11:23:59 +03:00
Marco Slot	ac07d33a29	Remove unused reduceQuery from physical planning (#6221 ) Co-authored-by: Marco Slot <marco.slot@gmail.com>	2022-08-24 17:24:27 +00:00
Naisila Puka	1f4fe35512	Support JSON_TABLE on PG 15 (#6241 ) Postgres supports JSON_TABLE feature on PG 15. We treat JSON_TABLE the same as correlated functions (e.g., recurring tuples). In the end, for multi-shard JSON_TABLE commands, we apply the same restrictions as reference tables (e.g., cannot be in the outer part of an outer join etc.) Co-authored-by: Onder Kalaci <onderkalaci@gmail.com>	2022-08-24 19:11:18 +03:00
Naisila Puka	35b4ddc355	Pg15 support (#6085 ) * Adjust configure script to allow PG15 * Adds copy of ruleutils_14.c as ruleutils_15.c * Uses get_namespace_name_or_temp in ruleutils_15.c Relevant PG commit: 48c5c9068211e0a04fd9553c8714b2821ed3ad17 * Clean up code using "(expr) ? true : false" in ruleutils_15.c Relevant PG commit: fd0625c7a9c679c0c1e896014b8f49a489c3a245 * Change varno from Index (unsigned int) to int in ruleutils_15.c Relevant PG commit: e3ec3c00d85bd2844ffddee83df2bd67c4f8297f * Adds find_recursive_union to ruleutils_15.c Relevant PG commit: 3f50b82639637c9908afa2087de7588450aa866b * Fix display of SQL-std func's args in INSERT/SELECT in ruleutils_15.c Relevant PG commit: a8d8445a7b2f80f6d0bfe97b19f90bd2cbef8759 * Fix ruleutils_15.c's dumping of whole-row Vars in more contexts Relevant PG commit: 43c2175121c829c8591fc5117b725f1f22bfb670 * Fix assorted missing logic for GroupingFunc nodes in ruleutils_15.c Relevant PG commit: 2591ee8ec44d8cbc8e1226550337a64c684746e4 * Adds grammar support for SQL/JSON clauses in ruleutils_15.c Relevant PG commit: f79b803dcc98d707450e158db3638dc67ff8380b * Adds SQL/JSON constructors to ruleutils_15.c Relevant PG commits: f4fb45d15c59d7add2e1b81a9d477d0119a9691a cc7401d5ca498a84d9b47fd2e01cebd8e830e558 * Adds support for MERGE in ruleutils_15.c Relevant PG commit: 7103ebb7aae8ab8076b7e85f335ceb8fe799097c * Add IS JSON predicate to ruleutils_15.c Relevant PG commit: 33a377608fc29cdd1f6b63be561eab0aee5c81f0 * Add SQL/JSON query functions to ruleutils_15.c Relevant PG commit: 1a36bc9dba8eae90963a586d37b6457b32b2fed4 * Adds three different SQL/JSON values to ruleutils_15.c Relevant PG commits: 606948b058dc16bce494270eea577011a602810e 49082c2cc3d8167cca70cfe697afb064710828ca * Adds JSON table functions in ruleutils_15.c Relevant PG commit: 4e34747c88a03ede6e9d731727815e37273d4bc9 * Add PLAN function for JSON table in ruleutils_15.c Relevant PG commit: fadb48b00e02ccfd152baa80942de30205ab3c4f * Remove extra blank lines before block-closing braces ruleutils_15.c Relevant PG commit: 24d2b2680a8d0e01b30ce8a41c4eb3b47aca5031 * set_deparse_plan: Reuse variable to appease Coverity ruleutils_15.c Relevant PG commit: e70813fbc4aaca35ec012d5a426706bd54e4acab * Mechanical code beautification ruleutils_15.c Relevant PG commit: 23e7b38bfe396f919fdb66057174d29e17086418 * Rename value_type to item_type in ruleutils_15.c Relevant PG commit: 3ab9a63cb638a1fd99475668e2da9c237495aeda * Show 'AS "?column?"' explicitly when it's important in ruleutils_15.c Relevant PG commit: c7461fc25558832dd347a9c8150b0f1ed85e36e8 * Fix ruleutils_15.c issues with dropped cols in funcs-returning-composite Relevant PG commit: c1d1e8469c77ce6b8e5310955580b4a3eee7fe96 * Change comment regarding functions returning composite in ruleutils_15.c Relevant PG commit: c2fa113ddb1117b1f03e91960f65d5d7d8a90270 * Replace int nodes with bool nodes where needed In PG15, Boolean nodes are added. Pre PG15, internal Boolean values in Create Role commands were represented by Integer nodes. This commit replaces int nodes logic with bool nodes logic where needed. Mostly there are CREATE ROLE logic changes. Relevant PG commit: 941460fcf731a32e6a90691508d5cfa3d1f8eeaf * Handle new option colliculocale in CREATE COLLATION logic In PG15, there is an added option to use ICU as global locale provider. pg_collation has three locale-related fields: collcollate and collctype, which are libc-related fields, and a new one colliculocale, which is the ICU-related field. Only the libc-related fields or the ICU-related field is set, never both. Relevant PG commits: f2553d43060edb210b36c63187d52a632448e1d2 54637508f87bd5f07fb9406bac6b08240283be3b * Add PG15 tests to CI using test images that have 15beta2 (#6093) * Change warning message in pg_signal_backend() Relevant PG commit: 7fa945b857cc1b2964799411f1633468826861ff * Revert "Add missing ifdef for PG 15" This reverts commit `c7b51025ab`. * Fixes tests for ALTER TRIGGER RENAME consistency for part. tables Relevant PG commit: 80ba4bb383538a2ee846fece6a7b8da9518b6866 * Prevent creating child triggers on partitions when adding new node Pre PG15, tgisinternal is true for a "child" trigger on a partition cloned from the trigger on the parent. In PG15, tgisinternal is false in that case. However, we don't want to create this trigger on the partition since it will create a conflict when we try to attach the partition to the parent table: ERROR: trigger "..." for relation "{partition_name}" already exists Relevant PG commit: f4566345cf40b068368cb5617e61318da60676ec * Fix tests for generated columns dependency changes In PG15, For GENERATED columns, all dependencies of the generation expression are recorded as NORMAL dependencies of the column itself. This requires CASCADE to drop generated cols with the original col. PRE PG15, dependencies were recorded as AUTO, with which generated columns are silently dropped with the original column. Relevant PG commit: cb02fcb4c95bae08adaca1202c2081cfc81a28b5 * Explicitly cast catalog "char" column to text before concatenation Relevant PG commit: 07eee5a0dc642d26f44d65c4e6263304208e8583 * Remove 'AS "?column?"' from test outputs There were some instances in the following tst outputs in planning debug outputs where AS "?column?" is added. We add a normalization rule to remove it as it is not important. cte_inline.out recursive_relation_planning_restriction_pushdown.out Relevant PG commit: c7461fc25558832dd347a9c8150b0f1ed85e36e8 * Use pg_backup_stop(PG15) instead of pg_stop_backup(PG<15) Add an alternative test output because of the change in the backup modes of Postgres. Specifically here, there is a renaming issue: pg_stop_backup PRE PG15 vs pg_backup_stop PG15+ The alternative output can be deleted when we drop support for PG14 Relevant PG commit: 39969e2a1e4d7f5a37f3ef37d53bbfe171e7d77a * Adds citus.mitmfifo GUC Previously we setting this configuration parameter in the fly for failure tests schedule. However, PG15 doesn't allow that anymore: reserved prefixes like "citus" cannot be used to set non-existing GUCs. Relevant PG commit: 88103567cb8fa5be46dc9fac3e3b8774951a2be7 * Handles EXPLAIN output diffs in PG15 - Extra result lines To handle extra "Result" lines in explain outputs, we add explain method to multi_test_helpers.sql file - plan_without_result_lines() is added for cases where we want the whole explain output with only "Result" lines removed * Handles EXPLAIN output diffs in PG15, Hash Agg/Join leverage To handle differences in usage of GroupAggregate vs HashAggregate or Merge Join vs Hash join in cases where this detail doesn't seem to matter, we use coordinator_plan(). - coordinator_plan() is updated to remove "Result" lines There are some cases where we have subplans so we add a new function that prints all Task Count lines as well - coordinator_plan_with_subplans() Still not sure of the relevant PG commit Could be db0d67db2401eb6238ccc04c6407a4fd4f985832 but disabling enable_group_by_reordering didn't help. * Handles EXPLAIN output diffs in PG15: enable_group_by_reordering Relevant PG commit db0d67db2401eb6238ccc04c6407a4fd4f985832 * Normalizes Memory Usage, Buckets, Batches for PG15 explain diffs We create a new function in multi_test_helpers, which is similar to explain_merge function in PG15. This explain helper function normalies Memory Usage, Buckets and Batches, and we use it in the tests which give a different output for PG15. * Bump test images to 15beta3 (#6172) * Omit namespace in post-copy errmsg Relevant PG commit: 069d33d0c5a021601245e44df77a0423ddd69359 * Handles EXPLAIN output diffs in PG15: extra arrows&result lines To handle extra "->" arrows resulting from extra Result lines in explain outputs, we add the following explain method to multi_test_helpers.sql file - plan_without_arrows() is added for cases where we want the whole explain output without arrows and without Result lines * Alters public schema's owner to pg_database_owner in PG15 In PG15, public schema is owned by pg_database_owner role. In multi_extension, we drop and recreate the ppublic schema, hence its owner become the default user in our tests, postgres. Change that to pg_database_owner for PG15 consistency. This results in alternative test output for public schema grants in the following test: grant_on_schema_propagation.sql Relevant PG commit: b073c3ccd06e4cb845e121387a43faa8c68a7b62 * Add alternative test outputs for change in Insert Select display citus_local_tables_queries.sql coordinator_shouldhaveshards.sql cte_inline.sql insert_select_repartition.sql intermediate_result_pruning.sql local_shard_execution.sql local_shard_execution_replicated.sql multi_deparse_shard_query.sql multi_insert_select.sql multi_insert_select_conflict.sql multi_mx_insert_select_repartition.sql mx_coordinator_shouldhaveshards.sql single_node.sql Relevant PG commit: a8d8445a7b2f80f6d0bfe97b19f90bd2cbef8759 * Fixes columnar tap tests for PG15 In PG15, Perl test modules have been moved to a new namespace. Also, postgres node new() and get_new_node() methods have been unified to one method: new() We create separate tap tests for PG13/14 and PG15+ and update the Makefiles accordingly. Relevant PG commits: 201a76183e2056c2217129e12d68c25ec9c559c8 b3b4d8e68ae83f432f43f035c7eb481ef93e1583 * Handles EXPLAIN output diffs in PG15: HashAgg Leverage,alt. output Still not sure of the relevant PG commit Could be db0d67db2401eb6238ccc04c6407a4fd4f985832 but disabling enable_group_by_reordering didn't help.	2022-08-24 17:59:17 +02:00
Naisila Puka	ddbd10d2e7	Rename server version checks in tests (#6239 )	2022-08-24 16:31:52 +03:00
Jelte Fennema	5c0205ce10	Fix flakyness in multi_replicate_reference_table (#6235 ) In CI multi_replicate_reference_table would sometimes fail like this: ```diff -- detects correctly that referecence table doesn't have replica identity SELECT replicate_reference_tables(); -ERROR: cannot use logical replication to transfer shards of the relation initially_not_replicated_reference_table since it doesn't have a REPLICA IDENTITY or PRIMARY KEY +ERROR: cannot use logical replication to transfer shards of the relation ref_table since it doesn't have a REPLICA IDENTITY or PRIMARY KEY DETAIL: UPDATE and DELETE commands on the shard will error out during logical replication unless there is a REPLICA IDENTITY or PRIMARY KEY. HINT: If you wish to continue without a replica identity set the shard_transfer_mode to 'force_logical' or 'block_writes'. ``` Because `CitusTableTypeIdList` returns tables in heap order so it's a bit random which one is first in the list. And the test contained multiple tables that didn't have a primary key or replica identity. So it made sense that the error could be for either one of these tables. This PR makes the test output consistent by changing one of the tables to have a primary key. Example of failing test: https://app.circleci.com/pipelines/github/citusdata/citus/26387/workflows/fc3196e7-ddf2-4000-a70b-5ac71c836321/jobs/748940	2022-08-24 13:34:10 +03:00
aykut-bozkurt	041f88d7bf	Revert "Revert "Creates new colocation for colocate_with:='none' too"" (#6227 ) This reverts commit `d171a736ab`.	2022-08-24 10:54:04 +03:00
Marco Slot	bad8196da3	Verify that we can replicate reference tables using rebalancer (#6232 ) Co-authored-by: Marco Slot <marco.slot@gmail.com>	2022-08-24 00:34:21 +02:00
Jelte Fennema	1dd775fae8	Speed up logical replication tests to fix flakyness (#6229 ) The isolation_tenant_isolation_nonblocking test would sometimes randomly fail in CI, because we have a limit of runtime limit of 2 minutes per test. ``` test isolation_tenant_isolation_nonblocking ... make: *** [Makefile:171: check-enterprise-isolation] Terminated Too long with no output (exceeded 2m0s): context deadline exceeded ``` One solution would obviously be to increase the timeout, but instead I spent some time to increase the speed of our tests by tweaking some timings. On my local machine the time it took to run the isolation_tenant_isolation_nonblocking test went from 75s to 15s. So now we should easily stay within the 2 minute per test limit. I also checked if the new settings improved other logical replication tests, but the impect differs wildly per test. One other example of a test that runs much quicker due to the change is isolation_non_blocking_shard_split_fkey. But the shard move tests I tried are impacted much less. Example of failed tests: https://app.circleci.com/pipelines/github/citusdata/citus/26373/workflows/4fa660e4-63c8-4844-bef8-70a7bea902b7/jobs/748199	2022-08-23 17:37:31 +02:00
Jelte Fennema	21780b4f65	Fix flakyness in ch_benchmarks_1 (#6228 ) One of our arbitrary config tests would sometimes fail like this in CI: ```diff su_nationkey, cust_nation, l_year; - supp_nation \| cust_nation \| l_year \| revenue ---------------------------------------------------------------------- - 9 \| C \| 2008 \| 3.00 -(1 row) - +ERROR: cannot connect to localhost:10212 to fetch intermediate results +CONTEXT: while executing command on localhost:10211 ``` When looking at the logs it seems like we were running out of connections: ``` 2022-08-23 14:03:52.856 UTC [28122] FATAL: sorry, too many clients already 2022-08-23 14:03:52.860 UTC [21027] ERROR: cannot connect to localhost:10212 to fetch intermediate results ``` This happened with `CitusThreeWorkersManyShards` config. This test on purpose tries to push the limits of Citus quite far. And the `ch_benchmarks_1` test is also run in parallel with a few more ones. So it's not too weird that it ran out of connections. This doubles the connection limit in the arbitrary config tests to hopefully not hit this error again. Example of failed test: https://app.circleci.com/pipelines/github/citusdata/citus/26365/workflows/7a1b5688-85cc-4bc3-ade5-9bd1d83cd0ed/jobs/747908/parallel-runs/1	2022-08-23 17:24:27 +02:00
Jelte Fennema	e0ada050aa	Enable binary logical replication for shard moves (#6017 ) Using binary encoding can save a lot of CPU cycles, both on the sender and on the receiver. Since the walsender and walreceiver processes are single threaded, this can matter a lot for the throughput if they are bottlenecked on CPU. This feature is only available in PG14, not PG13. It should be safe to always enable because it's only used for types that support binary encoding according to the PG docs: > Even when this option is enabled, only data types that have binary > send and receive functions will be transferred in binary. But in case it causes problems, it can still be disabled by setting `citus.enable_binary_protocol` to `false`.	2022-08-23 16:38:00 +02:00
aykut-bozkurt	07cfba461a	ensuring reference tables on nodes should not create colocation entry. (#6224 ) We create colocation entry in create_reference_table.	2022-08-23 16:17:59 +03:00
Jelte Fennema	cc7e93a56a	Fix flakyness in failure_connection_establishment (#6226 ) In CI our failure_connection_establishment sometimes failed randomly with the following error: ```diff -- verify a connection attempt was made to the intercepted node, this would have cause the -- connection to have been delayed and thus caused a timeout SELECT * FROM citus.dump_network_traffic() WHERE conn=0; conn \| source \| message ------+--------+--------- - 0 \| coordinator \| [initial message] -(1 row) +(0 rows) SELECT citus.mitmproxy('conn.allow()'); ``` Source: https://app.circleci.com/pipelines/github/citusdata/citus/26318/workflows/d3354024-9a67-4b01-9416-5cf79aec6bd8/jobs/745558 The way I fixed this was by removing the dump_network_traffic call. This might sound simple, but doing this while continuing to let the test serve its intended purpose required quite some more changes. This dump_network_traffic call was there because we didn't want to show warnings in the queries above, because the exact warnings were not reliable. The main reason this error was not reliable was because we were using round-robin task assignment. We did the same query twice, so that it would hit the node with the intercepted connection in one of those connections. Instead of doing that I'm now using the "first-replica" policy and do the queries only once. This works, because the first placements by placementid for each of the used tables are on the second node, so first-replica will cause the first connection to go there. This solved most of the flakyness, but when confirming that the flakyness was fixed I found some additional errors: ```diff -- show that INSERT failed SELECT citus.mitmproxy('conn.allow()'); mitmproxy ----------- (1 row) SELECT count(*) FROM single_replicatated WHERE key = 100; - count ---------------------------------------------------------------------- - 0 -(1 row) - +ERROR: could not establish any connections to the node localhost:9060 after 400 ms RESET client_min_messages; ``` Source: https://app.circleci.com/pipelines/github/citusdata/citus/26321/workflows/fd5f4622-400c-465e-8d82-83f5f55a87ec/jobs/745666 I addressed this with a combination of two things: 1. Only change citus.node_connection_timeout for the queries that we want to test timeout behaviour for. When those queries are done I reset the value to the default again. 2. Change our mitm framework to only delay the initial connection packet instead of all packets. I think sometimes a follow on packet of a previous connection attempt was causing the next connection attempt to be delayed even if `conn.allow()` was already called. For our tests we only care about connection timeouts, so there's no reason to delay any other packets than the initial connection packet. Then there was some last flakyness in the exact error that was given: ```diff -- tests for connectivity checks SELECT name FROM r1 WHERE id = 2; WARNING: could not establish any connections to the node localhost:9060 after 900 ms +WARNING: connection to the remote node localhost:9060 failed with the following error: name ------ bar (1 row) ``` Source: https://app.circleci.com/pipelines/github/citusdata/citus/26338/workflows/9610941c-4d01-4f62-84dc-b91abc56c252/jobs/746467 I don't have a good explaination for this slight change in error message, but given that it is missing the actual error message I expected this to be related to some small difference in timing: e.g. the server responding to the connection attempt right after the coordinator determined that the connection timed out. To solve this last flakyness I increased the connection timeouts and made the difference between the timeout and the delay a bit bigger. With these tweaks I wasn't able to reproduce this error on CI anymore. Finally, I made most of the same changes to failure_failover_to_local_execution, since it was using the `conn.delay()` mitm method too. The only change that I left out was the timing increase, since it might not be strictly necessary and increases time it takes to run the test. If this test ever becomes flaky the first thing we should try is increase its timeout.	2022-08-23 15:04:20 +03:00
Jelte Fennema	506c16efdf	Fix flakyness in failure_single_select (#6223 ) The failure_single_select test would sometimes fail with an error that's similar to this: ```diff -- cancel after first SELECT; txn should fail and nothing should be marked as invalid SELECT citus.mitmproxy('conn.onQuery(query="^SELECT").cancel(' \|\| pg_backend_pid() \|\| ')'); - mitmproxy ---------------------------------------------------------------------- - -(1 row) - +ERROR: canceling statement due to user request +CONTEXT: COPY mitmproxy_result, line 1: "" +SQL statement "COPY mitmproxy_result FROM '/home/circleci/project/src/test/regress/tmp_check/mitmproxy.fifo'" +PL/pgSQL function citus.mitmproxy(text) line 11 at EXECUTE BEGIN; ``` This error looked very to the one from #6217 and indeed the cause turned out to be similar. Because we were canceling all SELECT queries, we would actually sometimes cancel our mitmproxy SELECT queries itself. This puts some additional restrictions on the queries that we cancel, most importantly it should contain the name of the table that we're selecting from. I was able to reproduce the original issue locally pretty reliably. With the changes in this PR it didn't happen again. In passing this also changes one other failure test that was cancelling all selects and puts similar additional restrictions on those cancellations. Example of failed test in CI: https://app.circleci.com/pipelines/github/citusdata/citus/26305/workflows/4d942b91-f83c-453c-8d9a-ae22d608e756/jobs/745071	2022-08-22 20:06:33 +02:00
Hanefi Onaldi	616b1758c2	Add more normalization rules	2022-08-22 17:16:52 +03:00
Hanefi Onaldi	e33ba7da9e	Decrease min messages for normalization	2022-08-22 17:16:52 +03:00
Hanefi Onaldi	9ec9209fd9	Bump PG versions in CI configs	2022-08-22 17:16:52 +03:00
Marco Slot	639588bee0	Remove unused functions (#6220 ) Co-authored-by: Marco Slot <marco.slot@gmail.com>	2022-08-22 11:53:25 +03:00
Jelte Fennema	e2a24b921e	Fix flakyness in failure_create_distributed_table_non_empty (#6217 ) The failure_create_distributed_table_non_empty test would sometimes fail like this: ```diff -- in the first test, cancel the first connection we sent from the coordinator SELECT citus.mitmproxy('conn.cancel(' \|\| pg_backend_pid() \|\| ')'); - mitmproxy ---------------------------------------------------------------------- - -(1 row) - +ERROR: canceling statement due to user request +CONTEXT: COPY mitmproxy_result, line 1: "" +SQL statement "COPY mitmproxy_result FROM '/home/circleci/project/src/test/regress/tmp_check/mitmproxy.fifo'" +PL/pgSQL function citus.mitmproxy(text) line 11 at EXECUTE SELECT create_distributed_table('test_table', 'id'); ``` Because the cancel command had no filter it would actually sometimes cancel the mitmproxy cancel command itself. This PR addresses that by filtering on CREATE TABLE, which is one of the command that create_distributed_table will send to the workers. Example of failing test: https://app.circleci.com/pipelines/github/citusdata/citus/26252/workflows/1b7e5464-cca4-4ec1-99b3-48ddf25c29fa/jobs/742829	2022-08-20 01:23:25 +03:00
Jelte Fennema	4ce17f015b	Fix flakyness in columnar_memory test (#6216 ) Sometimes in CI the columnar_memory test was using slightly more memory than expected. ```diff SELECT CASE WHEN 1.0 * TopMemoryContext / :top_post BETWEEN 0.98 AND 1.02 THEN 1 ELSE 1.0 * TopMemoryContext / :top_post END AS top_growth FROM columnar_test_helpers.columnar_store_memory_stats(); --[ RECORD 1 ]- -top_growth \| 1 +-[ RECORD 1 ]------------------ +top_growth \| 1.0206132116232119 -- before this change, max mem usage while executing inserts was 28MB and ``` This PR changes the expectation to be slightly higher, such that this random increase in memory usage doesn't cause a flaky test. Failing test: https://app.circleci.com/pipelines/github/citusdata/citus/26256/workflows/c0870f66-3346-4f8d-a1d3-36dfd7c98289/jobs/743028	2022-08-19 23:46:28 +02:00
Jelte Fennema	de475feb69	Actually connect to the right database in logical_replication test (#6211 ) In the logical_replication test we test that the cleanup logic at the start of a shard move works as expected. To do so we create a subscription and publication slot manually. This changes the test to make that subscription actually connect to the database that the publication is in. Useful for #5987 #6085	2022-08-20 00:09:50 +03:00
Jelte Fennema	dfa6c26d7d	Increase isolation timeout because of shards splits (#6213 ) Recently isolation tests involving shard splits have been randomly failing in CI with timeouts. It's possible that there's an actual bug here, but it's also quite likely that our timeout is just slightly too low for the combination of shard splits and the CI VM having a bad day. Increasing the timeout is fairly low cost and allows us to find out if there's an actual bug or if its simply slowness. So that's what this PR does. If it turns out to be an actual bug, we can decrease the timeout again when we fix it. Examples of failed tests: 1. https://app.circleci.com/pipelines/github/citusdata/citus/26241/workflows/9e0bb721-d798-481b-907c-914236b63e38/jobs/742409 2. https://app.circleci.com/pipelines/github/citusdata/citus/26171/workflows/8f352e3b-e6e4-4f7f-b0d0-2543f62a0209/jobs/739470	2022-08-19 22:37:45 +03:00
Naisila Puka	9cfadd7965	Deletes unnecessary test outputs pt2 (#6214 )	2022-08-19 18:21:13 +03:00
Jelte Fennema	85305b2773	Don't run any isolation tests in parallel (#6212 ) By running isolation tests in parallel we're just asking for flaky tasks. The first test might temporarily block one of the commands in the second test, which we then detect as waiting like this: ```diff step s2-vacuum-analyze: VACUUM ANALYZE test_insert_vacuum; - + <waiting ...> step s1-commit: COMMIT; +step s2-vacuum-analyze: <... completed> ``` Debugging flaky tests is also much harder when they are run in parallel. This PR starts running all our isolation tests sequentially. The reason for opening this PR was me seeing this failing test: https://app.circleci.com/pipelines/github/citusdata/citus/26194/workflows/ff57e2cf-8ac4-40fe-bc0c-74a7f8fecb53/jobs/740454 As well as having fixed a similar issue recently in #6122	2022-08-19 17:05:36 +02:00
Önder Kalacı	616ff2a3fe	Adjust some isolation test for the recent PG commits (#6210 ) * Adjust some isolation test for the recent PG commits In `3f32395612`, Postgres starts any isolation session with `set application_name`. However, one of the tests we had expected that it is exactly the first command in the session. The test tries to show that even if a gpid has not been assigned, we can show it in the citus_lock_waits graph. Now that, it is literally not possible to have such test as gpid would be assigned after `set application_name` command. Still, it is good to have a test where a command is blocked on the parser	2022-08-19 17:06:34 +03:00
Jelte Fennema	e6a1a86db0	Improve debugability for columnar_memory flakyness (#6203 ) Sometimes the columnar_memory test fails in CI with the following error: ```diff SELECT 1.0 * TopMemoryContext / :top_post BETWEEN 0.98 AND 1.02 AS top_growth_ok FROM columnar_test_helpers.columnar_store_memory_stats(); -[ RECORD 1 ]-+-- -top_growth_ok \| t +top_growth_ok \| f -- before this change, max mem usage while executing inserts was 28MB and ``` This is almost certainly a harmless failure that simply requires bumping the margin a little bit. However, it's impossible to say with the current output. I was unable to reproduce this on-demand on my local machine or even in CI. So this changes the test to include the actual value difference in the size of TopMemoryContext when it's outside the expected range. Then next time it fails we at least have some information about why. Example of failing test: https://app.circleci.com/pipelines/github/citusdata/citus/25966/workflows/d472a57b-419a-4f33-b8bc-2e174a98d4d6/jobs/730576	2022-08-19 15:41:16 +02:00
Jelte Fennema	3f4440ff69	Improve debugability of failures in isolation_ref2ref_foreign_keys (#6197 ) As shown in #6196 the output of s1-view-locks is sometimes not as expected. However, because it's output is very minimal it's hard to understand the reason for that. This adds some more columns and aggregates less, so we can more easily see what locks are unexpectedly held or released. In passing this also fixes the following flaky part of this test by excluding locks taken by the maintenance daemon. After running it with this more detailed output for s1-view-locks it became obvious that that was the problem here. ```diff diff -dU10 -w /home/jelte/work/citus/src/test/regress/expected/isolation_ref2ref_foreign_keys.out /home/jelte/work/citus/src/test/regress/results/isolation_ref2ref_foreign_keys.out --- /home/jelte/work/citus/src/test/regress/expected/isolation_ref2ref_foreign_keys.out.modified 2022-08-18 15:42:08.689525233 +0200 +++ /home/jelte/work/citus/src/test/regress/results/isolation_ref2ref_foreign_keys.out.modified 2022-08-18 15:42:08.729525233 +0200 @@ -288,21 +288,22 @@ step s1-view-locks: SELECT mode, count(*) FROM pg_locks WHERE locktype='advisory' GROUP BY mode ORDER BY 1, 2; mode \|count ------------------------+----- -(0 rows) +ShareUpdateExclusiveLock\| 1 +(1 row) starting permutation: s2-begin s2-insert-table-3 s1-view-locks s2-rollback s1-view-locks step s2-begin: BEGIN; step s2-insert-table-3: INSERT INTO ref_table_3 VALUES (7, 5); step s1-view-locks: ```	2022-08-19 15:12:09 +02:00
Jelte Fennema	25e5cf2e50	Fix flakyness in failure_setup (#6205 ) In CI sometimes failure_setup will fail with the following error: ```diff SELECT master_add_node('localhost', :worker_2_proxy_port); -- an mitmproxy which forwards to the second worker - master_add_node ---------------------------------------------------------------------- - 2 -(1 row) - +ERROR: connection to the remote node localhost:9060 failed with the following error: could not connect to server: Connection refused + Is the server running on host "localhost" (127.0.0.1) and accepting + TCP/IP connections on port 9060? +could not connect to server: Connection refused + Is the server running on host "localhost" (127.0.0.1) and accepting + TCP/IP connections on port 9060? +could not connect to server: Cannot assign requested address + Is the server running on host "localhost" (::1) and accepting + TCP/IP connections on port 9060? diff -dU10 -w /home/circleci/project/src/test/regress/expected/failure_online_move_shard_placement.out /home/circleci/project/src/test/regress/results/failure_online_move_shard_placement.out ``` This then breaks all the tests run after it as well, because we're missing one worker node. Locally I was able to reproduce this error by sleeping for 10 seconds in the forked process sleep before actually starting mitmproxy. So I'm expecting what's happening in CI is that due to limited resources, mitmproxy is not up yet when we try to add its port as a workernode. This PR fixes this by waiting until mitmproxy is listening on its socket before actually starting to run our tests. This fixed it locally for me when I made the forked process sleep for 10 seconds before starting mitmproxy. In passing it also improves the detection and errors that we already had for the case where something was already listening on the mitmproxy port. Because both @gledis69 and me were changing things in our CI images at the same time this also includes a bump of the style checker tools. Closes #6200	2022-08-19 13:03:08 +00:00
Jelte Fennema	3fadb98380	Fix compilation warning on PG13 + OpenSSL 3.0 (#6038 ) This removes some warnings that are present when building on Ubuntu 22.04. It removes warnings on PG13 + OpenSSL 3.0. OpenSSL 3.0 has marked some functions that we use as deprecated, but we want to continue support OpenSSL 1.0.1 for the time being too. This indicates that to OpenSSL 3.0, so it doesn't show warnings.	2022-08-19 05:51:47 -07:00
Jelte Fennema	fe1668e43f	Fix flakyness in multi_utilities (#6204 ) Sometimes this multi_utilities would fail with the following error: ```diff SET citus.log_remote_commands TO ON; -- should propagate to all workers because no table is specified ANALYZE; NOTICE: issuing BEGIN TRANSACTION ISOLATION LEVEL READ COMMITTED;SELECT assign_distributed_transaction_id(0, 3461, '2022-08-19 01:56:06.35816-07'); DETAIL: on server postgres@localhost:57637 connectionId: 1 NOTICE: issuing BEGIN TRANSACTION ISOLATION LEVEL READ COMMITTED;SELECT assign_distributed_transaction_id(0, 3461, '2022-08-19 01:56:06.35816-07'); DETAIL: on server postgres@localhost:57638 connectionId: 2 NOTICE: issuing SET citus.enable_ddl_propagation TO 'off' DETAIL: on server postgres@localhost:57637 connectionId: 1 -NOTICE: issuing SET citus.enable_ddl_propagation TO 'off' -DETAIL: on server postgres@localhost:xxxxx connectionId: xxxxxxx NOTICE: issuing ANALYZE DETAIL: on server postgres@localhost:57637 connectionId: 1 +NOTICE: issuing SET citus.enable_ddl_propagation TO 'off' +DETAIL: on server postgres@localhost:57638 connectionId: 2 NOTICE: issuing ANALYZE DETAIL: on server postgres@localhost:57638 connectionId: 2 ``` This is simply a harmless change in output due to some timing differences. This PR makes the test output consistent by only logging the remote ANALYZE commands, not the SET commands.	2022-08-19 12:38:55 +02:00
Marco Slot	5160cafa82	Do not propagate GRANT ON SCHEMA from CREATE EXTENSION (#6175 ) Co-authored-by: Marco Slot <marco.slot@gmail.com>	2022-08-19 13:23:47 +03:00
Jelte Fennema	8ce12eb51f	Fix flakyness in failure_insert_select_repartition (#6202 ) This fixes our most commonly randomly failing failure test. The failing diff is as follows: ```diff SELECT citus.mitmproxy('conn.onQuery(query="fetch_intermediate_results").kill()'); mitmproxy ----------- (1 row) INSERT INTO target_table SELECT * FROM source_table; -ERROR: connection to the remote node localhost:xxxxx failed with the following error: connection not open +ERROR: could not open file "base/pgsql_job_cache/10_0_40/repartitioned_results_20770193413_from_4213590_to_1.data": No such file or directory +CONTEXT: while executing command on localhost:9060 +while executing command on localhost:57637 SELECT * FROM target_table ORDER BY a; ``` As far as I can tell this is the cause of a race condition: After killing fetch_intermediate_results on worker 9060, the previously created data file gets cleaned up. The fetch_intermediate_results call that's sent to worker 57637 will be cancelled and rolled back soon because of the failure on the other connection. But if that fetch_intermediate_results call is able to connect to 9060 before it is cancelled, it won't find the file it's looking for there anymore. So while it's not the error we expect, it does indicate that we succeeded. To avoid this issue instead of killing the fetch_intermediate_results call directly, we kill the COPY command that it uses to do the fetch. This results in stable output as can be seen here, where 227 runs of failure_insert_select_repartition succeeded: https://app.circleci.com/pipelines/github/citusdata/citus/26168/workflows/9c64a3b6-f46c-4725-9fb4-8f6a2d00a023/jobs/739389 To be clear this changes the test to affects the opposite fetch_intermediate_results call. This kills the fetch_intermediate_results call of worker 57637, instead of killing the fetch_intermediate_results call on worker 9060. Example of failing test: https://app.circleci.com/pipelines/github/citusdata/citus/26147/workflows/780e95ea-264a-4c9f-ad2e-cf11449a795e/jobs/738467	2022-08-19 09:11:07 +00:00
Naisila Puka	5a9fdc221b	Add explicit alias to avoid debug output diff in pg15 (#6183 )	2022-08-19 11:39:18 +03:00
Jelte Fennema	31faa88a4e	Track rebalance progress at the shard move level (#6187 ) We're in the processes of totally changing the shard rebalancer experience and infrastructure. Soon the shard rebalancer will include retries, crash recovery and support for running in the background. These improvements come at a cost though, the way the get_rebalance_progress UDF currently works is very hard to replicate with this new structure. This is mostly because the old behaviour doesn't really make sense anymore with this new infrastructure. A new and better way to track the progress will be included as part of the new infrastructure. This PR is in preparation of the new code rebalancer experience. It changes the get_rebalance_progress UDF to only display the moves that are in progress at the moment, not the ones that happened in the past or that are planned in the future. Another option would have been to completely remove the current get_rebalance_progress functionality and point people to the new way of tracking progress. But old blogposts still reference the old UDF and users might have some automation on top of it. Showing the progress of the current moves is fairly simple to achieve, even with the new infrastructure. So this PR is a kind of compromise: It doesn't have complete feature parity with the old get_rebalance_progress, but the most common use cases will still work. There's also an advantage of the change: You can now see progress of shard moves that were triggered by calling citus_move_shard_placement manually. Instead of only being able to see progress of moves that were initiated using get_rebalance_table_shards.	2022-08-18 18:57:04 +02:00
Önder Kalacı	961fcff5db	Properly add / remove coordinator for isolation tests (#6181 ) We used to rely on a seperate session to add the coordinator. However, that might prevent the existing sessions to get assigned proper gpids, which causes flaky tests.	2022-08-18 17:32:12 +03:00
Jelte Fennema	7dca028391	Fix flakyness in isolation_reference_table (#6193 ) The newly introduced isolation_reference_table test had some flakyness, because the assumption on how the arbitrary reference table gets chosen was incorrect. This introduces a VACUUM FULL at the start of the test to ensure the assumption actually holds. Example of failed test: https://app.circleci.com/pipelines/github/citusdata/citus/26108/workflows/0a5cd526-006b-423e-8b67-7411b9c6be36/jobs/736802	2022-08-18 15:47:28 +03:00
Jelte Fennema	0a045afd3a	Fix flakyness in columnar_first_row_number test (#6192 ) When running columnar_first_row_number in parallel with the columnar_query test sometimes it would fail. This bug is tracked in #6191. For now to make CI less flaky we simply don't run these tests in parallel. Example of failed test: https://app.circleci.com/pipelines/github/citusdata/citus/26106/workflows/75d00ea9-23f8-4bff-a927-bced19e1f81b/jobs/736713 Fixes #6184	2022-08-18 15:32:57 +03:00
Jelte Fennema	d16b458e2a	Remove the flaky rollback_to_savepoint test (#6190 ) This removes a flaky test that I introduced in #3868 after I fixed the issue described in #3622. This test is sometimes fails randomly in CI. The way it fails indicates that there might be some bug: A connection breaks after rolling back to a savepoint. I tried reproducing this issue locally, but I wasn't able to. I don't understand what causes the failure. Things that I tried were: 1. Running the test with: ```sql SET citus.force_max_query_parallelization = true; ``` 2. Running the test with: ```sql SET citus.max_adaptive_executor_pool_size = 1; ``` 3. Running the test in parallel with the same tests that it is run in parallel with in multi_schedule. None of these allowed me to reproduce the issue locally. So I think it's time to give on fixing this test and simply remove the test. The regression that this test protects against seems very unlikely to reappear, since in #3868 I also added a big comment about the need for the newly added `UnclaimConnection` call. So, I think the need for the test is quite small, and removing it will make our CI less flaky. In case the cause of the bug ever gets found, I tracked the bug in #6189 Example of a failing CI run: https://app.circleci.com/pipelines/github/citusdata/citus/26098/workflows/f84741d9-13b1-4ae7-9155-c21ed3466951/jobs/736424 For reference the unexpected diff is this (so both warnings and an error): ```diff INSERT INTO t SELECT i FROM generate_series(1, 100) i; +WARNING: connection to the remote node localhost:57638 failed with the following error: +WARNING: +CONTEXT: while executing command on localhost:57638 +ERROR: connection to the remote node localhost:57638 failed with the following error: ROLLBACK; ``` This test is also mentioned as the most failing regression test in #5975	2022-08-18 15:14:16 +03:00
Onder Kalaci	9ec8e627c1	Support Sequences owned by columns before distributing tables There are 3 different ways that a sequence can be interacting with tables. (1) and (2) are already supported. This commit adds support for (3). (1) column DEFAULT nextval('seq'): The dependency is roughly like below, and ExpandCitusSupportedTypes() is responsible for finding the depending sequences. schema <--- table <--- column <---- default value ^ \| \|------------------ sequence <--------\| (2) serial columns: Bigserial/small serial etc: The dependency is roughly like below, and ExpandCitusSupportedTypes() is responsible for finding the depending sequences. schema <--- table <--- column <---- default value ^ \| \| \| sequence <--------\| (3) Sequence OWNED BY table.column: Added support for this type of resolution in this commit. The dependency is almost like the following, and ExpandCitusSupportedTypes() is NOT responsible for finding the dependency. schema <--- table <--- column ^ \| sequence	2022-08-18 10:29:40 +02:00
Naisila Puka	69ffdbf0e3	Uses object name in cannot distribute object error (#6186 ) Object type ids have changed in PG15 because of at least two added objects in the list: OBJECT_PARAMETER_ACL, OBJECT_PUBLICATION_NAMESPACE To avoid different output between pg versions, let's use the object name in the error, and put the object id in the error detail. Relevant PG commits: a0ffa885e478f5eeacc4e250e35ce25a4740c487 5a2832465fd8984d089e8c44c094e6900d987fcd	2022-08-18 11:05:17 +03:00
Ying Xu	91473635db	[Columnar] Check for existence of Citus before creating Citus_Columnar (#6178 ) * Added a check to see if Citus has already been loaded before creating citus_columnar * added tests	2022-08-17 15:12:42 -07:00
Nils Dijk	a9d47a96f6	Fix reference table lock contention (#6173 ) DESCRIPTION: Fix reference table lock contention Dropping and creating reference tables unintentionally blocked on each other due to the use of an ExclusiveLock for both the Drop and conditionally copying existing reference tables to (new) nodes. The patch does the following: - Lower lock lever for dropping (reference) tables to `ShareLock` so they don't self conflict - Treat reference tables and distributed tables equally and acquire the colocation lock when dropping any table that is in a colocation group - Perform the precondition check for copying reference tables twice, first time with a lower lock that doesn't conflict with anything. Could have been a NoLock, however, in preparation for dropping a colocation group, it is an `AccessShareLock` During normal operation the first check will always pass and we don't have to escalate that lock. Making it that we won't be blocked on adding and remove reference tables. Only after a node addition the first `create_reference_table` will still need to acquire an `ExclusiveLock` on the colocation group to perform the copy.	2022-08-17 18:19:28 +02:00
Ahmet Gedemenli	0631e1998b	Fix upgrade paths for #6100 (#6176 ) * Fix upgrade paths for #6100 Co-authored-by: Hanefi Onaldi <Hanefi.Onaldi@microsoft.com>	2022-08-17 18:56:53 +03:00
Naisila Puka	20a0e0ed39	Grant create on public to some users where necessary (for PG15) (#6180 )	2022-08-17 17:35:10 +03:00
Jelte Fennema	3f6ce889eb	Use CreateSimpleHash (and variants) whenever possible (#6177 ) This is a refactoring PR that starts using our new hash table creation helper function. It adds a few more macros for ease of use, because C doesn't have default arguments. It also adds a macro to check if a struct contains automatic padding bytes. No struct that is hashed using tag_hash should have automatic padding bytes, because those bytes are undefined and thus using them to create a hash will result in undefined behaviour (usually a random hash).	2022-08-17 13:01:59 +03:00
aykut-bozkurt	52efe08642	default mode for shard splitting is set to auto. (#6179 )	2022-08-17 12:18:47 +03:00
aykut-bozkurt	be06d65721	Nonblocking tenant isolation is supported by using split api. (#6167 )	2022-08-17 11:13:07 +03:00
Jelte Fennema	78a5013e24	Support changing CPU priorities for backends and shard moves (#6126 ) Intro This adds support to Citus to change the CPU priority values of backends. This is created with two main usecases in mind: 1. Users might want to run the logical replication part of the shard moves or shard splits at a higher speed than they would do by themselves. This might cause some small loss of DB performance for their regular queries, but this is often worth it. During high load it's very possible that the logical replication WAL sender is not able to keep up with the WAL that is generated. This is especially a big problem when the machine is close to running out of disk when doing a rebalance. 2. Users might have certain long running queries that they don't impact their regular workload too much. Be very careful!!! Using CPU priorities to control scheduling can be helpful in some cases to control which processes are getting more CPU time than others. However, due to an issue called "[priority inversion][1]" it's possible that using CPU priorities together with the many locks that are used within Postgres cause the exact opposite behavior of what you intended. This is why this PR only allows the PG superuser to change the CPU priority of its own processes. Currently it's not recommended to set `citus.cpu_priority` directly. Currently the only recommended interface for users is the setting called `citus.cpu_priority_for_logical_replication_senders`. This setting controls CPU priority for a very limited set of processes (the logical replication senders). So, the dangers of priority inversion are also limited with when using it for this usecase. Background Before reading the rest it's important to understand some basic background regarding process CPU priorities, because they are a bit counter intuitive. A lower priority value, means that the process will be scheduled more and whatever it's doing will thus complete faster. The default priority for processes is 0. Valid values are from -20 to 19 inclusive. On Linux a larger difference between values of two processes will result in a bigger difference in percentage of scheduling. Handling the usecases Usecase 1 can be achieved by setting `citus.cpu_priority_for_logical_replication_senders` to the priority value that you want it to have. It's necessary to set this both on the workers and the coordinator. Example: ``` citus.cpu_priority_for_logical_replication_senders = -10 ``` Usecase 2 can with this PR be achieved by running the following as superuser. Note that this is only possible as superuser currently due to the dangers mentioned in the "Be very carefull!!!" section. And although this is possible it's NOT recommended: ```sql ALTER USER background_job_user SET citus.cpu_priority = 5; ``` OS configuration To actually make these settings work well it's important to run Postgres with more a more permissive value for the 'nice' resource limit than Linux will do by default. By default Linux will not allow a process to set its priority lower than it currently is, even if it was lower when the process originally started. This capability is necessary to reset the CPU priority to its original value after a transaction finishes. Depending on how you run Postgres this needs to be done in one of two ways: If you use systemd to start Postgres all you have to do is add a line like this to the systemd service file: ```conf LimitNice=+0 # the + is important, otherwise its interpreted incorrectly as 20 ``` If that's not the case you'll have to configure `/etc/security/limits.conf` like so, assuming that you are running Postgres as the `postgres` OS user: ``` postgres soft nice 0 postgres hard nice 0 ``` Finally you'd have add the following line to `/etc/pam.d/common-session` ``` session required pam_limits.so ``` These settings would allow to change the priority back after setting it to a higher value. However, to actually allow you to set priorities even lower than the default priority value you would need to change the values in the config to something lower than 0. So for example: ```conf LimitNice=-10 ``` or ``` postgres soft nice -10 postgres hard nice -10 ``` If you use WSL2 you'll likely have to do another thing. You have to open a new shell, because when PAM is only used during login, and WSL2 doesn't actually log you in. You can force a login like this: ``` sudo su $USER --shell /bin/bash ``` Source: https://stackoverflow.com/a/68322992/2570866 [1]: https://en.wikipedia.org/wiki/Priority_inversion	2022-08-16 13:07:17 +03:00
Jelte Fennema	1a01c896f0	Fix description of citus.distributed_deadlock_detection_factor (#5860 ) The long description of the `citus.distributed_deadlock_detection_factor` setting was incorrectly stating that 1000 would disable it. Instead -1 is the value that disables distributed deadlock detection.	2022-08-16 01:19:49 +03:00
Jelte Fennema	43c2a1e88b	Share more code between splits and moves (#6152 ) When introducing non-blocking shard split functionality it was based heavily on the non-blocking shard moves. However, differences between usage was slightly to big to be able to reuse the existing functions easily. So, most logical replication code was simply copied to dedicated shard split functions and modified for that purpose. This PR tries to create a more generic logical replication infrastructure that can be used by both shard splits and shard moves. There's probably more code sharing possible in the future, but I believe this is at least a good start and addresses the lowest hanging fruit. This also adds a CreateSimpleHash function that makes creating the most common type of hashmap common.	2022-08-15 20:21:51 +03:00
Marco Slot	6c73576606	Fix HTAB memory leaks	2022-08-15 16:10:24 +02:00
yxu2162	e1322ec905	Change for PG15 test because hash_mem_multiplier was changed to 2 as a default instead of 1 which was what PG13/14 have	2022-08-11 09:49:56 -07:00
Teja Mupparti	e962113c63	Remove the GUC mention in the error message as this config is meant for advanced users	2022-08-11 09:43:14 -07:00
Önder Kalacı	627feb6326	Merge branch 'main' into add_missing_schema	2022-08-11 13:02:50 +02:00
aykut-bozkurt	ccf1e0f584	Pg vanilla tests can be run with citus created. (#6018 )	2022-08-11 12:53:22 +03:00
Önder Kalacı	73fcbdf12c	Merge branch 'main' into add_missing_schema	2022-08-11 11:28:41 +02:00
Jelte Fennema	fd07cc9baf	Fix flakyness in create index concurrently isolation tests (#6158 ) This creates consistent test output for isolation tests that involve `CREATE INDEX CONCURRENTLY`. `CREATE INDEX CONCURRENTLY` is sometimes temporarily detected as blocking, even though it will complete without any other queries needing to be run. This change makes sure that we wait until that happens without running any other queries in the meantime. This way we always get consistent output. The way we do that is addressed by using an empty step in the same session as the `CREATE INDEX CONCURRENLTY` command. Doing so forces the isolation tester to wait until the command is finished and not continue with steps from other sessions. This is [the recommended approach by Postgres][1]. There's two separate cases which are addressed in slightly different ways: 1. If `CREATE INDEX CONCURRENTLY` is actually blocked on another session: Add an empty step right after the commit of blocking session. e.g. `"s2-ddl-create-index-concurrently" "s1-commit" "s2-empty"` 2. If it's not actually blocked on another session: Add [an asterisk marker][2] to make it look like it's blocked (because sometimes this happens randomly) and right after that we add an empty step to trigger waiting. e.g. `"s2-ddl-create-index-concurrently"(*) "s2-empty" "s1-commit"` In passing this also enables isolation tests that were disabled due to a bug that has already been fixed for a while. Fixes #5993 Related to #5910 and #2966 [1]: `5f0adec253/src/test/isolation/README (L197-L204)` [2]: `5f0adec253/src/test/isolation/README (L174-L179)` Co-authored-by: Hanefi Onaldi <Hanefi.Onaldi@microsoft.com>	2022-08-11 10:29:11 +02:00
aykut-bozkurt	898801504e	sysid should be parsed as int. (#6150 )	2022-08-11 10:44:46 +03:00
Hanefi Onaldi	294400b2eb	Fix typos in tests that fail on PG15	2022-08-10 22:45:28 +03:00
Onder Kalaci	00ce7235cb	Set missing search_path in the tests On PG 15, public schema requires explicit GRANT, so lets avoid the conflict helpful for #6085	2022-08-10 18:04:10 +02:00
Onder Kalaci	44947d5634	This is not supported in PG15 so fix earlier	2022-08-10 17:44:03 +02:00
naisila	ea209bd11d	Rename remaining regclass to relation in columnar.options	2022-08-10 15:38:53 +02:00
aykut-bozkurt	166272963a	log NOTICE createdb only if EnableUnsupportedFeatureMessages GUC is enabled. (#6151 )	2022-08-09 21:21:22 +03:00
aykut-bozkurt	cc694b6bcf	we consider stat object as invalid if it is not owned by current user (#6130 )	2022-08-09 20:59:30 +03:00
Hanefi Onaldi	6ef96ac560	Use client side \copy when accessing test files	2022-08-09 15:00:42 +03:00
Hanefi Onaldi	a58523f1d8	Remove all references to .source files	2022-08-09 14:15:52 +03:00
Hanefi Onaldi	9f52fa7610	Remove dynamic translation of regression test scripts, step 2. This commit is inspired by a commit dc9c3b0ff21465fa89d71eecf5e6cc956d647eca from PostgreSQL 15 that shares the same header. I also removed some gitignore rules so that I can add some files to git worktree. We used to ignore the generated files, that are no longer generated after this commit. -------------------- Below is the commit message from PostgreSQL 15 commit dc9c3b0ff21465fa89d71eecf5e6cc956d647eca : "git mv" all the input/.source and output/.source files into the corresponding sql/ and expected/ directories. Then remove the pg_regress and Makefile infrastructure associated with dynamic translation. Discussion: https://postgr.es/m/1655733.1639871614@sss.pgh.pa.us	2022-08-09 14:15:52 +03:00
Hanefi Onaldi	b6bd9ab87b	Remove dynamic translation of regression test scripts, step 1. This commit is inspired by a commit d1029bb5a26cb84b116b0dee4dde312291359f2a from PostgreSQL 15 that shares the same header. -------------------- Below is the commit message from PostgreSQL 15 commit d1029bb5a26cb84b116b0dee4dde312291359f2a : pg_regress has long had provisions for dynamically substituting path names into regression test scripts and result files, but use of that feature has always been a serious pain in the neck, mainly because updating the result files requires tedious manual editing. Let's get rid of that in favor of passing down the paths in environment variables. In addition to being easier to maintain, this way is capable of dealing with path names that require escaping at runtime, for example paths containing single-quote marks. (There are other stumbling blocks in the way of actually building in a path that looks like that, but removing this one seems like a good thing to do.) The key coding rule that makes that possible is to concatenate pieces of a dynamically-variable string using psql's \set command, and then use the :'variable' notation to quote and escape the string for the next level of interpretation. In hopes of making this change more transparent to "git blame", I've split it into two steps. This commit adds the necessary pg_regress.c support and changes all the *.source files in-place so that they no longer require any dynamic translation. The next commit will just "git mv" them into the regular sql/ and expected/ directories. Discussion: https://postgr.es/m/1655733.1639871614@sss.pgh.pa.us	2022-08-09 14:15:52 +03:00
Hanefi Onaldi	4185543910	Pass source directory in env to regression tests PostgreSQL 15 dropped usage of .source files that are used to generate .sql and .out files by replacing some placeholders with the actual values before test runs. Instead, the information is passed from pg_regress to the .sql and .out files directly via env variables. Those variables are read via \getenv psql command in relevant test files. PostgreSQL 15 commit d1029bb5a26cb84b116b0dee4dde312291359f2a introduced some changes to pg_regress binary that allowed this to happen. However this change is not backported to earlier versions of PG, and thus we come up with a similar mechanism in pg_regress_multi that works in all available PG versions.	2022-08-09 14:15:51 +03:00
Jelte Fennema	8017693b2f	Allow specifying the shard_transfer_mode when replicating reference tables (#6070 ) When using `citus.replicate_reference_tables_on_activate = off`, reference tables need to be replicated later. This can be done using the `replicate_reference_tables()` UDF. However, this function only allowed blocking replication. This changes the function to default to logical replication instead, and allows choosing any of our existing shard transfer modes.	2022-08-09 13:21:31 +03:00
Jelte Fennema	a645cb4b94	Better test failure debugging for arbitrary-configs (#5861 ) This improves debugging of arbitrary configs in two ways: 1. Enable logging of distributed deadlock detection 2. Show output of `psql` commands	2022-08-09 12:25:20 +03:00
Marco Slot	3b57ff2867	Fix crash in citus_copy_shard_placement	2022-08-09 09:31:05 +02:00
naisila	796d90d293	Explain w/out costs in ch_bench to avoid PG15 output diff	2022-08-09 07:53:27 +03:00
Naisila Puka	bcbba99c96	Clean up large_table_shard_count guc leftovers (#6144 )	2022-08-09 06:31:57 +03:00
Naisila Puka	3806f6f6a9	Add ORDER BY in pg_locks to avoid output order diffs (#6145 )	2022-08-09 06:02:07 +03:00
Naisila Puka	ce944c3c0f	Remove bogus guc citus.compression (#6142 )	2022-08-09 05:21:32 +03:00
Jelte Fennema	dd548ee3c7	Use faster custom copy logic for non-blocking shard moves (#6119 ) DESCRIPTION: Use faster custom copy logic for non-blocking shard moves Non-blocking shard moves consist of two main phases: 1. Initial data copy 2. Catchup phase This changes the first of these phases significantly. Previously we used the copy logic provided by postgres subscriptions. This meant we didn't have to implement it ourselves, but it came with the downside of little control. When implementing shard splits we needed more control to even make it work, so we implemented our own logic for copying data between nodes. This PR starts using that logic for non-blocking shard moves. Doing so has four main advantages: 1. It uses COPY in binary format when possible, which is cheaper to encode and decode. Furthermore it very often results in less data that needs to be sent over the network. 2. It allows us to create the primary key (or other replica identity) after doing the initial data copy. This should give some speed up over the total run, because creating an index is bulk is much faster than incrementally building it. 3. It doesn't require a replication slot per parallel copy. Increasing the maximum number of replication slots uses resources in postgres, even if they are not used. So reducing the number of replication slots that shard moves need is nice. 4. Logical replication table_sync workers are slow to start up, so if lots of shards need to be copied that can make it quite slow. This can happen easily when combining Postgres partitioning with Citus.	2022-08-08 17:09:43 +02:00
Marco Slot	6aee8f35a6	Fix tenant isolation failure tests	2022-08-08 13:33:23 +02:00
Marco Slot	ead9d28835	Avoid deadlocks on split failure by closing connections	2022-08-08 13:33:23 +02:00
Marco Slot	044dd26e40	Reimplement tenant isolation on top of block shard split	2022-08-08 13:33:23 +02:00
Naisila Puka	3401b31c13	Deletes unnecessary test outputs (#6140 )	2022-08-08 11:19:14 +03:00
Naisila Puka	9eedf6dcf8	Reduce log level to avoid alternative output for PG15 (#6139 )	2022-08-07 16:07:58 +03:00
Teja Mupparti	430c201d03	get_current_transaction_id() UDF is not printing the timestamp of the current transaction on the coordinator even when non-null	2022-08-05 10:12:07 -07:00
Naisila Puka	73f515f651	Add another expr to ORDER BY clause for consistency (#6136 )	2022-08-05 15:42:25 +03:00
aykut-bozkurt	4992533e33	support grant statement propagation for aggregates (#6132 )	2022-08-05 14:47:33 +03:00
Ahmet Gedemenli	8b68b0b5bb	Fix pg upgrade script for foreign tables (#6100 ) Fixes unexpected error for foreign tables when upgrading pg	2022-08-05 13:35:17 +03:00
Sameer Awasekar	e236711eea	Introduce Non-Blocking Shard Split Workflow	2022-08-04 16:32:38 +02:00
aykut-bozkurt	b67abdd28c	we should not log error in preprocess if attached partition is missing. (#6131 )	2022-08-04 15:49:14 +03:00
Naisila Puka	a1c630a16e	Reduce shard_count to reduce drain_node execution time (#6128 ) master_drain_node in distributed_triggers.sql test file takes too long to execute. It is directly dependent on the shard count. Hence I reduced shard count from 32 to 4 (default in tests), since this doesn't affect the validity of the tests.	2022-08-04 15:34:13 +03:00
aykut-bozkurt	3ddc089651	stop distributing views with no distributed dependency if GUC DistributeLocalViews is set false. (#6083 )	2022-08-04 12:34:40 +03:00
aykut-bozkurt	4ffe436bf9	we validate constraint as well if the statement is alter domain drop constraint (#6125 )	2022-08-03 23:06:33 +03:00
Jelte Fennema	dff71abc32	Fix flakyness in isolation_data_migration.spec (#6122 ) The tests isolation_concurrent_dml and isolation_data_migration tests were being run in parallel, but they were interfering with each others output. Sometimes queries from isolation_concurrent_dml were blocking create_distributed_table in isolation_data_migration: 1. https://app.circleci.com/pipelines/github/citusdata/citus/25562/workflows/f9d0a6ff-bb7a-4b71-9fcf-1a3e46d54425/jobs/713270 2. https://app.circleci.com/pipelines/github/citusdata/citus/25562/workflows/1e22454c-1623-48a7-97fb-c6803c7959c7/jobs/713223 3. https://app.circleci.com/pipelines/github/citusdata/citus/25562/workflows/618c419e-eefb-4582-9482-322dbb9ac96d/jobs/713110 This fixes it changing the schedule to not run these tests in parallel.	2022-08-03 17:56:49 +03:00
aykut-bozkurt	a662331668	qualify text dict and conf respect missingok (#6120 )	2022-08-03 13:13:53 +03:00
Jelte Fennema	8bbc1a45e1	Fix flakyness in isolation_replicate_reference_tables_to_coordinator.spec (#6123 ) When the deadlock detector kills s2-update-dist-table both sessions finish at the same time. The order in which they are displayed can be swapped. To counteract this we start using the ["marker" feature][1] of the isolationtester framework to create consistent output. In passing this also sets the next_shard_id to the expected value by this test so it can be run using `make check-isolation-base`. Failed CI test: https://app.circleci.com/pipelines/github/citusdata/citus/25562/workflows/dfe6f88a-c306-4d91-b771-d5d1deb1798d/jobs/713417 [1]: `ec62ce55a8/src/test/isolation/README (L152)`	2022-08-03 12:00:30 +02:00
aykutbozkurt	7387c7ed3d	address method should take parameter isPostprocess	2022-08-02 21:00:23 +03:00
aykutbozkurt	c98a68662a	introduces operation type for dist ops	2022-08-02 20:42:32 +03:00
aykutbozkurt	57ce4cf8c4	use address method to decide if we should run preprocess and postprocess steps for a distributed object	2022-08-02 20:42:32 +03:00
Jelte Fennema	8866d9ac32	Reduce setup time of check-minimal and check-minimal-mx (#6117 ) This change reduces the setup time of our minimal schedules in two ways: 1. Don't run `multi_cluster_managament`, but instead run a much smaller sql file with almost the same results. `multi_cluster_management` adds and removes lots of nodes and tests all kinds of failure scenarios. This is not needed for the minimal schedules. The only reason we were using it there was to get a working cluster of the layout that the tests expected. The new `minimal_cluster_management` test achieves this with much less work, going from ~2s to ~0.5s. 2. Parallelize a bit more of the helper tests.	2022-08-02 17:58:59 +03:00
Naisila Puka	28e22c4abf	Reduce log level to avoid alternative output for PG15 (#6118 ) We are reducing the log level here to avoid alternative test output in PG15 because of the change in the display of SQL-standard function's arguments in INSERT/SELECT in PG15. The log level changes can be reverted when we drop support for PG14 Relevant PG commit: a8d8445a7b2f80f6d0bfe97b19f90bd2cbef8759	2022-08-02 11:56:28 +03:00
Onder Kalaci	c7b51025ab	Add missing ifdef for PG 15	2022-08-02 09:46:53 +02:00
Jelte Fennema	abffa6c3b9	Use shard split copy code for blocking shard moves (#6098 ) The new shard copy code that was created for shard splits has some advantages over the old shard copy code. The old code was using worker_append_table_to_shard, which wrote to disk twice. And it also didn't use binary copy when that was possible. Both of these issues were fixed in the new copy code. This PR starts using this new copy logic also for shard moves, not just for shard splits. On my local machine I created a single shard table like this. ```sql set citus.shard_count = 1; create table t(id bigint, a bigint); select create_distributed_table('t', 'id'); INSERT into t(id, a) SELECT i, i from generate_series(1, 100000000) i; ``` I then turned `fsync` off to make sure I wasn't bottlenecked by disk. Finally I moved this shard between nodes with `citus_move_shard_placement` with `block_writes`. Before this PR a move took ~127s, after this PR it took only ~38s. So for this small test this resulted in spending ~70% less time. And I also tried the same test for a table that contained large strings: ```sql set citus.shard_count = 1; create table t(id bigint, a bigint, content text); select create_distributed_table('t', 'id'); INSERT into t(id, a, content) SELECT i, i, 'aunethautnehoautnheaotnuhetnohueoutnehotnuhetncouhaeohuaeochgrhgd.athbetndairgexdbuhaobulrhdbaetoausnetohuracehousncaoehuesousnaceohuenacouhancoexdaseohusnaetobuetnoduhasneouhaceohusnaoetcuhmsnaetohuacoeuhebtokteaoshetouhsanetouhaoug.lcuahesonuthaseauhcoerhuaoecuh.lg;rcydabsnetabuesabhenth' from generate_series(1, 20000000) i; ```	2022-08-01 20:10:36 +03:00
Naisila Puka	5060d0ab17	Remove leftover PG version_above_11 checks from tests (#6112 )	2022-08-01 15:38:19 +03:00
Naisila Puka	85324f3acc	Clean up multi_shard_commit_protocol guc leftovers (#6110 )	2022-08-01 15:22:02 +03:00
Naisila Puka	f9b02946b1	Delete PG version_above_ten alternative test outputs (#6111 )	2022-08-01 14:32:36 +03:00
aykut-bozkurt	f372e93d22	we supress notice log during looking up function oid to not break pg vanilla tests. (#6082 )	2022-08-01 10:14:35 +03:00
Önder Kalacı	cbdc2b3019	Merge branch 'main' into fix_relation_acess_2	2022-07-29 16:45:02 +02:00
Marco Slot	6d6e44166f	Avoid catalog read via superuser() call in DecrementSharedConnectionCounter	2022-07-29 14:05:41 +02:00
Onder Kalaci	bdaeb40b51	Add missing relation access record for local utility command While testing `5670dffd33`, I realized that we have a missing RecordNonDistTableAccessesForTask() for local utility commands. Although we don't have to record the relation access for local only cases, we really want to keep the behaviour for scale-out be the same with single node on all aspects. We wouldn't want any single node complex transaction to work on single machine, but not on multi node cluster. Hence, we apply the same restrictions. For example, on a distributed cluster, the following errors, and after this commit this errors locally as well ```SQL CREATE TABLE ref(a int primary key); INSERT INTO ref VALUES (1); CREATE TABLE dist(a int REFERENCES ref(a)); SELECT create_reference_table('ref'); SELECT create_distributed_table('dist', 'a'); BEGIN; SELECT * FROM dist; TRUNCATE ref CASCADE; ERROR: cannot execute DDL on table "ref" because there was a parallel SELECT access to distributed table "dist" in the same transaction HINT: Try re-running the transaction with "SET LOCAL citus.multi_shard_modify_mode TO 'sequential';" COMMIT; ``` We also add the comprehensive test suite and run the same locally.	2022-07-29 11:36:33 +02:00
Onder Kalaci	149771792b	Remove useless version compats most likely leftover from earlier versions	2022-07-29 10:31:55 +02:00
Ying Xu	7c1a93b26b	Removed USE_PGXS snippet in Makefile that was blocking citus build when flag is set (#6101 ) Code snippet in Makefile was blocking Citus build when USE_PGXS flag was set. This was included for port to FSPG but is not needed for Citus engine and can be safely removed.	2022-07-28 14:15:45 -07:00
aykut-bozkurt	a218198e8f	reindex object address should return invalid addresses for unsepported object types in reindex stmt (#6096 )	2022-07-28 15:31:49 +03:00
Marco Slot	cff013a057	Fix issues with insert..select casts and column ordering	2022-07-28 13:23:57 +02:00
aykut-bozkurt	789d5b9ef9	null check for server in GetObjectAddressByServerName (#6095 )	2022-07-28 13:13:28 +03:00
Onder Kalaci	b41c3fd30d	Add tests	2022-07-28 11:27:59 +02:00
Onder Kalaci	0a5112964d	Call relation access hash clean-up irrespective of remote transaction state Mainly because local-only transactions should be cleaned up	2022-07-28 11:27:59 +02:00
Onder Kalaci	d67cf907a2	Detach relation access tracking from connection management	2022-07-28 11:27:59 +02:00
Ying Xu	fdf090758b	Bugfix for IN clause to be considered during planner phase in Columnar (#6030 ) Reported bug #5803 shows that we are currently not sending the IN clause to our planner for columnar. This PR fixes it by checking for ScalarArrayOpExpr in ExtractPushdownClause so that we do not skip it. Also added a test case for this new addition.	2022-07-27 11:06:49 -07:00
Jelte Fennema	0f50bef696	Avoid possible information leakage about existing users (#6090 )	2022-07-27 17:46:32 +02:00

1 2 3 4 5 ...

4021 Commits (ab6064d3c3210817adb31e5515ab034b0bc5fafa)