citus

Commit Graph

Author	SHA1	Message	Date
Jelte Fennema	f87940221f	Fix flakyness in ch_benchmarks_1 (#6228 ) One of our arbitrary config tests would sometimes fail like this in CI: ```diff su_nationkey, cust_nation, l_year; - supp_nation \| cust_nation \| l_year \| revenue ---------------------------------------------------------------------- - 9 \| C \| 2008 \| 3.00 -(1 row) - +ERROR: cannot connect to localhost:10212 to fetch intermediate results +CONTEXT: while executing command on localhost:10211 ``` When looking at the logs it seems like we were running out of connections: ``` 2022-08-23 14:03:52.856 UTC [28122] FATAL: sorry, too many clients already 2022-08-23 14:03:52.860 UTC [21027] ERROR: cannot connect to localhost:10212 to fetch intermediate results ``` This happened with `CitusThreeWorkersManyShards` config. This test on purpose tries to push the limits of Citus quite far. And the `ch_benchmarks_1` test is also run in parallel with a few more ones. So it's not too weird that it ran out of connections. This doubles the connection limit in the arbitrary config tests to hopefully not hit this error again. Example of failed test: https://app.circleci.com/pipelines/github/citusdata/citus/26365/workflows/7a1b5688-85cc-4bc3-ade5-9bd1d83cd0ed/jobs/747908/parallel-runs/1 (cherry picked from commit `21780b4f65`)	2022-09-07 13:27:49 +02:00
Jelte Fennema	5d60bbf7f8	Better test failure debugging for arbitrary-configs (#5861 ) This improves debugging of arbitrary configs in two ways: 1. Enable logging of distributed deadlock detection 2. Show output of `psql` commands (cherry picked from commit `a645cb4b94`)	2022-09-07 13:27:49 +02:00
Jelte Fennema	0fd78b1bde	Fix flakyness in failure_connection_establishment (#6226 ) In CI our failure_connection_establishment sometimes failed randomly with the following error: ```diff -- verify a connection attempt was made to the intercepted node, this would have cause the -- connection to have been delayed and thus caused a timeout SELECT * FROM citus.dump_network_traffic() WHERE conn=0; conn \| source \| message ------+--------+--------- - 0 \| coordinator \| [initial message] -(1 row) +(0 rows) SELECT citus.mitmproxy('conn.allow()'); ``` Source: https://app.circleci.com/pipelines/github/citusdata/citus/26318/workflows/d3354024-9a67-4b01-9416-5cf79aec6bd8/jobs/745558 The way I fixed this was by removing the dump_network_traffic call. This might sound simple, but doing this while continuing to let the test serve its intended purpose required quite some more changes. This dump_network_traffic call was there because we didn't want to show warnings in the queries above, because the exact warnings were not reliable. The main reason this error was not reliable was because we were using round-robin task assignment. We did the same query twice, so that it would hit the node with the intercepted connection in one of those connections. Instead of doing that I'm now using the "first-replica" policy and do the queries only once. This works, because the first placements by placementid for each of the used tables are on the second node, so first-replica will cause the first connection to go there. This solved most of the flakyness, but when confirming that the flakyness was fixed I found some additional errors: ```diff -- show that INSERT failed SELECT citus.mitmproxy('conn.allow()'); mitmproxy ----------- (1 row) SELECT count(*) FROM single_replicatated WHERE key = 100; - count ---------------------------------------------------------------------- - 0 -(1 row) - +ERROR: could not establish any connections to the node localhost:9060 after 400 ms RESET client_min_messages; ``` Source: https://app.circleci.com/pipelines/github/citusdata/citus/26321/workflows/fd5f4622-400c-465e-8d82-83f5f55a87ec/jobs/745666 I addressed this with a combination of two things: 1. Only change citus.node_connection_timeout for the queries that we want to test timeout behaviour for. When those queries are done I reset the value to the default again. 2. Change our mitm framework to only delay the initial connection packet instead of all packets. I think sometimes a follow on packet of a previous connection attempt was causing the next connection attempt to be delayed even if `conn.allow()` was already called. For our tests we only care about connection timeouts, so there's no reason to delay any other packets than the initial connection packet. Then there was some last flakyness in the exact error that was given: ```diff -- tests for connectivity checks SELECT name FROM r1 WHERE id = 2; WARNING: could not establish any connections to the node localhost:9060 after 900 ms +WARNING: connection to the remote node localhost:9060 failed with the following error: name ------ bar (1 row) ``` Source: https://app.circleci.com/pipelines/github/citusdata/citus/26338/workflows/9610941c-4d01-4f62-84dc-b91abc56c252/jobs/746467 I don't have a good explaination for this slight change in error message, but given that it is missing the actual error message I expected this to be related to some small difference in timing: e.g. the server responding to the connection attempt right after the coordinator determined that the connection timed out. To solve this last flakyness I increased the connection timeouts and made the difference between the timeout and the delay a bit bigger. With these tweaks I wasn't able to reproduce this error on CI anymore. Finally, I made most of the same changes to failure_failover_to_local_execution, since it was using the `conn.delay()` mitm method too. The only change that I left out was the timing increase, since it might not be strictly necessary and increases time it takes to run the test. If this test ever becomes flaky the first thing we should try is increase its timeout. (cherry picked from commit `cc7e93a56a`)	2022-09-07 13:27:49 +02:00
Jelte Fennema	583080b872	Fix flakyness in failure_single_select (#6223 ) The failure_single_select test would sometimes fail with an error that's similar to this: ```diff -- cancel after first SELECT; txn should fail and nothing should be marked as invalid SELECT citus.mitmproxy('conn.onQuery(query="^SELECT").cancel(' \|\| pg_backend_pid() \|\| ')'); - mitmproxy ---------------------------------------------------------------------- - -(1 row) - +ERROR: canceling statement due to user request +CONTEXT: COPY mitmproxy_result, line 1: "" +SQL statement "COPY mitmproxy_result FROM '/home/circleci/project/src/test/regress/tmp_check/mitmproxy.fifo'" +PL/pgSQL function citus.mitmproxy(text) line 11 at EXECUTE BEGIN; ``` This error looked very to the one from #6217 and indeed the cause turned out to be similar. Because we were canceling all SELECT queries, we would actually sometimes cancel our mitmproxy SELECT queries itself. This puts some additional restrictions on the queries that we cancel, most importantly it should contain the name of the table that we're selecting from. I was able to reproduce the original issue locally pretty reliably. With the changes in this PR it didn't happen again. In passing this also changes one other failure test that was cancelling all selects and puts similar additional restrictions on those cancellations. Example of failed test in CI: https://app.circleci.com/pipelines/github/citusdata/citus/26305/workflows/4d942b91-f83c-453c-8d9a-ae22d608e756/jobs/745071 (cherry picked from commit `506c16efdf`)	2022-09-07 13:27:49 +02:00
Jelte Fennema	9b5027917a	Fix flakyness in failure_create_distributed_table_non_empty (#6217 ) The failure_create_distributed_table_non_empty test would sometimes fail like this: ```diff -- in the first test, cancel the first connection we sent from the coordinator SELECT citus.mitmproxy('conn.cancel(' \|\| pg_backend_pid() \|\| ')'); - mitmproxy ---------------------------------------------------------------------- - -(1 row) - +ERROR: canceling statement due to user request +CONTEXT: COPY mitmproxy_result, line 1: "" +SQL statement "COPY mitmproxy_result FROM '/home/circleci/project/src/test/regress/tmp_check/mitmproxy.fifo'" +PL/pgSQL function citus.mitmproxy(text) line 11 at EXECUTE SELECT create_distributed_table('test_table', 'id'); ``` Because the cancel command had no filter it would actually sometimes cancel the mitmproxy cancel command itself. This PR addresses that by filtering on CREATE TABLE, which is one of the command that create_distributed_table will send to the workers. Example of failing test: https://app.circleci.com/pipelines/github/citusdata/citus/26252/workflows/1b7e5464-cca4-4ec1-99b3-48ddf25c29fa/jobs/742829 (cherry picked from commit `e2a24b921e`)	2022-09-07 13:27:49 +02:00
Jelte Fennema	a3325c1146	Fix flakyness in columnar_memory test (#6216 ) Sometimes in CI the columnar_memory test was using slightly more memory than expected. ```diff SELECT CASE WHEN 1.0 * TopMemoryContext / :top_post BETWEEN 0.98 AND 1.02 THEN 1 ELSE 1.0 * TopMemoryContext / :top_post END AS top_growth FROM columnar_test_helpers.columnar_store_memory_stats(); --[ RECORD 1 ]- -top_growth \| 1 +-[ RECORD 1 ]------------------ +top_growth \| 1.0206132116232119 -- before this change, max mem usage while executing inserts was 28MB and ``` This PR changes the expectation to be slightly higher, such that this random increase in memory usage doesn't cause a flaky test. Failing test: https://app.circleci.com/pipelines/github/citusdata/citus/26256/workflows/c0870f66-3346-4f8d-a1d3-36dfd7c98289/jobs/743028 (cherry picked from commit `4ce17f015b`)	2022-09-07 13:27:49 +02:00
Jelte Fennema	242f40d640	Improve debugability for columnar_memory flakyness (#6203 ) Sometimes the columnar_memory test fails in CI with the following error: ```diff SELECT 1.0 * TopMemoryContext / :top_post BETWEEN 0.98 AND 1.02 AS top_growth_ok FROM columnar_test_helpers.columnar_store_memory_stats(); -[ RECORD 1 ]-+-- -top_growth_ok \| t +top_growth_ok \| f -- before this change, max mem usage while executing inserts was 28MB and ``` This is almost certainly a harmless failure that simply requires bumping the margin a little bit. However, it's impossible to say with the current output. I was unable to reproduce this on-demand on my local machine or even in CI. So this changes the test to include the actual value difference in the size of TopMemoryContext when it's outside the expected range. Then next time it fails we at least have some information about why. Example of failing test: https://app.circleci.com/pipelines/github/citusdata/citus/25966/workflows/d472a57b-419a-4f33-b8bc-2e174a98d4d6/jobs/730576 (cherry picked from commit `e6a1a86db0`)	2022-09-07 13:27:49 +02:00
Jelte Fennema	c1f58fac6c	Don't run any isolation tests in parallel (#6212 ) By running isolation tests in parallel we're just asking for flaky tasks. The first test might temporarily block one of the commands in the second test, which we then detect as waiting like this: ```diff step s2-vacuum-analyze: VACUUM ANALYZE test_insert_vacuum; - + <waiting ...> step s1-commit: COMMIT; +step s2-vacuum-analyze: <... completed> ``` Debugging flaky tests is also much harder when they are run in parallel. This PR starts running all our isolation tests sequentially. The reason for opening this PR was me seeing this failing test: https://app.circleci.com/pipelines/github/citusdata/citus/26194/workflows/ff57e2cf-8ac4-40fe-bc0c-74a7f8fecb53/jobs/740454 As well as having fixed a similar issue recently in #6122 (cherry picked from commit `85305b2773`)	2022-09-07 13:27:49 +02:00
Jelte Fennema	3d67ae6497	Fix flakyness in failure_insert_select_repartition (#6202 ) This fixes our most commonly randomly failing failure test. The failing diff is as follows: ```diff SELECT citus.mitmproxy('conn.onQuery(query="fetch_intermediate_results").kill()'); mitmproxy ----------- (1 row) INSERT INTO target_table SELECT * FROM source_table; -ERROR: connection to the remote node localhost:xxxxx failed with the following error: connection not open +ERROR: could not open file "base/pgsql_job_cache/10_0_40/repartitioned_results_20770193413_from_4213590_to_1.data": No such file or directory +CONTEXT: while executing command on localhost:9060 +while executing command on localhost:57637 SELECT * FROM target_table ORDER BY a; ``` As far as I can tell this is the cause of a race condition: After killing fetch_intermediate_results on worker 9060, the previously created data file gets cleaned up. The fetch_intermediate_results call that's sent to worker 57637 will be cancelled and rolled back soon because of the failure on the other connection. But if that fetch_intermediate_results call is able to connect to 9060 before it is cancelled, it won't find the file it's looking for there anymore. So while it's not the error we expect, it does indicate that we succeeded. To avoid this issue instead of killing the fetch_intermediate_results call directly, we kill the COPY command that it uses to do the fetch. This results in stable output as can be seen here, where 227 runs of failure_insert_select_repartition succeeded: https://app.circleci.com/pipelines/github/citusdata/citus/26168/workflows/9c64a3b6-f46c-4725-9fb4-8f6a2d00a023/jobs/739389 To be clear this changes the test to affects the opposite fetch_intermediate_results call. This kills the fetch_intermediate_results call of worker 57637, instead of killing the fetch_intermediate_results call on worker 9060. Example of failing test: https://app.circleci.com/pipelines/github/citusdata/citus/26147/workflows/780e95ea-264a-4c9f-ad2e-cf11449a795e/jobs/738467 (cherry picked from commit `8ce12eb51f`)	2022-09-07 13:27:49 +02:00
Önder Kalacı	173b7ed5ad	Properly add / remove coordinator for isolation tests (#6181 ) We used to rely on a seperate session to add the coordinator. However, that might prevent the existing sessions to get assigned proper gpids, which causes flaky tests. (cherry picked from commit `961fcff5db`)	2022-09-07 13:27:49 +02:00
Jelte Fennema	eed4d6452d	Fix flakyness in columnar_first_row_number test (#6192 ) When running columnar_first_row_number in parallel with the columnar_query test sometimes it would fail. This bug is tracked in #6191. For now to make CI less flaky we simply don't run these tests in parallel. Example of failed test: https://app.circleci.com/pipelines/github/citusdata/citus/26106/workflows/75d00ea9-23f8-4bff-a927-bced19e1f81b/jobs/736713 Fixes #6184 (cherry picked from commit `0a045afd3a`)	2022-09-07 13:27:49 +02:00
Jelte Fennema	aa879108b7	Remove the flaky rollback_to_savepoint test (#6190 ) This removes a flaky test that I introduced in #3868 after I fixed the issue described in #3622. This test is sometimes fails randomly in CI. The way it fails indicates that there might be some bug: A connection breaks after rolling back to a savepoint. I tried reproducing this issue locally, but I wasn't able to. I don't understand what causes the failure. Things that I tried were: 1. Running the test with: ```sql SET citus.force_max_query_parallelization = true; ``` 2. Running the test with: ```sql SET citus.max_adaptive_executor_pool_size = 1; ``` 3. Running the test in parallel with the same tests that it is run in parallel with in multi_schedule. None of these allowed me to reproduce the issue locally. So I think it's time to give on fixing this test and simply remove the test. The regression that this test protects against seems very unlikely to reappear, since in #3868 I also added a big comment about the need for the newly added `UnclaimConnection` call. So, I think the need for the test is quite small, and removing it will make our CI less flaky. In case the cause of the bug ever gets found, I tracked the bug in #6189 Example of a failing CI run: https://app.circleci.com/pipelines/github/citusdata/citus/26098/workflows/f84741d9-13b1-4ae7-9155-c21ed3466951/jobs/736424 For reference the unexpected diff is this (so both warnings and an error): ```diff INSERT INTO t SELECT i FROM generate_series(1, 100) i; +WARNING: connection to the remote node localhost:57638 failed with the following error: +WARNING: +CONTEXT: while executing command on localhost:57638 +ERROR: connection to the remote node localhost:57638 failed with the following error: ROLLBACK; ``` This test is also mentioned as the most failing regression test in #5975 (cherry picked from commit `d16b458e2a`)	2022-09-07 13:27:49 +02:00
Jelte Fennema	ee887ef648	Fix flakyness in create index concurrently isolation tests (#6158 ) This creates consistent test output for isolation tests that involve `CREATE INDEX CONCURRENTLY`. `CREATE INDEX CONCURRENTLY` is sometimes temporarily detected as blocking, even though it will complete without any other queries needing to be run. This change makes sure that we wait until that happens without running any other queries in the meantime. This way we always get consistent output. The way we do that is addressed by using an empty step in the same session as the `CREATE INDEX CONCURRENLTY` command. Doing so forces the isolation tester to wait until the command is finished and not continue with steps from other sessions. This is [the recommended approach by Postgres][1]. There's two separate cases which are addressed in slightly different ways: 1. If `CREATE INDEX CONCURRENTLY` is actually blocked on another session: Add an empty step right after the commit of blocking session. e.g. `"s2-ddl-create-index-concurrently" "s1-commit" "s2-empty"` 2. If it's not actually blocked on another session: Add [an asterisk marker][2] to make it look like it's blocked (because sometimes this happens randomly) and right after that we add an empty step to trigger waiting. e.g. `"s2-ddl-create-index-concurrently"(*) "s2-empty" "s1-commit"` In passing this also enables isolation tests that were disabled due to a bug that has already been fixed for a while. Fixes #5993 Related to #5910 and #2966 [1]: `5f0adec253/src/test/isolation/README (L197-L204)` [2]: `5f0adec253/src/test/isolation/README (L174-L179)` Co-authored-by: Hanefi Onaldi <Hanefi.Onaldi@microsoft.com> (cherry picked from commit `fd07cc9baf`)	2022-09-07 13:27:49 +02:00
Jelte Fennema	a9a114145a	Fix flakyness in isolation_data_migration.spec (#6122 ) The tests isolation_concurrent_dml and isolation_data_migration tests were being run in parallel, but they were interfering with each others output. Sometimes queries from isolation_concurrent_dml were blocking create_distributed_table in isolation_data_migration: 1. https://app.circleci.com/pipelines/github/citusdata/citus/25562/workflows/f9d0a6ff-bb7a-4b71-9fcf-1a3e46d54425/jobs/713270 2. https://app.circleci.com/pipelines/github/citusdata/citus/25562/workflows/1e22454c-1623-48a7-97fb-c6803c7959c7/jobs/713223 3. https://app.circleci.com/pipelines/github/citusdata/citus/25562/workflows/618c419e-eefb-4582-9482-322dbb9ac96d/jobs/713110 This fixes it changing the schedule to not run these tests in parallel. (cherry picked from commit `dff71abc32`)	2022-09-07 13:27:49 +02:00
Jelte Fennema	86614e3555	Fix flakyness in isolation_replicate_reference_tables_to_coordinator.spec (#6123 ) When the deadlock detector kills s2-update-dist-table both sessions finish at the same time. The order in which they are displayed can be swapped. To counteract this we start using the ["marker" feature][1] of the isolationtester framework to create consistent output. In passing this also sets the next_shard_id to the expected value by this test so it can be run using `make check-isolation-base`. Failed CI test: https://app.circleci.com/pipelines/github/citusdata/citus/25562/workflows/dfe6f88a-c306-4d91-b771-d5d1deb1798d/jobs/713417 [1]: `ec62ce55a8/src/test/isolation/README (L152)` (cherry picked from commit `8bbc1a45e1`)	2022-09-07 13:27:49 +02:00
Hanefi Onaldi	a29b689fc9	Replace isolation tester func only once on enterprise tests (#6064 ) This is a continuation of a refactor (with commit sha `2b7cf0c097`) that aimed to use Citus helper UDFs by default in iso tests. PostgreSQL isolation test infrastructure uses some UDFs to detect whether concurrent sessions block each other. Citus implements alternatives to that UDF so that we are able to detect and report distributed transactions that get blocked on the worker nodes as well. We needed to explicitly replace PG helper functions with Citus implementations in each isolation file. Now we replace them by default. (cherry picked from commit `ae58ca5783`)	2022-09-07 13:27:49 +02:00
Hanefi Onaldi	b90523628f	Replace iso tester func only once (#5964 ) Use Citus helper UDFs by default in iso tests PostgreSQL isolation test infrastructure uses some UDFs to detect whether concurrent sessions block each other. Citus implements alternatives to that UDF so that we are able to detect and report distributed transactions that get blocked on the worker nodes as well. We needed to explicitly replace PG helper functions with Citus implementations in each isolation file. Now we replace them by default. (cherry picked from commit `2b7cf0c097`)	2022-09-07 13:27:49 +02:00
Jelte Fennema	75cf7a748d	Define symbols required for downgrade from 11.1 (#6301 ) Since #6300/e29db74 changed the C symbol that our bigint overrides of pg_cancel_backend and pg_terminate_backend called. We needed to do something to continue to make these functions work after downgrading. Recreating the old definition with a downgrade scripts is not really possible, since people are expected to run the downgrade steps when using the new .so file, which does not contain the old symbols. So, the easiest way to solve it was also defining the new symbols in our old Citus versions. Luckily our overrides haven't existed for long, so these symbol definitions only needed to be backported to 11.0.	2022-09-07 12:18:39 +02:00
Marco Slot	5f57d77899	Allow citus_internal application_name with additional suffix (#6282 ) Co-authored-by: Marco Slot <marco.slot@gmail.com>	2022-09-05 21:41:06 +02:00
Marco Slot	0a11da1291	Add an allow_unsafe_constraints flag for constraints without distribution column (#6237 ) Co-authored-by: Marco Slot <marco.slot@gmail.com>	2022-08-25 16:13:07 +02:00
Gokhan Gulbiz	07143e7d12	Use the same colocation group for child and parent rels when altering a distributed table (#6225 ) * Alter_distributed_table colocateWith:none bug fix for partitioned tables. * Regression tests added for alter_distributed_table colocateWith:none for partitioned tables * Update query comparision to be more accurate (cherry picked from commit `69d2fcf5c0`)	2022-08-25 11:47:06 +03:00
Marco Slot	006c8eacc0	Verify that we can replicate reference tables using rebalancer	2022-08-23 23:26:49 +02:00
Marco Slot	9dc6273b88	Set application_name to citus_rebalancer when copying reference tables	2022-08-23 23:26:49 +02:00
Onur Tirtir	a5d6e841df	Bump citus version to 11.0.6	2022-08-19 10:58:38 +03:00
Jelte Fennema	73e993e908	Fix flakyness in isolation_reference_table (#6193 ) The newly introduced isolation_reference_table test had some flakyness, because the assumption on how the arbitrary reference table gets chosen was incorrect. This introduces a VACUUM FULL at the start of the test to ensure the assumption actually holds. Example of failed test: https://app.circleci.com/pipelines/github/citusdata/citus/26108/workflows/0a5cd526-006b-423e-8b67-7411b9c6be36/jobs/736802	2022-08-18 14:47:59 +02:00
Nils Dijk	08dee6fe08	Fix reference table lock contention (#6173 ) DESCRIPTION: Fix reference table lock contention Dropping and creating reference tables unintentionally blocked on each other due to the use of an ExclusiveLock for both the Drop and conditionally copying existing reference tables to (new) nodes. The patch does the following: - Lower lock lever for dropping (reference) tables to `ShareLock` so they don't self conflict - Treat reference tables and distributed tables equally and acquire the colocation lock when dropping any table that is in a colocation group - Perform the precondition check for copying reference tables twice, first time with a lower lock that doesn't conflict with anything. Could have been a NoLock, however, in preparation for dropping a colocation group, it is an `AccessShareLock` During normal operation the first check will always pass and we don't have to escalate that lock. Making it that we won't be blocked on adding and remove reference tables. Only after a node addition the first `create_reference_table` will still need to acquire an `ExclusiveLock` on the colocation group to perform the copy.	2022-08-18 13:22:31 +02:00
Onder Kalaci	87787dd146	Support Sequences owned by columns before distributing tables There are 3 different ways that a sequence can be interacting with tables. (1) and (2) are already supported. This commit adds support for (3). (1) column DEFAULT nextval('seq'): The dependency is roughly like below, and ExpandCitusSupportedTypes() is responsible for finding the depending sequences. schema <--- table <--- column <---- default value ^ \| \|------------------ sequence <--------\| (2) serial columns: Bigserial/small serial etc: The dependency is roughly like below, and ExpandCitusSupportedTypes() is responsible for finding the depending sequences. schema <--- table <--- column <---- default value ^ \| \| \| sequence <--------\| (3) Sequence OWNED BY table.column: Added support for this type of resolution in this commit. The dependency is almost like the following, and ExpandCitusSupportedTypes() is NOT responsible for finding the dependency. schema <--- table <--- column ^ \| sequence (cherry picked from commit `9ec8e627c1`)	2022-08-18 11:22:25 +02:00
Marco Slot	56939f0d14	Fix relation access tracking for local only transactions on release-11.0 (#6182 ) Co-authored-by: Onder Kalaci <onderkalaci@gmail.com>	2022-08-18 10:13:41 +02:00
Ahmet Gedemenli	7df8588107	Fix upgrade paths for 11.0 (#6171 ) * Fix upgrade paths for 11.0	2022-08-17 21:34:23 +03:00
aykut-bozkurt	e0b4455e45	sysid should be parsed as int. (#6150 ) (cherry picked from commit `898801504e`)	2022-08-11 11:03:41 +03:00
Onur Tirtir	a18f6c4e40	Bump citus version to 11.0.5	2022-08-01 10:56:31 +03:00
Jelte Fennema	d6c885713e	Work around flaky test related to search_path (#5894 ) For some reason search_path is not always set correctly on the worker when calling a distributed function, this shows up when calling `insert_document` in our distributed_triggers test. The underlying reason is currently unknown and warrants deeper investigation. Currently this test is one of the main causes for random CI failures. So this change sets the search_path of each function explicitly, to reduce these failures. So other devs can be more efficient, while I continue investigating the root cause of this issue. Also changes explicit `SET citus.enable_unsafe_triggers = false` to `RESET citus.enable_unsafe_triggers` in passing. (cherry picked from commit `6d8c5931d6`)	2022-08-01 10:17:26 +03:00
Ying Xu	a8aa82a3ec	Bugfix for IN clause to be considered during planner phase in Columnar (#6030 ) Reported bug #5803 shows that we are currently not sending the IN clause to our planner for columnar. This PR fixes it by checking for ScalarArrayOpExpr in ExtractPushdownClause so that we do not skip it. Also added a test case for this new addition.	2022-07-29 17:56:24 +02:00
Ahmet Gedemenli	2f1719c149	Do not create truncate triggers on foreign tables (#6103 )	2022-07-29 16:43:09 +03:00
Marco Slot	4eb0749369	Avoid catalog read via superuser() call in DecrementSharedConnectionCounter	2022-07-29 14:22:51 +02:00
Marco Slot	4439124b6d	Fix issues with insert..select casts and column ordering	2022-07-28 13:54:04 +02:00
Jelte Fennema	1cf079581f	Avoid possible information leakage about existing users (#6090 ) (cherry picked from commit `0f50bef696`)	2022-07-27 17:58:24 +02:00
Ahmet Gedemenli	4d01af5160	Error out for views with circular dependencies (#6051 ) Adds error check for views with circular dependencies (cherry picked from commit `2b2a529653`)	2022-07-27 17:59:49 +03:00
Marco Slot	e45b6ece0d	Allow WITH HOLD cursors with parameters	2022-07-27 14:08:18 +02:00
Onder Kalaci	9af736c7a6	Concurrent shard move/copy and colocated table creation fix It turns out that create_distributed_table and citus_move/copy_shard_placement does not work well concurrently. To fix that, we need to acquire a lock, which sounds like a good use of colocation lock. However, the current usage of colocation lock is limited to higher level UDFs like rebalance_table_shards etc. Those usage of lock is still useful, but we cannot acquire the same lock on citus_move_shard_placement etc. because the coordinator connects to itself to acquire the lock. Hence, the high level UDF blocks itself. To fix that, we use one more colocation lock, with the placements are the main objects to consider. (cherry picked from commit `12fa3aaf6b`)	2022-07-27 10:10:46 +02:00
Onder Kalaci	a21a4e128c	Optimize StringJoin() for when prefix-postfix is needed Before this commit, we required multiple copies of the same stringInfo if we needed to append/prepend data to the stringInfo. Now, we optionally get prefix/postfix. For large string operations, this can save up to %10 memory. (cherry picked from commit `26fdcb68f0`)	2022-07-27 10:02:32 +02:00
Onder Kalaci	2a684e426c	Do not cache all the metadata during fix_all_partition_shard_index_names (cherry picked from commit `f076e81166`)	2022-07-27 10:02:05 +02:00
Onder Kalaci	377375de2a	Reduce memory consumption while adjust partition index names Previously, CreateFixPartitionShardIndexNames() created all the relevant query strings for all the shards, and executed the large query string. And, in terms of the memory consumption, this huge command (and its ExprContext generated while running the command) is the main bottleneck/ With this change, we are reducing the total amount of memory usage to almost 1/shard_count. On my local machine, a distributed partitioned table with 120 partitions, each 32 shards, the total memory consumption reduced from ~3GB to ~0.1GB. And, the total execution time increased from ~28 seconds to ~30 seconds. This seems like a good trade-off. (cherry picked from commit `b8008999dc`)	2022-07-27 10:02:00 +02:00
Nitish Upreti	fcdf4434c6	Fix blocking shard moves failure due to constraint failure. DESCRIPTION: Fix Bug #4949 where Blocking shard moves fails if there is a foreign key between partitioned distributed tables (from child to parent). This is because we try to create constraints before attaching child partitions to parent. This causes constraint failure as parent table will be empty. Fix is to reverse the order i.e. attach partitions before we create constraints. TESTING: Added a new test 'shard_move_constraints_blocking' inspired for existing 'shard_move_constraints' where we trigger shard move with 'block_writes' instead of 'force_logical' to add coverage for this scenario.	2022-07-24 21:21:25 -07:00
Hanefi Onaldi	5ca792aef9	Bump Citus version to 11.0.4	2022-07-13 18:06:04 +03:00
Onder Kalaci	c51095c462	Add more generic read-replica tests (cherry picked from commit `6cd7319f12`)	2022-07-13 15:16:04 +02:00
Onder Kalaci	857a770b86	Add regression tests for LOCK command citus.use_secondary_nodes=always mode (cherry picked from commit `3c343d4563`)	2022-07-13 15:15:52 +02:00
Onder Kalaci	06e55df141	Make sure citus_is_coordinator works on read replicas (cherry picked from commit `b2e9a5baf1`)	2022-07-13 15:15:46 +02:00
Onder Kalaci	06d6ffbb6e	LOCK COMMAND does not require primaries at the start (cherry picked from commit `8ab696f7e2`)	2022-07-13 15:15:40 +02:00
Hanefi Onaldi	2bb106508a	Bump Citus version to 11.0.3	2022-07-05 13:19:10 +03:00
Ahmet Gedemenli	ac7511de7d	Fix matviews for citus_add_local_table_to_metadata (#6023 ) (cherry picked from commit `c8e1e243b8`)	2022-07-04 17:01:40 +03:00
Hanefi Onaldi	0eee7fd9b8	Fix downgrade scripts from 11.0-2 to 11.0-1 (cherry picked from commit `f60809a6c1`) Conflicts: src/test/regress/expected/multi_extension.out src/test/regress/sql/multi_extension.sql	2022-06-29 22:52:07 +03:00
Önder Kalacı	03a4305e06	Fixes a bug that prevents upgrades when there are no worker nodes (#6037 ) (cherry picked from commit `bab4c0a8c3`)	2022-06-29 14:36:24 +03:00
Onder Kalaci	d397dd0dfe	Fixes a bug that prevents upgrades when there COMPRESSION and DEFAULT columns	2022-06-29 10:45:33 +02:00
Hanefi Onaldi	9d05c30c13	Bump Citus version to 11.0.2	2022-06-16 16:54:47 +03:00
Ahmet Gedemenli	b559ae5813	Fix creating stats bug when CREATE TABLE LIKE (#6006 ) (cherry picked from commit `1ee3e8b7f4`)	2022-06-16 12:45:23 +03:00
Jelte Fennema	a01e45f3df	Make enterprise features open source This PR makes all of the features open source that were previously only available in Citus Enterprise. Features that this adds: 1. Non blocking shard moves/shard rebalancer (`citus.logical_replication_timeout`) 2. Propagation of CREATE/DROP/ALTER ROLE statements 3. Propagation of GRANT statements 4. Propagation of CLUSTER statements 5. Propagation of ALTER DATABASE ... OWNER TO ... 6. Optimization for COPY when loading JSON to avoid double parsing of the JSON object (`citus.skip_jsonb_validation_in_copy`) 7. Support for row level security 8. Support for `pg_dist_authinfo`, which allows storing different authentication options for different users, e.g. you can store passwords or certificates here. 9. Support for `pg_dist_poolinfo`, which allows using connection poolers in between coordinator and workers 10. Tracking distributed query execution times using citus_stat_statements (`citus.stat_statements_max`, `citus.stat_statements_purge_interval`, `citus.stat_statements_track`). This is disabled by default. 11. Blocking tenant_isolation 12. Support for `sslkey` and `sslcert` in `citus.node_conninfo`	2022-06-16 08:09:45 +02:00
Marco Slot	0861c80c8b	Fix bug in unqualified, non-existing DROP DOMAIN IF EXISTS (cherry picked from commit `ee34e1ed9d`)	2022-06-15 16:53:25 +02:00
Burak Velioglu	de6373b842	Fix dropping temporary view without specifying the explicit schema name (cherry picked from commit `4d533c3c56`)	2022-06-15 16:36:52 +02:00
Ahmet Gedemenli	4345627480	Fix materialized view intermediate result filename (#5982 ) (cherry picked from commit `268d3fa3a6`)	2022-06-14 15:43:18 +03:00
Onder Kalaci	978d31f330	Use citus_finish_citus_upgrade() in the tests We already have tests relying on citus_finalize_upgrade_to_citus11(). Now, adjust those to rely on citus_finish_citus_upgrade() and always call citus_finish_citus_upgrade().	2022-06-13 13:28:41 +02:00
Marco Slot	4bcffce036	Introduce a citus_finish_citus_upgrade() function	2022-06-13 13:28:31 +02:00
Halil Ozan Akgul	7166901492	Fixes the bug where undistribute can drop Citus extension (cherry picked from commit `b255706189`)	2022-06-01 18:56:56 +03:00
Hanefi Onaldi	8ef705012a	Add normalization rules for flaky isolation tests We remove `<waiting ...>` and `<... completed>` outputs for some CREATE INDEX CONCURRENTLY commands since they can cause flakiness in some scenarios. Postgres calls WaitForOlderSnapshots() and this can cause CREATE INDEX CONCURRENTLY commands for shards to get blocked by each other for brief periods of time. The extra waits can pop-up, or they can get completed at different lines in the output files. To remedy that, we rename those indexes to be captured by the new normalization rule. (cherry picked from commit `52541c5802`)	2022-06-01 16:12:01 +03:00
Hanefi Onaldi	530aafd8ee	Grep logs for deterministic global_cancel test results (#5948 ) (cherry picked from commit `313104ab9b`)	2022-06-01 16:12:01 +03:00
Gledis Zeneli	c440cbb643	Fix memory error with citus_add_node reported by valgrind test (#5967 ) The error comes due to the datum jsonb in pg_dist_metadata_node.metadata being 0 in some scenarios. This is likely due to not copying the data when receiving a datum from a tuple and pg deciding to deallocate that memory when the table that the tuple was from is closed. Also fix another place in the code that might have been susceptible to this issue. I tested on both multi-vg and multi-1-vg and the test were successful. (cherry picked from commit `beef392f5a`)	2022-06-01 13:06:54 +03:00
gledis69	a64e135a36	Revert "Copy data from heap tuples instead of using references" This reverts commit `50e8638ede`.	2022-06-01 13:06:38 +03:00
gledis69	50e8638ede	Copy data from heap tuples instead of using references The general rule is: If the data is used within the bounds of table_open ... table_close > no need to copy If the data is required for use even after the table is closed > copy (cherry picked from commit `dc9da7630f`)	2022-06-01 12:27:11 +03:00
jeff-davis	b34b1ce06b	Columnar: fix wraparound bug. (#5962 ) columnar_vacuum_rel() now advances relfrozenxid. Fixes #5958. (cherry picked from commit `74ce210f8b`)	2022-05-31 07:46:12 -07:00
Onder Kalaci	0d0dd0af1c	Show that no metadata is sent when disabled (cherry picked from commit `89c1ccb7a5`)	2022-05-30 17:01:49 +02:00
Onder Kalaci	3227d6551e	Do not send metadata changes during add node if citus.enable_metadata_sync is set to false (cherry picked from commit `7157152f6c`)	2022-05-30 17:01:44 +02:00
Onder Kalaci	d147d5d0c5	Avoid assertion failure on citus_add_node (cherry picked from commit `010a2a408e`)	2022-05-30 17:01:38 +02:00
Ahmet Gedemenli	4b5f749c23	Propagate dependent views upon distribution (#5950 ) (cherry picked from commit `26d927178c`)	2022-05-26 18:58:04 +03:00
Burak Velioglu	29c67c660d	Create view and materialized views with right schema and owner while altering the distributed table. To be able to alter view's owner without enforcing sequential mode. Alter view process functions have been udpated to use metadata connection.	2022-05-25 10:42:54 +03:00
Gledis Zeneli	6da2d41e00	Do not obtain AccessShareLock before actual lock (#5965 ) Do not obtain AccessShareLock before acquiring the distributed locks. Acquiring an AccessShareLock ensures that the relations which we are trying to get a distributed lock on will not be dropped in the time between when the LOCK command is issued and the LOCK commands are send to the worker. However, this also leads to distributed deadlocks in such scenarios: ```sql -- for dist lock acquiring order coor, w1, w2 -- on w2 LOCK t1 IN ACCESS EXLUSIVE MODE; -- acquire AccessShareLock locally on t1 to ensure it is not dropped while we get ready to distribute the lock -- concurrently on w1 LOCK t1 IN ACCESS EXLUSIVE MODE; -- acquire AccessShareLock locally on t1 to ensure it is not dropped while we get ready to distribute the lock -- acquire dist lock on coor, w1, gets blocked on local AccessShareLock on w2 -- on w2 continuation of the execution above -- starts to acquire dist locks and gets blocked on the coor by the lock acquired by w1 -- distributed deadlock ``` We opt for avoiding such deadlocks with the cost of the possibility of running into errors when the relations on which we are trying to acquire locks on get dropped. (cherry picked from commit `27ddb4fc8e`)	2022-05-23 17:28:37 +03:00
Onder Kalaci	2d5560537b	Due to new commits in master branch, outputs diverged	2022-05-23 09:36:38 +02:00
Onder Kalaci	8b0499c91a	Parallelize metadata syncing on node activate It is often useful to be able to sync the metadata in parallel across nodes. Also citus_finalize_upgrade_to_citus11() uses start_metadata_sync_to_primary_nodes() after this commit. Note that this commit does not parallelize all pieces of node activation or metadata syncing. Instead, it tries to parallelize potenially large parts of metadata, which is the objects and distributed tables (in general Citus tables). In the future, it would be nice to sync the reference tables in parallel across nodes. Create ~720 distributed tables / ~23450 shards ```SQL -- declaratively partitioned table CREATE TABLE github_events_looooooooooooooong_name ( event_id bigint, event_type text, event_public boolean, repo_id bigint, payload jsonb, repo jsonb, actor jsonb, org jsonb, created_at timestamp ) PARTITION BY RANGE (created_at); SELECT create_time_partitions( table_name := 'github_events_looooooooooooooong_name', partition_interval := '1 day', end_at := now() + '24 months' ); CREATE INDEX ON github_events_looooooooooooooong_name USING btree (event_id, event_type, event_public, repo_id); SELECT create_distributed_table('github_events_looooooooooooooong_name', 'repo_id'); SET client_min_messages TO ERROR; ``` across 1 node: almost same as expected ```SQL SELECT start_metadata_sync_to_primary_nodes(); Time: 15664.418 ms (00:15.664) select start_metadata_sync_to_node(nodename,nodeport) from pg_dist_node; Time: 14284.069 ms (00:14.284) ``` across 7 nodes: ~3.5x improvement ```SQL SELECT start_metadata_sync_to_primary_nodes(); ┌──────────────────────────────────────┐ │ start_metadata_sync_to_primary_nodes │ ├──────────────────────────────────────┤ │ t │ └──────────────────────────────────────┘ (1 row) Time: 25711.192 ms (00:25.711) -- across 7 nodes select start_metadata_sync_to_node(nodename,nodeport) from pg_dist_node; Time: 82126.075 ms (01:22.126) ``` (cherry picked from commit `dd02e1755f`)	2022-05-23 09:25:31 +02:00
Onder Kalaci	513e073206	Fixes a bug that prevents dropping/altering indexes There are two problems in this area. First, when there are expressions on the index name, we should call `transformIndexExpression()` before generating the index name. That is what Postgres does. Second, because of `40c24bfef9` PG 13 and PG 14 generates different names for indexes with function calls even for local PG tables. Assume we have: ```SQL create table t(id int); select create_distributed_table('t', 'id'); create index ON t (my_very_boring_function(id)); ``` On PG 13, the name of the index is `t_expr_idx` ```SQL \d t Table "public.t" ┌────────┬─────────┬───────────┬──────────┬─────────┐ │ Column │ Type │ Collation │ Nullable │ Default │ ├────────┼─────────┼───────────┼──────────┼─────────┤ │ id │ integer │ │ │ │ └────────┴─────────┴───────────┴──────────┴─────────┘ Indexes: "t_expr_idx" btree (my_very_boring_function(id::bigint)) ``` On PG 14, the name of the index is `t_my_very_boring_function_idx` ```SQL \d t Table "public.t" ┌────────┬─────────┬───────────┬──────────┬─────────┐ │ Column │ Type │ Collation │ Nullable │ Default │ ├────────┼─────────┼───────────┼──────────┼─────────┤ │ id │ integer │ │ │ │ └────────┴─────────┴───────────┴──────────┴─────────┘ Indexes: "t_my_very_boring_function_idx" btree (my_very_boring_function(id::bigint)) ``` The second issue is not very critical. The important part is that we adjust regression tests to drop all the indexes, which ensures the index names are sane on any version. (cherry picked from commit `2cc4053fc1`)	2022-05-23 09:22:25 +02:00
Onder Kalaci	4b5cb7e2b9	Mark existing views as distributed when upgrade to 11.0+ We have a mechanism which ensures that newly distributed objects are recorded during `alter extension citus update`. However, the logic was lacking "view"s. With this commit, we make sure that existing views are also marked as distributed during upgrade. (cherry picked from commit `ee45e7bfbf`)	2022-05-23 09:22:17 +02:00
Gledis Zeneli	97b453e679	Add TRUNCATE arbitrary config tests (#5848 ) Adds TRUNCATE arbitrary config tests. Also adds the ability to skip tests from particular configs.	2022-05-20 19:53:18 +02:00
Marco Slot	8c5035c0a5	Improve nested execution checks and add GUC to disable	2022-05-20 19:35:59 +02:00
Marco Slot	7c6784b1f4	Add caching for functions that check the backend type	2022-05-20 19:35:52 +02:00
Marco Slot	556f43f24a	Fix prepared statement bug when switching from local to remote execution	2022-05-20 19:35:45 +02:00
gledis69	909b72b027	Add distributing lock command support (cherry picked from commit `4731630741`)	2022-05-20 18:02:34 +03:00
Gledis Zeneli	3f282c660b	Switch to using LOCK instead of lock_relation_if_exists in TRUNCATE (#5930 ) Breaking down #5899 into smaller PR-s This particular PR changes the way TRUNCATE acquires distributed locks on the relations it is truncating to use the LOCK command instead of lock_relation_if_exists. This has the benefit of using pg's recursive locking logic it implements for the LOCK command instead of us having to resolve relation dependencies and lock them explicitly. While this does not directly affect truncate, it will allow us to generalize this locking logic to then log different relations where the pg recursive locking will become useful (e.g. locking views). This implementation is a bit more complex that it needs to be due to pg not supporting locking foreign tables. We can however, still lock foreign tables with lock_relation_if_exists. So for a command: TRUNCATE dist_table_1, dist_table_2, foreign_table_1, foreign_table_2, dist_table_3; We generate and send the following command to all the workers in metadata: ```sql SEL citus.enable_ddl_propagation TO FALSE; LOCK dist_table_1, dist_table_2 IN ACCESS EXCLUSIVE MODE; SELECT lock_relation_if_exists('foreign_table_1', 'ACCESS EXCLUSIVE'); SELECT lock_relation_if_exists('foreign_table_2', 'ACCESS EXCLUSIVE'); LOCK dist_table_3 IN ACCESS EXCLUSIVE MODE; SEL citus.enable_ddl_propagation TO TRUE; ``` Note that we need to alternate between the lock command and lock_table_if_exists in order to preserve the TRUNCATE order of relations. When pg supports locking foreign tables, we will be able to massive simplify this logic and send a single LOCK command. (cherry picked from commit `4c6f62efc6`)	2022-05-20 17:24:44 +03:00
Marco Slot	73fd4f7ded	Allow distributed execution from run_command_on_* functions	2022-05-20 15:42:50 +02:00
Burak Velioglu	8229d4b7ee	Add ALTER VIEW support Adds support for propagation ALTER VIEW commands to - Change owner of view - SET/RESET option - Rename view and view's column name - Change schema of the view Since PG also supports targeting views with ALTER TABLE commands, related code also added to direct such ALTER TABLE commands to ALTER VIEW commands while sending them to workers.	2022-05-20 12:18:14 +03:00
Burak Velioglu	0cf769c43a	Introduce CREATE/DROP VIEW Adds support for propagating create/drop view commands and views to worker node while scaling out the cluster. Since views are dropped while converting the table type, metadata connection will be used while propagating view commands to not switch to sequential mode.	2022-05-20 12:18:02 +03:00
Burak Velioglu	591f2565cc	Use object address instead of relation id on DDLJob to decide on syncing metadata	2022-05-20 12:17:56 +03:00
Ahmet Gedemenli	ddfcbfdca1	Add tests for materialized views	2022-05-20 12:17:48 +03:00
Ahmet Gedemenli	16071fac1d	Add view tests to arbitrary configs	2022-05-20 12:17:41 +03:00
Onder Kalaci	9c4e3329f6	Rename metadata sync to node metadata sync where applicable	2022-05-19 11:00:51 +02:00
Onder Kalaci	36f641c586	Serialize reference table modifications with node changes & restore point With Citus MX enabled, when a reference table is modified, it does some operations on the first worker node(e.g., acquire locks). If node metadata is locked (via add node or create restore point), the changes to the reference tables should be blocked.	2022-05-19 11:00:51 +02:00
Onder Kalaci	5fe384329e	Adds "sync" option to citus_disable_node() UDF	2022-05-19 11:00:51 +02:00
Marco Slot	c20732142e	Add a run_command_on_coordinator function	2022-05-19 10:41:10 +02:00
Marco Slot	082a14656d	Fix downgrade scripts and add new downgrade tests	2022-05-19 10:37:56 +02:00
Marco Slot	33dede5b75	Add a citus_is_coordinator function	2022-05-19 10:36:22 +02:00
Nils Dijk	5e4c0e4bea	Merge pull request #5931 from citusdata/refactor/dedupe-object-propagation Refactor: reduce complexity and code duplication for Object Propagation	2022-05-18 18:06:24 +02:00
Ahmet Gedemenli	c2d9e88bf5	Fix schema name bug for sequences (#5937 )	2022-05-18 17:29:30 +02:00
Ahmet Gedemenli	88369b6b23	Merge pull request #5934 from citusdata/fix-alter-statistics-nspname Fix alter statistics namespace name	2022-05-18 17:29:30 +02:00

1 2 3 4 5 ...

3805 Commits (3f163abffd76cbcf3bcb90b7901d014eff652163)