Commit Graph

6086 Commits (background-job-details)

Author SHA1 Message Date
Jelte Fennema 37d1c78e7d Initial background job rebalance details 2022-08-23 15:24:05 +02:00
Jelte Fennema 8b672421a2 Fix warnings 2022-08-23 15:24:05 +02:00
Nils Dijk 780d9da1c6 wip update job state 2022-08-23 15:24:05 +02:00
Nils Dijk a23d340fbe add jobs tracking table 2022-08-23 15:24:05 +02:00
Nils Dijk 624e29e9e9 fix memory context creation 2022-08-23 15:24:05 +02:00
Nils Dijk 160385ca2f copy postgres' error_severiry implementation 2022-08-23 15:24:05 +02:00
Nils Dijk 414dbf9070 capture output of backend running background task 2022-08-23 15:24:05 +02:00
Nils Dijk 3db49b43c2 add lock for background task monitor to make sure there is only ever one running per database at the same time 2022-08-23 15:24:05 +02:00
Nils Dijk 1b66403cb9 refactor updating of task tuple to inmemory representation of said tuple 2022-08-23 15:24:05 +02:00
Nils Dijk c09d0feb85 big refactoring 2022-08-23 15:24:05 +02:00
Nils Dijk 88ea7508f9 big naming refactor 2022-08-23 15:24:05 +02:00
Nils Dijk 1be6d30588 s/import/include 2022-08-23 15:24:05 +02:00
Nils Dijk ec292cde48 WIP refactor background job execution 2022-08-23 15:24:05 +02:00
Nils Dijk 80e2edf1a5 refactor inner loop of background worker thing 2022-08-23 15:24:05 +02:00
Nils Dijk 5a2ec73475 unschedule dependant jobs on failure 2022-08-23 15:24:05 +02:00
Nils Dijk aee5dba634 add dependency tracking between rebalance jobs 2022-08-23 15:24:05 +02:00
Nils Dijk 6fa08619cf reset job status' on restart of background worker 2022-08-23 15:24:05 +02:00
Nils Dijk 2fbcba6c2a No need for a FormData due to many null values in the desired column order 2022-08-23 15:24:05 +02:00
Nils Dijk 836950ff07 Add running state to rebalance job with pid reported 2022-08-23 15:24:05 +02:00
Nils Dijk 3cf14ee816 let rebalancer schedule moves in background 2022-08-23 15:24:05 +02:00
Nils Dijk 5e7a4edd64 add simple test for background job 2022-08-23 15:24:05 +02:00
Nils Dijk 12aec202d8 add error state + wait for job udf 2022-08-23 15:24:05 +02:00
Nils Dijk 12b0063c31 run task based on sql command, log messages and simple retry 2022-08-23 15:24:04 +02:00
Nils Dijk 3365771162 execute task based on dedicated compsite type 2022-08-23 15:24:04 +02:00
Nils Dijk a7428c4ce6 basic infrastructure for running jobs in a background worker from the maintenance daemon 2022-08-23 15:24:04 +02:00
Hanefi Onaldi 28b04dc9f4
Merge pull request #6078 from citusdata/bump-pg-versions 2022-08-22 17:52:56 +03:00
Hanefi Onaldi a570dfead2
Remove outdated defaults for PG versions in CI configs 2022-08-22 17:16:52 +03:00
Hanefi Onaldi 616b1758c2
Add more normalization rules 2022-08-22 17:16:52 +03:00
Hanefi Onaldi e33ba7da9e
Decrease min messages for normalization 2022-08-22 17:16:52 +03:00
Hanefi Onaldi 9ec9209fd9
Bump PG versions in CI configs 2022-08-22 17:16:52 +03:00
Marco Slot 639588bee0
Remove unused functions (#6220)
Co-authored-by: Marco Slot <marco.slot@gmail.com>
2022-08-22 11:53:25 +03:00
Jelte Fennema e2a24b921e
Fix flakyness in failure_create_distributed_table_non_empty (#6217)
The failure_create_distributed_table_non_empty test would sometimes fail
like this:
```diff
 -- in the first test, cancel the first connection we sent from the coordinator
 SELECT citus.mitmproxy('conn.cancel(' ||  pg_backend_pid() || ')');
- mitmproxy
----------------------------------------------------------------------
-
-(1 row)
-
+ERROR:  canceling statement due to user request
+CONTEXT:  COPY mitmproxy_result, line 1: ""
+SQL statement "COPY mitmproxy_result FROM '/home/circleci/project/src/test/regress/tmp_check/mitmproxy.fifo'"
+PL/pgSQL function citus.mitmproxy(text) line 11 at EXECUTE
 SELECT create_distributed_table('test_table', 'id');
```

Because the cancel command had no filter it would actually sometimes
cancel the mitmproxy cancel command itself. This PR addresses that by
filtering on CREATE TABLE, which is one of the command that
create_distributed_table will send to the workers.

Example of failing test: https://app.circleci.com/pipelines/github/citusdata/citus/26252/workflows/1b7e5464-cca4-4ec1-99b3-48ddf25c29fa/jobs/742829
2022-08-20 01:23:25 +03:00
Jelte Fennema 4ce17f015b
Fix flakyness in columnar_memory test (#6216)
Sometimes in CI the columnar_memory test was using slightly more memory
than expected.
```diff
 SELECT CASE WHEN 1.0 * TopMemoryContext / :top_post BETWEEN 0.98 AND 1.02 THEN 1 ELSE 1.0 * TopMemoryContext / :top_post END AS top_growth
 FROM columnar_test_helpers.columnar_store_memory_stats();
--[ RECORD 1 ]-
-top_growth | 1
+-[ RECORD 1 ]------------------
+top_growth | 1.0206132116232119

 -- before this change, max mem usage while executing inserts was 28MB and
```

This PR changes the expectation to be slightly higher, such that this
random increase in memory usage doesn't cause a flaky test.

Failing test: https://app.circleci.com/pipelines/github/citusdata/citus/26256/workflows/c0870f66-3346-4f8d-a1d3-36dfd7c98289/jobs/743028
2022-08-19 23:46:28 +02:00
Jelte Fennema de475feb69
Actually connect to the right database in logical_replication test (#6211)
In the logical_replication test we test that the cleanup logic at the
start of a shard move works as expected. To do so we create a
subscription and publication slot manually. This changes the test to
make that subscription actually connect to the database that the
publication is in.

Useful for #5987 #6085
2022-08-20 00:09:50 +03:00
Jelte Fennema dfa6c26d7d
Increase isolation timeout because of shards splits (#6213)
Recently isolation tests involving shard splits have been randomly
failing in CI with timeouts. It's possible that there's an actual bug
here, but it's also quite likely that our timeout is just slightly too
low for the combination of shard splits and the CI VM having a bad day.

Increasing the timeout is fairly low cost and allows us to find out if
there's an actual bug or if its simply slowness. So that's what this PR
does. If it turns out to be an actual bug, we can decrease the timeout
again when we fix it.

Examples of failed tests:
1. https://app.circleci.com/pipelines/github/citusdata/citus/26241/workflows/9e0bb721-d798-481b-907c-914236b63e38/jobs/742409
2. https://app.circleci.com/pipelines/github/citusdata/citus/26171/workflows/8f352e3b-e6e4-4f7f-b0d0-2543f62a0209/jobs/739470
2022-08-19 22:37:45 +03:00
Naisila Puka 9cfadd7965
Deletes unnecessary test outputs pt2 (#6214) 2022-08-19 18:21:13 +03:00
Jelte Fennema 85305b2773
Don't run any isolation tests in parallel (#6212)
By running isolation tests in parallel we're just asking for flaky
tasks. The first test might temporarily block one of the commands in the
second test, which we then detect as waiting like this:
```diff
 step s2-vacuum-analyze:
     VACUUM ANALYZE test_insert_vacuum;
-
+ <waiting ...>
 step s1-commit:
     COMMIT;

+step s2-vacuum-analyze: <... completed>
```

Debugging flaky tests is also much harder when they are run in parallel.
This PR starts running all our isolation tests sequentially.

The reason for opening this PR was me seeing this failing test:
https://app.circleci.com/pipelines/github/citusdata/citus/26194/workflows/ff57e2cf-8ac4-40fe-bc0c-74a7f8fecb53/jobs/740454

As well as having fixed a similar issue recently in #6122
2022-08-19 17:05:36 +02:00
Önder Kalacı 616ff2a3fe
Adjust some isolation test for the recent PG commits (#6210)
* Adjust some isolation test for the recent PG commits

In 3f32395612,
Postgres starts any isolation session with `set application_name`.

However, one of the tests we had expected that it is exactly the first
command in the session. The test tries to show that even if a gpid
has not been assigned, we can show it in the citus_lock_waits graph.

Now that, it is literally not possible to have such test as gpid
would be assigned after `set application_name` command. Still,
it is good to have a test where a command is blocked on the parser
2022-08-19 17:06:34 +03:00
Jelte Fennema e6a1a86db0
Improve debugability for columnar_memory flakyness (#6203)
Sometimes the columnar_memory test fails in CI with the following error:
```diff
 SELECT 1.0 * TopMemoryContext / :top_post BETWEEN 0.98 AND 1.02 AS top_growth_ok
 FROM columnar_test_helpers.columnar_store_memory_stats();
 -[ RECORD 1 ]-+--
-top_growth_ok | t
+top_growth_ok | f

 -- before this change, max mem usage while executing inserts was 28MB and
```

This is almost certainly a harmless failure that simply requires bumping
the margin a little bit. However, it's impossible to say with the
current output. I was unable to reproduce this on-demand on my local
machine or even in CI. So this changes the test to include the actual
value difference in the size of TopMemoryContext when it's outside the
expected range. Then next time it fails we at least have some
information about why.

Example of failing test: https://app.circleci.com/pipelines/github/citusdata/citus/25966/workflows/d472a57b-419a-4f33-b8bc-2e174a98d4d6/jobs/730576
2022-08-19 15:41:16 +02:00
Jelte Fennema 3f4440ff69
Improve debugability of failures in isolation_ref2ref_foreign_keys (#6197)
As shown in #6196 the output of s1-view-locks is sometimes not as
expected. However, because it's output is very minimal it's hard to
understand the reason for that. This adds some more columns and
aggregates less, so we can more easily see what locks are unexpectedly
held or released.

In passing this also fixes the following flaky part of this test by excluding
locks taken by the maintenance daemon. After running it with this more
detailed output for s1-view-locks it became obvious that that was the
problem here.
```diff
diff -dU10 -w /home/jelte/work/citus/src/test/regress/expected/isolation_ref2ref_foreign_keys.out /home/jelte/work/citus/src/test/regress/results/isolation_ref2ref_foreign_keys.out
--- /home/jelte/work/citus/src/test/regress/expected/isolation_ref2ref_foreign_keys.out.modified	2022-08-18 15:42:08.689525233 +0200
+++ /home/jelte/work/citus/src/test/regress/results/isolation_ref2ref_foreign_keys.out.modified	2022-08-18 15:42:08.729525233 +0200
@@ -288,21 +288,22 @@
 
 step s1-view-locks: 
     SELECT mode, count(*)
     FROM pg_locks
     WHERE locktype='advisory'
     GROUP BY mode
     ORDER BY 1, 2;
 
 mode                    |count
 ------------------------+-----
-(0 rows)
+ShareUpdateExclusiveLock|    1
+(1 row)
 
 
 starting permutation: s2-begin s2-insert-table-3 s1-view-locks s2-rollback s1-view-locks
 step s2-begin: 
  BEGIN;
 
 step s2-insert-table-3: 
     INSERT INTO ref_table_3 VALUES (7, 5);
 
 step s1-view-locks: 
 ```
2022-08-19 15:12:09 +02:00
Jelte Fennema 25e5cf2e50
Fix flakyness in failure_setup (#6205)
In CI sometimes failure_setup will fail with the following error:
```diff
 SELECT master_add_node('localhost', :worker_2_proxy_port);  -- an mitmproxy which forwards to the second worker
- master_add_node
----------------------------------------------------------------------
-               2
-(1 row)
-
+ERROR:  connection to the remote node localhost:9060 failed with the following error: could not connect to server: Connection refused
+	Is the server running on host "localhost" (127.0.0.1) and accepting
+	TCP/IP connections on port 9060?
+could not connect to server: Connection refused
+	Is the server running on host "localhost" (127.0.0.1) and accepting
+	TCP/IP connections on port 9060?
+could not connect to server: Cannot assign requested address
+	Is the server running on host "localhost" (::1) and accepting
+	TCP/IP connections on port 9060?
diff -dU10 -w /home/circleci/project/src/test/regress/expected/failure_online_move_shard_placement.out /home/circleci/project/src/test/regress/results/failure_online_move_shard_placement.out
```

This then breaks all the tests run after it as well, because we're
missing one worker node.

Locally I was able to reproduce this error by sleeping for 10 seconds in
the forked process sleep before actually starting mitmproxy. So I'm
expecting what's happening in CI is that due to limited resources,
mitmproxy is not up yet when we try to add its port as a workernode.

This PR fixes this by waiting until mitmproxy is listening on its socket
before actually starting to run our tests. This fixed it locally for me
when I made the forked process sleep for 10 seconds before starting
mitmproxy.

In passing it also improves the detection and errors that we already
had for the case where something was already listening on the 
mitmproxy port.

Because both @gledis69 and me were changing things in our CI images
at the same time this also includes a bump of the style checker tools.
Closes #6200
2022-08-19 13:03:08 +00:00
Jelte Fennema 3fadb98380
Fix compilation warning on PG13 + OpenSSL 3.0 (#6038)
This removes some warnings that are present when building on Ubuntu 22.04. 
It removes warnings on PG13 + OpenSSL 3.0. OpenSSL 3.0 has marked some 
functions that we use as deprecated, but we want to continue support OpenSSL
1.0.1 for the time being too. This indicates that to OpenSSL 3.0, so it doesn't 
show warnings.
2022-08-19 05:51:47 -07:00
Onur Tirtir d83fa7af34
Add a changelog entry for 10.2.8 (#6207) 2022-08-19 15:40:30 +03:00
Jelte Fennema fe1668e43f
Fix flakyness in multi_utilities (#6204)
Sometimes this multi_utilities would fail with the following error:

```diff
SET citus.log_remote_commands TO ON;
 -- should propagate to all workers because no table is specified
 ANALYZE;
 NOTICE:  issuing BEGIN TRANSACTION ISOLATION LEVEL READ COMMITTED;SELECT assign_distributed_transaction_id(0, 3461, '2022-08-19 01:56:06.35816-07');
 DETAIL:  on server postgres@localhost:57637 connectionId: 1
 NOTICE:  issuing BEGIN TRANSACTION ISOLATION LEVEL READ COMMITTED;SELECT assign_distributed_transaction_id(0, 3461, '2022-08-19 01:56:06.35816-07');
 DETAIL:  on server postgres@localhost:57638 connectionId: 2
 NOTICE:  issuing SET citus.enable_ddl_propagation TO 'off'
 DETAIL:  on server postgres@localhost:57637 connectionId: 1
-NOTICE:  issuing SET citus.enable_ddl_propagation TO 'off'
-DETAIL:  on server postgres@localhost:xxxxx connectionId: xxxxxxx
 NOTICE:  issuing ANALYZE
 DETAIL:  on server postgres@localhost:57637 connectionId: 1
+NOTICE:  issuing SET citus.enable_ddl_propagation TO 'off'
+DETAIL:  on server postgres@localhost:57638 connectionId: 2
 NOTICE:  issuing ANALYZE
 DETAIL:  on server postgres@localhost:57638 connectionId: 2
```

This is simply a harmless change in output due to some timing
differences. This PR makes the test output consistent by only logging
the remote ANALYZE commands, not the SET commands.
2022-08-19 12:38:55 +02:00
Marco Slot 5160cafa82
Do not propagate GRANT ON SCHEMA from CREATE EXTENSION (#6175)
Co-authored-by: Marco Slot <marco.slot@gmail.com>
2022-08-19 13:23:47 +03:00
Jelte Fennema 8ce12eb51f
Fix flakyness in failure_insert_select_repartition (#6202)
This fixes our most commonly randomly failing failure test. The failing
diff is as follows:

```diff
SELECT citus.mitmproxy('conn.onQuery(query="fetch_intermediate_results").kill()');
  mitmproxy
 -----------

 (1 row)

 INSERT INTO target_table SELECT * FROM source_table;
-ERROR:  connection to the remote node localhost:xxxxx failed with the following error: connection not open
+ERROR:  could not open file "base/pgsql_job_cache/10_0_40/repartitioned_results_20770193413_from_4213590_to_1.data": No such file or directory
+CONTEXT:  while executing command on localhost:9060
+while executing command on localhost:57637
 SELECT * FROM target_table ORDER BY a;
```

As far as I can tell this is the cause of a race condition: After killing
fetch_intermediate_results on worker 9060, the previously created data
file gets cleaned up. The fetch_intermediate_results call that's sent
to worker 57637 will be cancelled and rolled back soon because of the
failure on the other connection. But if that fetch_intermediate_results
call is able to connect to 9060 before it is cancelled, it won't find
the file it's looking for there anymore. So while it's not the error we
expect, it does indicate that we succeeded.

To avoid this issue instead of killing the fetch_intermediate_results
call directly, we kill the COPY command that it uses to do the fetch.
This results in stable output as can be seen here, where 227 runs of
failure_insert_select_repartition succeeded:
https://app.circleci.com/pipelines/github/citusdata/citus/26168/workflows/9c64a3b6-f46c-4725-9fb4-8f6a2d00a023/jobs/739389

To be clear this changes the test to affects the opposite
fetch_intermediate_results call. This kills the fetch_intermediate_results
call of worker 57637, instead of killing the fetch_intermediate_results call
on worker 9060.

Example of failing test: https://app.circleci.com/pipelines/github/citusdata/citus/26147/workflows/780e95ea-264a-4c9f-ad2e-cf11449a795e/jobs/738467
2022-08-19 09:11:07 +00:00
Naisila Puka 5a9fdc221b
Add explicit alias to avoid debug output diff in pg15 (#6183) 2022-08-19 11:39:18 +03:00
Onur Tirtir 1f1b02b64b
Merge pull request #6199 from citusdata/cl-10.2.6-11.0.6
Add changelog entries for 10.2.7 & 11.0.6
2022-08-19 10:57:42 +03:00
Onur Tirtir b6b8f198d9 Add changelog entries for 11.0.6 2022-08-19 10:41:39 +03:00
Onur Tirtir efb0d5f48e Add changelog entries for 10.2.7 2022-08-19 10:41:26 +03:00