citus

Commit Graph

Author	SHA1	Message	Date
Ahmet Gedemenli	e36890ce55	Add source_lsn and target_lsn fields into get_rebalance_progress (#6364 ) DESCRIPTION: Adds source_lsn and target_lsn fields into get_rebalance_progress Adding two fields named `source_lsn` and `target_lsn` to the function `get_rebalance_progress`. Target lsn data is fetched in `GetShardStatistics`, by expanding the query sent to workers (joining with pg_subscription_rel and pg_stat_subscription). Then put into the hashmap, for each shard. Source lsn data is fetched in `BuildWorkerShardStatististicsHash`, in the loop that iterate each node, by sending a pg_current_wal_lsn query to each node. Then put into the hashmap, for each node.	2022-10-05 11:12:24 +03:00
Jelte Fennema	aea4964b39	Fix flakyness in isolation_shard_rebalancer_progress (#6397 ) On our CI our isolation_shard_rebalancer_progress would sometimes randomly fail like this: ```diff table_name\|shardid\|shard_size\|sourcename\|sourceport\|source_shard_size\|targetname\|targetport\|target_shard_size\|progress\|operation_type ----------+-------+----------+----------+----------+-----------------+----------+----------+-----------------+--------+-------------- -colocated1\|1500001\| 49152\|localhost \| 57637\| 49152\|localhost \| 57638\| 73728\| 1\|move -colocated2\|1500005\| 376832\|localhost \| 57637\| 376832\|localhost \| 57638\| 401408\| 1\|move +colocated1\|1500001\| 49152\|localhost \| 57637\| 49152\|localhost \| 57638\| 81920\| 1\|move +colocated2\|1500005\| 376832\|localhost \| 57637\| 376832\|localhost \| 57638\| 409600\| 1\|move (2 rows) ``` Source: https://app.circleci.com/pipelines/github/citusdata/citus/27688/workflows/8c5ca443-5f21-4f21-b74f-0ca7bde69648/jobs/823648/parallel-runs/1 The shard sizes would be slightly larger or smaller than expected. This fixes this by fixing the output to the nearest expected shard size. To do so I used a trick described in this stack overflow answer: https://stackoverflow.com/a/33147437/2570866 When investigating I ran into one more random failure: ```diff -step s1-shard-move-c1-block-writes: <... completed> +step s4-shard-move-sep-block-writes: <... completed> citus_move_shard_placement -------------------------- (1 row) -step s4-shard-move-sep-block-writes: <... completed> +step s1-shard-move-c1-block-writes: <... completed> citus_move_shard_placement -------------------------- ``` Source: https://app.circleci.com/pipelines/github/citusdata/citus/27707/workflows/c3ff4fc7-5068-4096-ab9f-803c941ddac0/jobs/824622/parallel-runs/29?filterBy=FAILED This random failure happens, because the two parallel moves can complete at the same time. So, it's non-deterministic which one finishes first. To make this deterministic I used the "marker" feature from the isolation tester. And finally I ran into a third random failure: ```diff table_name\|shardid\|shard_size\|sourcename\|sourceport\|source_shard_size\|targetname\|targetport\|target_shard_size\|progress\|operation_type ----------+-------+----------+----------+----------+-----------------+----------+----------+-----------------+--------+-------------- -colocated1\|1500001\| 50000\|localhost \| 57637\| 50000\|localhost \| 57638\| 50000\| 1\|move -colocated2\|1500005\| 400000\|localhost \| 57637\| 400000\|localhost \| 57638\| 400000\| 1\|move +colocated1\|1500001\| 50000\|localhost \| 57637\| 50000\|localhost \| 57638\| 8000\| 1\|move +colocated2\|1500005\| 400000\|localhost \| 57637\| 400000\|localhost \| 57638\| 8000\| 1\|move colocated1\|1500002\| 200000\|localhost \| 57637\| 200000\|localhost \| 57638\| 0\| 0\|move colocated2\|1500006\| 8000\|localhost \| 57637\| 8000\|localhost \| 57638\| 0\| 0\|move ``` Source: https://app.circleci.com/pipelines/github/citusdata/citus/27707/workflows/c3ff4fc7-5068-4096-ab9f-803c941ddac0/jobs/824622/parallel-runs/30?filterBy=FAILED This happened in two of the tests only. For now I commented these tests out. I have some ideas on how to fix these, but these ideas require more impactful changes than I would like in this PR. One of these tests had a copy paste error too, in passing I fixed that in the commented out line.	2022-10-04 17:05:42 +02:00
Jelte Fennema	f13b140621	Show citus_copy_shard_placement progress in get_rebalance_progress (#6322 ) DESCRIPTION: Show citus_copy_shard_placement progress in get_rebalance_progress When rebalancing to a new node that does not have reference tables yet the rebalancer will first copy the reference tables to the nodes. Depending on the size of the reference tables, this might take a long time. However, there's no indication of what's happening at this stage of the rebalance. This PR improves this situation by also showing the progress of any citus_copy_shard_placement calls when calling get_rebalance_progress.	2022-09-13 08:59:52 +00:00
Jelte Fennema	8bb082e77d	Fix reporting of progress on waiting and moved shards (#6274 ) In commit `31faa88a4e` I removed some features of the rebalance progress monitor. I did this because the plan was to remove the foreground shard rebalancer later in the PR that would add the background shard rebalancer. So, I didn't want to spend time fixing something that we would throw away anyway. As it turns out we're not removing the foreground shard rebalancer after all, so it made sens to fix the stuff that I broke. This PR does that. For the most part this commit reverts the changes in commit `31faa88a4e`. It's not a full revert though, because it keeps the improved tests and the changes to `citus_move_shard_placement`.	2022-08-31 14:55:47 +03:00
Jelte Fennema	31faa88a4e	Track rebalance progress at the shard move level (#6187 ) We're in the processes of totally changing the shard rebalancer experience and infrastructure. Soon the shard rebalancer will include retries, crash recovery and support for running in the background. These improvements come at a cost though, the way the get_rebalance_progress UDF currently works is very hard to replicate with this new structure. This is mostly because the old behaviour doesn't really make sense anymore with this new infrastructure. A new and better way to track the progress will be included as part of the new infrastructure. This PR is in preparation of the new code rebalancer experience. It changes the get_rebalance_progress UDF to only display the moves that are in progress at the moment, not the ones that happened in the past or that are planned in the future. Another option would have been to completely remove the current get_rebalance_progress functionality and point people to the new way of tracking progress. But old blogposts still reference the old UDF and users might have some automation on top of it. Showing the progress of the current moves is fairly simple to achieve, even with the new infrastructure. So this PR is a kind of compromise: It doesn't have complete feature parity with the old get_rebalance_progress, but the most common use cases will still work. There's also an advantage of the change: You can now see progress of shard moves that were triggered by calling citus_move_shard_placement manually. Instead of only being able to see progress of moves that were initiated using get_rebalance_table_shards.	2022-08-18 18:57:04 +02:00
Hanefi Onaldi	2b7cf0c097	Replace iso tester func only once (#5964 ) Use Citus helper UDFs by default in iso tests PostgreSQL isolation test infrastructure uses some UDFs to detect whether concurrent sessions block each other. Citus implements alternatives to that UDF so that we are able to detect and report distributed transactions that get blocked on the worker nodes as well. We needed to explicitly replace PG helper functions with Citus implementations in each isolation file. Now we replace them by default.	2022-07-06 11:04:31 +03:00
SaitTalhaNisanci	b923d51fc6	Bump pg12 and pg13 images to pg12.8 and pg13.8 (#5208 ) In our testing infra structure, even though we use pinned versions of postgres, the auxiliary libraries might pull in newer versions. This is for example the case for libpq, which will now use the libpq libraries from 14beta3. The changes in this PR are a lot due to the libpq changes. We also have changed the citus version that is used as a base for the citus upgrades, from 10.0 to 10.1 . This caused columnar to enforce some extra limits on the settings, which conflicted with our upgrade tests. The changes in failure tests are due to the libpq changes. There are also a lot of changes on isolation tests outputs, hence we updated all of them. Co-authored-by: Nils Dijk <nils@citusdata.com>	2021-08-25 16:04:57 +03:00
Jelte Fennema	2aa67421a7	Fix showing target shard size in the rebalance progress monitor (#5136 ) The progress monitor wouldn't actually update the size of the shard on the target node when using "block_writes" as the `shard_transfer_mode`. The reason for this is that the CREATE TABLE part of the shard creation would only be committed once all data was moved as well. This caused our size calculation to always return 0, since the table did not exist yet in the session that the progress monitor used. This is fixed by first committing creation of the table, and only then starting the actual data copy. The test output changes slightly. Apparently splitting this up in two transactions instead of one, increases the table size after the copy by about 40kB. The additional size used doesn't increase when with the amount of data in the table is larger (it stays ~40kB per shard). So this small change in test output is not considered an actual problem.	2021-07-23 16:37:00 +02:00
SaitTalhaNisanci	82f34a8d88	Enable citus.defer_drop_after_shard_move by default (#4961 ) Enable citus.defer_drop_after_shard_move by default	2021-05-21 10:48:32 +03:00
Jelte Fennema	10f06ad753	Fetch shard size on the fly for the rebalance monitor Without this change the rebalancer progress monitor gets the shard sizes from the `shardlength` column in `pg_dist_placement`. This column needs to be updated manually by calling `citus_update_table_statistics`. However, `citus_update_table_statistics` could lead to distributed deadlocks while database traffic is on-going (see #4752). To work around this we don't use `shardlength` column anymore. Instead for every rebalance we now fetch all shard sizes on the fly. Two additional things this does are: 1. It adds tests for the rebalance progress function. 2. If a shard move cannot be done because a source or target node is unreachable, then we error in stop the rebalance, instead of showing a warning and continuing. When using the by_disk_size rebalance strategy it's not safe to continue with other moves if a specific move failed. It's possible that the failed move made space for the next move, and because the failed move never happened this space now does not exist. 3. Adds two new columns to the result of `get_rebalancer_progress` which shows the size of the shard on the source and target node. Fixes #4930	2021-05-20 16:38:17 +02:00

10 Commits (e36890ce558cb267659fa77274014785a47c04f2)