We had #6737 fix the same issue on main branch, but we need to fix it on
release branches as well. As the patch does not really apply to earlier
release branches, we just added the fix manually.
(cherry picked from commit 4f9a344085)
DESCRIPTION: Correctly report shard size in citus_shards view
When looking at citus_shards, people are interested in the actual size
that all the data related to the shard takes up on disk.
`pg_total_relation_size` is the function to use for that purpose. The
previously used `pg_relation_size` does not include indexes or TOAST.
Especially the missing toast can have enormous impact on the size of the
shown data.
(cherry picked from commit b489d763e1)
We should not omit to free PGResult when we receive single tuple result
from an internal backend.
Single tuple results are normally freed by our ReceiveResults for
`tupleDescriptor != NULL` flow but not for those with `tupleDescriptor
== NULL`. See PR #6722 for details.
DESCRIPTION: Fixes memory leak issue with query results that returns
single row.
(cherry picked from commit 9e69dd0e7f)
In #6598 it was noticed that Citus could generate syntactically invalid
statements during logical replication. With #6603 we resolved the direct
issue, by only generating valid subscription names. But there was also
the underlying problem that we did not escape certain identifier
strings. While in theory this should be okay since we should only
generate names that are valid, this issue reiterated that we should not
take this for granted. As an extra line of defense this quotes all
identifiers we use during logical replication setup.
(cherry picked from commit c2b4087ff0)
Pyenv is installed in our container images but I found out that pyenv is
not being activated since it is activated from ~/bashrc script and in
GitHub Actions (GHA) this script is not being executed
Since pyenv is not activated, default python versions comes from docker
images is being used and in this case we get errors for python version
3.11.
Additionally, $HOME directory is /github/home for containers executed
under GHA and our pyenv installation is under /root directory which is
normally home directory for our packaging containers
This PR activates usage of pyenv and additionally uses pyenv virtualenv
feature to execute validate_output function in isolation
---------
Co-authored-by: Onur Tirtir <onurcantirtir@gmail.com>
(cherry picked from commit d919506076)
Fixes#6570.
In the past, having columnar tables in the cluster was causing pg
upgrades to fail when attempting to access columnar metadata. This is
because, pg_dump doesn't see objects that we use for columnar-am related
booking as the dependencies of the tables using columnar-am.
To fix that; in #5456, we inserted some "normal dependency" edges (from
those objects to columnar-am) into pg_depend.
This helped us ensuring the existency of a class of metadata objects
--such as columnar.storageid_seq-- and helped fixing #5437.
However, the normal-dependency edges that we added for indexes on
columnar metadata tables --such columnar.stripe_pkey-- didn't help at
all because they were indeed causing dependency loops (#5510) and
pg_dump was not able to take those dependency edges into the account.
For this reason, instead of inserting such dependency edges from indexes
to columnar-am, we allow columnar metadata accessors to fall-back to
sequential scan during pg upgrades.
(cherry picked from commit 1c51ddae49)
We should disallow dropping table_name option if foreign table is in
metadata. Otherwise, we get table not found error which contains
shardid.
DESCRIPTION: Fixes an unexpected foreign table error by disallowing to drop the table_name option.
Fixes#6663
(cherry picked from commit 8a9bb272e4)
DESCRIPTION: Fix crash when trying to replicate a ref table that is actually dropped
see #6592
We should have a real solution for it.
(cherry picked from commit bc3383170e)
(cherry picked from commit 9e32e34313)
With this PR, citus code will be tested in all packaging environments.
Sometimes, there can be compile errors which blocks packaging and in
this case unplanned delays may occur.
By testing the code in packaging environments, I'm aiming to detect any
compilation errors before packaging.
Co-authored-by: Onur Tirtir <onurcantirtir@gmail.com>
Co-authored-by: Hanefi Onaldi <Hanefi.Onaldi@microsoft.com>
DESCRIPTION: PR description that will go into the change log, up to 78
characters
Co-authored-by: Onur Tirtir <onurcantirtir@gmail.com>
Co-authored-by: Hanefi Onaldi <Hanefi.Onaldi@microsoft.com>
In #6038 I tried to fix OpenSSL 3.0 warnings with PG13, but I had made a
mistake when doing that. This actually fixes these warnings.
(cherry picked from commit a477ffdf4b)
This removes some warnings that are present when building on Ubuntu 22.04.
It removes warnings on PG13 + OpenSSL 3.0. OpenSSL 3.0 has marked some
functions that we use as deprecated, but we want to continue support OpenSSL
1.0.1 for the time being too. This indicates that to OpenSSL 3.0, so it doesn't
show warnings.
(cherry picked from commit 3fadb98380)
DESCRIPTION: Fixes a potential dangling pointer issue
Need to backport to 11.0 & 11.1 since we might want to release packages
for debian/bookworm based on those branches in future.
(cherry picked from commit 80faf47ab5)
DESCRIPTION: Raises memory limits in columnar from 256MB to 1GB for
reads and writes
This doesn't completely fix#5918 but at least increases the
buffer limits that might cause throwing an error when reading
from or writing into into columnar storage. A way better approach
to fix this is documented in #6420.
Replacing memcpy_s with memcpy is quite safe in those places
since we anyway make sure to allocate enough amount of memory
before writing into related buffers.
(cherry picked from commit 0b81f68def)
Fixes https://github.com/citusdata/citus/issues/6394.
DESCRIPTION: Fixes a bug that causes creating disabled-triggers on
shards as enabled
Since CREATE TRIGGER doesn't have syntax support to specify
whether the trigger should be enabled/disabled, the underlying
PG function (`pg_get_triggerdef()`) that we use to generate the
command to create the trigger is not enough. For this reason, we
append a second command to enable/disable trigger, right after
creating it.
We don't retain explicit extension dependencies set by using
`ALTER trigger DEPENDS ON EXTENSION` commands too, but apparently
right fix for that is to throw an error as in
`PreprocessAlterTriggerDependsStmt()`; so, opened a separate PR
to fix that #6399.
(cherry picked from commit 86e186f671)
DESCRIPTION: Fixes a bug that prevents retaining columnar table options
after a table-rewrite A fix for this issue: Columnar: options ignored
during ALTER TABLE rewrite #5927
The OID for the temporary table created during ALTER TABLE was not the
same as the original table's OID so the columnar options were not being
applied during rewrite.
The change is that I applied the original table's columnar options to
the new table so that it has the correct options during write. I also
added a test.
cherry-pick from commit f21cbe68f8
During alter_distributed_table, we create a new table like the
original table but with the altered options.
To retrieve the name of the distribution column, we were using
the attribute syscache of the new table, since we already created
the new table as identical to the original table.
However, the attribute syscaches of these two tables are not
the same if the original table has dropped columns. The reason
is that dropped columns are all still present in the cache.
Hence, for example, the attnos would be different in the syscaches.
So, let's use the attribute syscache of the original table.
Given that we drop DEFAULT nextval('sequence') expressions from
shard relation columns, allowing `ON DELETE/UPDATE SET DEFAULT`
on such columns might cause inserting NULL values as a result
of a delete/update operation.
For this reason, we disallow ON DELETE/UPDATE SET DEFAULT actions
on columns that default to sequences.
DESCRIPTION: Disallows having ON DELETE/UPDATE SET DEFAULT actions on
columns that default to sequences
Fixes#6339.
(cherry picked from commit a868cc049a)
Conflicts (dropped those changes since pg15 is not supported on 11.0):
src/test/regress/expected/pg15.out
src/test/regress/sql/pg15.sql
On Citus versions <= 11.0, IntegerArrayTypeToList() doesn't exist and
its helpers (DeconstructArrayObject() & ArrayObjectCount()) are defined
in worker_protocol.h. (See 9476f377b5).
So we add IntegerArrayTypeToList() into worker_protocol.c and include
IntegerArrayTypeToList from worker_protocol.h instead of array_type.h
in foreign_constraint.c.
This is needed to backport a868cc049a into
this (release-11.0) branch, see the next commit.
As we did for GENERATED STORED columns in #4613, we should not drop
column
default expressions that are not based on sequences from shard relation
since
such expressions need to exist e.g. for foreign key actions.
For the column default expressions that are based on sequences we cannot
do much, so we need to disallow having ON DELETE SET DEFAULT actions on
such columns in a separate PR, see #6339.
Fixes#6318.
DESCRIPTION: Fixes a bug that might cause inserting incorrect DEFAULT
values when applying foreign key actions
(cherry picked from commit de24a3eda5)
Conflicts (dropped those changes since pg15 is not supported on 11.0):
src/test/regress/expected/pg15.out
src/test/regress/sql/pg15.sql
pg_dist_node and pg_dist_colocation have a primary key index, not a replica identity index.
Citus catalog tables are created in public schema, which has replica identity index by default
as primary key index. Later the citus catalog tables are moved to pg_catalog schema.
During pg_upgrade, all tables are recreated, and given that pg_dist_colocation is found in
pg_catalog schema, it is recreated in that schema, and when it is recreated it doesn't
have a replica identity index, because catalog tables have no replica identity.
Further action:
Do we even need to acquire this lock on the primary key index?
Postgres doesn't acquire such locks on indexes before deleting catalog tuples.
Also, catalog tuples don't have replica identities by definition.
Historically we have been testing with the 'latest' version of libpq
when the CI images were build. This has the downside that rebuilding the
images often break our tests due to different errors returned from
libpq.
With this change we will actually test with a stable version of libpq
that is based on the postgres minor version that we test against.
This will make it easier to maintain postgres images over time, as well
as running _all_ tests locally, where we change libpq in sync with the
postgres server version.
(cherry picked from commit f944f97d01)
Here is a flaky test output that is quite hard to fix:
```diff
diff -dU10 -w /home/circleci/project/src/test/regress/expected/isolation_master_update_node_1.out /home/circleci/project/src/test/regress/results/isolation_master_update_node.out
--- /home/circleci/project/src/test/regress/expected/isolation_master_update_node_1.out.modified 2022-03-21 19:03:54.237042562 +0000
+++ /home/circleci/project/src/test/regress/results/isolation_master_update_node.out.modified 2022-03-21 19:03:54.257043084 +0000
@@ -49,18 +49,20 @@
<waiting ...>
step s2-update-node-1-force: <... completed>
master_update_node
------------------
(1 row)
step s2-abort: ABORT;
step s1-abort: ABORT;
FATAL: terminating connection due to administrator command
-SSL connection has been closed unexpectedly
+server closed the connection unexpectedly
+ This probably means the server terminated abnormally
+ before or while processing the request.
```
I could not come up with a solution that would decrease the flakiness in the test outputs. We already have 3 output files for the same test and now I introduced a 4th one.
I can also add complex regular expressions that span multiple lines, and normalize these error messages. Feel free to suggest a normalized error message in a comment here.
## Current alternative file contents
`isolation_master_update_node.out`
```
step s1-abort: ABORT;
FATAL: terminating connection due to administrator command
FATAL: terminating connection due to administrator command
SSL connection has been closed unexpectedly
```
`isolation_master_update_node_0.out`
```
step s1-abort: ABORT;
WARNING: this step had a leftover error message
FATAL: terminating connection due to administrator command
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
```
`isolation_master_update_node_1.out`
```
step s1-abort: ABORT;
FATAL: terminating connection due to administrator command
SSL connection has been closed unexpectedly
```
new file: `isolation_master_update_node_2.out`
```
step s1-abort: ABORT;
FATAL: terminating connection due to administrator command
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
```
(cherry picked from commit 518fb0873e)
In CI sometimes failure_setup will fail with the following error:
```diff
SELECT master_add_node('localhost', :worker_2_proxy_port); -- an mitmproxy which forwards to the second worker
- master_add_node
----------------------------------------------------------------------
- 2
-(1 row)
-
+ERROR: connection to the remote node localhost:9060 failed with the following error: could not connect to server: Connection refused
+ Is the server running on host "localhost" (127.0.0.1) and accepting
+ TCP/IP connections on port 9060?
+could not connect to server: Connection refused
+ Is the server running on host "localhost" (127.0.0.1) and accepting
+ TCP/IP connections on port 9060?
+could not connect to server: Cannot assign requested address
+ Is the server running on host "localhost" (::1) and accepting
+ TCP/IP connections on port 9060?
diff -dU10 -w /home/circleci/project/src/test/regress/expected/failure_online_move_shard_placement.out /home/circleci/project/src/test/regress/results/failure_online_move_shard_placement.out
```
This then breaks all the tests run after it as well, because we're
missing one worker node.
Locally I was able to reproduce this error by sleeping for 10 seconds in
the forked process sleep before actually starting mitmproxy. So I'm
expecting what's happening in CI is that due to limited resources,
mitmproxy is not up yet when we try to add its port as a workernode.
This PR fixes this by waiting until mitmproxy is listening on its socket
before actually starting to run our tests. This fixed it locally for me
when I made the forked process sleep for 10 seconds before starting
mitmproxy.
In passing it also improves the detection and errors that we already
had for the case where something was already listening on the
mitmproxy port.
Because both @gledis69 and me were changing things in our CI images
at the same time this also includes a bump of the style checker tools.
Closes#6200
(cherry picked from commit 25e5cf2e50)
Sometimes in CI our adaptive_executor test would fail randomly with the
following error:
```diff
SELECT sum(result::bigint) FROM run_command_on_workers($$
SELECT count(*) FROM pg_stat_activity
WHERE pid <> pg_backend_pid() AND query LIKE '%8010090%'
$$);
sum
-----
- 4
+ 2
(1 row)
END;
```
Source: https://app.circleci.com/pipelines/github/citusdata/citus/26665/workflows/40665680-0044-4852-8fe4-5fd628f9fb47/jobs/764371
This means that the low slow start interval did not have any effect on
the number of connections being opened. I could see two possibilities
for this to happen:
1. CI was slow and actually doing the start of the second connection. I
tried to solve this by doubling the time a query to the worker takes.
2. The second option is that the shards were queried in the oposite
order than we expect. This would mean that the first query to the
worker completes quickly because there's no, sleep because it doesn't
contain any rows. I tried to solve this option by adding a row to
each shard.
After trying to reproduce the random failure in CI it turned out that I
needed both of these fixes to resolve the random failure.
(cherry picked from commit f22a47981a)
Sometimes in CI our drop_partitioned_talbe test would fail with the
following error:
```diff
NOTICE: issuing SELECT worker_drop_distributed_table('drop_partitioned_table.child1')
NOTICE: issuing SELECT worker_drop_distributed_table('drop_partitioned_table.child1')
NOTICE: issuing DROP TABLE IF EXISTS drop_partitioned_table.child1_727001 CASCADE
-NOTICE: issuing SELECT pg_catalog.citus_internal_delete_colocation_metadata(100047)
-NOTICE: issuing SELECT pg_catalog.citus_internal_delete_colocation_metadata(100047)
+NOTICE: issuing SELECT pg_catalog.citus_internal_delete_colocation_metadata(100046)
+NOTICE: issuing SELECT pg_catalog.citus_internal_delete_colocation_metadata(100046)
ROLLBACK;
NOTICE: issuing ROLLBACK
NOTICE: issuing ROLLBACK
```
Source: https://app.circleci.com/pipelines/github/citusdata/citus/26631/workflows/31536032-e1ba-493b-b12a-f40757f3a7d6/jobs/762170
For some reason the colocationid of the distributed partitioned table
would be one less than we expected. Why this happens I'm not sure, but
it seems fairly harmless that it does.
In an attempt to work around this flakyness I now reset the colocation
id sequence right before creating the table in question. This is good
practice in general, because it allows us to run the test successfully
using `check-minimal` and it also allows us to rerun it multiple times.
(cherry picked from commit 895a484b39)
In CI sometimes failure_connection_establishment would fail with the
following error:
```diff
-- cancel all connections to this node
SELECT citus.mitmproxy('conn.onAuthenticationOk().cancel(' || pg_backend_pid() || ')');
- mitmproxy
----------------------------------------------------------------------
-
-(1 row)
-
+ERROR: canceling statement due to user request
+CONTEXT: COPY mitmproxy_result, line 1: ""
+SQL statement "COPY mitmproxy_result FROM '/home/circleci/project/src/test/regress/tmp_check/mitmproxy.fifo'"
+PL/pgSQL function citus.mitmproxy(text) line 11 at EXECUTE
SELECT * FROM citus_check_cluster_node_health();
```
The reason for this is that the mitm command that was used is very
broad and doesn't actually do what the comment says. What happens is
that if any connection is made, the current backend is cancelled, which
is not the always the same as the backend that made the connection. My
assessment is that likely the maintenance daemon makes a connection to
the node while we are executing the mitmproxy command. The mitmproxy
command goes through, and then triggers a cancel of itself due to the
connection made by the maintenance daemon.
This PR simply removes this test, since it doesn't seem to test what it
intended to test anyway. There's also still the "kill" version of this
test, which does do the intended thing. So I don't think we lose
important coverage by removing this test.
(cherry picked from commit 2a0c0b3ba6)
Sometimes in CI multi_transaction_recovery would fail with the following
error:
```diff
SET LOCAL citus.defer_drop_after_shard_move TO OFF;
SELECT citus_move_shard_placement((SELECT * FROM selected_shard), 'localhost', :worker_1_port, 'localhost', :worker_2_port, shard_transfer_mode := 'block_writes');
- citus_move_shard_placement
----------------------------------------------------------------------
-
-(1 row)
-
+ERROR: could not find placement matching "localhost:57637"
+HINT: Confirm the placement still exists and try again.
COMMIT;
```
Source: https://app.circleci.com/pipelines/github/citusdata/citus/26510/workflows/8269ea93-d9b4-4376-ae0e-8332a5c15fc6/jobs/755548
The reason for this was that when choosing `selected_shard` we didn't
ensure that it was actually located on the node that we were moving it
from. Instead we simply picked the first shard for the table that was
returned by the query.
To fix this issue this PR adds a filter to only choose shards that are
located on the intended node.
(cherry picked from commit 18015ca501)
Our isolation_distributed_deadlock_detection test would fail randomly in
CI in three different ways.
The first type of failure looked like this:
```diff
check_distributed_deadlocks
---------------------------
t
(1 row)
-step s1-update-5: <... completed>
step s5-update-1: <... completed>
ERROR: canceling the transaction since it was involved in a distributed deadlock
+step s1-update-5: <... completed>
step s1-commit:
```
Source: https://app.circleci.com/pipelines/github/citusdata/citus/26399/workflows/d213ee85-397a-467a-9ffb-39e4f44e6688/jobs/749533
This random change in output was harmless and happened because when the
deadlock detector cancelled a query, two queries would continue: The one
that was cancelled would throw an error (and thus complete), and the one
that was unblocked would now complete.
It was random which of the two the isolation tester would first detect
as completed. To resolve this PR starts using the ["marker" feature][1],
this allows us to make sure one of the steps won't be marked as
completed until the other one completed first.
The second random failure was very similar:
```diff
check_distributed_deadlocks
---------------------------
t
(1 row)
-step s2-update-2: <... completed>
-step s3-update-3: <... completed>
-ERROR: canceling the transaction since it was involved in a distributed deadlock
step s6-commit:
COMMIT;
step s5-update-6: <... completed>
+step s2-update-2: <... completed>
+step s3-update-3: <... completed>
+ERROR: canceling the transaction since it was involved in a distributed deadlock
step s5-commit:
```
Again a harmless difference in test output. In this case it's possible
that the deadlock detector would not detect the unblocked processes
right away, and would thus continue with to the next step. This step was
a commit on a session that was not blocked, and which thus could
complete without issues.
To solve this I changed the order of the commits at the end of the
permutation, to always have the first session that would commit be the
session that would be unblocked the last. This ensures that no commit
will ever be executed before completing all the queries.
The third issue was different and looked like this:
```diff
step s4-update-5: <... completed>
step s4-commit:
COMMIT;
+step s1-update-4: <... completed>
+isolationtester: canceling step s3-update-4 after 5 seconds
step s3-update-4: <... completed>
+ERROR: canceling statement due to user request
+step s2-update-2: <... completed>
step s3-commit:
COMMIT;
-step s2-update-2: <... completed>
-step s1-update-4: <... completed>
step s1-commit:
```
Source: https://app.circleci.com/pipelines/github/citusdata/citus/26411/workflows/9089beec-4f0f-4027-b4ce-0e84889afc06/jobs/750143
The reason for this failure is not entirely clear to me, but I was able
to remove the flakyness without impacting the goal of the test. What was
happening was that both `s1` and `s3` were waiting for `s4` to commit
and release it's lock on the row 4. For some reason it wasn't
deterministic which of the two sessions would be granted the lock after
it was released by row 4. The test expected `s3` to be granted the lock,
but sometimes it would be granted to `s1` instead. Which would in turn
cause `s3` to still be blocked.
To solve this I simply removed `s1` completely from this test. It wasn't
actually part of the cycle that the deadlock detector should detect and
was an unrelated appendage:
```mermaid
graph TD;
s2-->s3;
s3-->s4;
s1-->s4;
s4-->s5;
s5-->s6;
s6-->s5;
```
By removing `s1` completely there was no contention for the lock and
`s3` could always acquire it.
[1]: a73d6c87f2/src/test/isolation/README (L163-L188)
(cherry picked from commit 9749622399)
In CI multi_replicate_reference_table would sometimes fail like this:
```diff
-- detects correctly that referecence table doesn't have replica identity
SELECT replicate_reference_tables();
-ERROR: cannot use logical replication to transfer shards of the relation initially_not_replicated_reference_table since it doesn't have a REPLICA IDENTITY or PRIMARY KEY
+ERROR: cannot use logical replication to transfer shards of the relation ref_table since it doesn't have a REPLICA IDENTITY or PRIMARY KEY
DETAIL: UPDATE and DELETE commands on the shard will error out during logical replication unless there is a REPLICA IDENTITY or PRIMARY KEY.
HINT: If you wish to continue without a replica identity set the shard_transfer_mode to 'force_logical' or 'block_writes'.
```
Because `CitusTableTypeIdList` returns tables in heap order so it's
a bit random which one is first in the list. And the test contained
multiple tables that didn't have a primary key or replica identity. So
it made sense that the error could be for either one of these tables.
This PR makes the test output consistent by changing one of the tables
to have a primary key.
Example of failing test: https://app.circleci.com/pipelines/github/citusdata/citus/26387/workflows/fc3196e7-ddf2-4000-a70b-5ac71c836321/jobs/748940
(cherry picked from commit 5c0205ce10)
One of our arbitrary config tests would sometimes fail like this in CI:
```diff
su_nationkey,
cust_nation,
l_year;
- supp_nation | cust_nation | l_year | revenue
----------------------------------------------------------------------
- 9 | C | 2008 | 3.00
-(1 row)
-
+ERROR: cannot connect to localhost:10212 to fetch intermediate results
+CONTEXT: while executing command on localhost:10211
```
When looking at the logs it seems like we were running out of
connections:
```
2022-08-23 14:03:52.856 UTC [28122] FATAL: sorry, too many clients already
2022-08-23 14:03:52.860 UTC [21027] ERROR: cannot connect to localhost:10212 to fetch intermediate results
```
This happened with `CitusThreeWorkersManyShards` config. This test on
purpose tries to push the limits of Citus quite far. And the
`ch_benchmarks_1` test is also run in parallel with a few more ones. So
it's not too weird that it ran out of connections. This doubles the
connection limit in the arbitrary config tests to hopefully not hit this
error again.
Example of failed test: https://app.circleci.com/pipelines/github/citusdata/citus/26365/workflows/7a1b5688-85cc-4bc3-ade5-9bd1d83cd0ed/jobs/747908/parallel-runs/1
(cherry picked from commit 21780b4f65)
This improves debugging of arbitrary configs in two ways:
1. Enable logging of distributed deadlock detection
2. Show output of `psql` commands
(cherry picked from commit a645cb4b94)
In CI our failure_connection_establishment sometimes failed randomly
with the following error:
```diff
-- verify a connection attempt was made to the intercepted node, this would have cause the
-- connection to have been delayed and thus caused a timeout
SELECT * FROM citus.dump_network_traffic() WHERE conn=0;
conn | source | message
------+--------+---------
- 0 | coordinator | [initial message]
-(1 row)
+(0 rows)
SELECT citus.mitmproxy('conn.allow()');
```
Source: https://app.circleci.com/pipelines/github/citusdata/citus/26318/workflows/d3354024-9a67-4b01-9416-5cf79aec6bd8/jobs/745558
The way I fixed this was by removing the dump_network_traffic call. This
might sound simple, but doing this while continuing to let the test
serve its intended purpose required quite some more changes.
This dump_network_traffic call was there because we didn't want to show
warnings in the queries above, because the exact warnings were not
reliable. The main reason this error was not reliable was because we
were using round-robin task assignment. We did the same query twice, so
that it would hit the node with the intercepted connection in one of
those connections. Instead of doing that I'm now using the
"first-replica" policy and do the queries only once. This works, because
the first placements by placementid for each of the used tables are on
the second node, so first-replica will cause the first connection to go
there.
This solved most of the flakyness, but when confirming that the
flakyness was fixed I found some additional errors:
```diff
-- show that INSERT failed
SELECT citus.mitmproxy('conn.allow()');
mitmproxy
-----------
(1 row)
SELECT count(*) FROM single_replicatated WHERE key = 100;
- count
----------------------------------------------------------------------
- 0
-(1 row)
-
+ERROR: could not establish any connections to the node localhost:9060 after 400 ms
RESET client_min_messages;
```
Source: https://app.circleci.com/pipelines/github/citusdata/citus/26321/workflows/fd5f4622-400c-465e-8d82-83f5f55a87ec/jobs/745666
I addressed this with a combination of two things:
1. Only change citus.node_connection_timeout for the queries that we
want to test timeout behaviour for. When those queries are done I
reset the value to the default again.
2. Change our mitm framework to only delay the initial connection packet
instead of all packets. I think sometimes a follow on packet of a previous
connection attempt was causing the next connection attempt to be delayed
even if `conn.allow()` was already called. For our tests we only care about
connection timeouts, so there's no reason to delay any other packets than
the initial connection packet.
Then there was some last flakyness in the exact error that was given:
```diff
-- tests for connectivity checks
SELECT name FROM r1 WHERE id = 2;
WARNING: could not establish any connections to the node localhost:9060 after 900 ms
+WARNING: connection to the remote node localhost:9060 failed with the following error:
name
------
bar
(1 row)
```
Source: https://app.circleci.com/pipelines/github/citusdata/citus/26338/workflows/9610941c-4d01-4f62-84dc-b91abc56c252/jobs/746467
I don't have a good explaination for this slight change in error message, but
given that it is missing the actual error message I expected this to be related
to some small difference in timing: e.g. the server responding to the connection
attempt right after the coordinator determined that the connection timed out.
To solve this last flakyness I increased the connection timeouts and made the
difference between the timeout and the delay a bit bigger. With these tweaks
I wasn't able to reproduce this error on CI anymore.
Finally, I made most of the same changes to failure_failover_to_local_execution,
since it was using the `conn.delay()` mitm method too. The only change that
I left out was the timing increase, since it might not be strictly necessary and
increases time it takes to run the test. If this test ever becomes flaky the first
thing we should try is increase its timeout.
(cherry picked from commit cc7e93a56a)
The failure_single_select test would sometimes fail with an error that's
similar to this:
```diff
-- cancel after first SELECT; txn should fail and nothing should be marked as invalid
SELECT citus.mitmproxy('conn.onQuery(query="^SELECT").cancel(' || pg_backend_pid() || ')');
- mitmproxy
----------------------------------------------------------------------
-
-(1 row)
-
+ERROR: canceling statement due to user request
+CONTEXT: COPY mitmproxy_result, line 1: ""
+SQL statement "COPY mitmproxy_result FROM '/home/circleci/project/src/test/regress/tmp_check/mitmproxy.fifo'"
+PL/pgSQL function citus.mitmproxy(text) line 11 at EXECUTE
BEGIN;
```
This error looked very to the one from #6217 and indeed the cause turned
out to be similar. Because we were canceling all SELECT queries, we
would actually sometimes cancel our mitmproxy SELECT queries itself.
This puts some additional restrictions on the queries that we cancel,
most importantly it should contain the name of the table that we're
selecting from.
I was able to reproduce the original issue locally pretty reliably. With
the changes in this PR it didn't happen again.
In passing this also changes one other failure test that was cancelling
all selects and puts similar additional restrictions on those
cancellations.
Example of failed test in CI: https://app.circleci.com/pipelines/github/citusdata/citus/26305/workflows/4d942b91-f83c-453c-8d9a-ae22d608e756/jobs/745071
(cherry picked from commit 506c16efdf)
The failure_create_distributed_table_non_empty test would sometimes fail
like this:
```diff
-- in the first test, cancel the first connection we sent from the coordinator
SELECT citus.mitmproxy('conn.cancel(' || pg_backend_pid() || ')');
- mitmproxy
----------------------------------------------------------------------
-
-(1 row)
-
+ERROR: canceling statement due to user request
+CONTEXT: COPY mitmproxy_result, line 1: ""
+SQL statement "COPY mitmproxy_result FROM '/home/circleci/project/src/test/regress/tmp_check/mitmproxy.fifo'"
+PL/pgSQL function citus.mitmproxy(text) line 11 at EXECUTE
SELECT create_distributed_table('test_table', 'id');
```
Because the cancel command had no filter it would actually sometimes
cancel the mitmproxy cancel command itself. This PR addresses that by
filtering on CREATE TABLE, which is one of the command that
create_distributed_table will send to the workers.
Example of failing test: https://app.circleci.com/pipelines/github/citusdata/citus/26252/workflows/1b7e5464-cca4-4ec1-99b3-48ddf25c29fa/jobs/742829
(cherry picked from commit e2a24b921e)
Sometimes in CI the columnar_memory test was using slightly more memory
than expected.
```diff
SELECT CASE WHEN 1.0 * TopMemoryContext / :top_post BETWEEN 0.98 AND 1.02 THEN 1 ELSE 1.0 * TopMemoryContext / :top_post END AS top_growth
FROM columnar_test_helpers.columnar_store_memory_stats();
--[ RECORD 1 ]-
-top_growth | 1
+-[ RECORD 1 ]------------------
+top_growth | 1.0206132116232119
-- before this change, max mem usage while executing inserts was 28MB and
```
This PR changes the expectation to be slightly higher, such that this
random increase in memory usage doesn't cause a flaky test.
Failing test: https://app.circleci.com/pipelines/github/citusdata/citus/26256/workflows/c0870f66-3346-4f8d-a1d3-36dfd7c98289/jobs/743028
(cherry picked from commit 4ce17f015b)
Sometimes the columnar_memory test fails in CI with the following error:
```diff
SELECT 1.0 * TopMemoryContext / :top_post BETWEEN 0.98 AND 1.02 AS top_growth_ok
FROM columnar_test_helpers.columnar_store_memory_stats();
-[ RECORD 1 ]-+--
-top_growth_ok | t
+top_growth_ok | f
-- before this change, max mem usage while executing inserts was 28MB and
```
This is almost certainly a harmless failure that simply requires bumping
the margin a little bit. However, it's impossible to say with the
current output. I was unable to reproduce this on-demand on my local
machine or even in CI. So this changes the test to include the actual
value difference in the size of TopMemoryContext when it's outside the
expected range. Then next time it fails we at least have some
information about why.
Example of failing test: https://app.circleci.com/pipelines/github/citusdata/citus/25966/workflows/d472a57b-419a-4f33-b8bc-2e174a98d4d6/jobs/730576
(cherry picked from commit e6a1a86db0)