DESCRIPTION: PR description that will go into the change log, up to 78
characters
---------
Co-authored-by: Jelte Fennema-Nio <jelte.fennema@microsoft.com>
DESCRIPTION: Fix leaking of memory and memory contexts in Foreign
Constraint Graphs
Previously, every time we (re)created the Foreign Constraint
Relationship Graph, we created a new Memory Context while loosing a
reference to the previous context. This old context could still have
left over memory in there causing a memory leak.
With this patch we statically have one memory context that we lazily
initialize the first time we create our foreign constraint relationship
graph. On every subsequent creation, beside destroying our previous
hashmap we also reset our memory context to remove any left over
references.
When braking a colocation, we need to create a new colocation group
record in pg_dist_colocation for the relation. It is not sufficient to
have a new colocationid value in pg_dist_partition only.
This patch also fixes a bug when deleting a colocation group if no
tables are left in it. Previously we passed a relation id as a parameter
to DeleteColocationGroupIfNoTablesBelong function, where we should have
passed a colocation id.
(cherry picked from commit c22547d221)
Prior to this commit, the code would skip processing the
errors happened for local commands.
Prior to https://github.com/citusdata/citus/pull/5379, it might
make sense to allow the execution continue. But, as of today,
if a modification fails on any placement, we can safely fail
the execution.
(cherry picked from commit b4008bc872)
We had #6737 fix the same issue on main branch, but we need to fix it on
release branches as well. As the patch does not really apply to earlier
release branches, we just added the fix manually.
(cherry picked from commit 4f9a344085)
DESCRIPTION: Correctly report shard size in citus_shards view
When looking at citus_shards, people are interested in the actual size
that all the data related to the shard takes up on disk.
`pg_total_relation_size` is the function to use for that purpose. The
previously used `pg_relation_size` does not include indexes or TOAST.
Especially the missing toast can have enormous impact on the size of the
shown data.
(cherry picked from commit b489d763e1)
We should not omit to free PGResult when we receive single tuple result
from an internal backend.
Single tuple results are normally freed by our ReceiveResults for
`tupleDescriptor != NULL` flow but not for those with `tupleDescriptor
== NULL`. See PR #6722 for details.
DESCRIPTION: Fixes memory leak issue with query results that returns
single row.
(cherry picked from commit 9e69dd0e7f)
In #6598 it was noticed that Citus could generate syntactically invalid
statements during logical replication. With #6603 we resolved the direct
issue, by only generating valid subscription names. But there was also
the underlying problem that we did not escape certain identifier
strings. While in theory this should be okay since we should only
generate names that are valid, this issue reiterated that we should not
take this for granted. As an extra line of defense this quotes all
identifiers we use during logical replication setup.
(cherry picked from commit c2b4087ff0)
Pyenv is installed in our container images but I found out that pyenv is
not being activated since it is activated from ~/bashrc script and in
GitHub Actions (GHA) this script is not being executed
Since pyenv is not activated, default python versions comes from docker
images is being used and in this case we get errors for python version
3.11.
Additionally, $HOME directory is /github/home for containers executed
under GHA and our pyenv installation is under /root directory which is
normally home directory for our packaging containers
This PR activates usage of pyenv and additionally uses pyenv virtualenv
feature to execute validate_output function in isolation
---------
Co-authored-by: Onur Tirtir <onurcantirtir@gmail.com>
(cherry picked from commit d919506076)
Fixes#6570.
In the past, having columnar tables in the cluster was causing pg
upgrades to fail when attempting to access columnar metadata. This is
because, pg_dump doesn't see objects that we use for columnar-am related
booking as the dependencies of the tables using columnar-am.
To fix that; in #5456, we inserted some "normal dependency" edges (from
those objects to columnar-am) into pg_depend.
This helped us ensuring the existency of a class of metadata objects
--such as columnar.storageid_seq-- and helped fixing #5437.
However, the normal-dependency edges that we added for indexes on
columnar metadata tables --such columnar.stripe_pkey-- didn't help at
all because they were indeed causing dependency loops (#5510) and
pg_dump was not able to take those dependency edges into the account.
For this reason, instead of inserting such dependency edges from indexes
to columnar-am, we allow columnar metadata accessors to fall-back to
sequential scan during pg upgrades.
(cherry picked from commit 1c51ddae49)
We should disallow dropping table_name option if foreign table is in
metadata. Otherwise, we get table not found error which contains
shardid.
DESCRIPTION: Fixes an unexpected foreign table error by disallowing to drop the table_name option.
Fixes#6663
(cherry picked from commit 8a9bb272e4)
DESCRIPTION: Fix crash when trying to replicate a ref table that is actually dropped
see #6592
We should have a real solution for it.
(cherry picked from commit bc3383170e)
(cherry picked from commit 9e32e34313)
With this PR, citus code will be tested in all packaging environments.
Sometimes, there can be compile errors which blocks packaging and in
this case unplanned delays may occur.
By testing the code in packaging environments, I'm aiming to detect any
compilation errors before packaging.
Co-authored-by: Onur Tirtir <onurcantirtir@gmail.com>
Co-authored-by: Hanefi Onaldi <Hanefi.Onaldi@microsoft.com>
DESCRIPTION: PR description that will go into the change log, up to 78
characters
Co-authored-by: Onur Tirtir <onurcantirtir@gmail.com>
Co-authored-by: Hanefi Onaldi <Hanefi.Onaldi@microsoft.com>
In #6038 I tried to fix OpenSSL 3.0 warnings with PG13, but I had made a
mistake when doing that. This actually fixes these warnings.
(cherry picked from commit a477ffdf4b)
This removes some warnings that are present when building on Ubuntu 22.04.
It removes warnings on PG13 + OpenSSL 3.0. OpenSSL 3.0 has marked some
functions that we use as deprecated, but we want to continue support OpenSSL
1.0.1 for the time being too. This indicates that to OpenSSL 3.0, so it doesn't
show warnings.
(cherry picked from commit 3fadb98380)
DESCRIPTION: Fixes a potential dangling pointer issue
Need to backport to 11.0 & 11.1 since we might want to release packages
for debian/bookworm based on those branches in future.
(cherry picked from commit 80faf47ab5)
DESCRIPTION: Raises memory limits in columnar from 256MB to 1GB for
reads and writes
This doesn't completely fix#5918 but at least increases the
buffer limits that might cause throwing an error when reading
from or writing into into columnar storage. A way better approach
to fix this is documented in #6420.
Replacing memcpy_s with memcpy is quite safe in those places
since we anyway make sure to allocate enough amount of memory
before writing into related buffers.
(cherry picked from commit 0b81f68def)
Fixes https://github.com/citusdata/citus/issues/6394.
DESCRIPTION: Fixes a bug that causes creating disabled-triggers on
shards as enabled
Since CREATE TRIGGER doesn't have syntax support to specify
whether the trigger should be enabled/disabled, the underlying
PG function (`pg_get_triggerdef()`) that we use to generate the
command to create the trigger is not enough. For this reason, we
append a second command to enable/disable trigger, right after
creating it.
We don't retain explicit extension dependencies set by using
`ALTER trigger DEPENDS ON EXTENSION` commands too, but apparently
right fix for that is to throw an error as in
`PreprocessAlterTriggerDependsStmt()`; so, opened a separate PR
to fix that #6399.
(cherry picked from commit 86e186f671)
DESCRIPTION: Fixes a bug that prevents retaining columnar table options
after a table-rewrite A fix for this issue: Columnar: options ignored
during ALTER TABLE rewrite #5927
The OID for the temporary table created during ALTER TABLE was not the
same as the original table's OID so the columnar options were not being
applied during rewrite.
The change is that I applied the original table's columnar options to
the new table so that it has the correct options during write. I also
added a test.
cherry-pick from commit f21cbe68f8
During alter_distributed_table, we create a new table like the
original table but with the altered options.
To retrieve the name of the distribution column, we were using
the attribute syscache of the new table, since we already created
the new table as identical to the original table.
However, the attribute syscaches of these two tables are not
the same if the original table has dropped columns. The reason
is that dropped columns are all still present in the cache.
Hence, for example, the attnos would be different in the syscaches.
So, let's use the attribute syscache of the original table.
Given that we drop DEFAULT nextval('sequence') expressions from
shard relation columns, allowing `ON DELETE/UPDATE SET DEFAULT`
on such columns might cause inserting NULL values as a result
of a delete/update operation.
For this reason, we disallow ON DELETE/UPDATE SET DEFAULT actions
on columns that default to sequences.
DESCRIPTION: Disallows having ON DELETE/UPDATE SET DEFAULT actions on
columns that default to sequences
Fixes#6339.
(cherry picked from commit a868cc049a)
Conflicts (dropped those changes since pg15 is not supported on 11.0):
src/test/regress/expected/pg15.out
src/test/regress/sql/pg15.sql
On Citus versions <= 11.0, IntegerArrayTypeToList() doesn't exist and
its helpers (DeconstructArrayObject() & ArrayObjectCount()) are defined
in worker_protocol.h. (See 9476f377b5).
So we add IntegerArrayTypeToList() into worker_protocol.c and include
IntegerArrayTypeToList from worker_protocol.h instead of array_type.h
in foreign_constraint.c.
This is needed to backport a868cc049a into
this (release-11.0) branch, see the next commit.
As we did for GENERATED STORED columns in #4613, we should not drop
column
default expressions that are not based on sequences from shard relation
since
such expressions need to exist e.g. for foreign key actions.
For the column default expressions that are based on sequences we cannot
do much, so we need to disallow having ON DELETE SET DEFAULT actions on
such columns in a separate PR, see #6339.
Fixes#6318.
DESCRIPTION: Fixes a bug that might cause inserting incorrect DEFAULT
values when applying foreign key actions
(cherry picked from commit de24a3eda5)
Conflicts (dropped those changes since pg15 is not supported on 11.0):
src/test/regress/expected/pg15.out
src/test/regress/sql/pg15.sql
pg_dist_node and pg_dist_colocation have a primary key index, not a replica identity index.
Citus catalog tables are created in public schema, which has replica identity index by default
as primary key index. Later the citus catalog tables are moved to pg_catalog schema.
During pg_upgrade, all tables are recreated, and given that pg_dist_colocation is found in
pg_catalog schema, it is recreated in that schema, and when it is recreated it doesn't
have a replica identity index, because catalog tables have no replica identity.
Further action:
Do we even need to acquire this lock on the primary key index?
Postgres doesn't acquire such locks on indexes before deleting catalog tuples.
Also, catalog tuples don't have replica identities by definition.
Historically we have been testing with the 'latest' version of libpq
when the CI images were build. This has the downside that rebuilding the
images often break our tests due to different errors returned from
libpq.
With this change we will actually test with a stable version of libpq
that is based on the postgres minor version that we test against.
This will make it easier to maintain postgres images over time, as well
as running _all_ tests locally, where we change libpq in sync with the
postgres server version.
(cherry picked from commit f944f97d01)
Here is a flaky test output that is quite hard to fix:
```diff
diff -dU10 -w /home/circleci/project/src/test/regress/expected/isolation_master_update_node_1.out /home/circleci/project/src/test/regress/results/isolation_master_update_node.out
--- /home/circleci/project/src/test/regress/expected/isolation_master_update_node_1.out.modified 2022-03-21 19:03:54.237042562 +0000
+++ /home/circleci/project/src/test/regress/results/isolation_master_update_node.out.modified 2022-03-21 19:03:54.257043084 +0000
@@ -49,18 +49,20 @@
<waiting ...>
step s2-update-node-1-force: <... completed>
master_update_node
------------------
(1 row)
step s2-abort: ABORT;
step s1-abort: ABORT;
FATAL: terminating connection due to administrator command
-SSL connection has been closed unexpectedly
+server closed the connection unexpectedly
+ This probably means the server terminated abnormally
+ before or while processing the request.
```
I could not come up with a solution that would decrease the flakiness in the test outputs. We already have 3 output files for the same test and now I introduced a 4th one.
I can also add complex regular expressions that span multiple lines, and normalize these error messages. Feel free to suggest a normalized error message in a comment here.
## Current alternative file contents
`isolation_master_update_node.out`
```
step s1-abort: ABORT;
FATAL: terminating connection due to administrator command
FATAL: terminating connection due to administrator command
SSL connection has been closed unexpectedly
```
`isolation_master_update_node_0.out`
```
step s1-abort: ABORT;
WARNING: this step had a leftover error message
FATAL: terminating connection due to administrator command
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
```
`isolation_master_update_node_1.out`
```
step s1-abort: ABORT;
FATAL: terminating connection due to administrator command
SSL connection has been closed unexpectedly
```
new file: `isolation_master_update_node_2.out`
```
step s1-abort: ABORT;
FATAL: terminating connection due to administrator command
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
```
(cherry picked from commit 518fb0873e)
In CI sometimes failure_setup will fail with the following error:
```diff
SELECT master_add_node('localhost', :worker_2_proxy_port); -- an mitmproxy which forwards to the second worker
- master_add_node
----------------------------------------------------------------------
- 2
-(1 row)
-
+ERROR: connection to the remote node localhost:9060 failed with the following error: could not connect to server: Connection refused
+ Is the server running on host "localhost" (127.0.0.1) and accepting
+ TCP/IP connections on port 9060?
+could not connect to server: Connection refused
+ Is the server running on host "localhost" (127.0.0.1) and accepting
+ TCP/IP connections on port 9060?
+could not connect to server: Cannot assign requested address
+ Is the server running on host "localhost" (::1) and accepting
+ TCP/IP connections on port 9060?
diff -dU10 -w /home/circleci/project/src/test/regress/expected/failure_online_move_shard_placement.out /home/circleci/project/src/test/regress/results/failure_online_move_shard_placement.out
```
This then breaks all the tests run after it as well, because we're
missing one worker node.
Locally I was able to reproduce this error by sleeping for 10 seconds in
the forked process sleep before actually starting mitmproxy. So I'm
expecting what's happening in CI is that due to limited resources,
mitmproxy is not up yet when we try to add its port as a workernode.
This PR fixes this by waiting until mitmproxy is listening on its socket
before actually starting to run our tests. This fixed it locally for me
when I made the forked process sleep for 10 seconds before starting
mitmproxy.
In passing it also improves the detection and errors that we already
had for the case where something was already listening on the
mitmproxy port.
Because both @gledis69 and me were changing things in our CI images
at the same time this also includes a bump of the style checker tools.
Closes#6200
(cherry picked from commit 25e5cf2e50)
Sometimes in CI our adaptive_executor test would fail randomly with the
following error:
```diff
SELECT sum(result::bigint) FROM run_command_on_workers($$
SELECT count(*) FROM pg_stat_activity
WHERE pid <> pg_backend_pid() AND query LIKE '%8010090%'
$$);
sum
-----
- 4
+ 2
(1 row)
END;
```
Source: https://app.circleci.com/pipelines/github/citusdata/citus/26665/workflows/40665680-0044-4852-8fe4-5fd628f9fb47/jobs/764371
This means that the low slow start interval did not have any effect on
the number of connections being opened. I could see two possibilities
for this to happen:
1. CI was slow and actually doing the start of the second connection. I
tried to solve this by doubling the time a query to the worker takes.
2. The second option is that the shards were queried in the oposite
order than we expect. This would mean that the first query to the
worker completes quickly because there's no, sleep because it doesn't
contain any rows. I tried to solve this option by adding a row to
each shard.
After trying to reproduce the random failure in CI it turned out that I
needed both of these fixes to resolve the random failure.
(cherry picked from commit f22a47981a)
Sometimes in CI our drop_partitioned_talbe test would fail with the
following error:
```diff
NOTICE: issuing SELECT worker_drop_distributed_table('drop_partitioned_table.child1')
NOTICE: issuing SELECT worker_drop_distributed_table('drop_partitioned_table.child1')
NOTICE: issuing DROP TABLE IF EXISTS drop_partitioned_table.child1_727001 CASCADE
-NOTICE: issuing SELECT pg_catalog.citus_internal_delete_colocation_metadata(100047)
-NOTICE: issuing SELECT pg_catalog.citus_internal_delete_colocation_metadata(100047)
+NOTICE: issuing SELECT pg_catalog.citus_internal_delete_colocation_metadata(100046)
+NOTICE: issuing SELECT pg_catalog.citus_internal_delete_colocation_metadata(100046)
ROLLBACK;
NOTICE: issuing ROLLBACK
NOTICE: issuing ROLLBACK
```
Source: https://app.circleci.com/pipelines/github/citusdata/citus/26631/workflows/31536032-e1ba-493b-b12a-f40757f3a7d6/jobs/762170
For some reason the colocationid of the distributed partitioned table
would be one less than we expected. Why this happens I'm not sure, but
it seems fairly harmless that it does.
In an attempt to work around this flakyness I now reset the colocation
id sequence right before creating the table in question. This is good
practice in general, because it allows us to run the test successfully
using `check-minimal` and it also allows us to rerun it multiple times.
(cherry picked from commit 895a484b39)