Bump PG versions to the latest minors 14.15, 15.10, 16.6
There is a libpq symlink issue when the images are built remotely
https://github.com/citusdata/citus/actions/runs/12583502447/job/35071296238
Hence, we use the commit sha of a local build of the images, pushed.
This is temporary, until we find the underlying cause of the symlink
issue.
---------
Co-authored-by: Onur Tirtir <onurcantirtir@gmail.com>
We thought we provided support for this in
b8c493f2c4
However the use of parameters in SQL is not supported in Citus. Since
generic plan queries use parameters, we can't support for now.
Relevant PG16 commit https://github.com/postgres/postgres/commit/3c05284Fixes#7813 with proper error message
DESCRIPTION: Fixes a crash that happens because of unsafe catalog access
when re-assigning the global pid after application_name changes.
When application_name changes, we don't actually need to
try re-assigning the global pid for external client backends because
application_name doesn't affect the global pid for such backends. Plus,
trying to re-assign the global pid for external client backends would
unnecessarily cause performing a catalog access when the cached local
node id is invalidated. However, accessing to the catalog tables is
dangerous in certain situations like when we're not in a transaction
block. And for the other types of backends, i.e., the Citus internal
backends, we need to re-assign the global pid when the application_name
changes because for such backends we simply extract the global pid
inherited from the originating backend from the application_name -that's
specified by originating backend when openning that connection- and this
doesn't require catalog access.
This PR is a proposed fix for issue
[7705](https://github.com/citusdata/citus/issues/7705). The following is
the background and rationale for the fix (please refer to
[7705](https://github.com/citusdata/citus/issues/7705) for context);
The `varnullingrels `field was introduced to the Var node struct
definition in Postgres 16. Its purpose is to associate a variable with
the set of outer join relations that can cause the variable to be NULL.
The `varnullingrels ` for the variable
`"gianluca_camp_test"."start_timestamp"` in the problem query is 3,
because the variable "gianluca_camp_test"."start_timestamp" is coming
from the inner (nullable) side of an outer join and 3 is the RT index
(aka relid) of that outer join. The problem occurs when the Postgres
planner attempts to plan the combine query. The format of a combine
query is:
```
SELECT <targets>
FROM pg_catalog.citus_extradata_container();
```
There is only one relation in a combine query, so no outer joins are
present, but the non-empty `varnullingrels `field causes the Postgres
planner to access structures for a non-existent relation. The source of
the problem is that, when creating the target list for the combine
query, function MasterAggregateMutator() uses copyObject() to construct
a Var node before setting the master table ID, and this copies over the
non-empty varnullingrels field in the case of the
`"gianluca_camp_test"."start_timestamp"` var. The proposed solution is
to have MasterAggregateMutator() use makeVar() instead of copyObject(),
and only set the fields that make sense for the combine query; var type,
collation and type modifier. The `varnullingrels `field can be left
empty because there is only one relation in the combine query.
A new regress test issue_7705.sql is added to exercise the fix. The
issue is not specific to window functions, any target expression that
cannot be pushed down and contains at least one column from the inner
side of a left outer join (so has a non-empty varnullingrels field) can
cause the same issue.
More about Citus combine queries
[here](https://github.com/citusdata/citus/tree/main/src/backend/distributed#combine-query-planner).
More about Postgres varnullingrels
[here](https://github.com/postgres/postgres/blob/master/src/backend/optimizer/README).
In function MasterAggregateMutator(), when the original Node is a Var node use makeVar() instead
of copyObject() when constructing the Var node for the target list of the combine query.
The varnullingrels field of the original Var node is ignored because it is not relevant for the
combine query; copying this cause the problem in issue 7705, where a coordinator query had
a Var with a reference to a non-existent join relation.
Very small PR, no changes to behaviour. Just a typo fix :-)
Under
`src/backend/distributed/sql/udfs/citus_finalize_upgrade_to_citus11/`
the sql has a typo "runnnig", which will be displayed to the user if the
`citus_check_cluster_node_health()` fails when calling
`citus_finish_citus_upgrade();`
Co-authored-by: eaydingol <60466783+eaydingol@users.noreply.github.com>
When multiple sessions concurrently attempt to add the same coordinator
node using `citus_set_coordinator_host`, there is a potential race
condition. Both sessions may pass the initial metadata check
(`isCoordinatorInMetadata`), but only one will succeed in adding the
node. The other session will fail with an assertion error
(`Assert(!nodeAlreadyExists)`), causing the server to crash. Even though
the `AddNodeMetadata` function takes an exclusive lock, it appears that
the lock is not preventing the race condition before the initial
metadata check.
- **Issue**: The current logic allows concurrent sessions to pass the
check for existing coordinators, leading to an attempt to insert
duplicate nodes, which triggers the assertion failure.
- **Impact**: This race condition leads to crashes during operations
that involve concurrent coordinator additions, as seen in
https://github.com/citusdata/citus/issues/7646.
**Test Plan:**
- Isolation Test Limitation: An isolation test was added to simulate
concurrent additions of the same coordinator node, but due to the
behavior of PostgreSQL locking mechanisms, the test does not trigger the
edge case. The lock applied within the function serializes the
operations, preventing the race condition from occurring in the
isolation test environment.
While the edge case is difficult to reproduce in an isolation test, the
fix addresses the core issue by ensuring concurrency control through
proper locking.
- Existing Tests: All existing tests related to node metadata and
coordinator management have been run to ensure that no regressions were
introduced.
**After the Fix:**
- Concurrent attempts to add the same coordinator node will be
serialized. One session will succeed in adding the node, while the
others will skip the operation without crashing the server.
Co-authored-by: Mehmet YILMAZ <mehmet.yilmaz@microsoft.com>
**Description:**
This PR adds a section to CONTRIBUTING.md that explains how to set up
debugging in the devcontainer using VS Code.
**Changes:**
- **New Debugging Section**: Clear instructions on starting the
debugger, selecting the appropriate PostgreSQL process, and setting
breakpoints for easier troubleshooting.
**Purpose:**
- **Improved Contributor Workflow**: Enables contributors to debug the
Citus extension within the devcontainer, enhancing productivity and
making it easier to resolve issues.
---------
Co-authored-by: Mehmet YILMAZ <mehmet.yilmaz@microsoft.com>
DESCRIPTION: Add a check to see if the given limit is null.
Fixes a bug by checking if the limit given in the query is null when the
actual limit is computed with respect to the given offset.
Prior to this change, null is interpreted as 0 during the limit
calculation when both limit and offset are given.
Fixes#7663
Removes el/7 and ol/7 as runners and update checkout action to v4
We use EL/7 and OL/7 runners to test packaging for these distributions.
However, for the past two weeks, we've encountered errors during the
checkout step in the pipelines. The error message is as follows:
```
/__e/node20/bin/node: /lib64/libm.so.6: version `GLIBC_2.27' not found (required by /__e/node20/bin/node)
/__e/node20/bin/node: /lib64/libstdc++.so.6: version `GLIBCXX_3.4.20' not found (required by /__e/node20/bin/node)
/__e/node20/bin/node: /lib64/libstdc++.so.6: version `CXXABI_1.3.9' not found (required by /__e/node20/bin/node)
/__e/node20/bin/node: /lib64/libstdc++.so.6: version `GLIBCXX_3.4.21' not found (required by /__e/node20/bin/node)
/__e/node20/bin/node: /lib64/libc.so.6: version `GLIBC_2.28' not found (required by /__e/node20/bin/node)
/__e/node20/bin/node: /lib64/libc.so.6: version `GLIBC_2.25' not found (required by /__e/node20/bin/node)
```
The GCC version within the EL/7 and OL/7 Docker images is 2.17, and we
cannot upgrade it. Therefore, we need to remove these images from the
packaging test pipelines. Consequently, we will no longer verify if the
code builds for EL/7 and OL/7.
However, we are not using these packaging images as runners within the
packaging infrastructure, so we can continue to use these images for
packaging.
Additional Info: I learned that Marlin team fully dropped the el/7
support so we will drop in further releases as well
We move the CI images to the github container registry.
Given we mostly (if not solely) run these containers on github actions
infra it makes sense to have them hosted closer to where they are
needed.
Image changes: https://github.com/citusdata/the-process/pull/157
The sections about the rebalancer algorithm and the backround tasks were
empty.
---------
Co-authored-by: Marco Slot <marco.slot@gmail.com>
Co-authored-by: Steven Sheehy <17552371+steven-sheehy@users.noreply.github.com>
Related to issue #7619, #7620
Merge command fails when source query is single sharded and source and
target are co-located and insert is not using distribution key of
source.
Example
```
CREATE TABLE source (id integer);
CREATE TABLE target (id integer );
-- let's distribute both table on id field
SELECT create_distributed_table('source', 'id');
SELECT create_distributed_table('target', 'id');
MERGE INTO target t
USING ( SELECT 1 AS somekey
FROM source
WHERE source.id = 1) s
ON t.id = s.somekey
WHEN NOT MATCHED
THEN INSERT (id)
VALUES (s.somekey)
ERROR: MERGE INSERT must use the source table distribution column value
HINT: MERGE INSERT must use the source table distribution column value
```
Author's Opinion: If join is not between source and target distributed
column, we should not force user to use source distributed column while
inserting value of target distributed column.
Fix: If user is not using distributed key of source for insertion let's
not push down query to workers and don't force user to use source
distributed column if it is not part of join.
This reverts commit fa4fc0b372.
Co-authored-by: paragjain <paragjain@microsoft.com>
Because we want to track PR numbers and to make backporting easy we
(pretty much always) use squash-merges when merging to master. We
accidentally used a rebase merge for PR #7620. This reverts those
changes so we can redo the merge using squash merge.
This reverts all commits from eedb607c to 9e71750fc.
For some reason using localhost in our hba file doesn't have the
intended effect anymore in our Github Actions runners. Probably because
of some networking change (IPv6 maybe) or some change in the
`/etc/hosts` file.
Replacing localhost with the equivalent loopback IPv4 and IPv6 addresses
resolved this issue.
Updates checkout plugin for github actions to v4. Can not update the
version for check-sql-snapshots since new plugin causes below error in
the docker image this step is using . Please refer to:
https://github.com/citusdata/citus/actions/runs/9286197994/job/25552373953
Error:
```
/__e/node20/bin/node: /lib/x86_64-linux-gnu/libm.so.6: version `GLIBC_2.27' not found (required by /__e/node20/bin/node)
/__e/node20/bin/node: /lib/x86_64-linux-gnu/libc.so.6: version `GLIBC_2.28' not found (required by /__e/node20/bin/node)
/__e/node20/bin/node: /lib/x86_64-linux-gnu/libc.so.6: version `GLIBC_2.25' not found (required by /__e/node20/bin/node)
```
DESCRIPTION: Fix performance issue when using "\d tablename" on a server
with many tables
We introduce a filter to every query on pg_class to automatically remove
shards. This is useful to make sure \d and PgAdmin are not cluttered
with shards. However, the way we were introducing this filter was using
`securityQuals` which can have negative impact on query performance.
On clusters with 100k+ tables this could cause a simple "\d tablename"
command to take multiple seconds, because a skipped optimization by
Postgres causes a full table scan. This changes the code to introduce
this filter in the regular `quals` list instead of in `securityQuals`.
Which causes Postgres to use the intended optimization again.
For reference, this was initially reported as a Postgres issue by me:
https://www.postgresql.org/message-id/flat/4189982.1712785863%40sss.pgh.pa.us#b87421293b362d581ea8677e3bfea920
Variables being modified in the PG_TRY block and read in the PG_CATCH
block should be qualified with volatile.
The variable waitEventSet is modified in the PG_TRY block (line 1085)
and read in the PG_CATCH block (line 1095).
The variable relation is modified in the PG_TRY block (line 500) and
read in the PG_CATCH block (line 515).
Besides, the variable objectAddress doesn't need the volatile qualifier.
Ref: C99 7.13.2.1[^1],
> All accessible objects have values, and all other components of the
abstract machine have state, as of the time the longjmp function was
called, except that the values of objects of automatic storage duration
that are local to the function containing the invocation of the
corresponding setjmp macro that do not have volatile-qualified type and
have been changed between the setjmp invocation and longjmp call are
indeterminate.
[^1]: https://www.open-std.org/jtc1/sc22/wg14/www/docs/n1256.pdf
DESCRIPTION: Correctly mark some variables as volatile
---------
Co-authored-by: Hong Yi <zouzou0208@gmail.com>
Fix check-arbitrary-configs tests failure with current REL_16_STABLE.
This is the same problem as described in #7573. I missed pg_regress call
in _run_pg_regress() in that PR.
Co-authored-by: Karina Litskevich <litskevichkarina@gmail.com>
DESCRIPTION: Fix performance issue in GetForeignKeyOids on systems with
many constraints
GetForeignKeyOids was showing up in CPU profiles when distributing
schemas on systems with 100k+ constraints. The reason was that this
function was doing a sequence scan of pg_constraint to get the foreign
keys that referenced the requested table.
This fixes that by finding the constraints referencing the table through
pg_depend instead of pg_constraint. We're doing this indirection,
because pg_constraint doesn't have an index that we can use, but
pg_depend does.
DESCRIPTION: Fix PG upgrades when invalid rebalance strategies exist
Without this change an upgrade of a cluster with an invalid rebalance
strategy would fail with an error like this:
```
cache lookup failed for shard_cost_function with oid 6077337
CONTEXT: SQL statement "SELECT citus_validate_rebalance_strategy_functions(
NEW.shard_cost_function,
NEW.node_capacity_function,
NEW.shard_allowed_on_node_function)"
PL/pgSQL function citus_internal.pg_dist_rebalance_strategy_trigger_func() line 5 at PERFORM
SQL statement "INSERT INTO pg_catalog.pg_dist_rebalance_strategy SELECT
name,
default_strategy,
shard_cost_function::regprocedure::regproc,
node_capacity_function::regprocedure::regproc,
shard_allowed_on_node_function::regprocedure::regproc,
default_threshold,
minimum_threshold,
improvement_threshold
FROM public.pg_dist_rebalance_strategy"
PL/pgSQL function citus_finish_pg_upgrade() line 115 at SQL statement
```
This fixes that by disabling the trigger and simply re-inserting the
invalid rebalance strategy without checking. We could also silently
remove it, but this seems nicer.