With https://github.com/citusdata/citus/pull/5493 we introduced
metadata specific connections.
With this connection we guarantee that there is a single metadata connection.
But note that this connection can be used for any other operation.
In other words, this connection is not only reserved for metadata
operations.
However, as https://github.com/citusdata/citus-enterprise/issues/715 showed
us that the logic has a flaw. We allowed ineligible connections to be
picked as metadata connections: such as exclusively claimed connections
or not fully initialized connections.
With this commit, we make sure that we only consider eligable connections
for metadata operations.
We prefer the background daemon to only sync node metadata. That's
why we move placement metadata changes from disable node to
activate node. With that, we can make sure that disable node
only changes node metadata, whereas activate node syncs all
the metadata changes. In essence, we already expect all
nodes to be up when a node is activated. So, this does not change
the behavior much.
Dropping sequences means we need to recreate
and hence losing the sequence.
With this commit, we keep the existing sequences
such that resyncing wouldn't drop the sequence.
We do that by breaking the dependency of the sequence
from the table.
Split distributed/version_compat.h into dependency-free
pg_version_compat.h, and the original which still has
dependencies. The original doesn't have much purpose, but until other
files have better discipline about including the correct header files,
then it's still needed.
Also make distributed/listutils.h dependency-free. Should be moved
outside of 'distributed' subdirectory, but that will cause significant
code churn, so leave for another cleanup patch.
Now both files can be included in columnar without creating a
dependency on citus.
Before this commit, Citus was triggering metadata syncing
in the background when a function is distributed. However,
with Citus 11, we expect all clusters to have metadata synced
enabled. So, we do not expect any nodes not to have the metadata.
This change:
(a) pro: simplifies the code and opens up possibilities
to simplify futher by reducing the scope of
bg worker to only sync node metadata
(b) pro: explicitly asks users to sync the metadata such that
any unforseen impact can be easily detected
(c) con: For distributed functions without distribution
argument, we do not necessarily require the metadata
sycned. However, for completeness and simplicity, we
do so.
With Citus 11, the default behavior is to sync the metadata.
However, partitioned tables created pre-Citus 11 might have
index names that are not compatiable with metadata syncing.
See https://github.com/citusdata/citus/issues/4962 for the
details.
With this commit, we record the existence of partitioned tables
such that we can fix it later if any exists.
With this commit, fix_partition_shard_index_names()
works significantly faster.
For example,
32 shards, 365 partitions, 5 indexes drop from ~120 seconds to ~44 seconds
32 shards, 1095 partitions, 5 indexes drop from ~600 seconds to ~265 seconds
`queryStringList` can be really long, because it may contain #partitions * #indexes entries.
Before this change, we were actually going through the executor where each command
in the query string triggers 1 round trip per entry in queryStringList.
The aim of this commit is to avoid the round-trips by creating a single query string.
I first simply tried sending `q1;q2;..;qn` . However, the executor is designed to
handle `q1;q2;..;qn` type of query executions via the infrastructure mentioned
above (e.g., by tracking the query indexes in the list and doing 1 statement
per round trip).
One another option could have been to change the executor such that only track
the query index when `queryStringList` is provided not with queryString
including multiple `;`s . That is (a) more work (b) could cause weird edge
cases with failure handling (c) felt like coding a special case in to the executor
(cherry picked from commit 90928cfd74)
Fix function signature generation
Fix comment typo
Add test for worker_create_or_replace_object
Add test for recreating distributed functions with OUT/TABLE params
Add test for recreating distributed function that returns setof int
Fix test output
Fix comment
Simply applies
```SQL
SELECT textlike(command, citus.grep_remote_commands)
```
And, if returns true, the command is logged. Else, the log is ignored.
When citus.grep_remote_commands is empty string, all commands are
logged.
This UDF coordinates connectivity checks accross the whole cluster.
This UDF gets the list of active readable nodes in the cluster, and
coordinates all connectivity checks in sequential order.
The algorithm is:
for sourceNode in activeReadableWorkerList:
c = connectToNode(sourceNode)
for targetNode in activeReadableWorkerList:
result = c.execute(
"SELECT citus_check_connection_to_node(targetNode.name,
targetNode.port")
emit sourceNode.name,
sourceNode.port,
targetNode.name,
targetNode.port,
result
- result -> true -> connection attempt from source to target succeeded
- result -> false -> connection attempt from source to target failed
- result -> NULL -> connection attempt from the current node to source node failed
I suggest you use the following query to get an overview on the connectivity:
SELECT bool_and(COALESCE(result, false))
FROM citus_check_cluster_node_health();
Whenever this query returns false, there is a connectivity issue, check in detail.
PostgreSQL does not need calling this function since 7.4 release, and it
is a NOOP.
For more details, check PostgreSQL commit below :
commit dd04e958c8b03c0f0512497651678c7816af3198
Author: Tom Lane <tgl@sss.pgh.pa.us>
Date: Sun Mar 9 03:34:10 2003 +0000
tuplestore_donestoring() isn't needed anymore, but provide a no-op
macro definition so as not to create compatibility problems.
diff --git a/src/include/utils/tuplestore.h b/src/include/utils/tuplestore.h
index b46babacd1..76fe9fb428 100644
--- a/src/include/utils/tuplestore.h
+++ b/src/include/utils/tuplestore.h
@@ -17,7 +17,7 @@
* Portions Copyright (c) 1996-2002, PostgreSQL Global Development Group
* Portions Copyright (c) 1994, Regents of the University of California
*
- * $Id: tuplestore.h,v 1.8 2003/03/09 02:19:13 tgl Exp $
+ * $Id: tuplestore.h,v 1.9 2003/03/09 03:34:10 tgl Exp $
*
*-------------------------------------------------------------------------
*/
@@ -41,6 +41,9 @@ extern Tuplestorestate *tuplestore_begin_heap(bool randomAccess,
extern void tuplestore_puttuple(Tuplestorestate *state, void *tuple);
+/* tuplestore_donestoring() used to be required, but is no longer used */
+#define tuplestore_donestoring(state) ((void) 0)
+
/* backwards scan is only allowed if randomAccess was specified 'true' */
extern void *tuplestore_gettuple(Tuplestorestate *state, bool forward,
bool *should_free);
Since sequences are not marked as distributed while creating table if no
metadata worker node exists, we are marking all sequences distributed
while syncing metadata explicitly.
We've both allowed delegating functions and procedures from worker nodes
and also prevented delegation if a function/procedure has already been
propagated from another node.
Before that PR we were updating citus.pg_dist_object metadata, which keeps
the metadata related to objects on Citus, only on the coordinator node. In
order to allow using those object from worker nodes (or erroring out with
proper error message) we've started to propagate that metedata to worker
nodes as well.
citus_check_connection_to_node runs a simple query on a remote node and
reports whether this attempt was successful.
This UDF will be used to make sure each worker node can connect to all
the worker nodes in the cluster.
parameters:
nodename: required
nodeport: optional (default: 5432)
return value:
boolean success
As of master branch, Citus does all the modifications to replicated tables
(e.g., reference tables and distributed tables with replication factor > 1),
via 2PC and avoids any shardstate=3. As a side-effect of those changes,
handling node failures for replicated tables change.
With this PR, when one (or multiple) node failures happen, the users would
see query errors on modifications. If the problem is intermitant, that's OK,
once the node failure(s) recover by themselves, the modification queries would
succeed. If the node failure(s) are permenant, the users should call
`SELECT citus_disable_node(...)` to disable the node. As soon as the node is
disabled, modification would start to succeed. However, now the old node gets
behind. It means that, when the node is up again, the placements should be
re-created on the node. First, use `SELECT citus_activate_node()`. Then, use
`SELECT replicate_table_shards(...)` to replicate the missing placements on
the re-activated node.
With this commit, we make sure to use a dedicated connection per
node for all the metadata operations within the same transaction.
This is needed because the same metadata (e.g., metadata includes
the distributed table on the workers) can be modified accross
multiple connections.
With this connection we guarantee that there is a single metadata connection.
But note that this connection can be used for any other operation.
In other words, this connection is not only reserved for metadata
operations.
The checks for preventing to remove a node are very much reference
table centric. We are soon going to add the same checks for replicated
tables. So, make the checks generic such that:
(a) replicated tables fit naturally
(b) we can the same checks in `citus_disable_node`.
We re-define the meaning of active shard placement. It used
to only be defined via shardstate == SHARD_STATE_ACTIVE.
Now, we also add one more check. The worker node that the
placement is on should be active as well.
This is a preparation for supporting citus_disable_node()
for MX with multiple failures at the same time.
With this change, the maintanince daemon only needs to
sync the "node metadata" (e.g., pg_dist_node), not the
shard metadata.
Before this commit, we acquire the metadata locks on the reference
tables while removing/disabling a node on all the MX nodes.
Although it has some marginal benefits, such as a concurrent
modification during remove/disable node blocks, instead of erroring
out, the drawbacks seems worse. Both citus_remove_node and citus_disable_node
are not tolerant to multiple node failures.
With this commit, we relax the locks. The implication is that while
a node is removed/disabled, users might see query errors. On the
other hand, this change becomes removing/disabling nodes more
tolerant to multiple node failures.
During pg upgrades, we have seen that it is not guaranteed that a
columnar table will be created after metadata objects got created.
Prior to changes done in this commit, we had such a dependency
relationship in `pg_depend`:
```
columnar_table ----> columnarAM ----> citus extension
^ ^
| |
columnar.storage_id_seq -------------------- |
|
columnar.stripe -------------------------------
```
Since `pg_upgrade` just knows to follow topological sort of the objects
when creating database dump, above dependency graph doesn't imply that
`columnar_table` should be created before metadata objects such as
`columnar.storage_id_seq` and `columnar.stripe` are created.
For this reason, with this commit we add new records to `pg_depend` to
make columnarAM depending on all rel objects living in `columnar`
schema. That way, `pg_upgrade` will know it needs to create those before
creating `columnarAM`, and similarly, before creating any tables using
`columnarAM`.
Note that in addition to inserting those records via installation script,
we also do the same in `citus_finish_pg_upgrade()`. This is because,
`pg_upgrade` rebuilds catalog tables in the new cluster and that means,
we must insert them in the new cluster too.
- [x] Add some more regression test coverage
- [x] Make sure returning works fine in case of
local execution + remote execution
(task->partiallyLocalOrRemote works as expected, already added tests)
- [x] Implement locking properly (and add isolation tests)
- [x] We do #shardcount round-trips on `SerializeNonCommutativeWrites`.
We made it a single round-trip.
- [x] Acquire locks for subselects on the workers & add isolation tests
- [x] Add a GUC to prevent modification from the workers, hence increase the
coordinator-only throughput
- The performance slightly drops (~%15), unless
`citus.allow_modifications_from_workers_to_replicated_tables`
is set to false
We recently introduced a set of patches to 10.2, and introduced 10.2-4
migration version. This migration version only resides on `release-10.2`
branch, and is missing on our default branch. This creates a problem
because we do not have a valid migration path from 10.2 to latest 11.0.
To remedy this issue, I copied the relevant migration files from
`release-10.2` branch, and renamed some of our migration files on
default branch to make sure we have a linear upgrade path.
Before this commit, we required the user to be owner of the shard/table
in order to call lock_shard_resources.
However, that is too restrictive. We can have users with GRANTS
to the table who are not owners of the tables/shards.
With this commit, we allow such patterns.
This change creates a slightly higher abstraction of the `PartitionedResultDestReceiver` where it decouples the partitioning from writing it to a file. This allows for easier reuse for other `DestReceiver`'s that would like to route different tuples to different `DestReceiver`'s.
Originally there was a lot of state kept in `PartitionedResultDestReceiver` to be able to lazily create `FileDestReceivers` when the first tuple arrived for that target. This convoluted the implementation of the processing of tuples with where they should go.
This refactor changes that where it makes the `PartitionedResultDestReceiver` completely agnostic of what kind of Receivers it is writing to. When constructed you pass it a list of `DestReceiver` compatible pointers with the length of `partitionCount`. Internally the `PartitionedResultDestReceiver` keeps track of which `DestReceiver`'s have been started or not, and start them when they first receive a tuple.
Alternatively, if the instantiating code of the `PartitionedResultDestReceiver` wants, the startup can be turned from lazily to eagerly. When the startup is eager (not lazy) all `rStartup` functions on the list of `DestReceiver`'s are called during the startup of the `PartitionedResultDestReceiver` and marked as such.
A downside of this approach is the following. On highly partitioned destinations we now need to allocate a `FileDestReceiver` for every target, _always_. When the data passed into the `PartitionedResultDestReceiver` is highly skewed to a small set of `FileDestReceiver`'s this will waste some memory. Given the small size of a `FileDestReceiver`, and the fact that actual file handles are only created during the processing of the startup of the `FileDestReceiver` I think this memory waste is not a problem. If this would become a problem we could refactor the source list into some kind of generator object which can generate the `DestReceiver`'s on the fly.
* Refactor some checks in citus local tables
* all existing citus local tables are auto converted after upgrade
* Update warning messages in CreateCitusLocalTable
* Hide notice msg for auto converting local tables
* Hide hint msg
Co-authored-by: Ahmet Gedemenli <afgedemenli@gmail.com>
Clang 13 complains about a suspicious string concatenation. It thinks we
might have missed a comma. This adds parentheses to make it clear that
concatenation is indeed what we meant.
It seems like the decision for 2PC is more complicated than
it should be.
With this change, we do one behavioral change. In essense,
before this commit, when a SELECT task with replication factor > 1
is executed, the executor was triggering 2PC. And, in fact,
the transaction manager (`ConnectionModifiedPlacement()`) was
able to understand not to trigger 2PC when no modification happens.
However, for transaction blocks like:
BEGIN;
-- a command that triggers 2PC
-- A SELECT command on replication > 1
..
COMMIT;
The SELECT was used to be qualified as required 2PC. And, as a side-effect
the executor was setting `xactProperties.errorOnAnyFailure = true;`
So, the commands was failing at the time of execution. Now, they fail at
the end of the transaction.
In the past, we allowed users to manually switch to 1PC
(e.g., one phase commit). However, with this commit, we
don't. All multi-shard modifications are done via 2PC.
With Citus 9.0, we introduced `citus.single_shard_commit_protocol` which
defaults to 2PC.
With this commit, we prevent any user to set it to 1PC and drop support
for `citus.single_shard_commit_protocol`.
Although this might add some overhead for users, it is already the default
behaviour (so less likely) and marking placements as INVALID is much
worse.
- citus_get_all_dependencies_for_object: emulate what Citus
would qualify as
dependency when adding
a new node
- citus_get_dependencies_for_object: emulate what Citus would qualify
as dependency when creating an
object
Example use:
```SQL
-- find all the depedencies of table test
SELECT
pg_identify_object(t.classid, t.objid, t.objsubid)
FROM
(SELECT * FROM pg_get_object_address('table', '{test}', '{}')) as addr
JOIN LATERAL
citus_get_all_dependencies_for_object(addr.classid, addr.objid, addr.objsubid) as t(classid oid, objid oid, objsubid int)
ON TRUE
ORDER BY 1;
```
Add/fix tests
Fix creating partitions
Add test for mx - partition creating case
Enable cascading to partitioned tables
Fix mx partition adding test
Fix cascading through fkeys
Style
Disable converting with non-inherited fkeys
Fix detach bug
Early return in case of cascade & Add tests
Style
Fix undistribute_table bug & Fix test outputs
Remove RemovePartitionRelationIds
Test with undistribute_table
Add test for mx+convert+undistribute
Remove redundant usage of CreatePartitionedCitusLocalTable
Add some comments
Introduce bulk functions for generating attach/detach partition commands
Fix: Convert partitioned tables after adding fkey
Change the error message for partitions
Introduce function ErrorIfPartitionTableAddedToMetadata
Polish attach/detach command generation functions
Use time_partitions for testing
Move mx tests to citus_local_tables_mx
Add new partitioned table to cascade test
Add test with time series management UDFs
Fix test output
Fix: Assertion fail on relation access tracking
Style
Refactor creating partitioned citus local tables
Remove CreatePartitionedCitusLocalTable
Style
Error out if converting multi-level table
Revert some old tests
Error out adding partitioned partition
Polish
Polish/address
Fix create table partition of case
Use CascadeOperationForRelationIdList if no cascade needed
Fix create partition bug
Revert / Add new tests to mx
Style
Fix dropping fkey bug
Add test with IF NOT EXISTS
Convert to CLT when doing ATTACH PARTITION
Add comments
Add more tests with time series management
Edit the error message for converting the child
Use OR instead of AND in ErrorIfUnsupportedAlterTableStmt
Edit/improve tests
Disable ddl prop when dropping default column definitions
Disable/enable ddl prop just before/after the command
Add comment
Add sequence test
Add trigger test
Remove NeedCascadeViaForeignKeys
Add one more insert to sequence test
Add comment
Style
Fix test output shard ids
Update comments
Disable creating fkey on partitions
Move partition check to CreateCitusLocalTable
Add comment
Add check for attachingmulti-level partition
Add test for pg_constraint
Check pg_dist_partition in tests
Add test inserting on the worker
* Add udf to include shardId in broken partition shard index names
* Address reviews: rename index such that operations can be done on it
* More comprehensive index tests
* Final touches and formatting
* Make (columnar.stripe) first_row_number index a unique constraint
Since stripe_first_row_number_idx is required to scan a columnar
table, we need to make sure that it is created before doing anything
with columnar tables during pg upgrades.
However, a plain btree index is not a dependency of a table, so
pg_upgrade cannot guarantee that stripe_first_row_number_idx gets
created when creating columnar.stripe, unless we make it a unique
"constraint".
To do that, drop stripe_first_row_number_idx and create a unique
constraint with the same name to keep the code change at minimum.
* Add more pg upgrade tests for columnar
* Fix a logic error in uprade_columnar_after test
Co-authored-by: Onur Tirtir <onurcantirtir@gmail.com>
Since PG14 we can now use binary encoding for arrays and composite types
that contain user defined types. This was fixed in this commit in
Postgres: 670c0a1d47
This change starts using that knowledge, by not necessarily falling back
to text encoding anymore for those types.
While doing this and testing a bit more I found various cases where
binary encoding would fail that our checks didn't cover. This fixes
those cases and adds tests for those. It also fixes EXPLAIN ANALYZE
never using binary encoding, which was a leftover of workaround that
was not necessary anymore.
Finally, it changes the default for both `citus.enable_binary_protocol`
and `citus.binary_worker_copy_format` to `true` for PG14 and up. In our
cloud offering `binary_worker_copy_format` already was true by default.
`enable_binary_protocol` had some bug with MX and user defined types,
this bug was fixed by the above mentioned fixes.
- get_missing_time_partition_ranges: Gets the ranges of missing partitions for the given table, interval and range unless any existing partition conflicts with calculated missing ranges.
- create_time_partitions: Creates partitions by getting range values from get_missing_time_partition_ranges.
- drop_old_time_partitions: Drops partitions of the table older than given threshold.
In PG 14, procedures can have OUT parameters. In Citus' procedure
delegation framework, we need to adjust the function expression
to get the outargs parameters.
Releven PG change:
e56bce5d43
Simply call Postgres' function to report the progress on
each row recieved.
Note that we currently do not support "COPY dist/ref TO .." progress
report nicely. Citus has some specialized logic to support
"COPY dist/ref TO .." such that it either converts the underlying
command into "COPY (SELECT * FROM dist/ref ) ..." or sends COPY
command to shards directly. In the former case, "tuples_processed"
is only updated when the executor returns all the tuples, so the
progress is not accurate. In the latter case, Citus can actually
implement the progress report. But, for the sake of consistency,
we prefer to not implement at all.
Added to PG 14 with https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=8a4f618e7ae3cb11b0b37d0f06f05c8ff905833f
Postgres changed stats expression types as of PG14. Hence we needed to
write the AppendColumnNames method. Also they removed the error on PG
side so we remove it as well.
Relevant commits on pg14:
a4d75c86bf15220df22de0a92c819ecef9db3849
388e75ad33489b77cfb9a8590a91e9287d8fb960
When queryId is not 0 and verbose is true, the query identifier is
emitted to the explain output. This is breaking Postgres outputs.
We disable de query identifier calculation in the tests.
Commit on PG that introduced the query identifier in the explain output:
4f0b0966c866ae9f0e15d7cc73ccf7ce4e1af84b
Relevant PG commit:
9e38c2bb5093ceb0c04d6315ccd8975bd17add66
fix array_cat_agg for pg upgrades
array_cat_agg now needs to take anycompatiblearray instead of anyarray
because array_cat changed its type from anyarray to anycompatiblearray
with pg14.
To handle upgrades correctly, we drop the aggregate in
citus_pg_prepare_upgrade. To be able to drop it, we first remove the
dependency from pg_depend.
Then we create the right aggregate in citus_finish_pg_upgrade and we
also add the dependency back to pg_depend.
Postgres doesn't accept NULL for queryStrings in explain plans anymore.
Internally, there are some places in Postgres where they modified the
NULLS to ""(the empty string). So we do the same on citus side.
Commit on Postgres:
1111b2668d89bfcb6f502789158b1233ab4217a6
Postgres expects to set the HASH_STRINGS explicitly in case of the
default behaivor for string hash function.
Postgres Commit
b3817f5f774663d55931dd4fab9c5a94a15ae7ab
get_partition_parent and RelationGetPartitionDesc functions now have new parameters to also include detached partitions
Thess new macros give us the ability to use these new parameter for PG14 and they don't give the parameters for previous versions
Existing parameters are set to not accept detached partitions
Relevant PG commit:
71f4c8c6f74ba021e55d35b1128d22fb8c6e1629
In two commits vacuumFlags in PGXACT is moved and then renamed to status flags
This macro uses the appropriate version of the flag
Relevant PG commits:
5788e258bb26495fab65ff3aa486268d1c50b123
cd9c1b3e197a9b53b840dcc87eb41b04d601a5f9
SetTuplestoreDestReceiverParams function now has two new parameters
This new macro give us the ability to use this new parameter for PG14 and it doesn't give the parameter for previous versions
Existing parameters are set to NULL to keep previous behavior
Relevant PG commit:
2f48ede080f42b97b594fb14102c82ca1001b80c
Some Copy related functions copied from Postgres had support for both old and new protocols
Postgres removed support for old version so we remove it too
Relevant PG commit:
3174d69fb96a66173224e60ec7053b988d5ed4d9
New macros: standard_ProcessUtility_compat, ProcessUtility_compat, ColumnarProcessUtility_compat, PrevProcessUtilityHook_compat
The functions now have a new bool parameter: readOnlyTree
These new macros give us the ability to use this new parameter for PG14 and it doesn't give the parameter for previous versions
In multi_ProcessUtility and ColumnarProcessUtility, before doing anything else, we check if readOnlyTree parameter is true and create a copy of pstmt
Existing readOnlyTree parameters are set to false since we already handle the read only case at multi_ProcessUtility and ColumnarProcessUtility
Relevant PG commit:
7c337b6b527b7052e6a751f966d5734c56f668b5
This function was copied from Postgres but it is not static at PG14
So we keep the definition only for previous versions
Relevant PG commit:
c532d15dddff14b01fe9ef1d465013cb8ef186df
CopyState struct is divided into parts and one of them is CopyFromState
This macro uses the appropriate one for PG versions
Relevant PG commit:
c532d15dddff14b01fe9ef1d465013cb8ef186df
In ReindexStmt concurrent field is moved to options and then options are converted to params list.
This macro uses previous fields for previous versions and the new params list with a new function named IsReindexWithParam for PG14
Relevant PG commits:
844c05abc3f1c1703bf17cf44ab66351ed9711d2
b5913f6120792465f4394b93c15c2e2ac0c08376
VacOptTernaryValue enum is renamed to VacOptValue.
In the enum there were three values, VACOPT_TERNARY_DEFAULT, VACOPT_TERNARY_DISABLED, and VACOPT_TERNARY_ENABLED
Now there are four values VACOPTVALUE_UNSPECIFIED, VACOPTVALUE_AUTO, VACOPTVALUE_DISABLED, and VACOPTVALUE_ENABLED
New macros are VacOptValue_compat, VACOPTVALUE_UNSPECIFIED_COMPAT, VACOPTVALUE_DISABLED_COMPAT, and VACOPTVALUE_ENABLED_COMPAT
The VACOPTVALUE_UNSPECIFIED_COMPAT matches VACOPT_TERNARY_DEFAULT and VACOPTVALUE_UNSPECIFIED. And there are no macro for VACOPTVALUE_AUTO.
Relevant PG commit:
3499df0dee8c4ea51d264a674df5b5e31991319a
New macros: FuncnameGetCandidates_compat and expand_function_arguments_compat
The functions (the ones without _compat) now have a new bool include_out_arguments parameter
These new macros give us the ability to use this new parameter for PG14 and it doesn't give the parameter for previous versions
Existing include_out_arguments parameters are set to 'false' to keep current behavior
Relevant PG commit:
e56bce5d43789cce95d099554ae9593ada92b3b7
stats function now have a new bool print_to_stderr parameter
This new macro gives us the ability to use this new parameter for PG14 and it doesn't give the parameter for previous versions
Existing print_to_stderr parameter is set to true to keep current behavior
Relevant PG commit:
43620e328617c1f41a2a54c8cee01723064e3ffa
getObjectTypeDescription and getObjectIdentity functions now have a new bool missing_ok parameter
These new macros give us the ability to use this new parameter for PG14 and they don't give the parameter for previous versions
Currently all missing_ok parameters are set to false to keep current behavior
Relevant PG commit:
2a10fdc4307a667883f7a3369cb93a721ade9680
The STATUS_WAITING define is removed and an enum with PROC_WAIT_STATUS_WAITING is added instead
This macro uses appropriate one
Relevant PG commit:
a513f1dfbf2c29a51b0f7cbd5913ce2d2ee452c5
AlterTableStmt's relkind field is changed into objtype
New AlterTableStmtObjType macro uses the appropriate one
Relevant PG commit:
cc35d8933a211d9965eb1c1d2749a903d5735db2
The logging of the amount of ignored moves crashed when no distributed
tables existed in a cluster. This also fixes in passing that the logging
of ignored moves logs the correct number of ignored moves if there
exist multiple colocation groups and all are rebalanced at the same time.
`tcp_user_timeout` is the awesome relatively unknown big brother of the
TCP keepalive related options. Instead of depending on keepalives being
sent, this determines that a socket is dead by waiting at most N seconds
for an ack of data that it has sent. It's exposed in libpq starting from
PG12.
DESCRIPTION: Fix a segfault caused by use after free in ConnectionsPlacementHash
Fix a segfault caused by retaining data in any of the hashmaps making up the Placement Connection Management.
We have seen production systems segfault due to random data referenced from ConnectionPlacementHash.
On investigation we found that the backends segfaulting on this had OOM errors closely prior to the segfault.
It has shown there are at least 15 places where an allocation can OOM that would cause ConnectionPlacementHash to retain pointers to memory from contexts that are subsequently freed. This would reproduce the segfault we have observed in production.
Conditions for these allocations are:
- allocated after first call to `AssociatePlacementWithShard`: https://github.com/citusdata/citus/blob/v10.0.3/src/backend/distributed/connection/placement_connection.c#L880-L881
- allocated before `StartNodeUserDatabaseConnection`: https://github.com/citusdata/citus/blob/v10.0.3/src/backend/distributed/connection/connection_management.c#L291
At least 15 points of memory allocation (which could fail) are between the callsites of both in a primary key lookup on a reference table - where we have seen an OOM cause a segfault moments later.
Instead of leaving any references in ConnectionPlacementHash, ConnectionShardHash and ColocatedPlacementsHash that could retain any pointers that are freed due to the TopTransactionContext being reset we clear all these hashes irregardless of the state of CurrentCoordinatedTransactionState.
Downside is that on any transaction abort we will now iterate through 4 hashmaps and clear their contents. Given that they are either already empty, which should cause a quick iteration, or non-empty, causing segfaults in subsequent executions, this overhead seems reasonable.
A better solution would be to move the creation of these hashmaps so they would live in the TopTransactionContext themself, assuming their contents would never outlive a transaction. This needs more investigation and is an involved refactor Hence fixing this quickly here.
- Add support for CRETE INDEX ... ON ONLY: Before that commit we were not sending "ONLY" option to the worker nodes at all. With this commit, "ONLY" parameter will be sent to the worker nodes if it is necessary. (#4938)
- Add support for ALTER INDEX ... ATTACH PARTITION: Attach child_index to parent_index by creating same inheritance on shard level in addition to table level. (#4980)
* Synchronize hasmetadata flag on mx workers
* Switch to sequential execution
* Add test
* Use SetWorkerColumn
* Add test for stop_sync
* Remove usage of UpdateHasmetadataOnWorkersWithMetadata
* Remove MarkNodeMetadataSynced
* Fix test for metadatasynced
* Remove MarkNodeMetadataSynced
* Style
* Remove MarkNodeHasMetadata
* Remove UpdateDistNodeBoolAttr
* Refactor SetWorkerColumn
* Use SetWorkerColumnLocalOnly when setting up dependencies
* Use SetWorkerColumnLocalOnly in TriggerSyncMetadataToPrimaryNodes
* Style
* Make update command generator functions static
* Set metadatasynced before syncing
* Call SetWorkerColumn only if the sync is successful
* Try to sync all nodes
* Fix indexno
* Update metadatasynced locally first
* Break if a node fails to sync metadata
* Send worker commands optional
* Style & Rebase
* Add raiseOnError param to SetWorkerColumn
* Style
* Set metadatasynced for all metadata nodes
* Style
* Introduce SetWorkerColumnOptional
* Polish
* Style
* Dont send set command to not synced metadata nodes
* Style
* Polish
* Add test for stop_sync
* Add test for shouldhaveshards
* Add test for isactive flag
* Sort by placementid in the function verify_metadata
* Cover edge cases for failing nodes
* Add comments
* Add nodeport to isactive test
* Add warning if metadata out of sync
* Update warning message
In short, add wrappers around Postgres' AddWaitEventToSet() and
ModifyWaitEvent().
AddWaitEventToSet()/ModifyWaitEvent*() may throw hard errors. For
example, when the underlying socket for a connection is closed by
the remote server and already reflected by the OS, however
Citus hasn't had a chance to get this information. In that case,
if replication factor is >1, Citus can failover to other nodes
for executing the query. Even if replication factor = 1, Citus
can give much nicer errors.
So CitusAddWaitEventSetToSet()/CitusModifyWaitEvent() simply puts
AddWaitEventToSet()/ModifyWaitEvent() into a PG_TRY/PG_CATCH block
in order to catch any hard errors, and returns this information to
the caller.
As we use the current user to sync the metadata to the nodes
with #5105 (and many other PRs), there is no reason that
prevents us to use the coordinated transaction for metadata syncing.
This commit also renames few functions to reflect their actual
implementation.
Before this commit, creating a partition after a DROP column
on the parent (position before dist. key) was leading to
partition to have the wrong distribution column.
update_distributed_table_colocation can be called by the relation
owner, and internally it updates pg_dist_partition. With this
commit, update_distributed_table_colocation uses an internal
UDF to access pg_dist_partition.
As a result, this operation can now be done by regular users
on MX.
* Fix UNION not being pushdown
Postgres optimizes column fields that are not needed in the output. We
were relying on these fields to understand if it is safe to push down a
union query.
This fix looks at the parse query, which has the original column fields
to detect if it is safe to push down a union query.
* Add more tests
* Simplify code and make it more robust
* Process varlevelsup > 0 in FindReferencedTableColumn
* Only look for outers vars in union path
* Add more comments
* Remove UNION ALL specific logic for pulling up childvars
The progress monitor wouldn't actually update the size of the shard on
the target node when using "block_writes" as the `shard_transfer_mode`.
The reason for this is that the CREATE TABLE part of the shard creation
would only be committed once all data was moved as well. This caused
our size calculation to always return 0, since the table did not exist
yet in the session that the progress monitor used.
This is fixed by first committing creation of the table, and only then
starting the actual data copy.
The test output changes slightly. Apparently splitting this up in two
transactions instead of one, increases the table size after the copy by
about 40kB. The additional size used doesn't increase when with the
amount of data in the table is larger (it stays ~40kB per shard). So
this small change in test output is not considered an actual problem.
These two options were not included when creating the sequences on the
workers as part of metadata syncing.
The missing `data_type` part of the definition made finding the cause
of #5126 harder than necessary, because of confusing errors.
Before this commit, we always synced the metadata with superuser.
However, that creates various edge cases such as visibility errors
or self distributed deadlocks or complicates user access checks.
Instead, with this commit, we use the current user to sync the metadata.
Note that, `start_metadata_sync_to_node` still requires super user
because accessing certain metadata (like pg_dist_node) always require
superuser (e.g., the current user should be a superuser).
However, metadata syncing operations regarding the distributed
tables can now be done with regular users, as long as the user
is the owner of the table. A table owner can still insert non-sense
metadata, however it'd only affect its own table. So, we cannot do
anything about that.
This happens only when we have a "<" or "<=" filter on distribution
column of a range distributed table and that filter falls in between
two shards.
When the filter falls in between two shards:
If the filter is ">" or ">=", then UpperShardBoundary was
returning "upperBoundIndex - 1", where upperBoundIndex is
exclusive shard index used during binary seach.
This is expected since upperBoundIndex is an exclusive
index.
If the filter is "<" or "<=", then LowerShardBoundary was
returning "lowerBoundIndex + 1", where lowerBoundIndex is
inclusive shard index used during binary seach.
On the other hand, since lowerBoundIndex is an inclusive
index, we should just return lowerBoundIndex instead of
doing "+ 1". Before this commit, we were missing leftmost
shard in such queries.
* Remove useless conditional branches
The branch that we delete from UpperShardBoundary was obviously useless.
The other one in LowerShardBoundary became useless after we remove "+ 1"
from there.
This indeed is another proof of what & how we are fixing with this pr.
* Improve comments and add more
* Add some tests for upper bound calculation too
* Add parameter to cleanup metadata
* Set clear metadata default to true
* Add test for clearing metadata
* Separate test file for start/stop metadata syncing
* Fix stop_sync bug for secondary nodes
* Use PreventInTransactionBlock
* DRemovedebuggiing logs
* Remove relation not found logs from mx test
* Revert localGroupId when doing stop_sync
* Move metadata sync test to mx schedule
* Add test with name that needs to be quoted
* Add test for views and matviews
* Add test for distributed table with custom type
* Add comments to test
* Add test with stats, indexes and constraints
* Fix matview test
* Add test for dropped column
* Add notice messages to stop_metadata_sync
* Add coordinator check to stop metadat sync
* Revert local_group_id only if clearMetadata is true
* Add a final check to see the metadata is sane
* Remove the drop verbosity in test
* Remove table description tests from sync test
* Add stop sync to coordinator test
* Change the order in stop_sync
* Add test for hybrid (columnar+heap) partitioned table
* Change error to notice for stop sync to coordinator
* Sync at the end of the test to prevent any failures
* Add test case in a transaction block
* Remove relation not found tests
Ignore orphaned shards in more places
Only use active shard placements in RouterInsertTaskList
Use IncludingOrphanedPlacements in some more places
Fix comment
Add tests
The name and comment of this function did not indicate that it only
really could detect locally accessible citus local tables. This fixes
that, while also cleaning up the function a bit.
* Alter seq type when we first use the seq in a dist table
* Don't allow type changes when seq is used in dist table
* ALTER SEQUENCE propagation
* Tests for ALTER SEQUENCE propagation
* Relocate AlterSequenceType and ensure dependencies for sequence
* Support for citus local tables, and other fixes
* Final formatting
With the previous version of this check we would disallow distributed
tables that did not have a colocationid, to have a foreign key to a
reference table. This fixes that, since there's no reason to disallow
that.
Originally ReplicateShardToNode was meant for
`upgrade_to_reference_table`, which required handling of existing inactive
placements. These days `upgrade_to_reference_table` is deprecated and
cannot be used anymore. Now that we have SHARD_STATE_TO_DELETE too, this
left over code seemed error prone. So this removes support for
activating inactive reference table placemements, since these should not
be possible. If it finds a non active reference table placement anyway
it now errors out.
This also removes a few outdated comments related to `upgrade_to_refeference_table`.
Moving shards of reference tables was possible in at least one case:
```sql
select citus_disable_node('localhost', 9702);
create table r(x int);
select create_reference_table('r');
set citus.replicate_reference_tables_on_activate = off;
select citus_activate_node('localhost', 9702);
select citus_move_shard_placement(102008, 'localhost', 9701, 'localhost', 9702);
```
This would then remove the reference table shard on the source, causing
all kinds of issues. This fixes that by disallowing all shard moves
except for shards of distributed tables.
Co-authored-by: Onur Tirtir <onurcantirtir@gmail.com>
The first and main issue was that we were putting absolute pointers into
shared memory for the `steps` field of the `ProgressMonitorData`. This
pointer was being overwritten every time a process requested the monitor
steps, which is the only reason why this even worked in the first place.
To quote a part of a relevant stack overflow answer:
> First of all, putting absolute pointers in shared memory segments is
> terrible terible idea - those pointers would only be valid in the
> process that filled in their values. Shared memory segments are not
> guaranteed to attach at the same virtual address in every process.
> On the contrary - they attach where the system deems it possible when
> `shmaddr == NULL` is specified on call to `shmat()`
Source: https://stackoverflow.com/a/10781921/2570866
In this case a race condition occurred when a second process overwrote
the pointer in between the first process its write and read of the steps
field.
This issue is fixed by not storing the pointer in shared memory anymore.
Instead we now calculate it's position every time we need it.
The second race condition I have not been able to trigger, but I found
it while investigating this. This issue was that we published the handle
of the shared memory segment, before we initialized the data in the
steps. This means that during initialization of the data, a call to
`get_rebalance_progress()` could read partial data in an unsynchronized
manner.
With local query caching, we try to avoid deparse/parse stages as the
operation is too costly.
However, we can do deparse/parse operations once per cached queries, right
before we put the plan into the cache. With that, we avoid edge
cases like (4239) or (5038).
In a sense, we are making the local plan caching behave similar for non-cached
local/remote queries, by forcing to deparse the query once.
A shard move would fail if there was an orphaned version of the shard on
the target node. With this change before actually fail, we try to clean
up orphaned shards to see if that fixes the issue.
Sometimes the background daemon doesn't cleanup orphaned shards quickly
enough. It's useful to have a UDF to trigger this removal when needed.
We already had a UDF like this but it was only used during testing. This
exposes that UDF to users. As a safety measure it cannot be run in a
transaction, because that would cause the background daemon to stop
cleaning up shards while this transaction is running.
* Add user-defined sequence support for MX
* Remove default part when propagating to workers
* Fix ALTER TABLE with sequences for mx tables
* Clean up and add tests
* Propagate DROP SEQUENCE
* Removing function parts
* Propagate ALTER SEQUENCE
* Change sequence type before propagation & cleanup
* Revert "Propagate ALTER SEQUENCE"
This reverts commit 2bef64c5a29f4e7224a7f43b43b88e0133c65159.
* Ensure sequence is not used in a different column with different type
* Insert select tests
* Propagate rename sequence stmt
* Fix issue with group ID cache invalidation
* Add ALTER TABLE ALTER COLUMN TYPE .. precaution
* Fix attnum inconsistency and add various tests
* Add ALTER SEQUENCE precaution
* Remove Citus hook
* More tests
Co-authored-by: Marco Slot <marco.slot@gmail.com>
InvalidateForeignKeyGraph sends an invalidation via shared memory to all
backends, including the current one.
However, we might not call AcceptInvalidationMessages before reading
from the cache below. It would be better to also add a call to
AcceptInvalidationMessages in IsForeignConstraintRelationshipGraphValid.
Previously this was usually done after argument parsing. This can cause
SEGFAULTs if the number or type of arguments changes in a new version.
By checking that Citus version is correct before doing any argument
parsing we protect against these types of issues. Issues like this have
occurred in pg_auto_failover, so it's not just a theoretical issue.
The main reason why these calls were not at the top of functions is
really just historical. It was because in the past we didn't allow
statements before declarations. Thus having this check before the
argument parsing would have only been possible if we first declared all
variables.
In addition to moving existing CheckCitusVersion calls it also adds
these calls to rebalancer related functions (they were missing there).
To be able to report progress of the rebalancer, the rebalancer updates
the state of a shard move in a shared memory segment. To then fetch the
progress, `get_rebalance_progress` can be called which reads this shared
memory.
Without this change it did so without using any synchronization
primitives, allowing for data races. This fixes that by using atomic
operations to update and read from the parts of the shared memory that
can be changed after initialization.
DESCRIPTION: fix shared dependencies that are not resident in a database
eg. databases depend on users (their owners) that both don’t have a
database they reside in. These dependencies are recorded in pg_shdepend
with a `dbid` of `InvalidOid` When we fetch our shared dependencies we don’t take
these links in account.
With this patch we use logic inspired by `classIdGetDbId` to decide when to use `MyDatabaseId` vs `InvalidOid` to correctly resolve dependencies between shared objects.
Without this change the rebalancer progress monitor gets the shard sizes
from the `shardlength` column in `pg_dist_placement`. This column needs to
be updated manually by calling `citus_update_table_statistics`.
However, `citus_update_table_statistics` could lead to distributed
deadlocks while database traffic is on-going (see #4752).
To work around this we don't use `shardlength` column anymore. Instead
for every rebalance we now fetch all shard sizes on the fly.
Two additional things this does are:
1. It adds tests for the rebalance progress function.
2. If a shard move cannot be done because a source or target node is
unreachable, then we error in stop the rebalance, instead of showing
a warning and continuing. When using the by_disk_size rebalance
strategy it's not safe to continue with other moves if a specific
move failed. It's possible that the failed move made space for the
next move, and because the failed move never happened this space now
does not exist.
3. Adds two new columns to the result of `get_rebalancer_progress` which
shows the size of the shard on the source and target node.
Fixes#4930
DESCRIPTION: Add support for ALTER DATABASE OWNER
This adds support for changing the database owner. It achieves this by marking the database as a distributed object. By marking the database as a distributed object it will look for its dependencies and order the user creation commands (enterprise only) before the alter of the database owner. This is mostly important when adding new nodes.
By having the database marked as a distributed object it can easily understand for which `ALTER DATABASE ... OWNER TO ...` commands to propagate by resolving the object address of the database and verifying it is a distributed object, and hence should propagate changes of owner ship to all workers.
Given the ownership of the database might have implications on subsequent commands in transactions we force sequential mode for transactions that have a `ALTER DATABASE ... OWNER TO ...` command in them. This will fail the transaction with meaningful help when the transaction already executed parallel statements.
By default the feature is turned off since roles are not automatically propagated, having it turned on would cause hard to understand errors for the user. It can be turned on by the user via setting the `citus.enable_alter_database_owner`.
Comment from the code:
/*
* Iterate until all the tasks are finished. Once all the tasks
* are finished, ensure that that all the connection initializations
* are also finished. Otherwise, those connections are terminated
* abruptly before they are established (or failed). Instead, we let
* the ConnectionStateMachine() to properly handle them.
*
* Note that we could have the connections that are not established
* as a side effect of slow-start algorithm. At the time the algorithm
* decides to establish new connections, the execution might have tasks
* to finish. But, the execution might finish before the new connections
* are established.
*/
Note that the abruptly terminated connections lead to the following errors:
2020-11-16 21:09:09.800 CET [16633] LOG: could not accept SSL connection: Connection reset by peer
2020-11-16 21:09:09.872 CET [16657] LOG: could not accept SSL connection: Undefined error: 0
2020-11-16 21:09:09.894 CET [16667] LOG: could not accept SSL connection: Connection reset by peer
To easily reproduce the issue:
- Create a single node Citus
- Add the coordinator to the metadata
- Create a distributed table with shards on the coordinator
- f.sql: select count(*) from test;
- pgbench -f /tmp/f.sql postgres -T 12 -c 40 -P 1 or pgbench -f /tmp/f.sql postgres -T 12 -c 40 -P 1 -C
With this commit, the executor becomes smarter about refrain to open
new connections. The very basic example is that, if the connection
establishments take 1000ms and task executions as 5 msecs, the executor
becomes smart enough to not establish new connections.
It was possible to block maintenance daemon by taking an SHARE ROW
EXCLUSIVE lock on pg_dist_placement. Until the lock is released
maintenance daemon would be blocked.
We should not block the maintenance daemon under any case hence now we
try to get the pg_dist_placement lock without waiting, if we cannot get
it then we don't try to drop the old placements.
DESCRIPTION: introduce `citus.local_hostname` GUC for connections to the current node
Citus once in a while needs to connect to itself for some systems operations. This used to be hardcoded to `localhost`. The hardcoded hostname causes some issues, for example in environments where `sslmode=verify-full` is required. It is not always desirable or even feasible to get `localhost` as an alt name on the certificate.
By introducing a GUC to use when connecting to the current instance the user has more control what network path is used and what hostname is required to be present in the server certificate.
Every move in the rebalancer algorithm results in an improvement in the
balance. However, even if the improvement in the balance was very small
the move was still chosen. This is especially problematic if the shard
itself is very big and the move will take a long time.
This changes the rebalancer algorithm to take the relative size of the
balance improvement into account when choosing moves. By default a move
will not be chosen if it improves the balance by less than half of the
size of the shard. An extra argument is added to the rebalancer
functions so that the user can decide to lower the default threshold if
the ignored move is wanted anyway.
* Columnar: introduce columnar storage API.
This new API is responsible for the low-level storage details of
columnar; translating large reads and writes into individual block
reads and writes that respect the page headers and emit WAL. It's also
responsible for the columnar metapage, resource reservations (stripe
IDs, row numbers, and data), and truncation.
This new API is not used yet, but will be used in subsequent
forthcoming commits.
* Columnar: add columnar_storage_info() for debugging purposes.
* Columnar: expose ColumnarMetadataNewStorageId().
* Columnar: always initialize metapage at creation time.
This avoids the complexity of dealing with tables where the metapage
has not yet been initialized.
* Columnar: columnar storage upgrade/downgrade UDFs.
Necessary upgrade/downgrade step so that new code doesn't see an old
metapage.
* Columnar: improve metadata.c comment.
* Columnar: make ColumnarMetapage internal to the storage API.
Callers should not have or need direct access to the metapage.
* Columnar: perform resource reservation using storage API.
* Columnar: implement truncate using storage API.
* Columnar: implement read/write paths with storage API.
* Columnar: add storage tests.
* Revert "Columnar: don't include stripe reservation locks in lock graph."
This reverts commit c3dcd6b9f8.
No longer needed because the columnar storage API takes care of
concurrency for resource reservation.
* Columnar: remove unnecessary lock when reserving.
No longer necessary because the columnar storage API takes care of
concurrent resource reservation.
* Add simple upgrade tests for storage/ branch
* fix multi_extension.out
Co-authored-by: Onur Tirtir <onurcantirtir@gmail.com>
* When moving a shard to a new node ensure there is enough space
* Add WairForMiliseconds time utility
* Add more tests and increase readability
* Remove the retry loop and use a single udf for disk stats
* Address review
* address review
Co-authored-by: Jelte Fennema <github-tech@jeltef.nl>
The comment of DropMarkedShards described the behaviour that after a
failure we would continue trying to drop other shards. However the code
did not do this and would stop after the first failure. Instead of
simply fixing the comment I fixed the code, because the described
behaviour is more useful. Now a single shard that cannot be removed yet
does not block others from being removed.
We decrease memory usage by:
- Freeing temporary buffers
- Using separate memory context for blocks that uses "small" amount of
memory but can be repeated many times such as loops
As long as the VALUES clause contains constant values, we should not
recursively plan the queries/CTEs.
This is a follow-up work of #1805. So, we can easily apply OUTER join
checks as if VALUES clause is a reference table/immutable function.
* Fix problews with concurrent calls of DropMarkedShards
When trying to enable `citus.defer_drop_after_shard_move` by default it
turned out that DropMarkedShards was not safe to call concurrently.
This could especially cause big problems when also moving shards at the
same time. During tests it was possible to trigger a state where a shard
that was moved would not be available on any of the nodes anymore after
the move.
Currently DropMarkedShards is only called in production by the
maintenaince deamon. Since this is only a single process triggering such
a race is currently impossible in production settings. In future changes
we will want to call DropMarkedShards from other places too though.
* Add some isolation tests
Co-authored-by: Jelte Fennema <github-tech@jeltef.nl>
This commit adds support for long partition names for distributed tables:
- ALTER TABLE dist_table ATTACH PARTITION ..
- CREATE TABLE .. PARTITION OF dist_table ..
Note: create_distributed_table UDF does not support long table and
partition names, and is not covered in this commit
* Introduce 3 partitioned size udfs
* Add tests for new partition size udfs
* Fix type incompatibilities
* Convert UDFs into pure sql functions
* Fix function comment
ConnParams(AuthInfo and PoolInfo) gets a snapshot, which will block the
remote connectinos to localhost. And the release of snapshot will be
blocked by the snapshot. This leads to a deadlock.
We warm up the conn params hash before starting a new transaction so
that the entries will already be there when we start a new transaction.
Hence GetConnParams will not get a snapshot.
With https://github.com/citusdata/citus/pull/4806 we enabled
2PC for any non-read-only local task. However, if the execution
is a single task, enabling 2PC (CoordinatedTransactionShouldUse2PC)
hits an assertion as we are not in a coordinated transaction.
There is no downside of using a coordinated transaction for single
task local queries.
Because setting the flag doesn't necessarily mean that we'll
use 2PC. If connections are read-only, we will not use 2PC.
In other words, we'll use 2PC only for connections that modified
any placements.
Before this commit, Citus used 2PC no matter what kind of
local query execution happens.
For example, if the coordinator has shards (and the workers as well),
even a simple SELECT query could start 2PC:
```SQL
WITH cte_1 AS (SELECT * FROM test LIMIT 10) SELECT count(*) FROM cte_1;
```
In this query, the local execution of the shards (and also intermediate
result reads) triggers the 2PC.
To prevent that, Citus now distinguishes local reads and local writes.
And, Citus switches to 2PC only if a modification happens. This may
still lead to unnecessary 2PCs when there is a local modification
and remote SELECTs only. Though, we handle that separately
via #4587.
Postgres keeps AFTER trigger state for each transaction, because we can have deferred AFTER triggers which will be fired at the end of a transaction. Postgres cleans up this state at the end of transaction.
Postgres processes ON COMMIT triggers after cleaning-up the AFTER trigger states. So if we fire any triggers in ON COMMIT, the AFTER trigger state won't be cleaned-up properly and the transaction state will be left in an inconsistent state, which might result in assertion failure.
So with this commit, we remove foreign keys between columnar metadata tables and enforce constraints between them manually when dropping columnar tables.
* Skip 2PC for readonly connections in a transaction
* Use ConnectionModifiedPlacement() function
* Remove the second check of ConnectionModifiedPlacement()
* Add order by to prevent flaky output
* Test using pg_dist_transaction
With this commit, we make sure to prevent infinite recursion for queries
in the format: [subquery with a UNION ALL] JOIN [table or subquery]
Also, fixes a bug where we pushdown UNION ALL below a JOIN even if the
UNION ALL is not safe to pushdown.
* Reimplement citus_update_table_statistics
* Update stats for the given table not colocation group
* Add tests for reimplemented citus_update_table_statistics
* Use coordinated transaction, merge with citus_shard_sizes functions
* Update the old master_update_table_statistics as well
* Use translated vars in postgres 13 as well
Postgres 13 removed translated vars with pg 13 so we had a special logic
for pg 13. However it had some bug, so now we copy the translated vars
before postgres deletes it. This also simplifies the logic.
* fix rtoffset with pg >= 13
/*
* The physical planner assumes that all worker queries would have
* target list entries based on the fact that at least the column
* on the JOINs have to be on the target list. However, there is
* an exception to that if there is a cartesian product join and
* there is no additional target list entries belong to one side
* of the JOIN. Once we support cartesian product join, we should
* remove this error.
*/
When we use PROCESS_UTILITY_TOPLEVEL it causes some problems when
combined with other extensions such as pg_audit. With this commit we use
PROCESS_UTILITY_QUERY in the codebase to fix those problems.
When executing alter_table / undistribute_table udf's, we should not try
to change sequence dependencies on MX workers if new table wouldn't
require syncing metadata.
Previously, we were checking that for input table. But in some cases, the
fact that input table requires syncing metadata doesn't imply the same
for resulting table (e.g when undistributing a Citus table).
Even more, doing that was giving an unexpected error when undistributing
a Citus table so this commit actually fixes that.
It seems that we need to consider only pseudo constants while doing some
shortcuts in planning. For example there could be a false clause but it
can contribute to the result in which case it will not be a pseudo
constant.
We would exclude tables without relationRestriction from conversion
candidates in local-distributed table joins. This could leave a leftover
local table which should have been converted to a subquery.
Ideally I would expect that in each call to CreateDistributedPlan we
would pass a new plan id, but that seems like a bigger change.
/*
* Colocated intermediate results are just files and not required to use
* the same connections with their co-located shards. So, we are free to
* use any connection we can get.
*
* Also, the current connection re-use logic does not know how to handle
* intermediate results as the intermediate results always truncates the
* existing files. That's why, we use one connection per intermediate
* result.
*/
We do not include dummy column if original task didn't return any
columns.
Otherwise, number of columns that original task returned wouldn't
match number of columns returned by worker_save_query_explain_analyze.
When COPY is used for copying into co-located files, it was
not allowed to use local execution. The primary reason was
Citus treating co-located intermediate results as co-located
shards, and COPY into the distributed table was done via
"format result". And, local execution of such COPY commands
was not implemented.
With this change, we implement support for local execution with
"format result". To do that, we use the buffer for every file
on shardState->copyOutState, similar to how local copy on
shards are implemented. In fact, the logic is similar to
local copy on shards, but instead of writing to the shards,
Citus writes the results to a file.
The logic relies on LOCAL_COPY_FLUSH_THRESHOLD, and flushes
only when the size exceeds the threshold. But, unlike local
copy on shards, in this case we write the headers and footers
just once.
* Sort results in citus_shards and give raw size
Sort results so that it is consistent and also similar to citus_tables.
Use raw size in the output so that doing operations on the size is
easier.
* Change column ordering
With #4338, the executor is smart enough to failover to
local node if there is not enough space in max_connections
for remote connections.
For COPY, the logic is different. With #4034, we made COPY
work with the adaptive connection management slightly
differently. The cause of the difference is that COPY doesn't
know which placements are going to be accessed hence requires
to get connections up-front.
Similarly, COPY decides to use local execution up-front.
With this commit, we change the logic for COPY on local nodes:
Try to reserve a connection to local host. This logic follows
the same logic (e.g., citus.local_shared_pool_size) as the
executor because COPY also relies on TryToIncrementSharedConnectionCounter().
If reservation to local node fails, switch to local execution
Apart from this, if local execution is disabled, we follow the
exact same logic for multi-node Citus. It means that if we are
out of the connection, we'd give an error.
It seems that we were not considering the case where coordinator was
added to the cluster as a worker in the optimization of intermediate
results.
This could lead to errors when coordinator was added as a worker.
pg_get_tableschemadef_string doesn't know how to deparse identity
columns so we cannot reflect those columns when creating table
from scratch. For this reason, we don't allow using alter_table udfs
with tables having any identity cols.
pg_get_tableschemadef_string doesn't know how to deparse identity
columns so we cannot reflect those columns when creating shell
relation.
For this reason, we don't allow adding local tables -having identity cols-
to metadata.
Postgres doesn't allow inserting into columns having GENERATED ALWAYS
AS (...) STORED expressions.
For this reason, when executing undistribute_table or an alter_* udf,
we should skip copying such columns.
This is not bad since Postgres would already generate such columns.
When finding columns owning sequences, we shouldn't rely on atthasdef
since it might be true when column has GENERATED ALWAYS AS (...)
STORED expression.
Since create_citus_local_table doesn't specify cascadeViaForeignKeys
option, we can't directly call citus_add_local_table_to_metadata
from create_citus_local_table.
Instead, implement an internal method and call it from deprecated udf
too.
* Fix partition column index issue
We send column names to worker_hash/range_partition_table methods, and
in these methods we check the column name index from tuple descriptor.
Then this index is used to decide the bucket that the current row will
be sent for the repartition.
This becomes a problem when there are the same column names in the
tupleDescriptor. Then we can choose the wrong index. Hence the
partitioned data will be put to wrong workers. Then the result could
miss some data because workers might contain different range of data.
An example:
TupleDescriptor contains "trip_id", "car_id", "car_id" for one table.
It contains only "car_id" for the other table. And assuming that the
tables will be partitioned by car_id, it is not certain what should be
used for deciding the bucket number for the first table. Assuming value
2 goes to bucket 2 and value 3 goes to bucket 3, it is not certain which
bucket "1 2 3" (trip_id, car_id, car_id) row will go to.
As a solution we send the index of partition column in targetList
instead of the column name.
The old API is kept so that if workers upgrade work, it still works
(though it will have the same bug)
* Use the same method so that backporting is easier
* Make undistribute_table() and citus_create_local_table() work with columnar
* Rename and use LocallyExecuteUtilityTask for UDF check
* Remove 'local' references in ExecuteUtilityCommand
As described in the comment, we have observed crashes in production
due to a segfault caused by the dereference of a NULL pointer in our
connection statemachine.
As a mitigation, preventing system crashes, we provide an error with
a small explanation of the issue. Unfortunately the case is not
reliably reproduced yet, hence the inability to add tests.
DESCRIPTION: Prevent segfaults when SAVEPOINT handling cannot recover from connection failures
Currently we choose an arbitrary colocation id from all the matches for
a colocation id. This could mean that 2 distributed tables, which have
the same scheme could go into different colocation groups. This fix
makes sure that the same match will go to the same colocation group.
/*
* Creating Citus local tables relies on functions that accesses
* shards locally (e.g., ExecuteAndLogDDLCommand()). As long as
* we don't teach those functions to access shards remotely, we
* cannot relax this check.
*/
The reason behind skipping postgres tables is that we support
foreign keys between postgres tables and reference tables
(without converting postgres tables to citus local tables)
when enable_local_reference_table_foreign_keys is false or
when coordinator is not added to metadata.
For certaion purposes, we drop and recreate the foreign
keys. As we acquire exclusive locks on the tables in between
drop and re-create, we can safely skip validation phase of
the foreign keys. The reason is purely being performance as
foreign key validation could take a long value.
When enabled any foreign keys between local tables and reference
tables supported by converting the local table to a citus local
table.
When the coordinator is not in the metadata, the logic is disabled
as foreign keys are not allowed in this configuration.
Because master_add_node(or others) might acquire ExclusiveLock
and their initiated sessions may call CoordinatorAddedAsWorkerNode().
With this we prevent potential deadlocks.
If relation is not involved in any foreign key relationships,
foreign key graph would not return any relations for given
relationId as expected.
But even if it's the case, we should still undistribute the table
itself.
With citus shard helper view, we can easily see:
- where each shard is, which node, which port
- what kind of table it belongs to
- its size
With such a view, we can see shards that have a size bigger than some
value, which could be useful. Also debugging can be easier in production
as well with this view.
Fetch shards in one go per node
The previous implementation was slow because it would do a lot of round
trips, one per shard to be exact. Hence it is improved so that we fetch
all the shard_name, shard-size pairs per node in one go.
Construct shards_names, sizes query on coordinator
* Replace master_add_node with citus_add_node
* Replace master_activate_node with citus_activate_node
* Replace master_add_inactive_node with citus_add_inactive_node
* Use master udfs in old scripts
* Replace master_add_secondary_node with citus_add_secondary_node
* Replace master_disable_node with citus_disable_node
* Replace master_drain_node with citus_drain_node
* Replace master_remove_node with citus_remove_node
* Replace master_set_node_property with citus_set_node_property
* Replace master_unmark_object_distributed with citus_unmark_object_distributed
* Replace master_update_node with citus_update_node
* Replace master_update_shard_statistics with citus_update_shard_statistics
* Replace master_update_table_statistics with citus_update_table_statistics
* Rename master_conninfo_cache_invalidate to citus_conninfo_cache_invalidate
Rename master_dist_local_group_cache_invalidate to citus_dist_local_group_cache_invalidate
* Replace master_copy_shard_placement with citus_copy_shard_placement
* Replace master_move_shard_placement with citus_move_shard_placement
* Rename master_dist_node_cache_invalidate to citus_dist_node_cache_invalidate
* Rename master_dist_object_cache_invalidate to citus_dist_object_cache_invalidate
* Rename master_dist_partition_cache_invalidate to citus_dist_partition_cache_invalidate
* Rename master_dist_placement_cache_invalidate to citus_dist_placement_cache_invalidate
* Rename master_dist_shard_cache_invalidate to citus_dist_shard_cache_invalidate
* Drop master_modify_multiple_shards
* Rename master_drop_all_shards to citus_drop_all_shards
* Drop master_create_distributed_table
* Drop master_create_worker_shards
* Revert old function definitions
* Add missing revoke statement for citus_disable_node
We used to need WarnAboutLeakedPreparedTransaction()
as we didn't have auto 2PC recovery. But, we long have
2PC recovery by https://github.com/citusdata/citus/pull/1574
So, we don't need anymore.
* Rethrow original concurrent index creation failure message
* Alter test outputs for concurrent index creation
* Detect duplicate table failure in concurrent index creation
* Add test for conc. index creation w/out duplicates
* Prevent deadlock for long named partitioned index creation on single node
* Create IsSingleNodeCluster function
* Use both local and sequential execution
On top of our foreign key graph, implement the infrastructure to get
list of relations that are connected to input relation via a foreign key
graph.
We need this to support cascading create_citus_local_table &
undistribute_table operations.
Also add regression tests to see what our foreign key graph is able to
capture currently.
With this commit, we remove visited flags from ForeignConstraintRelationshipNode
struct since keeping local state in global object is both dangerous and
meaningless.
Also to improve readability, this commit also converts needless recursion to
iterative DFS to avoid passing local hash-map as another parameter to
GetConnectedListHelper function.
Attribute number in a subquery RTE and relation RTE means different
things. In a relation attribute number will point to the column number
in the table definition including the dropped columns as well however in
subquery, it means the index in the target list. When we convert a
relation RTE to subquery RTE we should either correct all the relevant
attribute numbers or we can just add a dummy column for the dropped
columns. We choose the latter in this commit because it is practically
too vulnerable to update all the vars in a query.
Another thing this commit fixes is that in case a join restriction
clause list contains a false clause, we should just returns a false
clause instead of the whole list, because the whole list will contain
restrictions from other RTEs as well and this breaks the query, which
can be seen from the output changes, now it is much simpler.
Also instead of adding single tests for dropped columns, we choose to
run the whole mixed queries with tables with dropped columns, this
revealed some bugs already, which are fixed in this commit.
It seems that there are only very few cases where that is useful, and
for now we prefer not having that check. This means that we might
perform some unnecessary checks, but that should be rare and not
performance critical.
Instead of sending NULL's over a network, we now convert the subqueries
in the form of:
SELECT t.a, NULL, NULL FROM (SELECT a FROM table)t;
And we recursively plan the inner part so that we don't send the NULL's
over network. We still need the NULLs in the outer subquery because we
currently don't have an easy way of updating all the necessary places in
the query.
Add some documentation for how the conversion is done
Baseinfo also has pushed down filters etc, so it makes more sense to use
BaseRestrictInfo to determine what columns have constant equality
filters.
Also RteIdentity is used for removing conversion candidates instead of
rteIndex.
It seems that most of the updates were broken, we weren't aware of it
because there wasn't any data in the tables. They are broken mostly
because local tables do not have a shard id and some code paths should
be updated with that information, currently when there is an invalid
shard id, it is assumed to be pruned.
Consider local tables in router planner
In case there is a local table, the shard id will not be valid and there
are some checks that rely on shard id, we should skip these in case of
local tables, which is handled with a dummy placement.
Add citus local table dist table join tests
add local-dist table mixed joins tests
AllDataLocallyAccessible and ContainsLocalTableSubqueryJoin are removed.
We can possibly remove ModifiesLocalTableWithRemoteCitusLocalTable as
well. Though this removal has a side effect that now when all the data
is locally available, we could still wrap a relation into a subquery, I
guess that should be resolved in the router planner itself.
Add more tests
When we wrap an RTE to subquery we are updating the variables varno's as
1, however we should also update the varno's of vars in quals.
Also some other small code quality improvements are done.
The previous algorithm was not consistent and it could convert different
RTEs based on the table orders in the query. Now we convert local tables
if there is a distributed table which doesn't have a unique index. So if
there are 4 tables, local1, local2, dist1, dist2_with_pkey then we will
convert local1 and local2 in `auto` mode. Converting a distributed table
is not that logical because as there is a distributed table without a
unique index, we will need to convert the local tables anyway. So
converting the distributed table with pkey is redundant.
Remove FillLocalAndDistributedRTECandidates and use
ShouldConvertLocalTableJoinsToSubqueries, which simplifies things as we
rely on a single function to decide whether we should continue
converting RTE to subquery.
We should not recursively plan an already routable plannable query. An
example of this is (SELECT * FROM local JOIN (SELECT * FROM dist) d1
USING(a));
So we let the recursive planner do all of its work and at the end we
convert the final query to to handle unsupported joins. While doing each
conversion, we check if it is router plannable, if so we stop.
Only consider range table entries that are in jointree
If a range table is not in jointree then there is no point in
considering that because we are trying to convert range table entries to
subqueries for join use case.
Check equality in quals
We want to recursively plan distributed tables only if they have an
equality filter on a unique column. So '>' and '<' operators will not
trigger recursive planning of distributed tables in local-distributed
table joins.
Recursively plan distributed table only if the filter is constant
If the filter is not a constant then the join might return multiple rows
and there is a chance that the distributed table will return huge data.
Hence if the filter is not constant we choose to recursively plan the
local table.
When doing local-distributed table joins we convert one of them to
subquery. The current policy is that we convert distributed tables to
subquery if it has a unique index on a column that has unique
index(primary key also has a unique index).
The logical planner cannot handle joins between local and distributed table.
Instead, we can recursively plan one side of the join and let the logical
planner handle the rest.
Our algorithm is a little smart, trying not to recursively plan distributed
tables, but favors local tables.
A utility function is added so that each caller can implement a handler
for each index on a given table. This means that the caller doesn't need
to worry about how to access each index, the only thing that it needs to
do each to implement a function to which each index on the table is
passed iteratively.
When Citus needs to parallelize queries on the local node (e.g., the node
executing the distributed query and the shards are the same), we need to
be mindful about the connection management. The reason is that the client
backends that are running distributed queries are competing with the client
backends that Citus initiates to parallelize the queries in order to get
a slot on the max_connections.
In that regard, we implemented a "failover" mechanism where if the distributed
queries cannot get a connection, the execution failovers the tasks to the local
execution.
The failover logic is follows:
- As the connection manager if it is OK to get a connection
- If yes, we are good.
- If no, we fail the workerPool and the failure triggers
the failover of the tasks to local execution queue
The decision of getting a connection is follows:
/*
* For local nodes, solely relying on citus.max_shared_pool_size or
* max_connections might not be sufficient. The former gives us
* a preview of the future (e.g., we let the new connections to establish,
* but they are not established yet). The latter gives us the close to
* precise view of the past (e.g., the active number of client backends).
*
* Overall, we want to limit both of the metrics. The former limit typically
* kics in under regular loads, where the load of the database increases in
* a reasonable pace. The latter limit typically kicks in when the database
* is issued lots of concurrent sessions at the same time, such as benchmarks.
*/
When distributing a columnar table, as well as changing options on a distributed columnar table, this patch will forward the settings from the coordinator to the workers.
For propagating options changes on an already distributed table this change is pretty straight forward. Before applying the change in options locally we will create a `DDLJob` that contains a call to `alter_columnar_table_set(...)` for every shard placement with all settings of the current table. This goes both for setting an option as well as resetting. This will reset the values to the defaults configured on the coordinator. Having the effect that the coordinator is authoritative on the settings and makes sure the shards have the same settings set as the table on the coordinator.
When a columnar table is distributed it is using the `TableDDLCommand` infra structure to create a new kind of `TableDDLCommand`. This new type, called a `TableDDLCommandFunction` contains a context and 2 function pointers to execute. One function returns the command as applied on the table, the second function will return the sql command to apply to a shard with a given shard id. The schema name is ignored as it will use the fully qualified name of the shard in the same schema as the base table.
Multi-row execution already uses sequential execution. When shards
are local, using local execution is profitable as it avoids
an extra connection establishment to the local node.
If MemoryContextAlloc errors out -e.g. during an OOM-, ConnectionHashEntry->connections
stays as NULL.
With this commit, we add isValid flag to ConnectionHashEntry that should be set to true
right after we allocate & initialize ConnectionHashEntry->connections list properly, and we
check it before accesing to ConnectionHashEntry->connections.
The name of the function is different than the implemantation. Because
the function is designed to only consider SELECT queries. Also this
changes the assert with an error.
Refactor internals on how Citus creates the SQL commands it sends to recreate shards.
Before Citus collected solely ddl commands as `char *`'s to recreate a table. If they were used to create a shard they were wrapped with `worker_apply_shard_ddl_command` and send to the workers. On the workers the UDF wrapping the ddl command would rewrite the parsetree to replace tables names with their shard name equivalent.
This worked well, but poses an issue when adding columnar. Due to limitations in Postgres on creating custom options on table access methods we need to fall back on a UDF to set columnar specific options. Now, to recreate the table, we can not longer rely on having solely DDL statements to recreate a table.
A prototype was made to run this UDF wrapped in `worker_apply_shard_ddl_command`. This became pretty messy, hard to understand and subsequently hard to maintain.
This PR proposes a refactor of the internal representation of table ddl commands into a `TableDDLCommand` structure. The current implementation only supports a `char *` as its contents. Based on the use of the DDL statement (eg. creating the table -mx- or creating a shard) one of two different functions can be called to get the statement to send to the worker:
- `GetTableDDLCommand(TableDDLCommand *command)`: This function returns that ddl command to create the table. In this implementation it will just return the `char *`. This has the same functionality as getting the old list and not wrapping it.
- `GetShardedTableDDLCommand(TableDDLCommand *command, uint64 shardId, char *schemaName)`: This function returns the ddl command wrapped in `worker_apply_shard_ddl_command` with the `shardId` as an argument. Due to backwards compatibility it also accepts a. `schemaName`. The exact purpose is not directly clear. Ideally new implementations would work with fully qualified statements and ignore the `schemaName`.
A future implementation could accept 2.function pointers and a `void *` for context to let the two pointers work on. This gives greater flexibility in controlling what commands get send in which situations. Also, in a future, we could implement the intermediate step of creating the `parsetree` datastructure of statements based on the contents in the catalog with a corresponding deparser. For sharded queries a mutator could be ran over the parsetree to rewrite the tablenames to the names with the shard identifier. This will completely omit the requirement for `worker_apply_shard_ddl_command`.
Considering the adaptive connection management
improvements that we plan to roll soon, it makes it
very helpful to know the number of active client
backends.
We are doing this addition to simplify yhe adaptive connection
management for single node Citus. In single node Citus, both the
client backends and Citus parallel queries would compete to get
slots on Postgres' `max_connections` on the same Citus database.
With adaptive connection management, we have the counters for
Citus parallel queries. That helps us to adaptively decide
on the remote executions pool size (e.g., throttle connections
if necessary).
However, we do not have any counters for the total number of
client backends on the database. For single node Citus, we
should consider all the client backends, not only the remote
connections that Citus does.
Of course Postgres internally knows how many client
backends are active. However, to get that number Postgres
iterates over all the backends. For examaple, see [pg_stat_get_db_numbackends](8e90ec5580/src/backend/utils/adt/pgstatfuncs.c (L1240))
where Postgres iterates over all the backends.
For our purpuses, we need this information on every connection
establishment. That's why we cannot affort to do this kind of
iterattion.
CitusTableTypeIdList() function iterates on all the entries of pg_dist_partition
and loads all the metadata in to the cache. This can be quite memory intensive
especially when there are lots of distributed tables.
When partitioned tables are used, it is common to have many distributed tables
given that each partition also becomes a distributed table.
CitusTableTypeIdList() is used on every CREATE TABLE .. PARTITION OF.. command
as well. It means that, anytime a partition is created, Citus loads all the
metadata to the cache. Note that Citus typically only loads the accessed table's
metadata to the cache.
* Move local execution after the remote execution
Before this commit, when both local and remote tasks
exist, the executor was starting the execution with
local execution. There is no strict requirements on
this.
Especially considering the adaptive connection management
improvements that we plan to roll soon, moving the local
execution after to the remote execution makes more sense.
The adaptive connection management for single node Citus
would look roughly as follows:
- Try to connect back to the coordinator for running
parallel queries.
- If succeeds, go on and execute tasks in parallel
- If fails, fallback to the local execution
So, we'll use local execution as a fallback mechanism. And,
moving it after to the remote execution allows us to implement
such further scenarios.
The adaptive executor emulates the TCP's slow start algorithm.
Whenever the executor needs new connections, it doubles the number
of connections established in the previous iteration.
This approach is powerful. When the remote queries are very short
(like index lookup with < 1ms), even a single connection is sufficent
most of the time. When the remote queries are long, the executor
can quickly establish necessary number of connections.
One missing piece on our implementation seems that the executor
keeps doubling the number of connections even if the previous
connection attempts have been finalized. Instead, we should
wait until all the attempts are finalized. This is how TCP's
slow-start works. Plus, it decreases the unnecessary pressure
on the remote nodes.
Before this commit, we let AdaptiveExecutorPreExecutorRun()
to be effective multiple times on every FETCH on cursors.
That does not affect the correctness of the query results,
but adds significant overhead.
It seems that we forgot to pass the revelant
flag to enable Postgres' parallel query
capabilities on the shards when user does
EXPLAIN ANALYZE on a distributed table.
If one wishes to iterate through a List and insert list elements in
PG13, it is not safe to use for_each_ptr as the List representation
in PostgreSQL no longer linked lists, but arrays, and it is possible
that the whole array is repalloc'ed if ther is not sufficient space
available.
See postgres commit 1cff1b95ab6ddae32faa3efe0d95a820dbfdc164 for more
information
When a relation is used on an OUTER JOIN with FALSE filters,
set_rel_pathlist_hook may not be called for the table.
There might be other cases as well, so do not rely on the hook
for classification of the tables.
RemoveDuplicateJoinRestrictions() function was introduced with the aim of decrasing the overall planning times by eliminating the duplicate JOIN restriction entries (#1989). However, it turns out that the function itself is so CPU intensive with a very high algorithmic complexity, it hurts a lot more than it helps. The function is a clear example of premature optimization.
The table below shows the difference clearly:
"distributed query planning
time master" RemoveDuplicateJoinRestrictions() execution time on master "Remove the function RemoveDuplicateJoinRestrictions()
this PR"
5 table INNER JOIN 9 msec 2msec 7 msec
10 table INNER JOIN 227 msec 194 msec 29 msec
20 table INNER JOIN 1 sec 235 msec 1 sec 139 msec 90 msecs
50 table INNER JOIN 24 seconds 21 seconds 1.5 seconds
100 table INNER JOIN 2 minutes 16 secods 1 minute 53 seconds 23 seconds
250 table INNER JOIN Bottleneck on JoinClauseList 18 minutes 52 seconds Bottleneck on JoinClauseList
5 table INNER JOIN in subquery 9 msec 0 msec 6 msec
10 table INNER JOIN subquery 33 msec 10 msec 32 msec
20 table INNER JOIN subquery 132 msec 67 msec 123 msec
50 table INNER JOIN subquery 1.2 seconds 900 msec 500 msec
100 table INNER JOIN subquery 6 seconds 5 seconds 2 seconds
250 table INNER JOIN subquery 54 seconds 37 seconds 20 seconds
5 table LEFT JOIN 5 msec 0 msec 5 msec
10 table LEFT JOIN 11 msec 0 msec 13 msec
20 table LEFT JOIN 26 msec 2 msec 30 msec
50 table LEFT JOIN 150 msec 15 msec 193 msec
100 table LEFT JOIN 757 msec 71 msec 722 msec
250 table LEFT JOIN 8 seconds 600 msec 8 seconds
5 JOINs among 2 table JOINs 37 msec 11 msec 25 msec
10 JOINs among 2 table JOINs 536 msec 306 msec 352 msec
20 JOINs among 2 table JOINs 794 msec 181 msec 640 msec
50 JOINs among 2 table JOINs 25 seconds 2 seconds 22 seconds
100 JOINs among 2 table JOINs Bottleneck on JoinClauseList 9 seconds Bottleneck on JoinClauseList
150 JOINs among 2 table JOINs Bottleneck on JoinClauseList 46 seconds Bottleneck on JoinClauseList
On top of the performance penalty, the function had a critical bug #4255, and with #4254 we hit one more important bug. It should be fixed by adding the followig check to the ContextCoversJoinRestriction():
```
static bool
JoinRelIdsSame(JoinRestriction *leftRestriction, JoinRestriction *rightRestriction)
{
Relids leftInnerRelIds = leftRestriction->innerrel->relids;
Relids rightInnerRelIds = rightRestriction->innerrel->relids;
if (!bms_equal(leftInnerRelIds, rightInnerRelIds))
{
return false;
}
Relids leftOuterRelIds = leftRestriction->outerrel->relids;
Relids rightOuterRelIds = rightRestriction->outerrel->relids;
if (!bms_equal(leftOuterRelIds, rightOuterRelIds))
{
return false;
}
return true;
}
```
However, adding this eliminates all the benefits tha RemoveDuplicateJoinRestrictions() brings.
I've used the commands here to generate the JOINs mentioned in the PR: https://gist.github.com/onderkalaci/fe8654f9df5916c7af4c7c5eb892561e#file-gistfile1-txt
Inner and outer JOINs behave roughly the same, to simplify the table only added INNER joins.
* Fix incorrect join related fields
Ruleutils expect to give the original index of join columns hence we
should consider the dropped columns while setting the fields in
SetJoinRelatedFieldsCompat.
* add some more tests for joins
* Move tests to join.sql and create a utility function
Disallow `ON TRUE` outer joins with reference & distributed tables
when reference table is outer relation by fixing the logic bug made
when calling `LeftListIsSubset` function.
Also, be more defensive when removing duplicate join restrictions
when join clause is empty for non-inner joins as they might still
contain useful information for non-inner joins.
It seems like Postgres could call set_rel_pathlist() for
the same relation multiple times. This breaks the logic
where we assume relationCount eqauls to the number of
entries in relationRestrictionList.
In summary, relationRestrictionList may contain duplicate
entries.
With this commit, we make sure that local execution adds the
intermediate result size as the distributed execution adds. Plus,
it enforces the citus.max_intermediate_result_size value.
We should not access CurrentLocalExecutionStatus directly because that
would mean that we could also set it directly, which we shouldn't
because we have checks to see if the new state is possible, otherwise we
error.
Before this commit, the logic was:
- As long as the outer side of the JOIN is not a JOIN (e.g., relation
or subquery etc.), we check for the existence of any recurring
tuples. There were two implications of this decision.
First, even if a subquery which is on the outer side contains
distributed table JOIN reference table, Citus would unnecessarily throw
an error. Note that, the JOIN inside the subquery would already
be going to be tested recursively. But, as long as that check
passes, there is no reason for the upper JOIN to fail. An example, which
used to fail and now works:
SELECT * FROM (SELECT * FROM dist JOIN ref) as foo LEFT JOIN dist;
Second, certain JOINs, especially with ON (true) conditions were not
represented as Citus expects the JOINs to be in the format
DeferredErrorIfUnsupportedRecurringTuplesJoin().
Use short lived per-tuple context in citus_evaluate_expr like
(pg) evaluate_expr does.
We should not use planState->ExprContext when evaluating expressions
as it might lead to freeing the same executor twice (first one happens
in citus_evaluate_expr itself and the other one happens when postgres
doing clean-up for the top level executor state), which in turn might
cause seg.faults.
However, now as we don't have necessary planState info to evaluate
prepared statements, we also add planState->es_param_list_info to
per-tuple ExprContext.
With postgres 13, there is a global lock that prevents multiple VACUUMs
happening in the current database. This global lock is taken for a short
time but this creates a problem because of the following:
- We execute the VACUUM for the shell table through the standard process
utility. In this step the global lock is taken for the current database.
- If the current node has shard placements then it tries to execute
VACUUM over a connection to localhost with ExecuteUtilityTaskList.
- the VACUUM on shard placements cannot proceed because it is waiting
for the global lock for the current database to be released.
- The acquired lock from the VACUUM for shell table will not be released
until the transaction is committed.
- So there is a deadlock.
As a solution, we commit the current transaction in case of VACUUM after
the VACUUM is executed for the shell table. Executing the VACUUM on a
shell table is not important because the data there will probably be
truncated. PostprocessVacuumStmt takes the necessary locks on the shell
table so we don't need to take any extra locks after we commit the
current transaction.
The prepare for upgrade script creates the `'public.pg_dist_rebalance_strategy` table which is not dropped when the upgrade is finished. This may block future upgrades.
Multi-row & router INSERT's were crashing with local execution if at
least one of the DEFAULT columns were not specified in VALUES list.
This was because, the changes we make on query->values_lists and
query->targetList was sufficient for deparsing given INSERT for remote
execution but not sufficient for local execution.
With this commit, DEFAULT value normalization for multi-row & router
INSERT's is fixed by adding dummy column references for unspecified
DEFAULT columns.
Citus has the logic to truncate the long shard names to prevent
various issues, including self-deadlocks. However, for partitioned
tables, when index is created on the parent table, the index names
on the partitions are auto-generated by Postgres. We use the same
Postgres function to generate the index names on the shards of the
partitions. If the length exceeds the limit, we switch to sequential
execution mode.
In postmasters execution of _PG_init, IsUnderPostmaster will be false and
we want to do the cleanup at that time only, otherwise there is a chance that
there will be parallel queries and we might do a cleanup for things that are
already in use.
After the connection timeout, we fail the session/pool. However, the
underlying connection can still be trying to connect. That is dangerous
because the new placement executions have already been in place. The
executor cannot handle the situation where multiple of
EXECUTION_ORDER_ANY task executions succeeds.
Adding a regression test doesn't seem easily doable. To reproduce the issue
- Add 2 worker nodes
- create a reference table
- set citus.node_connection_timeout to 1ms (requires code change)
- Continiously execute `SELECT count(*) FROM ref_table`
- Sometime later, you hit an out-of-array access in
`ScheduleNextPlacementExecution()` hence crashing.
- The reason for that is sometimes the first connection
successfully established while the executor is already
trying to execute the query on the second node.
We currently do not support volatile functions in update/delete statements
because the function evaluation logic does not know how to distinguish
volatile functions (that need to be evaluated per row) from stable functions
(that need to be evaluated per query), and it is also not safe to push the
volatile functions down on replicated tables.
Add sort method parameter for regression tests
Fix check-style
Change sorting method parameters to enum
Polish
Add task fields to OutTask
Add test into multi_explain
Fix isolation test
As the previous versions of Citus don't know how to handle citus local
tables, we should prevent downgrading from 9.5 to older versions if any
citus local tables exists.
Pushing down the CALLs to the node that the CALL is executed is
dangerous and could lead to infinite recursion.
When the coordinator added as worker, Citus was by chance preventing
this. The coordinator was marked as "not metadatasynced" node
in pg_dist_node, which prevented CALL/function delegation to happen.
With this commit, we do the following:
- Fix metadatasynced column for the coordinator on pg_dist_node
- Prevent pushdown of function/procedure to the same node that
the function/procedure is being executed. Today, we do not sync
pg_dist_object (e.g., distributed functions metadata) to the
worker nodes. But, even if we do it now, the function call delegation
would prevent the infinite recursion.
* Not take ShareUpdateExlusiveLock on pg_dist_transaction
We were taking ShareUpdateExlusiveLock on pg_dist_transaction during
recovery to prevent multiple recoveries happening concurrenly. VACUUM(
not FULL) also takes ShareUpdateExclusiveLock, and they can conflict. It
seems that VACUUM will skip the table if there is a conflicting lock
already taken unless it is doing the vacuum to prevent id wraparound, in
which case there can be a deadlock. I guess the deadlock happens if:
- VACUUM takes a lock on pg_dist_transaction and is done for id
wraparound problem
- The transaction in the maintenance tries to take a lock but
cannot as that conflicts with the lock acquired by VACUUM
- The transaction in the maintenance daemon has a very old xid hence
VACUUM cannot proceed.
If we take a row exclusive lock in transaction recovery then it wouldn't
conflict with VACUUM hence it could proceed so the deadlock would be
resolved. To prevent concurrent transaction recoveries happening, an
advisory lock is taken with ShareUpdateExlusiveLock as before.
* Use CITUS_OPERATIONS tag
* Not allow removing a single node with ref tables
We should not allow removing a node if it is the only node in the
cluster and there is a data on it. We have this check for distributed
tables but we didn't have it for reference tables.
* Update src/test/regress/expected/single_node.out
Co-authored-by: Onur Tirtir <onurcantirtir@gmail.com>
* Update src/test/regress/sql/single_node.sql
Co-authored-by: Onur Tirtir <onurcantirtir@gmail.com>
This commit brings following features:
Foreign key support from citus local tables to reference tables
* Foreign key support from reference tables to citus local tables
(only with RESTRICT & NO ACTION behavior)
* ALTER TABLE ENABLE/DISABLE trigger command support
* CREATE/DROP/ALTER trigger command support
and disallows:
* ALTER TABLE ATTACH/DETACH PARTITION commands
* CREATE TABLE <postgres table> ATTACH PARTITION <citus local table>
commands
* Foreign keys from postgres tables to citus local tables
(the other way was already disallowed)
for citus local tables.
Introduce table entry utility functions
Citus table cache entry utilities are introduced so that we can easily
extend existing functionality with minimum changes, specifically changes
to these functions. For example IsNonDistributedTableCacheEntry can be
extended for citus local tables without the need to scan the whole
codebase and update each relevant part.
* Introduce utility functions to find the type of tables
A table type can be a reference table, a hash/range/append distributed
table. Utility methods are created so that we don't have to worry about
how a table is considered as a reference table etc. This also makes it
easy to extend the table types.
* Add IsCitusTableType utilities
* Rename IsCacheEntryCitusTableType -> IsCitusTableTypeCacheEntry
* Change citus table types in some checks
create_distributed_function(function_name,
distribution_arg_name,
colocate_with text)
This UDF did not allow colocate_with parameters when there were no
disttribution_arg_name supplied. This commit changes the behaviour to
allow missing distribution_arg_name parameters when the function should
be colocated with a reference table.
* Hide citus.subquery_pushdown flag
This flag is dangerous and could likely to let queries
return wrong results.
The flag has a very specific purpose for a very specific
data distribution and query structure. In those cases, when
the flag is set, the user can skip recursive planning altogether
*at their own risk*.
The meaning of the flag is that "I know what I'm doing such that
the query structure/data distribution is on my control, so Citus
can skip many correctness checks".
For regular users, enabling this flag is discouraged. We have to
keep the support only for backward compatibility for some users.
In addition to that, give a NOTICE to discourage new users to
use it.
RemoveCoordinatorPlacement does not do what it says. It removes the
coordinator placement only if there are other placements, so it is not a
single node, and only if the coordinator has a placement.
AllTargetExpressionsAreColumnReferences would return false if a query
had an entry that is referencing the outer query. It seems safe to not
have this for non-distributed tables, such as reference tables. We
already have separate checks for other cases such as having limits.
FindNodeCheck is not clear about what the function is doing. They are
renamed to FindNodeMatchingCheckFunctionXXX. Also for choosing elements in these
functions, CheckNodeFunc type is introduced.
It seems that currently we process even postgres tables in explain
commands. This is because we register a hook for explain and we don't
have any check to see if the query has any citus table.
With this commit, we now send the buffer usage as well to the relevant
API. There is some duplicate in the code but it is because of the
existing structure, we can refactor this separately.
The codebase is updated to use varattnosync and varnosyn and we defined
the macros for older versions. This way we can just remove the macros
when we drop an older version.
CMDTAG_SELECT exists in PG12 hence defining a MACRO such as
CMDTAG_SELECT -> "SELECT" is not possible. I chose CMDTAG_SELECT_COMPAT
because with the COMPAT suffix it is explicit that it maps to different
things in different versions and also has a less chance of mapping
something irrevelant. For example if we used SELECT as a macro, then it
would map every SELECT to whatever it is mapping to, which might have
unexpected/undesired behaviour.
The error message when index has opclassopts is improved and the commit
from postgres side is also included for future reference.
Also some minor style related changes are applied.
Error out if index has opclassopts.
Changelog entry on PG13:
Allow CREATE INDEX to specify the GiST signature length and maximum number of integer ranges (Nikita Glukhov)
Postgres 13 added a new VACUUM option, PARALLEL. It is now supported
in our code as well.
Relevant changelog message on postgres:
Allow VACUUM to process indexes in parallel (Masahiko Sawada, Amit Kapila)
This commit mostly adds pg_get_triggerdef_command to our ruleutils_13.
This doesn't add anything extra for ruleutils 13 so it is basically a copy
of the change on ruleutils_12
When there is a join alias, var->varnosync will point to the alias and
var->varno will point to the table itself, but we need to use the alias
when deparsing the query. Hence a workaround is introduced to solve this
problem in ruleutils. Normally this case can be understood with
dpns->plan == NULL check but in our case, dpns->plan is always NULL. We
should sync our ruleutils at some point with postgres ruleutils. This
could be a wrong solution as well but the tests pass.
Rte index is increased by range table index offset in pg >= 13. The
offset is removed with the pg >= 13.
Currently pushdown for union all is disabled because translatedVars is
set to nil on postgres side, and we were using translatedVars to
figure out if partition key has the same index in both sides of union
all. This should be fixed.
Commit on postgres side:
6ef77cf46e81f45716ec981cb08781d426181378
fix union all pushdown logic for pg13
Before pg 13, there was a field, translatedVars, and we were using that
to understand if the partition key has the same index on both sides of
the union all. With pg13 there is a parent_colnos field in appendRelInfo
and we can use that to get the attribute numbers(varattnos) in union all
vars. We make use of parent_colnos instead of translatedVars in pg >=13.
For joins 3 new fields are added, joinleftcols, joinrightcols, and
joinmergedcols. We are not interested in joinmergedcols because we
always expand the column used in joins. There joinmergedcols is always 0
in our case.
For filling joinleftcols and joinrightcols we basically construct the
lists with sequences so either list is of the form: [1 2 3 4 .... n]
Ruleutils is not completed synced with postgres ruleutils and the most
important part is identify_join_columns function change, which now uses
joinleftcols and joinrightcols.
Commit on postgres side:
9ce77d75c5ab094637cc4a446296dc3be6e3c221
A useful email thread:
https://www.postgresql.org/message-id/flat/7115.1577986646%40sss.pgh.pa.us#0ae1d66feeb400013fbaa67a7cccd6ca
PG13 uses joinmergedcols, joinleftcols and joinrightcols for finding
join order now. There relevant fields are set on citus side.
Postgres side commit:
9ce77d75c5ab094637cc4a446296dc3be6e3c221
Postgres changed some join related fields and therefore they also
changed ruleutils, this commit applies those changes to our copy of
ruleutils.
Related commit on postgres side:
9ce77d75c5ab094637cc4a446296dc3be6e3c221
Postgres introduced QueryCompletion struct. Hence a compat utility is
added to finish query completion for older versions and pg >= 13.
The commit on Postgres side:
2f9661311b83dc481fc19f6e3bda015392010a40