Clang 13 complains about a suspicious string concatenation. It thinks we
might have missed a comma. This adds parentheses to make it clear that
concatenation is indeed what we meant.
It seems like the decision for 2PC is more complicated than
it should be.
With this change, we do one behavioral change. In essense,
before this commit, when a SELECT task with replication factor > 1
is executed, the executor was triggering 2PC. And, in fact,
the transaction manager (`ConnectionModifiedPlacement()`) was
able to understand not to trigger 2PC when no modification happens.
However, for transaction blocks like:
BEGIN;
-- a command that triggers 2PC
-- A SELECT command on replication > 1
..
COMMIT;
The SELECT was used to be qualified as required 2PC. And, as a side-effect
the executor was setting `xactProperties.errorOnAnyFailure = true;`
So, the commands was failing at the time of execution. Now, they fail at
the end of the transaction.
In the past, we allowed users to manually switch to 1PC
(e.g., one phase commit). However, with this commit, we
don't. All multi-shard modifications are done via 2PC.
With Citus 9.0, we introduced `citus.single_shard_commit_protocol` which
defaults to 2PC.
With this commit, we prevent any user to set it to 1PC and drop support
for `citus.single_shard_commit_protocol`.
Although this might add some overhead for users, it is already the default
behaviour (so less likely) and marking placements as INVALID is much
worse.
- citus_get_all_dependencies_for_object: emulate what Citus
would qualify as
dependency when adding
a new node
- citus_get_dependencies_for_object: emulate what Citus would qualify
as dependency when creating an
object
Example use:
```SQL
-- find all the depedencies of table test
SELECT
pg_identify_object(t.classid, t.objid, t.objsubid)
FROM
(SELECT * FROM pg_get_object_address('table', '{test}', '{}')) as addr
JOIN LATERAL
citus_get_all_dependencies_for_object(addr.classid, addr.objid, addr.objsubid) as t(classid oid, objid oid, objsubid int)
ON TRUE
ORDER BY 1;
```
Add/fix tests
Fix creating partitions
Add test for mx - partition creating case
Enable cascading to partitioned tables
Fix mx partition adding test
Fix cascading through fkeys
Style
Disable converting with non-inherited fkeys
Fix detach bug
Early return in case of cascade & Add tests
Style
Fix undistribute_table bug & Fix test outputs
Remove RemovePartitionRelationIds
Test with undistribute_table
Add test for mx+convert+undistribute
Remove redundant usage of CreatePartitionedCitusLocalTable
Add some comments
Introduce bulk functions for generating attach/detach partition commands
Fix: Convert partitioned tables after adding fkey
Change the error message for partitions
Introduce function ErrorIfPartitionTableAddedToMetadata
Polish attach/detach command generation functions
Use time_partitions for testing
Move mx tests to citus_local_tables_mx
Add new partitioned table to cascade test
Add test with time series management UDFs
Fix test output
Fix: Assertion fail on relation access tracking
Style
Refactor creating partitioned citus local tables
Remove CreatePartitionedCitusLocalTable
Style
Error out if converting multi-level table
Revert some old tests
Error out adding partitioned partition
Polish
Polish/address
Fix create table partition of case
Use CascadeOperationForRelationIdList if no cascade needed
Fix create partition bug
Revert / Add new tests to mx
Style
Fix dropping fkey bug
Add test with IF NOT EXISTS
Convert to CLT when doing ATTACH PARTITION
Add comments
Add more tests with time series management
Edit the error message for converting the child
Use OR instead of AND in ErrorIfUnsupportedAlterTableStmt
Edit/improve tests
Disable ddl prop when dropping default column definitions
Disable/enable ddl prop just before/after the command
Add comment
Add sequence test
Add trigger test
Remove NeedCascadeViaForeignKeys
Add one more insert to sequence test
Add comment
Style
Fix test output shard ids
Update comments
Disable creating fkey on partitions
Move partition check to CreateCitusLocalTable
Add comment
Add check for attachingmulti-level partition
Add test for pg_constraint
Check pg_dist_partition in tests
Add test inserting on the worker
* Add udf to include shardId in broken partition shard index names
* Address reviews: rename index such that operations can be done on it
* More comprehensive index tests
* Final touches and formatting
Under high write concurrency, we were sometimes reading columnar
metapage as all zeros.
In `WriteToBlock()`, if `clear == true`, then it will clear the page before
writing the new one, rather than just adding data to the page. That
means any concurrent connection that is holding only a pin will be
able to see the all-zero state between the `InitPage()` and the
`memcpy_s()`.
Moreover, postgres/storage/buffer/README states that:
> Buffer access rules:
>
> 1. To scan a page for tuples, one must hold a pin and either shared or
> exclusive content lock. To examine the commit status (XIDs and status bits)
> of a tuple in a shared buffer, one must likewise hold a pin and either shared
> or exclusive lock.
For those reasons, we have to make sure to never keep a pin on the
page without (at least) the shared lock, to avoid having such problems.
A write operation might trigger index deletion if index already had
dead entries for the key we are about to insert.
There are two ways of index deletion:
a) simple deletion
b) bottom-up deletion (>= pg14)
Since columnar_index_fetch_tuple never sets all_dead to true,
columnarAM doesn't ever expect to receive simple deletion requests
(columnar_index_delete_tuples) as we don't mark any index entries
as dead.
However, since columnarAM doesn't delete any dead entries via simple
deletion, postgres might ask for a more comprehensive deletion
(i.e.: bottom-up) at some point when pg >= 14.
So with this commit, we start gracefully ignoring bottom-up deletion
requests made to columnar_index_delete_tuples.
Given that users can anyway "VACUUM FULL" their columnar tables,
we don't see any problem in ignoring deletion requests.
* Make (columnar.stripe) first_row_number index a unique constraint
Since stripe_first_row_number_idx is required to scan a columnar
table, we need to make sure that it is created before doing anything
with columnar tables during pg upgrades.
However, a plain btree index is not a dependency of a table, so
pg_upgrade cannot guarantee that stripe_first_row_number_idx gets
created when creating columnar.stripe, unless we make it a unique
"constraint".
To do that, drop stripe_first_row_number_idx and create a unique
constraint with the same name to keep the code change at minimum.
* Add more pg upgrade tests for columnar
* Fix a logic error in uprade_columnar_after test
Co-authored-by: Onur Tirtir <onurcantirtir@gmail.com>
BuildStripeMetadata() calls HeapTupleHeaderGetXmin(), which must only
be called on a proper heap tuple with MVCC information. Make sure the
caller passes the heap tuple, and not a datum tuple.
Fixes#5318.
Considering all code-paths that we might interact with a columnar table,
add `CheckCitusVersion` calls to tableAM callbacks:
- initializing table scan (`columnar_beginscan` & `columnar_index_fetch_begin`)
- setting a new filenode for a relation (storage initializiation or a table rewrite)
- truncating the storage
- inserting tuple (single and multi)
Also add `CheckCitusVersion` call to:
- drop hook (`ColumnarTableDropHook`)
- `alter_columnar_table_set` & `alter_columnar_table_reset` UDFs
* Columnar: separate plain and exec quals.
Make a clear separation between plain quals, which contain constants
or extern params; and exec quals, which contain exec params and can't
be evaluated until a rescan.
Fixes#5258.
* more vanilla tests
Co-authored-by: Onur Tirtir <onurcantirtir@gmail.com>
When performing a partition-wise join, the planner will adjust paths
parameterized by the parent rel to instead parameterize by the child
rel directly. When this reparameterization happens, we also need to
adjust the join quals to reference the child rather than the parent.
Fixes#5257.
Not flush pending writes if given tid belongs to a "flushed" or
"aborted" stripe write, or to an "in-progress" stripe write of
another backend.
That way, we would reduce the cases where we flush single-tuple
stripes during index scan.
To do that, we follow below steps for index look-up's:
- Do not flush any pending writes and do stripe metadata look-up for
given tid.
If tuple with tid is found, then no need to do another look-up
since we already found the tuple without needing to flush pending
writes.
- If tuple is not found without flushing pending writes, then we have two
scenarios:
- If given tid belongs to a pending write of my backend, then do stripe
metadata look-up for given tid. But this time first **flush any pending
writes**.
- Otherwise, just return false from `index_fetch_tuple` since flushing
pending writes wouldn't help.
With 5825c44d5f, we made the changes to
skip aborted writes when scanning a columnar table.
However, looks like we forgot to handle such cases for the very first
call made to columnar_getnextslot. That means, that commit only
considered the intermediate stripe read operations.
However, functions called by columnar_getnextslot to find first stripe
to read (ColumnarBeginRead & ColumnarRescan) were not caring about
those aborted writes.
To fix that, we teach AdvanceStripeRead to find the very first stripe
to read, and then start using it where were blindly calling
FindNextStripeByRowNumber.
Previously, even when `EXPLAIN` output tells that we will do
index-only scan, it was never the case since columnar tables
don't have the visibility fork that postgres is looking for.
For this reason, visibility check done in
`IndexOnlyNext->VM_ALL_VISIBLE`
code-path was always returning false and postgres was reading
the tuple from the columnar relation itself.
Previously, for regular table scans, we were setting `RelOptInfo->partial_pathlist`
to `NIL` via `set_rel_pathlist_hook` to discard scan `Path`s that need to use any
parallel workers, this was working nicely.
However, when building indexes, this hook doesn't get called so we were not
able to prevent spawning parallel workers when building an index. For this
reason, 9b4dc2f804 added basic
implementation for `columnar_parallelscan_*` callbacks but also made some
changes to skip using those workers when building the index.
However, now that we are doing stripe reservation in two stages, we call
`heap_inplace_update` at some point to complete stripe reservation.
However, postgres throws an error if we call `heap_inplace_update` during
a parallel operation, even if we don't actually make use of those workers.
For this reason, with this pr, we make sure to not generate scan `Path`s that
need to use any parallel workers by using `get_relation_info_hook`.
This is indeed useful to prevent spawning parallel workers during index builds.
If it is certain that we will not use any `parallel_worker`s for a columnar table,
then stripe entries inserted by aborted transactions become visible to
`SnapshotAny` and that causes `REINDEX` to fail by throwing a duplicate key
error.
To fix that:
* consider three states for a stripe write operation:
"flushed", "aborted", or "in-progress",
* make sure to have a clear separation between them, and
* act according to those three states when reading from a columnar table
Since PG14 we can now use binary encoding for arrays and composite types
that contain user defined types. This was fixed in this commit in
Postgres: 670c0a1d47
This change starts using that knowledge, by not necessarily falling back
to text encoding anymore for those types.
While doing this and testing a bit more I found various cases where
binary encoding would fail that our checks didn't cover. This fixes
those cases and adds tests for those. It also fixes EXPLAIN ANALYZE
never using binary encoding, which was a leftover of workaround that
was not necessary anymore.
Finally, it changes the default for both `citus.enable_binary_protocol`
and `citus.binary_worker_copy_format` to `true` for PG14 and up. In our
cloud offering `binary_worker_copy_format` already was true by default.
`enable_binary_protocol` had some bug with MX and user defined types,
this bug was fixed by the above mentioned fixes.
- get_missing_time_partition_ranges: Gets the ranges of missing partitions for the given table, interval and range unless any existing partition conflicts with calculated missing ranges.
- create_time_partitions: Creates partitions by getting range values from get_missing_time_partition_ranges.
- drop_old_time_partitions: Drops partitions of the table older than given threshold.
* Rename RecostColumnarPaths to CostColumnarPaths
* Rename RecostColumnarIndexPath to CostColumnarIndexPath
* Reorder args of CostColumnarScan to align with other two costing functions
* Not adjust index scan start-up cost
* Rename ColumnarIndexScanAddTotalCost to ColumnarIndexScanAdditionalCost
* Reflect that index scan will at least read one stripe in totalCost calculation
* Organize declarations in columnar_customscan.c
In PG 14, procedures can have OUT parameters. In Citus' procedure
delegation framework, we need to adjust the function expression
to get the outargs parameters.
Releven PG change:
e56bce5d43
Simply call Postgres' function to report the progress on
each row recieved.
Note that we currently do not support "COPY dist/ref TO .." progress
report nicely. Citus has some specialized logic to support
"COPY dist/ref TO .." such that it either converts the underlying
command into "COPY (SELECT * FROM dist/ref ) ..." or sends COPY
command to shards directly. In the former case, "tuples_processed"
is only updated when the executor returns all the tuples, so the
progress is not accurate. In the latter case, Citus can actually
implement the progress report. But, for the sake of consistency,
we prefer to not implement at all.
Added to PG 14 with https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=8a4f618e7ae3cb11b0b37d0f06f05c8ff905833f
Postgres changed stats expression types as of PG14. Hence we needed to
write the AppendColumnNames method. Also they removed the error on PG
side so we remove it as well.
Relevant commits on pg14:
a4d75c86bf15220df22de0a92c819ecef9db3849
388e75ad33489b77cfb9a8590a91e9287d8fb960
When queryId is not 0 and verbose is true, the query identifier is
emitted to the explain output. This is breaking Postgres outputs.
We disable de query identifier calculation in the tests.
Commit on PG that introduced the query identifier in the explain output:
4f0b0966c866ae9f0e15d7cc73ccf7ce4e1af84b
These changes were removed in commit: Introduces ExecSimpleRelationInsert_compat and modifyStateResultRelInfo macros
We shouldn't have removed them but instead kept them for before PG14
Relevant PG commit:
9e38c2bb5093ceb0c04d6315ccd8975bd17add66
fix array_cat_agg for pg upgrades
array_cat_agg now needs to take anycompatiblearray instead of anyarray
because array_cat changed its type from anyarray to anycompatiblearray
with pg14.
To handle upgrades correctly, we drop the aggregate in
citus_pg_prepare_upgrade. To be able to drop it, we first remove the
dependency from pg_depend.
Then we create the right aggregate in citus_finish_pg_upgrade and we
also add the dependency back to pg_depend.
Postgres doesn't accept NULL for queryStrings in explain plans anymore.
Internally, there are some places in Postgres where they modified the
NULLS to ""(the empty string). So we do the same on citus side.
Commit on Postgres:
1111b2668d89bfcb6f502789158b1233ab4217a6
Postgres expects to set the HASH_STRINGS explicitly in case of the
default behaivor for string hash function.
Postgres Commit
b3817f5f774663d55931dd4fab9c5a94a15ae7ab
index_insert function now has a new parameter, indexUnchanged
This new macro give us the ability to use these new parameter for PG14 and they don't give the parameters for previous versions
Existing parameter is set to false
Relevant PG commit:
9dc718bdf2b1a574481a45624d42b674332e2903
es_result_relation_info is removed from Estate. In this commit we make some changes to handle that.
resultRelationInfo filed is added to ModifyState to support the removed field.
Relevant PG commits:
1375422c7826a2bf387be29895e961614f69de4b
a04daa97a4339c38e304cd6164d37da540d665a8
GetOldestXmin function is removed so we use GetOldestNonRemovableTransactionId functions instead
GetOldestNonRemovableTransactionId_compat picks the appropriate one
Relevant PG commit:
dc7420c2c9274a283779ec19718d2d16323640c0
get_partition_parent and RelationGetPartitionDesc functions now have new parameters to also include detached partitions
Thess new macros give us the ability to use these new parameter for PG14 and they don't give the parameters for previous versions
Existing parameters are set to not accept detached partitions
Relevant PG commit:
71f4c8c6f74ba021e55d35b1128d22fb8c6e1629
In two commits vacuumFlags in PGXACT is moved and then renamed to status flags
This macro uses the appropriate version of the flag
Relevant PG commits:
5788e258bb26495fab65ff3aa486268d1c50b123
cd9c1b3e197a9b53b840dcc87eb41b04d601a5f9
SetTuplestoreDestReceiverParams function now has two new parameters
This new macro give us the ability to use this new parameter for PG14 and it doesn't give the parameter for previous versions
Existing parameters are set to NULL to keep previous behavior
Relevant PG commit:
2f48ede080f42b97b594fb14102c82ca1001b80c
Some Copy related functions copied from Postgres had support for both old and new protocols
Postgres removed support for old version so we remove it too
Relevant PG commit:
3174d69fb96a66173224e60ec7053b988d5ed4d9
New macros: standard_ProcessUtility_compat, ProcessUtility_compat, ColumnarProcessUtility_compat, PrevProcessUtilityHook_compat
The functions now have a new bool parameter: readOnlyTree
These new macros give us the ability to use this new parameter for PG14 and it doesn't give the parameter for previous versions
In multi_ProcessUtility and ColumnarProcessUtility, before doing anything else, we check if readOnlyTree parameter is true and create a copy of pstmt
Existing readOnlyTree parameters are set to false since we already handle the read only case at multi_ProcessUtility and ColumnarProcessUtility
Relevant PG commit:
7c337b6b527b7052e6a751f966d5734c56f668b5
This function was copied from Postgres but it is not static at PG14
So we keep the definition only for previous versions
Relevant PG commit:
c532d15dddff14b01fe9ef1d465013cb8ef186df
CopyState struct is divided into parts and one of them is CopyFromState
This macro uses the appropriate one for PG versions
Relevant PG commit:
c532d15dddff14b01fe9ef1d465013cb8ef186df
In ReindexStmt concurrent field is moved to options and then options are converted to params list.
This macro uses previous fields for previous versions and the new params list with a new function named IsReindexWithParam for PG14
Relevant PG commits:
844c05abc3f1c1703bf17cf44ab66351ed9711d2
b5913f6120792465f4394b93c15c2e2ac0c08376
VacOptTernaryValue enum is renamed to VacOptValue.
In the enum there were three values, VACOPT_TERNARY_DEFAULT, VACOPT_TERNARY_DISABLED, and VACOPT_TERNARY_ENABLED
Now there are four values VACOPTVALUE_UNSPECIFIED, VACOPTVALUE_AUTO, VACOPTVALUE_DISABLED, and VACOPTVALUE_ENABLED
New macros are VacOptValue_compat, VACOPTVALUE_UNSPECIFIED_COMPAT, VACOPTVALUE_DISABLED_COMPAT, and VACOPTVALUE_ENABLED_COMPAT
The VACOPTVALUE_UNSPECIFIED_COMPAT matches VACOPT_TERNARY_DEFAULT and VACOPTVALUE_UNSPECIFIED. And there are no macro for VACOPTVALUE_AUTO.
Relevant PG commit:
3499df0dee8c4ea51d264a674df5b5e31991319a
New macros: FuncnameGetCandidates_compat and expand_function_arguments_compat
The functions (the ones without _compat) now have a new bool include_out_arguments parameter
These new macros give us the ability to use this new parameter for PG14 and it doesn't give the parameter for previous versions
Existing include_out_arguments parameters are set to 'false' to keep current behavior
Relevant PG commit:
e56bce5d43789cce95d099554ae9593ada92b3b7
stats function now have a new bool print_to_stderr parameter
This new macro gives us the ability to use this new parameter for PG14 and it doesn't give the parameter for previous versions
Existing print_to_stderr parameter is set to true to keep current behavior
Relevant PG commit:
43620e328617c1f41a2a54c8cee01723064e3ffa
getObjectTypeDescription and getObjectIdentity functions now have a new bool missing_ok parameter
These new macros give us the ability to use this new parameter for PG14 and they don't give the parameter for previous versions
Currently all missing_ok parameters are set to false to keep current behavior
Relevant PG commit:
2a10fdc4307a667883f7a3369cb93a721ade9680
The STATUS_WAITING define is removed and an enum with PROC_WAIT_STATUS_WAITING is added instead
This macro uses appropriate one
Relevant PG commit:
a513f1dfbf2c29a51b0f7cbd5913ce2d2ee452c5
AlterTableStmt's relkind field is changed into objtype
New AlterTableStmtObjType macro uses the appropriate one
Relevant PG commit:
cc35d8933a211d9965eb1c1d2749a903d5735db2
Allow ColumnarScans to push down join quals by generating
parameterized paths. This significantly expands the utility of chunk
group filtering, making a ColumnarScan behave similar to an index when
on the inner of a nested loop join.
Also, evaluate all parameters on beginscan/rescan, which also works
for external parameters.
Fixes#4488.
Previously, we were doing `first_row_number` reservation for the first
row written to current `WriteState` but were doing `stripe_id`
reservation when flushing the `WriteState` and were inserting the
related record to `columnar.stripe` at that time as well.
However, inserting `columnar.stripe` record at flush-time is
problematic. This is because, as told in #5160, if relation has
any index-based constraints and if there are two concurrent
writes that are inserting conflicting key values for that constraint,
then postgres relies on `tableAM->fetch_index_tuple`
(=`columnar_fetch_index_tuple`) callback to return `true` when
indexAM is checking against possible constraint violations.
However, pending writes of other backends are not visible to concurrent
sessions in columnar since we were not inserting the stripe metadata
record until flushing the stripe.
With this commit, we split stripe reservation into two phases:
i) Reserve `stripe_id` and insert a "dummy" record to `columnar.stripe`
at the very same time we reserve `first_row_number`, i.e. when writing
the first row to the current `WriteState`.
ii) At flush time, do the storage level allocation and complete the
missing fields of the dummy record inserted into `columnar.stripe`
during i).
That way, any concurrent writes would be able to check against possible
constraint violations by using `SnapshotDirty` when scanning
`columnar.stripe`.
Note that `columnar_fetch_index_tuple` still wouldn't be able to fill
the output tupleslot for the requested tid but it would at least return
`true` for such index look-up's and we believe this should be sufficient
for the caller indexAM callback to make the concurrent writer block on
prior one.
That is how we fix#5160.
Only downside of reserving `stripe_id` at the same time we reserve
`first_row_number` is that now any aborted writes would also waste
some amount of `stripe_id` as in the case of `first_row_number` but
we are just wasting them one-by-one.
Considering the fact that we waste `first_row_number` by the amount
stripe row limit (=150k by default) in such cases, this shouldn't be
important at all.
Before starting to scan a columnar table, we always flush the pending
writes to disk.
However, we increment command counter after modifying metadata tables.
On the other hand, now that we _don't always use_ xact snapshot to scan
a columnar table, writes that we just flushed might not be visible to
the query that just flushed pending writes to disk since curcid of
provided snapshot would become smaller than the command id being used
when modifying metadata tables.
To give an example, before this change, below was a possible scenario
due to the changes that we made to use the correct snapshot.
```sql
CREATE TABLE t(a int, b int) USING columnar;
BEGIN;
INSERT INTO t VALUES (5, 10);
SELECT * FROM t;
┌───┬───┐
│ a │ b │
├───┼───┤
└───┴───┘
(0 rows)
SELECT * FROM t;
┌───┬────┐
│ a │ b │
├───┼────┤
│ 5 │ 10 │
└───┴────┘
(1 row)
```
In next commit, we will adjust curcid of the snapshot being used when
scanning the columnar table.
However, for index scan, snapshot is provided not when beginning scan
but within fetch-tuple call.
For this reason, start flushing pending writes in init_columnar_read_state
since this seem to be a prerequisite step that needs to be done before
scanning a columnar table regardless of the scan method being used.
Seems that we always increment the command counter right after
finishing metadata table modification.
For this reason, it makes sense to call CommandCounterIncrement
within FinishModifyRelation.
The logging of the amount of ignored moves crashed when no distributed
tables existed in a cluster. This also fixes in passing that the logging
of ignored moves logs the correct number of ignored moves if there
exist multiple colocation groups and all are rebalanced at the same time.
`tcp_user_timeout` is the awesome relatively unknown big brother of the
TCP keepalive related options. Instead of depending on keepalives being
sent, this determines that a socket is dead by waiting at most N seconds
for an ack of data that it has sent. It's exposed in libpq starting from
PG12.
DESCRIPTION: Fix a segfault caused by use after free in ConnectionsPlacementHash
Fix a segfault caused by retaining data in any of the hashmaps making up the Placement Connection Management.
We have seen production systems segfault due to random data referenced from ConnectionPlacementHash.
On investigation we found that the backends segfaulting on this had OOM errors closely prior to the segfault.
It has shown there are at least 15 places where an allocation can OOM that would cause ConnectionPlacementHash to retain pointers to memory from contexts that are subsequently freed. This would reproduce the segfault we have observed in production.
Conditions for these allocations are:
- allocated after first call to `AssociatePlacementWithShard`: https://github.com/citusdata/citus/blob/v10.0.3/src/backend/distributed/connection/placement_connection.c#L880-L881
- allocated before `StartNodeUserDatabaseConnection`: https://github.com/citusdata/citus/blob/v10.0.3/src/backend/distributed/connection/connection_management.c#L291
At least 15 points of memory allocation (which could fail) are between the callsites of both in a primary key lookup on a reference table - where we have seen an OOM cause a segfault moments later.
Instead of leaving any references in ConnectionPlacementHash, ConnectionShardHash and ColocatedPlacementsHash that could retain any pointers that are freed due to the TopTransactionContext being reset we clear all these hashes irregardless of the state of CurrentCoordinatedTransactionState.
Downside is that on any transaction abort we will now iterate through 4 hashmaps and clear their contents. Given that they are either already empty, which should cause a quick iteration, or non-empty, causing segfaults in subsequent executions, this overhead seems reasonable.
A better solution would be to move the creation of these hashmaps so they would live in the TopTransactionContext themself, assuming their contents would never outlive a transaction. This needs more investigation and is an involved refactor Hence fixing this quickly here.
All the callers except columnar_relation_copy_for_cluster were already
switching to right memory context when creating ColumnarReadState.
With this commit, we embed that logic into init_columnar_read_state
to avoid further such bugs.
That way, we start using the right memory context for
columnar_relation_copy_for_cluster too.
- Add support for CRETE INDEX ... ON ONLY: Before that commit we were not sending "ONLY" option to the worker nodes at all. With this commit, "ONLY" parameter will be sent to the worker nodes if it is necessary. (#4938)
- Add support for ALTER INDEX ... ATTACH PARTITION: Attach child_index to parent_index by creating same inheritance on shard level in addition to table level. (#4980)
* Synchronize hasmetadata flag on mx workers
* Switch to sequential execution
* Add test
* Use SetWorkerColumn
* Add test for stop_sync
* Remove usage of UpdateHasmetadataOnWorkersWithMetadata
* Remove MarkNodeMetadataSynced
* Fix test for metadatasynced
* Remove MarkNodeMetadataSynced
* Style
* Remove MarkNodeHasMetadata
* Remove UpdateDistNodeBoolAttr
* Refactor SetWorkerColumn
* Use SetWorkerColumnLocalOnly when setting up dependencies
* Use SetWorkerColumnLocalOnly in TriggerSyncMetadataToPrimaryNodes
* Style
* Make update command generator functions static
* Set metadatasynced before syncing
* Call SetWorkerColumn only if the sync is successful
* Try to sync all nodes
* Fix indexno
* Update metadatasynced locally first
* Break if a node fails to sync metadata
* Send worker commands optional
* Style & Rebase
* Add raiseOnError param to SetWorkerColumn
* Style
* Set metadatasynced for all metadata nodes
* Style
* Introduce SetWorkerColumnOptional
* Polish
* Style
* Dont send set command to not synced metadata nodes
* Style
* Polish
* Add test for stop_sync
* Add test for shouldhaveshards
* Add test for isactive flag
* Sort by placementid in the function verify_metadata
* Cover edge cases for failing nodes
* Add comments
* Add nodeport to isactive test
* Add warning if metadata out of sync
* Update warning message
In short, add wrappers around Postgres' AddWaitEventToSet() and
ModifyWaitEvent().
AddWaitEventToSet()/ModifyWaitEvent*() may throw hard errors. For
example, when the underlying socket for a connection is closed by
the remote server and already reflected by the OS, however
Citus hasn't had a chance to get this information. In that case,
if replication factor is >1, Citus can failover to other nodes
for executing the query. Even if replication factor = 1, Citus
can give much nicer errors.
So CitusAddWaitEventSetToSet()/CitusModifyWaitEvent() simply puts
AddWaitEventToSet()/ModifyWaitEvent() into a PG_TRY/PG_CATCH block
in order to catch any hard errors, and returns this information to
the caller.
As we use the current user to sync the metadata to the nodes
with #5105 (and many other PRs), there is no reason that
prevents us to use the coordinated transaction for metadata syncing.
This commit also renames few functions to reflect their actual
implementation.
Before this commit, creating a partition after a DROP column
on the parent (position before dist. key) was leading to
partition to have the wrong distribution column.
update_distributed_table_colocation can be called by the relation
owner, and internally it updates pg_dist_partition. With this
commit, update_distributed_table_colocation uses an internal
UDF to access pg_dist_partition.
As a result, this operation can now be done by regular users
on MX.
Instead of setting stripeReadState to NULL, call ColumnarResetRead
before re-scanning a columnar table since this function is already
designed for doing the necessary clean up when finishing a stripe
read.
Note that this change shouldn't have a great effect on memory usage
since AdvanceStripe was already doing the clean-up for all the
stripes except the last one.
Previously, we were only using chunk group reader for sequential scan.
However, to support index scans on columnar tables, now we use very
same low level functions for index scan too.
Since those low-level functions were only used for sequential scan, it
was guaranteed that we would never read the same chunk group more than
once, so we were freeing chunk buffers after deserializing them into a
separate buffer.
Now that we use those low level functions for index scan, we cannot
free chunk buffers since it's possible to read the same chunk group
again, such that:
- read chunk group 1 of stripe 5
- read chunk group 2 of stripe 5
- read chunk group 1 of stripe 5 again
Here, when we decide to read chunk group 1 for a second time,
chunk group 1 is not cached. Plus, before this commit, we were
freeing the chunk buffers for chunk group 1 after the first
read and then we were getting segfault or errors from low-level
de-compression APIs.
* Fix UNION not being pushdown
Postgres optimizes column fields that are not needed in the output. We
were relying on these fields to understand if it is safe to push down a
union query.
This fix looks at the parse query, which has the original column fields
to detect if it is safe to push down a union query.
* Add more tests
* Simplify code and make it more robust
* Process varlevelsup > 0 in FindReferencedTableColumn
* Only look for outers vars in union path
* Add more comments
* Remove UNION ALL specific logic for pulling up childvars
The progress monitor wouldn't actually update the size of the shard on
the target node when using "block_writes" as the `shard_transfer_mode`.
The reason for this is that the CREATE TABLE part of the shard creation
would only be committed once all data was moved as well. This caused
our size calculation to always return 0, since the table did not exist
yet in the session that the progress monitor used.
This is fixed by first committing creation of the table, and only then
starting the actual data copy.
The test output changes slightly. Apparently splitting this up in two
transactions instead of one, increases the table size after the copy by
about 40kB. The additional size used doesn't increase when with the
amount of data in the table is larger (it stays ~40kB per shard). So
this small change in test output is not considered an actual problem.
These two options were not included when creating the sequences on the
workers as part of metadata syncing.
The missing `data_type` part of the definition made finding the cause
of #5126 harder than necessary, because of confusing errors.
Before this commit, we always synced the metadata with superuser.
However, that creates various edge cases such as visibility errors
or self distributed deadlocks or complicates user access checks.
Instead, with this commit, we use the current user to sync the metadata.
Note that, `start_metadata_sync_to_node` still requires super user
because accessing certain metadata (like pg_dist_node) always require
superuser (e.g., the current user should be a superuser).
However, metadata syncing operations regarding the distributed
tables can now be done with regular users, as long as the user
is the owner of the table. A table owner can still insert non-sense
metadata, however it'd only affect its own table. So, we cannot do
anything about that.
With this commit, we add (`CREATE INDEX` / `REINDEX`) `CONCURRENTLY` support for columnar tables.
For that, we implement `columnar_index_validate_scan` callback.
The reasoning behind the implementation is as follows:
* Postgres function `validate_index` provides all the TIDs that are currently in the
index to `columnar_index_validate_scan` callback via a `tupleSort` object..
* We start scanning the table by using `columnar_getnextslot` as usual.
Before moving forward, note that `columnar_getnextslot` guarantees
to return tuples in the order of their TIDs.
* For us to use during table scan, postgres provides a snapshot guaranteeing
that any tuples that are valid according to that snapshot but are not in the
index must be added to the index.
* Then for each tuple that we read from our table, we continue iterating
given `tupleSort` to find the first TID that is greater than or equal to our
tuple's TID.
If both TID's are equal to each other, then we skip the tuple since it's already
indexed.
If the TID that we read from tupleSort is greater then our tuple's TID, then
we decide to insert this tuple into index.
systable_getnext already uses ForwardScanDirection if relation has any
open indexes, but let's be more explicit doing ordered scan on columnar
catalog tables.
This happens only when we have a "<" or "<=" filter on distribution
column of a range distributed table and that filter falls in between
two shards.
When the filter falls in between two shards:
If the filter is ">" or ">=", then UpperShardBoundary was
returning "upperBoundIndex - 1", where upperBoundIndex is
exclusive shard index used during binary seach.
This is expected since upperBoundIndex is an exclusive
index.
If the filter is "<" or "<=", then LowerShardBoundary was
returning "lowerBoundIndex + 1", where lowerBoundIndex is
inclusive shard index used during binary seach.
On the other hand, since lowerBoundIndex is an inclusive
index, we should just return lowerBoundIndex instead of
doing "+ 1". Before this commit, we were missing leftmost
shard in such queries.
* Remove useless conditional branches
The branch that we delete from UpperShardBoundary was obviously useless.
The other one in LowerShardBoundary became useless after we remove "+ 1"
from there.
This indeed is another proof of what & how we are fixing with this pr.
* Improve comments and add more
* Add some tests for upper bound calculation too
* Add parameter to cleanup metadata
* Set clear metadata default to true
* Add test for clearing metadata
* Separate test file for start/stop metadata syncing
* Fix stop_sync bug for secondary nodes
* Use PreventInTransactionBlock
* DRemovedebuggiing logs
* Remove relation not found logs from mx test
* Revert localGroupId when doing stop_sync
* Move metadata sync test to mx schedule
* Add test with name that needs to be quoted
* Add test for views and matviews
* Add test for distributed table with custom type
* Add comments to test
* Add test with stats, indexes and constraints
* Fix matview test
* Add test for dropped column
* Add notice messages to stop_metadata_sync
* Add coordinator check to stop metadat sync
* Revert local_group_id only if clearMetadata is true
* Add a final check to see the metadata is sane
* Remove the drop verbosity in test
* Remove table description tests from sync test
* Add stop sync to coordinator test
* Change the order in stop_sync
* Add test for hybrid (columnar+heap) partitioned table
* Change error to notice for stop sync to coordinator
* Sync at the end of the test to prevent any failures
* Add test case in a transaction block
* Remove relation not found tests
Ignore orphaned shards in more places
Only use active shard placements in RouterInsertTaskList
Use IncludingOrphanedPlacements in some more places
Fix comment
Add tests
The name and comment of this function did not indicate that it only
really could detect locally accessible citus local tables. This fixes
that, while also cleaning up the function a bit.
* Alter seq type when we first use the seq in a dist table
* Don't allow type changes when seq is used in dist table
* ALTER SEQUENCE propagation
* Tests for ALTER SEQUENCE propagation
* Relocate AlterSequenceType and ensure dependencies for sequence
* Support for citus local tables, and other fixes
* Final formatting
With the previous version of this check we would disallow distributed
tables that did not have a colocationid, to have a foreign key to a
reference table. This fixes that, since there's no reason to disallow
that.
Originally ReplicateShardToNode was meant for
`upgrade_to_reference_table`, which required handling of existing inactive
placements. These days `upgrade_to_reference_table` is deprecated and
cannot be used anymore. Now that we have SHARD_STATE_TO_DELETE too, this
left over code seemed error prone. So this removes support for
activating inactive reference table placemements, since these should not
be possible. If it finds a non active reference table placement anyway
it now errors out.
This also removes a few outdated comments related to `upgrade_to_refeference_table`.
Moving shards of reference tables was possible in at least one case:
```sql
select citus_disable_node('localhost', 9702);
create table r(x int);
select create_reference_table('r');
set citus.replicate_reference_tables_on_activate = off;
select citus_activate_node('localhost', 9702);
select citus_move_shard_placement(102008, 'localhost', 9701, 'localhost', 9702);
```
This would then remove the reference table shard on the source, causing
all kinds of issues. This fixes that by disallowing all shard moves
except for shards of distributed tables.
Co-authored-by: Onur Tirtir <onurcantirtir@gmail.com>
The first and main issue was that we were putting absolute pointers into
shared memory for the `steps` field of the `ProgressMonitorData`. This
pointer was being overwritten every time a process requested the monitor
steps, which is the only reason why this even worked in the first place.
To quote a part of a relevant stack overflow answer:
> First of all, putting absolute pointers in shared memory segments is
> terrible terible idea - those pointers would only be valid in the
> process that filled in their values. Shared memory segments are not
> guaranteed to attach at the same virtual address in every process.
> On the contrary - they attach where the system deems it possible when
> `shmaddr == NULL` is specified on call to `shmat()`
Source: https://stackoverflow.com/a/10781921/2570866
In this case a race condition occurred when a second process overwrote
the pointer in between the first process its write and read of the steps
field.
This issue is fixed by not storing the pointer in shared memory anymore.
Instead we now calculate it's position every time we need it.
The second race condition I have not been able to trigger, but I found
it while investigating this. This issue was that we published the handle
of the shared memory segment, before we initialized the data in the
steps. This means that during initialization of the data, a call to
`get_rebalance_progress()` could read partial data in an unsynchronized
manner.
With local query caching, we try to avoid deparse/parse stages as the
operation is too costly.
However, we can do deparse/parse operations once per cached queries, right
before we put the plan into the cache. With that, we avoid edge
cases like (4239) or (5038).
In a sense, we are making the local plan caching behave similar for non-cached
local/remote queries, by forcing to deparse the query once.
A shard move would fail if there was an orphaned version of the shard on
the target node. With this change before actually fail, we try to clean
up orphaned shards to see if that fixes the issue.
Sometimes the background daemon doesn't cleanup orphaned shards quickly
enough. It's useful to have a UDF to trigger this removal when needed.
We already had a UDF like this but it was only used during testing. This
exposes that UDF to users. As a safety measure it cannot be run in a
transaction, because that would cause the background daemon to stop
cleaning up shards while this transaction is running.
* Add user-defined sequence support for MX
* Remove default part when propagating to workers
* Fix ALTER TABLE with sequences for mx tables
* Clean up and add tests
* Propagate DROP SEQUENCE
* Removing function parts
* Propagate ALTER SEQUENCE
* Change sequence type before propagation & cleanup
* Revert "Propagate ALTER SEQUENCE"
This reverts commit 2bef64c5a29f4e7224a7f43b43b88e0133c65159.
* Ensure sequence is not used in a different column with different type
* Insert select tests
* Propagate rename sequence stmt
* Fix issue with group ID cache invalidation
* Add ALTER TABLE ALTER COLUMN TYPE .. precaution
* Fix attnum inconsistency and add various tests
* Add ALTER SEQUENCE precaution
* Remove Citus hook
* More tests
Co-authored-by: Marco Slot <marco.slot@gmail.com>
InvalidateForeignKeyGraph sends an invalidation via shared memory to all
backends, including the current one.
However, we might not call AcceptInvalidationMessages before reading
from the cache below. It would be better to also add a call to
AcceptInvalidationMessages in IsForeignConstraintRelationshipGraphValid.
Previously this was usually done after argument parsing. This can cause
SEGFAULTs if the number or type of arguments changes in a new version.
By checking that Citus version is correct before doing any argument
parsing we protect against these types of issues. Issues like this have
occurred in pg_auto_failover, so it's not just a theoretical issue.
The main reason why these calls were not at the top of functions is
really just historical. It was because in the past we didn't allow
statements before declarations. Thus having this check before the
argument parsing would have only been possible if we first declared all
variables.
In addition to moving existing CheckCitusVersion calls it also adds
these calls to rebalancer related functions (they were missing there).
To be able to report progress of the rebalancer, the rebalancer updates
the state of a shard move in a shared memory segment. To then fetch the
progress, `get_rebalance_progress` can be called which reads this shared
memory.
Without this change it did so without using any synchronization
primitives, allowing for data races. This fixes that by using atomic
operations to update and read from the parts of the shared memory that
can be changed after initialization.
DESCRIPTION: fix shared dependencies that are not resident in a database
eg. databases depend on users (their owners) that both don’t have a
database they reside in. These dependencies are recorded in pg_shdepend
with a `dbid` of `InvalidOid` When we fetch our shared dependencies we don’t take
these links in account.
With this patch we use logic inspired by `classIdGetDbId` to decide when to use `MyDatabaseId` vs `InvalidOid` to correctly resolve dependencies between shared objects.
Without this change the rebalancer progress monitor gets the shard sizes
from the `shardlength` column in `pg_dist_placement`. This column needs to
be updated manually by calling `citus_update_table_statistics`.
However, `citus_update_table_statistics` could lead to distributed
deadlocks while database traffic is on-going (see #4752).
To work around this we don't use `shardlength` column anymore. Instead
for every rebalance we now fetch all shard sizes on the fly.
Two additional things this does are:
1. It adds tests for the rebalance progress function.
2. If a shard move cannot be done because a source or target node is
unreachable, then we error in stop the rebalance, instead of showing
a warning and continuing. When using the by_disk_size rebalance
strategy it's not safe to continue with other moves if a specific
move failed. It's possible that the failed move made space for the
next move, and because the failed move never happened this space now
does not exist.
3. Adds two new columns to the result of `get_rebalancer_progress` which
shows the size of the shard on the source and target node.
Fixes#4930
DESCRIPTION: Add support for ALTER DATABASE OWNER
This adds support for changing the database owner. It achieves this by marking the database as a distributed object. By marking the database as a distributed object it will look for its dependencies and order the user creation commands (enterprise only) before the alter of the database owner. This is mostly important when adding new nodes.
By having the database marked as a distributed object it can easily understand for which `ALTER DATABASE ... OWNER TO ...` commands to propagate by resolving the object address of the database and verifying it is a distributed object, and hence should propagate changes of owner ship to all workers.
Given the ownership of the database might have implications on subsequent commands in transactions we force sequential mode for transactions that have a `ALTER DATABASE ... OWNER TO ...` command in them. This will fail the transaction with meaningful help when the transaction already executed parallel statements.
By default the feature is turned off since roles are not automatically propagated, having it turned on would cause hard to understand errors for the user. It can be turned on by the user via setting the `citus.enable_alter_database_owner`.
Comment from the code:
/*
* Iterate until all the tasks are finished. Once all the tasks
* are finished, ensure that that all the connection initializations
* are also finished. Otherwise, those connections are terminated
* abruptly before they are established (or failed). Instead, we let
* the ConnectionStateMachine() to properly handle them.
*
* Note that we could have the connections that are not established
* as a side effect of slow-start algorithm. At the time the algorithm
* decides to establish new connections, the execution might have tasks
* to finish. But, the execution might finish before the new connections
* are established.
*/
Note that the abruptly terminated connections lead to the following errors:
2020-11-16 21:09:09.800 CET [16633] LOG: could not accept SSL connection: Connection reset by peer
2020-11-16 21:09:09.872 CET [16657] LOG: could not accept SSL connection: Undefined error: 0
2020-11-16 21:09:09.894 CET [16667] LOG: could not accept SSL connection: Connection reset by peer
To easily reproduce the issue:
- Create a single node Citus
- Add the coordinator to the metadata
- Create a distributed table with shards on the coordinator
- f.sql: select count(*) from test;
- pgbench -f /tmp/f.sql postgres -T 12 -c 40 -P 1 or pgbench -f /tmp/f.sql postgres -T 12 -c 40 -P 1 -C
With this commit, the executor becomes smarter about refrain to open
new connections. The very basic example is that, if the connection
establishments take 1000ms and task executions as 5 msecs, the executor
becomes smart enough to not establish new connections.
It was possible to block maintenance daemon by taking an SHARE ROW
EXCLUSIVE lock on pg_dist_placement. Until the lock is released
maintenance daemon would be blocked.
We should not block the maintenance daemon under any case hence now we
try to get the pg_dist_placement lock without waiting, if we cannot get
it then we don't try to drop the old placements.
DESCRIPTION: introduce `citus.local_hostname` GUC for connections to the current node
Citus once in a while needs to connect to itself for some systems operations. This used to be hardcoded to `localhost`. The hardcoded hostname causes some issues, for example in environments where `sslmode=verify-full` is required. It is not always desirable or even feasible to get `localhost` as an alt name on the certificate.
By introducing a GUC to use when connecting to the current instance the user has more control what network path is used and what hostname is required to be present in the server certificate.
Every move in the rebalancer algorithm results in an improvement in the
balance. However, even if the improvement in the balance was very small
the move was still chosen. This is especially problematic if the shard
itself is very big and the move will take a long time.
This changes the rebalancer algorithm to take the relative size of the
balance improvement into account when choosing moves. By default a move
will not be chosen if it improves the balance by less than half of the
size of the shard. An extra argument is added to the rebalancer
functions so that the user can decide to lower the default threshold if
the ignored move is wanted anyway.
* Make VACUUM hint for upgrade scenario actually work
* Suggest using VACUUM if metapage doesn't exist
Plus, suggest upgrading sql version as another option.
* Always force read metapage block
* Fix two typos
* Columnar: introduce columnar storage API.
This new API is responsible for the low-level storage details of
columnar; translating large reads and writes into individual block
reads and writes that respect the page headers and emit WAL. It's also
responsible for the columnar metapage, resource reservations (stripe
IDs, row numbers, and data), and truncation.
This new API is not used yet, but will be used in subsequent
forthcoming commits.
* Columnar: add columnar_storage_info() for debugging purposes.
* Columnar: expose ColumnarMetadataNewStorageId().
* Columnar: always initialize metapage at creation time.
This avoids the complexity of dealing with tables where the metapage
has not yet been initialized.
* Columnar: columnar storage upgrade/downgrade UDFs.
Necessary upgrade/downgrade step so that new code doesn't see an old
metapage.
* Columnar: improve metadata.c comment.
* Columnar: make ColumnarMetapage internal to the storage API.
Callers should not have or need direct access to the metapage.
* Columnar: perform resource reservation using storage API.
* Columnar: implement truncate using storage API.
* Columnar: implement read/write paths with storage API.
* Columnar: add storage tests.
* Revert "Columnar: don't include stripe reservation locks in lock graph."
This reverts commit c3dcd6b9f8.
No longer needed because the columnar storage API takes care of
concurrency for resource reservation.
* Columnar: remove unnecessary lock when reserving.
No longer necessary because the columnar storage API takes care of
concurrent resource reservation.
* Add simple upgrade tests for storage/ branch
* fix multi_extension.out
Co-authored-by: Onur Tirtir <onurcantirtir@gmail.com>
* When moving a shard to a new node ensure there is enough space
* Add WairForMiliseconds time utility
* Add more tests and increase readability
* Remove the retry loop and use a single udf for disk stats
* Address review
* address review
Co-authored-by: Jelte Fennema <github-tech@jeltef.nl>
The comment of DropMarkedShards described the behaviour that after a
failure we would continue trying to drop other shards. However the code
did not do this and would stop after the first failure. Instead of
simply fixing the comment I fixed the code, because the described
behaviour is more useful. Now a single shard that cannot be removed yet
does not block others from being removed.
We decrease memory usage by:
- Freeing temporary buffers
- Using separate memory context for blocks that uses "small" amount of
memory but can be repeated many times such as loops
As long as the VALUES clause contains constant values, we should not
recursively plan the queries/CTEs.
This is a follow-up work of #1805. So, we can easily apply OUTER join
checks as if VALUES clause is a reference table/immutable function.
* Fix problews with concurrent calls of DropMarkedShards
When trying to enable `citus.defer_drop_after_shard_move` by default it
turned out that DropMarkedShards was not safe to call concurrently.
This could especially cause big problems when also moving shards at the
same time. During tests it was possible to trigger a state where a shard
that was moved would not be available on any of the nodes anymore after
the move.
Currently DropMarkedShards is only called in production by the
maintenaince deamon. Since this is only a single process triggering such
a race is currently impossible in production settings. In future changes
we will want to call DropMarkedShards from other places too though.
* Add some isolation tests
Co-authored-by: Jelte Fennema <github-tech@jeltef.nl>
This commit adds support for long partition names for distributed tables:
- ALTER TABLE dist_table ATTACH PARTITION ..
- CREATE TABLE .. PARTITION OF dist_table ..
Note: create_distributed_table UDF does not support long table and
partition names, and is not covered in this commit
* Introduce 3 partitioned size udfs
* Add tests for new partition size udfs
* Fix type incompatibilities
* Convert UDFs into pure sql functions
* Fix function comment
ConnParams(AuthInfo and PoolInfo) gets a snapshot, which will block the
remote connectinos to localhost. And the release of snapshot will be
blocked by the snapshot. This leads to a deadlock.
We warm up the conn params hash before starting a new transaction so
that the entries will already be there when we start a new transaction.
Hence GetConnParams will not get a snapshot.
* Columnar: use clause Vars for chunk group filtering.
This solves #4780 and also provides a cleaner separation between chunk
group filtering and projection pushdown.
* Columnar: sort and deduplicate Vars pulled from clauses.
* Columnar: cleanup variable names.
* Columnar: remove alternate test output.
* Columnar: do not recurse when looking for whereClauseVars.
Co-authored-by: Jeff Davis <jefdavi@microsoft.com>
With https://github.com/citusdata/citus/pull/4806 we enabled
2PC for any non-read-only local task. However, if the execution
is a single task, enabling 2PC (CoordinatedTransactionShouldUse2PC)
hits an assertion as we are not in a coordinated transaction.
There is no downside of using a coordinated transaction for single
task local queries.
* Columnar: fix misnamed file.
* Columnar: make compression not dependent on columnar.h.
* Columnar: rename columnar_metadata_tables.c to columnar_metadata.c.
* Columnar: make customscan not depend on columnar.h.
Co-authored-by: Jeff Davis <jefdavi@microsoft.com>
Because setting the flag doesn't necessarily mean that we'll
use 2PC. If connections are read-only, we will not use 2PC.
In other words, we'll use 2PC only for connections that modified
any placements.
Before this commit, Citus used 2PC no matter what kind of
local query execution happens.
For example, if the coordinator has shards (and the workers as well),
even a simple SELECT query could start 2PC:
```SQL
WITH cte_1 AS (SELECT * FROM test LIMIT 10) SELECT count(*) FROM cte_1;
```
In this query, the local execution of the shards (and also intermediate
result reads) triggers the 2PC.
To prevent that, Citus now distinguishes local reads and local writes.
And, Citus switches to 2PC only if a modification happens. This may
still lead to unnecessary 2PCs when there is a local modification
and remote SELECTs only. Though, we handle that separately
via #4587.
Postgres keeps AFTER trigger state for each transaction, because we can have deferred AFTER triggers which will be fired at the end of a transaction. Postgres cleans up this state at the end of transaction.
Postgres processes ON COMMIT triggers after cleaning-up the AFTER trigger states. So if we fire any triggers in ON COMMIT, the AFTER trigger state won't be cleaned-up properly and the transaction state will be left in an inconsistent state, which might result in assertion failure.
So with this commit, we remove foreign keys between columnar metadata tables and enforce constraints between them manually when dropping columnar tables.
* Skip 2PC for readonly connections in a transaction
* Use ConnectionModifiedPlacement() function
* Remove the second check of ConnectionModifiedPlacement()
* Add order by to prevent flaky output
* Test using pg_dist_transaction
With this commit, we make sure to prevent infinite recursion for queries
in the format: [subquery with a UNION ALL] JOIN [table or subquery]
Also, fixes a bug where we pushdown UNION ALL below a JOIN even if the
UNION ALL is not safe to pushdown.
* Reimplement citus_update_table_statistics
* Update stats for the given table not colocation group
* Add tests for reimplemented citus_update_table_statistics
* Use coordinated transaction, merge with citus_shard_sizes functions
* Update the old master_update_table_statistics as well
* Use translated vars in postgres 13 as well
Postgres 13 removed translated vars with pg 13 so we had a special logic
for pg 13. However it had some bug, so now we copy the translated vars
before postgres deletes it. This also simplifies the logic.
* fix rtoffset with pg >= 13
/*
* The physical planner assumes that all worker queries would have
* target list entries based on the fact that at least the column
* on the JOINs have to be on the target list. However, there is
* an exception to that if there is a cartesian product join and
* there is no additional target list entries belong to one side
* of the JOIN. Once we support cartesian product join, we should
* remove this error.
*/
When we use PROCESS_UTILITY_TOPLEVEL it causes some problems when
combined with other extensions such as pg_audit. With this commit we use
PROCESS_UTILITY_QUERY in the codebase to fix those problems.
When executing alter_table / undistribute_table udf's, we should not try
to change sequence dependencies on MX workers if new table wouldn't
require syncing metadata.
Previously, we were checking that for input table. But in some cases, the
fact that input table requires syncing metadata doesn't imply the same
for resulting table (e.g when undistributing a Citus table).
Even more, doing that was giving an unexpected error when undistributing
a Citus table so this commit actually fixes that.
It seems that we need to consider only pseudo constants while doing some
shortcuts in planning. For example there could be a false clause but it
can contribute to the result in which case it will not be a pseudo
constant.
We would exclude tables without relationRestriction from conversion
candidates in local-distributed table joins. This could leave a leftover
local table which should have been converted to a subquery.
Ideally I would expect that in each call to CreateDistributedPlan we
would pass a new plan id, but that seems like a bigger change.
/*
* Colocated intermediate results are just files and not required to use
* the same connections with their co-located shards. So, we are free to
* use any connection we can get.
*
* Also, the current connection re-use logic does not know how to handle
* intermediate results as the intermediate results always truncates the
* existing files. That's why, we use one connection per intermediate
* result.
*/
We do not include dummy column if original task didn't return any
columns.
Otherwise, number of columns that original task returned wouldn't
match number of columns returned by worker_save_query_explain_analyze.
When COPY is used for copying into co-located files, it was
not allowed to use local execution. The primary reason was
Citus treating co-located intermediate results as co-located
shards, and COPY into the distributed table was done via
"format result". And, local execution of such COPY commands
was not implemented.
With this change, we implement support for local execution with
"format result". To do that, we use the buffer for every file
on shardState->copyOutState, similar to how local copy on
shards are implemented. In fact, the logic is similar to
local copy on shards, but instead of writing to the shards,
Citus writes the results to a file.
The logic relies on LOCAL_COPY_FLUSH_THRESHOLD, and flushes
only when the size exceeds the threshold. But, unlike local
copy on shards, in this case we write the headers and footers
just once.
* Sort results in citus_shards and give raw size
Sort results so that it is consistent and also similar to citus_tables.
Use raw size in the output so that doing operations on the size is
easier.
* Change column ordering
With #4338, the executor is smart enough to failover to
local node if there is not enough space in max_connections
for remote connections.
For COPY, the logic is different. With #4034, we made COPY
work with the adaptive connection management slightly
differently. The cause of the difference is that COPY doesn't
know which placements are going to be accessed hence requires
to get connections up-front.
Similarly, COPY decides to use local execution up-front.
With this commit, we change the logic for COPY on local nodes:
Try to reserve a connection to local host. This logic follows
the same logic (e.g., citus.local_shared_pool_size) as the
executor because COPY also relies on TryToIncrementSharedConnectionCounter().
If reservation to local node fails, switch to local execution
Apart from this, if local execution is disabled, we follow the
exact same logic for multi-node Citus. It means that if we are
out of the connection, we'd give an error.
It seems that we were not considering the case where coordinator was
added to the cluster as a worker in the optimization of intermediate
results.
This could lead to errors when coordinator was added as a worker.
pg_get_tableschemadef_string doesn't know how to deparse identity
columns so we cannot reflect those columns when creating table
from scratch. For this reason, we don't allow using alter_table udfs
with tables having any identity cols.
pg_get_tableschemadef_string doesn't know how to deparse identity
columns so we cannot reflect those columns when creating shell
relation.
For this reason, we don't allow adding local tables -having identity cols-
to metadata.
Postgres doesn't allow inserting into columns having GENERATED ALWAYS
AS (...) STORED expressions.
For this reason, when executing undistribute_table or an alter_* udf,
we should skip copying such columns.
This is not bad since Postgres would already generate such columns.
Enables an overall plan to be parallel (e.g. over a partition
hierarchy), even though an individual ColumnarScan is not
parallel-aware.
Co-authored-by: Jeff Davis <jefdavi@microsoft.com>
Previously, if columnar.enable_custom_scan was false, parallel paths
could remain, leading to an unexpected error.
Also, ensure that cheapest_parameterized_paths is cleared if a custom
scan is used.
Co-authored-by: Jeff Davis <jefdavi@microsoft.com>
When finding columns owning sequences, we shouldn't rely on atthasdef
since it might be true when column has GENERATED ALWAYS AS (...)
STORED expression.
Since create_citus_local_table doesn't specify cascadeViaForeignKeys
option, we can't directly call citus_add_local_table_to_metadata
from create_citus_local_table.
Instead, implement an internal method and call it from deprecated udf
too.
* Fix partition column index issue
We send column names to worker_hash/range_partition_table methods, and
in these methods we check the column name index from tuple descriptor.
Then this index is used to decide the bucket that the current row will
be sent for the repartition.
This becomes a problem when there are the same column names in the
tupleDescriptor. Then we can choose the wrong index. Hence the
partitioned data will be put to wrong workers. Then the result could
miss some data because workers might contain different range of data.
An example:
TupleDescriptor contains "trip_id", "car_id", "car_id" for one table.
It contains only "car_id" for the other table. And assuming that the
tables will be partitioned by car_id, it is not certain what should be
used for deciding the bucket number for the first table. Assuming value
2 goes to bucket 2 and value 3 goes to bucket 3, it is not certain which
bucket "1 2 3" (trip_id, car_id, car_id) row will go to.
As a solution we send the index of partition column in targetList
instead of the column name.
The old API is kept so that if workers upgrade work, it still works
(though it will have the same bug)
* Use the same method so that backporting is easier
Fixing a division by zero in the cost calculations for scanning a columnar table.
Due to how the columns in a columnar table are counted an empty table would result in a division by zero. Instead this patch keeps the column selection ratio on zero when this happens, resulting in an accurate cost of zero pages to scan a columnar table.
fixes#4589
In pg11.8 it seemingly tries to parse the full sql file creating the extension,
since we use syntax introduced in postgres 12 this fails.
This patch rewrites the statement not recognized by pg11.8 to be dynamically
executed from a string literal via `EXECUTE`.
* Make undistribute_table() and citus_create_local_table() work with columnar
* Rename and use LocallyExecuteUtilityTask for UDF check
* Remove 'local' references in ExecuteUtilityCommand
As described in the comment, we have observed crashes in production
due to a segfault caused by the dereference of a NULL pointer in our
connection statemachine.
As a mitigation, preventing system crashes, we provide an error with
a small explanation of the issue. Unfortunately the case is not
reliably reproduced yet, hence the inability to add tests.
DESCRIPTION: Prevent segfaults when SAVEPOINT handling cannot recover from connection failures
Currently we choose an arbitrary colocation id from all the matches for
a colocation id. This could mean that 2 distributed tables, which have
the same scheme could go into different colocation groups. This fix
makes sure that the same match will go to the same colocation group.
/*
* Creating Citus local tables relies on functions that accesses
* shards locally (e.g., ExecuteAndLogDDLCommand()). As long as
* we don't teach those functions to access shards remotely, we
* cannot relax this check.
*/
The reason behind skipping postgres tables is that we support
foreign keys between postgres tables and reference tables
(without converting postgres tables to citus local tables)
when enable_local_reference_table_foreign_keys is false or
when coordinator is not added to metadata.
For certaion purposes, we drop and recreate the foreign
keys. As we acquire exclusive locks on the tables in between
drop and re-create, we can safely skip validation phase of
the foreign keys. The reason is purely being performance as
foreign key validation could take a long value.
When enabled any foreign keys between local tables and reference
tables supported by converting the local table to a citus local
table.
When the coordinator is not in the metadata, the logic is disabled
as foreign keys are not allowed in this configuration.
Because master_add_node(or others) might acquire ExclusiveLock
and their initiated sessions may call CoordinatorAddedAsWorkerNode().
With this we prevent potential deadlocks.
If relation is not involved in any foreign key relationships,
foreign key graph would not return any relations for given
relationId as expected.
But even if it's the case, we should still undistribute the table
itself.
* Stronger check for triggers on columnar tables (#4493).
Previously, we used a simple ProcessUtility_hook. Change to use an
object_access_hook instead.
* Replace alter_table_set_access_method test on partition with foreign key
Co-authored-by: Jeff Davis <jefdavi@microsoft.com>
Co-authored-by: Marco Slot <marco.slot@gmail.com>
With citus shard helper view, we can easily see:
- where each shard is, which node, which port
- what kind of table it belongs to
- its size
With such a view, we can see shards that have a size bigger than some
value, which could be useful. Also debugging can be easier in production
as well with this view.
Fetch shards in one go per node
The previous implementation was slow because it would do a lot of round
trips, one per shard to be exact. Hence it is improved so that we fetch
all the shard_name, shard-size pairs per node in one go.
Construct shards_names, sizes query on coordinator
* Replace master_add_node with citus_add_node
* Replace master_activate_node with citus_activate_node
* Replace master_add_inactive_node with citus_add_inactive_node
* Use master udfs in old scripts
* Replace master_add_secondary_node with citus_add_secondary_node
* Replace master_disable_node with citus_disable_node
* Replace master_drain_node with citus_drain_node
* Replace master_remove_node with citus_remove_node
* Replace master_set_node_property with citus_set_node_property
* Replace master_unmark_object_distributed with citus_unmark_object_distributed
* Replace master_update_node with citus_update_node
* Replace master_update_shard_statistics with citus_update_shard_statistics
* Replace master_update_table_statistics with citus_update_table_statistics
* Rename master_conninfo_cache_invalidate to citus_conninfo_cache_invalidate
Rename master_dist_local_group_cache_invalidate to citus_dist_local_group_cache_invalidate
* Replace master_copy_shard_placement with citus_copy_shard_placement
* Replace master_move_shard_placement with citus_move_shard_placement
* Rename master_dist_node_cache_invalidate to citus_dist_node_cache_invalidate
* Rename master_dist_object_cache_invalidate to citus_dist_object_cache_invalidate
* Rename master_dist_partition_cache_invalidate to citus_dist_partition_cache_invalidate
* Rename master_dist_placement_cache_invalidate to citus_dist_placement_cache_invalidate
* Rename master_dist_shard_cache_invalidate to citus_dist_shard_cache_invalidate
* Drop master_modify_multiple_shards
* Rename master_drop_all_shards to citus_drop_all_shards
* Drop master_create_distributed_table
* Drop master_create_worker_shards
* Revert old function definitions
* Add missing revoke statement for citus_disable_node
We used to need WarnAboutLeakedPreparedTransaction()
as we didn't have auto 2PC recovery. But, we long have
2PC recovery by https://github.com/citusdata/citus/pull/1574
So, we don't need anymore.
* Rethrow original concurrent index creation failure message
* Alter test outputs for concurrent index creation
* Detect duplicate table failure in concurrent index creation
* Add test for conc. index creation w/out duplicates
* Prevent deadlock for long named partitioned index creation on single node
* Create IsSingleNodeCluster function
* Use both local and sequential execution
On top of our foreign key graph, implement the infrastructure to get
list of relations that are connected to input relation via a foreign key
graph.
We need this to support cascading create_citus_local_table &
undistribute_table operations.
Also add regression tests to see what our foreign key graph is able to
capture currently.
With this commit, we remove visited flags from ForeignConstraintRelationshipNode
struct since keeping local state in global object is both dangerous and
meaningless.
Also to improve readability, this commit also converts needless recursion to
iterative DFS to avoid passing local hash-map as another parameter to
GetConnectedListHelper function.
Attribute number in a subquery RTE and relation RTE means different
things. In a relation attribute number will point to the column number
in the table definition including the dropped columns as well however in
subquery, it means the index in the target list. When we convert a
relation RTE to subquery RTE we should either correct all the relevant
attribute numbers or we can just add a dummy column for the dropped
columns. We choose the latter in this commit because it is practically
too vulnerable to update all the vars in a query.
Another thing this commit fixes is that in case a join restriction
clause list contains a false clause, we should just returns a false
clause instead of the whole list, because the whole list will contain
restrictions from other RTEs as well and this breaks the query, which
can be seen from the output changes, now it is much simpler.
Also instead of adding single tests for dropped columns, we choose to
run the whole mixed queries with tables with dropped columns, this
revealed some bugs already, which are fixed in this commit.
It seems that there are only very few cases where that is useful, and
for now we prefer not having that check. This means that we might
perform some unnecessary checks, but that should be rare and not
performance critical.
Instead of sending NULL's over a network, we now convert the subqueries
in the form of:
SELECT t.a, NULL, NULL FROM (SELECT a FROM table)t;
And we recursively plan the inner part so that we don't send the NULL's
over network. We still need the NULLs in the outer subquery because we
currently don't have an easy way of updating all the necessary places in
the query.
Add some documentation for how the conversion is done
Baseinfo also has pushed down filters etc, so it makes more sense to use
BaseRestrictInfo to determine what columns have constant equality
filters.
Also RteIdentity is used for removing conversion candidates instead of
rteIndex.
It seems that most of the updates were broken, we weren't aware of it
because there wasn't any data in the tables. They are broken mostly
because local tables do not have a shard id and some code paths should
be updated with that information, currently when there is an invalid
shard id, it is assumed to be pruned.
Consider local tables in router planner
In case there is a local table, the shard id will not be valid and there
are some checks that rely on shard id, we should skip these in case of
local tables, which is handled with a dummy placement.
Add citus local table dist table join tests
add local-dist table mixed joins tests
AllDataLocallyAccessible and ContainsLocalTableSubqueryJoin are removed.
We can possibly remove ModifiesLocalTableWithRemoteCitusLocalTable as
well. Though this removal has a side effect that now when all the data
is locally available, we could still wrap a relation into a subquery, I
guess that should be resolved in the router planner itself.
Add more tests
When we wrap an RTE to subquery we are updating the variables varno's as
1, however we should also update the varno's of vars in quals.
Also some other small code quality improvements are done.
The previous algorithm was not consistent and it could convert different
RTEs based on the table orders in the query. Now we convert local tables
if there is a distributed table which doesn't have a unique index. So if
there are 4 tables, local1, local2, dist1, dist2_with_pkey then we will
convert local1 and local2 in `auto` mode. Converting a distributed table
is not that logical because as there is a distributed table without a
unique index, we will need to convert the local tables anyway. So
converting the distributed table with pkey is redundant.
Remove FillLocalAndDistributedRTECandidates and use
ShouldConvertLocalTableJoinsToSubqueries, which simplifies things as we
rely on a single function to decide whether we should continue
converting RTE to subquery.
We should not recursively plan an already routable plannable query. An
example of this is (SELECT * FROM local JOIN (SELECT * FROM dist) d1
USING(a));
So we let the recursive planner do all of its work and at the end we
convert the final query to to handle unsupported joins. While doing each
conversion, we check if it is router plannable, if so we stop.
Only consider range table entries that are in jointree
If a range table is not in jointree then there is no point in
considering that because we are trying to convert range table entries to
subqueries for join use case.
Check equality in quals
We want to recursively plan distributed tables only if they have an
equality filter on a unique column. So '>' and '<' operators will not
trigger recursive planning of distributed tables in local-distributed
table joins.
Recursively plan distributed table only if the filter is constant
If the filter is not a constant then the join might return multiple rows
and there is a chance that the distributed table will return huge data.
Hence if the filter is not constant we choose to recursively plan the
local table.
When doing local-distributed table joins we convert one of them to
subquery. The current policy is that we convert distributed tables to
subquery if it has a unique index on a column that has unique
index(primary key also has a unique index).
The logical planner cannot handle joins between local and distributed table.
Instead, we can recursively plan one side of the join and let the logical
planner handle the rest.
Our algorithm is a little smart, trying not to recursively plan distributed
tables, but favors local tables.
UPDATEs on partitioned tables that affect only row partitions should
succeed, the rest should fail.
Also rename CStoreScan to ColumnarScan to make the error message more
relevant.
A utility function is added so that each caller can implement a handler
for each index on a given table. This means that the caller doesn't need
to worry about how to access each index, the only thing that it needs to
do each to implement a function to which each index on the table is
passed iteratively.
When Citus needs to parallelize queries on the local node (e.g., the node
executing the distributed query and the shards are the same), we need to
be mindful about the connection management. The reason is that the client
backends that are running distributed queries are competing with the client
backends that Citus initiates to parallelize the queries in order to get
a slot on the max_connections.
In that regard, we implemented a "failover" mechanism where if the distributed
queries cannot get a connection, the execution failovers the tasks to the local
execution.
The failover logic is follows:
- As the connection manager if it is OK to get a connection
- If yes, we are good.
- If no, we fail the workerPool and the failure triggers
the failover of the tasks to local execution queue
The decision of getting a connection is follows:
/*
* For local nodes, solely relying on citus.max_shared_pool_size or
* max_connections might not be sufficient. The former gives us
* a preview of the future (e.g., we let the new connections to establish,
* but they are not established yet). The latter gives us the close to
* precise view of the past (e.g., the active number of client backends).
*
* Overall, we want to limit both of the metrics. The former limit typically
* kics in under regular loads, where the load of the database increases in
* a reasonable pace. The latter limit typically kicks in when the database
* is issued lots of concurrent sessions at the same time, such as benchmarks.
*/
When distributing a columnar table, as well as changing options on a distributed columnar table, this patch will forward the settings from the coordinator to the workers.
For propagating options changes on an already distributed table this change is pretty straight forward. Before applying the change in options locally we will create a `DDLJob` that contains a call to `alter_columnar_table_set(...)` for every shard placement with all settings of the current table. This goes both for setting an option as well as resetting. This will reset the values to the defaults configured on the coordinator. Having the effect that the coordinator is authoritative on the settings and makes sure the shards have the same settings set as the table on the coordinator.
When a columnar table is distributed it is using the `TableDDLCommand` infra structure to create a new kind of `TableDDLCommand`. This new type, called a `TableDDLCommandFunction` contains a context and 2 function pointers to execute. One function returns the command as applied on the table, the second function will return the sql command to apply to a shard with a given shard id. The schema name is ignored as it will use the fully qualified name of the shard in the same schema as the base table.
Multi-row execution already uses sequential execution. When shards
are local, using local execution is profitable as it avoids
an extra connection establishment to the local node.
If MemoryContextAlloc errors out -e.g. during an OOM-, ConnectionHashEntry->connections
stays as NULL.
With this commit, we add isValid flag to ConnectionHashEntry that should be set to true
right after we allocate & initialize ConnectionHashEntry->connections list properly, and we
check it before accesing to ConnectionHashEntry->connections.
Columnar options were by accident linked to the relfilenode instead of the regclass/relation oid. This PR moves everything related to columnar options to their own catalog table.
The name of the function is different than the implemantation. Because
the function is designed to only consider SELECT queries. Also this
changes the assert with an error.
Refactor internals on how Citus creates the SQL commands it sends to recreate shards.
Before Citus collected solely ddl commands as `char *`'s to recreate a table. If they were used to create a shard they were wrapped with `worker_apply_shard_ddl_command` and send to the workers. On the workers the UDF wrapping the ddl command would rewrite the parsetree to replace tables names with their shard name equivalent.
This worked well, but poses an issue when adding columnar. Due to limitations in Postgres on creating custom options on table access methods we need to fall back on a UDF to set columnar specific options. Now, to recreate the table, we can not longer rely on having solely DDL statements to recreate a table.
A prototype was made to run this UDF wrapped in `worker_apply_shard_ddl_command`. This became pretty messy, hard to understand and subsequently hard to maintain.
This PR proposes a refactor of the internal representation of table ddl commands into a `TableDDLCommand` structure. The current implementation only supports a `char *` as its contents. Based on the use of the DDL statement (eg. creating the table -mx- or creating a shard) one of two different functions can be called to get the statement to send to the worker:
- `GetTableDDLCommand(TableDDLCommand *command)`: This function returns that ddl command to create the table. In this implementation it will just return the `char *`. This has the same functionality as getting the old list and not wrapping it.
- `GetShardedTableDDLCommand(TableDDLCommand *command, uint64 shardId, char *schemaName)`: This function returns the ddl command wrapped in `worker_apply_shard_ddl_command` with the `shardId` as an argument. Due to backwards compatibility it also accepts a. `schemaName`. The exact purpose is not directly clear. Ideally new implementations would work with fully qualified statements and ignore the `schemaName`.
A future implementation could accept 2.function pointers and a `void *` for context to let the two pointers work on. This gives greater flexibility in controlling what commands get send in which situations. Also, in a future, we could implement the intermediate step of creating the `parsetree` datastructure of statements based on the contents in the catalog with a corresponding deparser. For sharded queries a mutator could be ran over the parsetree to rewrite the tablenames to the names with the shard identifier. This will completely omit the requirement for `worker_apply_shard_ddl_command`.
Considering the adaptive connection management
improvements that we plan to roll soon, it makes it
very helpful to know the number of active client
backends.
We are doing this addition to simplify yhe adaptive connection
management for single node Citus. In single node Citus, both the
client backends and Citus parallel queries would compete to get
slots on Postgres' `max_connections` on the same Citus database.
With adaptive connection management, we have the counters for
Citus parallel queries. That helps us to adaptively decide
on the remote executions pool size (e.g., throttle connections
if necessary).
However, we do not have any counters for the total number of
client backends on the database. For single node Citus, we
should consider all the client backends, not only the remote
connections that Citus does.
Of course Postgres internally knows how many client
backends are active. However, to get that number Postgres
iterates over all the backends. For examaple, see [pg_stat_get_db_numbackends](8e90ec5580/src/backend/utils/adt/pgstatfuncs.c (L1240))
where Postgres iterates over all the backends.
For our purpuses, we need this information on every connection
establishment. That's why we cannot affort to do this kind of
iterattion.
CitusTableTypeIdList() function iterates on all the entries of pg_dist_partition
and loads all the metadata in to the cache. This can be quite memory intensive
especially when there are lots of distributed tables.
When partitioned tables are used, it is common to have many distributed tables
given that each partition also becomes a distributed table.
CitusTableTypeIdList() is used on every CREATE TABLE .. PARTITION OF.. command
as well. It means that, anytime a partition is created, Citus loads all the
metadata to the cache. Note that Citus typically only loads the accessed table's
metadata to the cache.
* Move local execution after the remote execution
Before this commit, when both local and remote tasks
exist, the executor was starting the execution with
local execution. There is no strict requirements on
this.
Especially considering the adaptive connection management
improvements that we plan to roll soon, moving the local
execution after to the remote execution makes more sense.
The adaptive connection management for single node Citus
would look roughly as follows:
- Try to connect back to the coordinator for running
parallel queries.
- If succeeds, go on and execute tasks in parallel
- If fails, fallback to the local execution
So, we'll use local execution as a fallback mechanism. And,
moving it after to the remote execution allows us to implement
such further scenarios.
The adaptive executor emulates the TCP's slow start algorithm.
Whenever the executor needs new connections, it doubles the number
of connections established in the previous iteration.
This approach is powerful. When the remote queries are very short
(like index lookup with < 1ms), even a single connection is sufficent
most of the time. When the remote queries are long, the executor
can quickly establish necessary number of connections.
One missing piece on our implementation seems that the executor
keeps doubling the number of connections even if the previous
connection attempts have been finalized. Instead, we should
wait until all the attempts are finalized. This is how TCP's
slow-start works. Plus, it decreases the unnecessary pressure
on the remote nodes.
I got this warning when compiling citus:
```
../columnar/write_state_management.c: In function ‘PendingWritesInUpperTransactions’:
../columnar/write_state_management.c:364:20: warning: ‘entry’ may be used uninitialized in this function [-Wmaybe-uninitialized]
if (found && entry->writeStateStack != NULL)
~~~~~^~~~~~~~~~~~~~~~
```
I fixed this by checking by always initializing entry, by using an early
return if `WriteStateMap` didn't exist. Instead of using the `found`
variable to check for existence of the key, I now simply check the
`entry` variable itself.
To quote the postgres comment on the hash_enter function:
> If foundPtr isn't NULL, then *foundPtr is set true if we found an
> existing entry in the table, false otherwise. This is needed in the
> HASH_ENTER case, but is redundant with the return value otherwise.
Before this commit, we let AdaptiveExecutorPreExecutorRun()
to be effective multiple times on every FETCH on cursors.
That does not affect the correctness of the query results,
but adds significant overhead.
It seems that we forgot to pass the revelant
flag to enable Postgres' parallel query
capabilities on the shards when user does
EXPLAIN ANALYZE on a distributed table.
TableAM API doesn't allow us to pass around a state variable along all of the tuple inserts belonging to the same command. We require this in columnar store, since we batch them, and when we have enough rows we flush them as stripes.
To do that, we keep a (relfilenode) -> stack of (subxact id, TableWriteState) global mapping.
**Inserts**
Whenever we want to insert a tuple, we look up for the relation's relfilenode in this mapping. If top of the stack matches current subtransaction, we us the existing TableWriteState. Otherwise, we allocate a new TableWriteState and push it on top of stack.
**(Sub)Transaction Commit/Aborts**
When the subtransaction or transaction is committed, we flush and pop all entries matching current SubTransactionId.
When the subtransaction or transaction is committed, we pop all entries matching current SubTransactionId and discard them without flushing.
**Reads**
Since we might have unwritten rows which needs to be read by a table scan, we flush write states on SELECTs. Since flushing the write state of upper transactions in a subtransaction will cause metadata being written in wrong subtransaction, we ERROR out if any of the upper subtransactions have unflushed rows.
**Table Drops**
We record in which subtransaction the table was dropped. When committing a subtransaction in which table was dropped, we propagate the drop to upper transaction. When aborting a subtransaction in which table was dropped, we mark table as not deleted.
If one wishes to iterate through a List and insert list elements in
PG13, it is not safe to use for_each_ptr as the List representation
in PostgreSQL no longer linked lists, but arrays, and it is possible
that the whole array is repalloc'ed if ther is not sufficient space
available.
See postgres commit 1cff1b95ab6ddae32faa3efe0d95a820dbfdc164 for more
information
When a relation is used on an OUTER JOIN with FALSE filters,
set_rel_pathlist_hook may not be called for the table.
There might be other cases as well, so do not rely on the hook
for classification of the tables.
RemoveDuplicateJoinRestrictions() function was introduced with the aim of decrasing the overall planning times by eliminating the duplicate JOIN restriction entries (#1989). However, it turns out that the function itself is so CPU intensive with a very high algorithmic complexity, it hurts a lot more than it helps. The function is a clear example of premature optimization.
The table below shows the difference clearly:
"distributed query planning
time master" RemoveDuplicateJoinRestrictions() execution time on master "Remove the function RemoveDuplicateJoinRestrictions()
this PR"
5 table INNER JOIN 9 msec 2msec 7 msec
10 table INNER JOIN 227 msec 194 msec 29 msec
20 table INNER JOIN 1 sec 235 msec 1 sec 139 msec 90 msecs
50 table INNER JOIN 24 seconds 21 seconds 1.5 seconds
100 table INNER JOIN 2 minutes 16 secods 1 minute 53 seconds 23 seconds
250 table INNER JOIN Bottleneck on JoinClauseList 18 minutes 52 seconds Bottleneck on JoinClauseList
5 table INNER JOIN in subquery 9 msec 0 msec 6 msec
10 table INNER JOIN subquery 33 msec 10 msec 32 msec
20 table INNER JOIN subquery 132 msec 67 msec 123 msec
50 table INNER JOIN subquery 1.2 seconds 900 msec 500 msec
100 table INNER JOIN subquery 6 seconds 5 seconds 2 seconds
250 table INNER JOIN subquery 54 seconds 37 seconds 20 seconds
5 table LEFT JOIN 5 msec 0 msec 5 msec
10 table LEFT JOIN 11 msec 0 msec 13 msec
20 table LEFT JOIN 26 msec 2 msec 30 msec
50 table LEFT JOIN 150 msec 15 msec 193 msec
100 table LEFT JOIN 757 msec 71 msec 722 msec
250 table LEFT JOIN 8 seconds 600 msec 8 seconds
5 JOINs among 2 table JOINs 37 msec 11 msec 25 msec
10 JOINs among 2 table JOINs 536 msec 306 msec 352 msec
20 JOINs among 2 table JOINs 794 msec 181 msec 640 msec
50 JOINs among 2 table JOINs 25 seconds 2 seconds 22 seconds
100 JOINs among 2 table JOINs Bottleneck on JoinClauseList 9 seconds Bottleneck on JoinClauseList
150 JOINs among 2 table JOINs Bottleneck on JoinClauseList 46 seconds Bottleneck on JoinClauseList
On top of the performance penalty, the function had a critical bug #4255, and with #4254 we hit one more important bug. It should be fixed by adding the followig check to the ContextCoversJoinRestriction():
```
static bool
JoinRelIdsSame(JoinRestriction *leftRestriction, JoinRestriction *rightRestriction)
{
Relids leftInnerRelIds = leftRestriction->innerrel->relids;
Relids rightInnerRelIds = rightRestriction->innerrel->relids;
if (!bms_equal(leftInnerRelIds, rightInnerRelIds))
{
return false;
}
Relids leftOuterRelIds = leftRestriction->outerrel->relids;
Relids rightOuterRelIds = rightRestriction->outerrel->relids;
if (!bms_equal(leftOuterRelIds, rightOuterRelIds))
{
return false;
}
return true;
}
```
However, adding this eliminates all the benefits tha RemoveDuplicateJoinRestrictions() brings.
I've used the commands here to generate the JOINs mentioned in the PR: https://gist.github.com/onderkalaci/fe8654f9df5916c7af4c7c5eb892561e#file-gistfile1-txt
Inner and outer JOINs behave roughly the same, to simplify the table only added INNER joins.
* Fix incorrect join related fields
Ruleutils expect to give the original index of join columns hence we
should consider the dropped columns while setting the fields in
SetJoinRelatedFieldsCompat.
* add some more tests for joins
* Move tests to join.sql and create a utility function
Disallow `ON TRUE` outer joins with reference & distributed tables
when reference table is outer relation by fixing the logic bug made
when calling `LeftListIsSubset` function.
Also, be more defensive when removing duplicate join restrictions
when join clause is empty for non-inner joins as they might still
contain useful information for non-inner joins.
It seems like Postgres could call set_rel_pathlist() for
the same relation multiple times. This breaks the logic
where we assume relationCount eqauls to the number of
entries in relationRestrictionList.
In summary, relationRestrictionList may contain duplicate
entries.
With this commit, we make sure that local execution adds the
intermediate result size as the distributed execution adds. Plus,
it enforces the citus.max_intermediate_result_size value.
We should not access CurrentLocalExecutionStatus directly because that
would mean that we could also set it directly, which we shouldn't
because we have checks to see if the new state is possible, otherwise we
error.
Before this commit, the logic was:
- As long as the outer side of the JOIN is not a JOIN (e.g., relation
or subquery etc.), we check for the existence of any recurring
tuples. There were two implications of this decision.
First, even if a subquery which is on the outer side contains
distributed table JOIN reference table, Citus would unnecessarily throw
an error. Note that, the JOIN inside the subquery would already
be going to be tested recursively. But, as long as that check
passes, there is no reason for the upper JOIN to fail. An example, which
used to fail and now works:
SELECT * FROM (SELECT * FROM dist JOIN ref) as foo LEFT JOIN dist;
Second, certain JOINs, especially with ON (true) conditions were not
represented as Citus expects the JOINs to be in the format
DeferredErrorIfUnsupportedRecurringTuplesJoin().
Use short lived per-tuple context in citus_evaluate_expr like
(pg) evaluate_expr does.
We should not use planState->ExprContext when evaluating expressions
as it might lead to freeing the same executor twice (first one happens
in citus_evaluate_expr itself and the other one happens when postgres
doing clean-up for the top level executor state), which in turn might
cause seg.faults.
However, now as we don't have necessary planState info to evaluate
prepared statements, we also add planState->es_param_list_info to
per-tuple ExprContext.
With postgres 13, there is a global lock that prevents multiple VACUUMs
happening in the current database. This global lock is taken for a short
time but this creates a problem because of the following:
- We execute the VACUUM for the shell table through the standard process
utility. In this step the global lock is taken for the current database.
- If the current node has shard placements then it tries to execute
VACUUM over a connection to localhost with ExecuteUtilityTaskList.
- the VACUUM on shard placements cannot proceed because it is waiting
for the global lock for the current database to be released.
- The acquired lock from the VACUUM for shell table will not be released
until the transaction is committed.
- So there is a deadlock.
As a solution, we commit the current transaction in case of VACUUM after
the VACUUM is executed for the shell table. Executing the VACUUM on a
shell table is not important because the data there will probably be
truncated. PostprocessVacuumStmt takes the necessary locks on the shell
table so we don't need to take any extra locks after we commit the
current transaction.
The prepare for upgrade script creates the `'public.pg_dist_rebalance_strategy` table which is not dropped when the upgrade is finished. This may block future upgrades.
Multi-row & router INSERT's were crashing with local execution if at
least one of the DEFAULT columns were not specified in VALUES list.
This was because, the changes we make on query->values_lists and
query->targetList was sufficient for deparsing given INSERT for remote
execution but not sufficient for local execution.
With this commit, DEFAULT value normalization for multi-row & router
INSERT's is fixed by adding dummy column references for unspecified
DEFAULT columns.
Citus has the logic to truncate the long shard names to prevent
various issues, including self-deadlocks. However, for partitioned
tables, when index is created on the parent table, the index names
on the partitions are auto-generated by Postgres. We use the same
Postgres function to generate the index names on the shards of the
partitions. If the length exceeds the limit, we switch to sequential
execution mode.
In postmasters execution of _PG_init, IsUnderPostmaster will be false and
we want to do the cleanup at that time only, otherwise there is a chance that
there will be parallel queries and we might do a cleanup for things that are
already in use.
After the connection timeout, we fail the session/pool. However, the
underlying connection can still be trying to connect. That is dangerous
because the new placement executions have already been in place. The
executor cannot handle the situation where multiple of
EXECUTION_ORDER_ANY task executions succeeds.
Adding a regression test doesn't seem easily doable. To reproduce the issue
- Add 2 worker nodes
- create a reference table
- set citus.node_connection_timeout to 1ms (requires code change)
- Continiously execute `SELECT count(*) FROM ref_table`
- Sometime later, you hit an out-of-array access in
`ScheduleNextPlacementExecution()` hence crashing.
- The reason for that is sometimes the first connection
successfully established while the executor is already
trying to execute the query on the second node.
We currently do not support volatile functions in update/delete statements
because the function evaluation logic does not know how to distinguish
volatile functions (that need to be evaluated per row) from stable functions
(that need to be evaluated per query), and it is also not safe to push the
volatile functions down on replicated tables.
Add sort method parameter for regression tests
Fix check-style
Change sorting method parameters to enum
Polish
Add task fields to OutTask
Add test into multi_explain
Fix isolation test
As the previous versions of Citus don't know how to handle citus local
tables, we should prevent downgrading from 9.5 to older versions if any
citus local tables exists.
Pushing down the CALLs to the node that the CALL is executed is
dangerous and could lead to infinite recursion.
When the coordinator added as worker, Citus was by chance preventing
this. The coordinator was marked as "not metadatasynced" node
in pg_dist_node, which prevented CALL/function delegation to happen.
With this commit, we do the following:
- Fix metadatasynced column for the coordinator on pg_dist_node
- Prevent pushdown of function/procedure to the same node that
the function/procedure is being executed. Today, we do not sync
pg_dist_object (e.g., distributed functions metadata) to the
worker nodes. But, even if we do it now, the function call delegation
would prevent the infinite recursion.
* Not take ShareUpdateExlusiveLock on pg_dist_transaction
We were taking ShareUpdateExlusiveLock on pg_dist_transaction during
recovery to prevent multiple recoveries happening concurrenly. VACUUM(
not FULL) also takes ShareUpdateExclusiveLock, and they can conflict. It
seems that VACUUM will skip the table if there is a conflicting lock
already taken unless it is doing the vacuum to prevent id wraparound, in
which case there can be a deadlock. I guess the deadlock happens if:
- VACUUM takes a lock on pg_dist_transaction and is done for id
wraparound problem
- The transaction in the maintenance tries to take a lock but
cannot as that conflicts with the lock acquired by VACUUM
- The transaction in the maintenance daemon has a very old xid hence
VACUUM cannot proceed.
If we take a row exclusive lock in transaction recovery then it wouldn't
conflict with VACUUM hence it could proceed so the deadlock would be
resolved. To prevent concurrent transaction recoveries happening, an
advisory lock is taken with ShareUpdateExlusiveLock as before.
* Use CITUS_OPERATIONS tag
* Not allow removing a single node with ref tables
We should not allow removing a node if it is the only node in the
cluster and there is a data on it. We have this check for distributed
tables but we didn't have it for reference tables.
* Update src/test/regress/expected/single_node.out
Co-authored-by: Onur Tirtir <onurcantirtir@gmail.com>
* Update src/test/regress/sql/single_node.sql
Co-authored-by: Onur Tirtir <onurcantirtir@gmail.com>
This commit brings following features:
Foreign key support from citus local tables to reference tables
* Foreign key support from reference tables to citus local tables
(only with RESTRICT & NO ACTION behavior)
* ALTER TABLE ENABLE/DISABLE trigger command support
* CREATE/DROP/ALTER trigger command support
and disallows:
* ALTER TABLE ATTACH/DETACH PARTITION commands
* CREATE TABLE <postgres table> ATTACH PARTITION <citus local table>
commands
* Foreign keys from postgres tables to citus local tables
(the other way was already disallowed)
for citus local tables.
Introduce table entry utility functions
Citus table cache entry utilities are introduced so that we can easily
extend existing functionality with minimum changes, specifically changes
to these functions. For example IsNonDistributedTableCacheEntry can be
extended for citus local tables without the need to scan the whole
codebase and update each relevant part.
* Introduce utility functions to find the type of tables
A table type can be a reference table, a hash/range/append distributed
table. Utility methods are created so that we don't have to worry about
how a table is considered as a reference table etc. This also makes it
easy to extend the table types.
* Add IsCitusTableType utilities
* Rename IsCacheEntryCitusTableType -> IsCitusTableTypeCacheEntry
* Change citus table types in some checks
create_distributed_function(function_name,
distribution_arg_name,
colocate_with text)
This UDF did not allow colocate_with parameters when there were no
disttribution_arg_name supplied. This commit changes the behaviour to
allow missing distribution_arg_name parameters when the function should
be colocated with a reference table.
* Hide citus.subquery_pushdown flag
This flag is dangerous and could likely to let queries
return wrong results.
The flag has a very specific purpose for a very specific
data distribution and query structure. In those cases, when
the flag is set, the user can skip recursive planning altogether
*at their own risk*.
The meaning of the flag is that "I know what I'm doing such that
the query structure/data distribution is on my control, so Citus
can skip many correctness checks".
For regular users, enabling this flag is discouraged. We have to
keep the support only for backward compatibility for some users.
In addition to that, give a NOTICE to discourage new users to
use it.
RemoveCoordinatorPlacement does not do what it says. It removes the
coordinator placement only if there are other placements, so it is not a
single node, and only if the coordinator has a placement.
AllTargetExpressionsAreColumnReferences would return false if a query
had an entry that is referencing the outer query. It seems safe to not
have this for non-distributed tables, such as reference tables. We
already have separate checks for other cases such as having limits.
FindNodeCheck is not clear about what the function is doing. They are
renamed to FindNodeMatchingCheckFunctionXXX. Also for choosing elements in these
functions, CheckNodeFunc type is introduced.
It seems that currently we process even postgres tables in explain
commands. This is because we register a hook for explain and we don't
have any check to see if the query has any citus table.
With this commit, we now send the buffer usage as well to the relevant
API. There is some duplicate in the code but it is because of the
existing structure, we can refactor this separately.
The codebase is updated to use varattnosync and varnosyn and we defined
the macros for older versions. This way we can just remove the macros
when we drop an older version.
CMDTAG_SELECT exists in PG12 hence defining a MACRO such as
CMDTAG_SELECT -> "SELECT" is not possible. I chose CMDTAG_SELECT_COMPAT
because with the COMPAT suffix it is explicit that it maps to different
things in different versions and also has a less chance of mapping
something irrevelant. For example if we used SELECT as a macro, then it
would map every SELECT to whatever it is mapping to, which might have
unexpected/undesired behaviour.
The error message when index has opclassopts is improved and the commit
from postgres side is also included for future reference.
Also some minor style related changes are applied.
Error out if index has opclassopts.
Changelog entry on PG13:
Allow CREATE INDEX to specify the GiST signature length and maximum number of integer ranges (Nikita Glukhov)
Postgres 13 added a new VACUUM option, PARALLEL. It is now supported
in our code as well.
Relevant changelog message on postgres:
Allow VACUUM to process indexes in parallel (Masahiko Sawada, Amit Kapila)
This commit mostly adds pg_get_triggerdef_command to our ruleutils_13.
This doesn't add anything extra for ruleutils 13 so it is basically a copy
of the change on ruleutils_12
When there is a join alias, var->varnosync will point to the alias and
var->varno will point to the table itself, but we need to use the alias
when deparsing the query. Hence a workaround is introduced to solve this
problem in ruleutils. Normally this case can be understood with
dpns->plan == NULL check but in our case, dpns->plan is always NULL. We
should sync our ruleutils at some point with postgres ruleutils. This
could be a wrong solution as well but the tests pass.
Rte index is increased by range table index offset in pg >= 13. The
offset is removed with the pg >= 13.
Currently pushdown for union all is disabled because translatedVars is
set to nil on postgres side, and we were using translatedVars to
figure out if partition key has the same index in both sides of union
all. This should be fixed.
Commit on postgres side:
6ef77cf46e81f45716ec981cb08781d426181378
fix union all pushdown logic for pg13
Before pg 13, there was a field, translatedVars, and we were using that
to understand if the partition key has the same index on both sides of
the union all. With pg13 there is a parent_colnos field in appendRelInfo
and we can use that to get the attribute numbers(varattnos) in union all
vars. We make use of parent_colnos instead of translatedVars in pg >=13.
For joins 3 new fields are added, joinleftcols, joinrightcols, and
joinmergedcols. We are not interested in joinmergedcols because we
always expand the column used in joins. There joinmergedcols is always 0
in our case.
For filling joinleftcols and joinrightcols we basically construct the
lists with sequences so either list is of the form: [1 2 3 4 .... n]
Ruleutils is not completed synced with postgres ruleutils and the most
important part is identify_join_columns function change, which now uses
joinleftcols and joinrightcols.
Commit on postgres side:
9ce77d75c5ab094637cc4a446296dc3be6e3c221
A useful email thread:
https://www.postgresql.org/message-id/flat/7115.1577986646%40sss.pgh.pa.us#0ae1d66feeb400013fbaa67a7cccd6ca
PG13 uses joinmergedcols, joinleftcols and joinrightcols for finding
join order now. There relevant fields are set on citus side.
Postgres side commit:
9ce77d75c5ab094637cc4a446296dc3be6e3c221
Postgres changed some join related fields and therefore they also
changed ruleutils, this commit applies those changes to our copy of
ruleutils.
Related commit on postgres side:
9ce77d75c5ab094637cc4a446296dc3be6e3c221
Postgres introduced QueryCompletion struct. Hence a compat utility is
added to finish query completion for older versions and pg >= 13.
The commit on Postgres side:
2f9661311b83dc481fc19f6e3bda015392010a40
addRangeTableEntryXXX methods return a ParseNamespaceItem with pg >= 13.
RangeTableEntryFromNSItem macro is added so that we return the range
table entry from the ParseNamespaceItem in pg>=13 and for pg < 13 rte
would already be returned with addRangeTableEntryXXX methods.
Commit on Postgres side:
5815696bc66b3092f6361f53e0394909647042c8
Since PG13 changed the list, a listcell doesn't contain data anymore.
Therefore Set_ptr_value macro is created, so that depending on the
version it will either use cell->data.ptr_value or cell->ptr_value.
Commit on Postgres side:
1cff1b95ab6ddae32faa3efe0d95a820dbfdc164
Since ExplainOnePlan expects BufferUsage as well with PG >= 13,
ExplainOnePlanCompat is added.
Commit on Postgres side:
ed7a5095716ee498ecc406e1b8d5ab92c7662d10
Commit on postgres side:
05d8449e73694585b59f8b03aaa087f04cc4679a
Command on postgres side:
git log --all --grep="hashutils"
include common/hashfn.h for pg >= 13
tag_hash was moved from hsearch.h to hashutils.h then to hashfn.h
Commits on Postgres side:
9341c783cc42ffae5860c86bdc713bd47d734ffd
PortalDefineQuery doesn't accept char* for command tag anymore with PG
>= 13. We are currently only using it with Select, therefore a Portal
define query compat for select is created.
Commit on PG side:
2f9661311b83dc481fc19f6e3bda015392010a40
As the new planner and pg_plan_query_compat methods expect the query
string as well, macros are defined to be compatible in different
versions of postgres.
Relevant commit on Postgres:
6aba63ef3e606db71beb596210dd95fa73c44ce2
Command on Postgres:
git log --all --grep="pg_plan_query"
With PG13 heap_* (heap_open, heap_close etc) are replaced with table_*
(table_open, table_close etc).
It is better to use the new table access methods in the codebase and
define the macros for the previous versions as we can easily remove the
macro without having to change the codebase when we drop the support for
the old version.
Commits that introduced this change on Postgres:
f25968c49697db673f6cd2a07b3f7626779f1827
e0c4ec07284db817e1f8d9adfb3fffc952252db0
4b21acf522d751ba5b6679df391d5121b6c4a35f
Command to see relevant commits on Postgres side:
git log --all --grep="heap_open"
Pass the list to lnext API
lnext API now expects the list as well.
The commit on Postgres that introduced the change: 1cff1b95ab6ddae32faa3efe0d95a820dbfdc164
lnext_compat and list_delete_cell_compat macros are introduced so that
we can use these macros in the codebase without having to use #if
directives in the codebase.
Related commit on postgres:
1cff1b95ab6ddae32faa3efe0d95a820dbfdc164
Command to search in postgres:
git log --all --grep="list_delete_cell"
add ListCellAndListWrapper
When iterating a list in separate function calls, we need both the list
and the current cell starting from PG13, therefore
ListCellAndListWrapper is added to store both as a wrapper.
Use ListCellAndListWrapper in foreign key test udfs
As we iterate a list in these udfs using a functionContext, we need to
use the wrapper to be able to access both the list and the current cell.
With this patch, we introduce `locally_reserved_shared_connections.c/h` files
which are responsible for reserving some space in shared memory counters
upfront.
We sometimes need to reserve connections, but not necessarily
establish them. For example:
- COPY command should reserve connections as it cannot know which
connections it needs in which order. COPY establishes connections
as any input data hits the workers. For example, for router COPY
command, it only establishes 1 connection.
As discussed here (https://github.com/citusdata/citus/pull/3849#pullrequestreview-431792473),
COPY needs to reserve connections up-front, otherwise we can end
up with resource starvation/un-detected deadlocks.
* ensure propagation of CHECK statements to workers with parantheses & adjust regression test outputs
* add tests for distributing tables with simple CHECK constraints
* added test for CHECK on bool variable
Enable custom aggregates with multiple parameters to be executed on workers.
#2921 introduces distributed execution of custom aggregates. One of the limitations of this feature is that only aggregate functions with a single aggregation parameter can be pushed to worker nodes. Aim of this change is to remove that limitation and support handling of multi-parameter aggregates.
Resolves: #3997
See also: #2921
* Use CalculateUniformHashRangeIndex in HashPartitionId
INT32_MIN definition can change among different platforms hence it is
possible to get overflow, we would see crashes because of this in debian
distros. We have already solved a similar problem with introducing
CalculateUniformHashRangeIndex method, hence to solve it we can use the
same method, this also removes some duplication and has a single place
to decide that.
* Use PG_INT32_XX instead of INT32_XX to be safer
With adaptive connection management, we might have some connections
which are not fully initialized. Those connections should not be
qualified as available.
1) Rename CONNECTION_PER_PLACEMENT to REQUIRE_CLEAN_CONNECTION. This is
mostly to make things clear as the new name reveals more.
2) We also make sure that mark all the copy connections critical,
even if they are accessed earlier in the transction
The executor relies on WorkerPool, and many other places rely on WorkerNode.
With this commit, we make sure that they are sorted via the same function/logic.
Some GUCs support a list of values which is indicated by GUC_LIST_INPUT flag.
When an ALTER ROLE .. SET statement is executed, the new configuration
default for affected users and databases are stored in the
setconfig(text[]) column in a pg_db_role_setting record.
If a GUC that supports a list of values is used in an ALTER ROLE .. SET
statement, we need to split the text into items delimited by commas.
* use adaptive executor even if task-tracker is set
* Update check-multi-mx tests for adaptive executor
Basically repartition joins are enabled where necessary. For parallel
tests max adaptive executor pool size is decresed to 2, otherwise we
would get too many clients error.
* Update limit_intermediate_size test
It seems that when we use adaptive executor instead of task tracker, we
exceed the intermediate result size less in the test. Therefore updated
the tests accordingly.
* Update multi_router_planner
It seems that there is one problem with multi_router_planner when we use
adaptive executor, we should fix the following error:
+ERROR: relation "authors_range_840010" does not exist
+CONTEXT: while executing command on localhost:57637
* update repartition join tests for check-multi
* update isolation tests for repartitioning
* Error out if shard_replication_factor > 1 with repartitioning
As we are removing the task tracker, we cannot switch to it if
shard_replication_factor > 1. In that case, we simply error out.
* Remove MULTI_EXECUTOR_TASK_TRACKER
* Remove multi_task_tracker_executor
Some utility methods are moved to task_execution_utils.c.
* Remove task tracker protocol methods
* Remove task_tracker.c methods
* remove unused methods from multi_server_executor
* fix style
* remove task tracker specific tests from worker_schedule
* comment out task tracker udf calls in tests
We were using task tracker udfs to test permissions in
multi_multiuser.sql. We should find some other way to test them, then we
should remove the commented out task tracker calls.
* remove task tracker test from follower schedule
* remove task tracker tests from multi mx schedule
* Remove task-tracker specific functions from worker functions
* remove multi task tracker extra schedule
* Remove unused methods from multi physical planner
* remove task_executor_type related things in tests
* remove LoadTuplesIntoTupleStore
* Do initial cleanup for repartition leftovers
During startup, task tracker would call TrackerCleanupJobDirectories and
TrackerCleanupJobSchemas to clean up leftover directories and job
schemas. With adaptive executor, while doing repartitions it is possible
to leak these things as well. We don't retry cleanups, so it is possible
to have leftover in case of errors.
TrackerCleanupJobDirectories is renamed as
RepartitionCleanupJobDirectories since it is repartition specific now,
however TrackerCleanupJobSchemas cannot be used currently because it is
task tracker specific. The thing is that this function is a no-op
currently.
We should add cleaning up intermediate schemas to DoInitialCleanup
method when that problem is solved(We might want to solve it in this PR
as well)
* Revert "remove task tracker tests from multi mx schedule"
This reverts commit 03ecc0a681.
* update multi mx repartition parallel tests
* not error with task_tracker_conninfo_cache_invalidate
* not run 4 repartition queries in parallel
It seems that when we run 4 repartition queries in parallel we get too
many clients error on CI even though we don't get it locally. Our guess
is that, it is because we open/close many connections without doing some
work and postgres has some delay to close the connections. Hence even
though connections are removed from the pg_stat_activity, they might
still not be closed. If the above assumption is correct, it is unlikely
for it to happen in practice because:
- There is some network latency in clusters, so this leaves some times
for connections to be able to close
- Repartition joins return some data and that also leaves some time for
connections to be fully closed.
As we don't get this error in our local, we currently assume that it is
not a bug. Ideally this wouldn't happen when we get rid of the
task-tracker repartition methods because they don't do any pruning and
might be opening more connections than necessary.
If this still gives us "too many clients" error, we can try to increase
the max_connections in our test suite(which is 100 by default).
Also there are different places where this error is given in postgres,
but adding some backtrace it seems that we get this from
ProcessStartupPacket. The backtraces can be found in this link:
https://circleci.com/gh/citusdata/citus/138702
* Set distributePlan->relationIdList when it is needed
It seems that we were setting the distributedPlan->relationIdList after
JobExecutorType is called, which would choose task-tracker if
replication factor > 1 and there is a repartition query. However, it
uses relationIdList to decide if the query has a repartition query, and
since it was not set yet, it would always think it is not a repartition
query and would choose adaptive executor when it should choose
task-tracker.
* use adaptive executor even with shard_replication_factor > 1
It seems that we were already using adaptive executor when
replication_factor > 1. So this commit removes the check.
* remove multi_resowner.c and deprecate some settings
* remove TaskExecution related leftovers
* change deprecated API error message
* not recursively plan single relatition repartition subquery
* recursively plan single relation repartition subquery
* test depreceated task tracker functions
* fix overlapping shard intervals in range-distributed test
* fix error message for citus_metadata_container
* drop task-tracker deprecated functions
* put the implemantation back to worker_cleanup_job_schema_cachesince citus cloud uses it
* drop some functions, add downgrade script
Some deprecated functions are dropped.
Downgrade script is added.
Some gucs are deprecated.
A new guc for repartition joins bucket size is added.
* order by a test to fix flappiness
Here are the updated make targets:
- install: install everything except downgrade scripts.
- install-downgrades: build and install only the downgrade migration scripts.
- install-all: install everything along with the downgrade migration scripts.
The reason we should use ActiveReadableNodeList instead of ActiveReadableNonCoordinatorNodeList is that if coordinator is added to cluster as a worker, it should be counted as well. Otherwise if there is only coordinator in the cluster, the count will be 0, hence we get a warning.
In MultiTaskTrackerExecute, we should connect to coordinator if it is
added to the cluster because it will also be assigned tasks.
ActiveReadableWorkerNodeList doesn't include coordinator, however if
coordinator is added as a worker, we should also include that while
planning. The current methods are very easily misusable and this
requires a refactoring to make the distinction between methods that
include coordinator and that don't very explicit as they can introduce
subtle/major bugs pretty easily.
We were using ALL_WORKERS TargetWorkerSet while sending temporary schema
creation and cleanup. We(well mostly I) thought that ALL_WORKERS would also include coordinator when it is added as a worker. It turns out that it was FILTERING OUT the coordinator even if it is added as a worker to the cluster.
So to have some context here, in repartitions, for each jobId we create
(at least we were supposed to) a schema in each worker node in the cluster. Then we partition each shard table into some intermediate files, which is called the PARTITION step. So after this partition step each node has some intermediate files having tuples in those nodes. Then we fetch the partition files to necessary worker nodes, which is called the FETCH step. Then from the files we create intermediate tables in the temporarily created schemas, which is called a MERGE step. Then after evaluating the result, we remove the temporary schemas(one for each job ID in each node) and files.
If node 1 has file1, and node 2 has file2 after PARTITION step, it is
enough to either move file1 from node1 to node2 or vice versa. So we
prune one of them.
In the MERGE step, if the schema for a given jobID doesn't exist, the
node tries to use the `public` schema if it is a superuser, which is
actually added for testing in the past.
So when we were not sending schema creation comands for each job ID to
the coordinator(because we were using ALL_WORKERS flag, and it doesn't
include the coordinator), we would basically not have any schemas for
repartitions in the coordinator. The PARTITION step would be executed on
the coordinator (because the tasks are generated in the planner part)
and it wouldn't give us any error because it doesn't have anything to do
with the temporary schemas(that we didn't create). But later two things
would happen:
- If by chance the fetch is pruned on the coordinator side, we the other
nodes would fetch the partitioned files from the coordinator and execute
the query as expected, because it has all the information.
- If the fetch tasks are not pruned in the coordinator, in the MERGE
step, the coordinator would either error out saying that the necessary
schema doesn't exist, or it would try to create the temporary tables
under public schema ( if it is a superuser). But then if we had the same
task ID with different jobID it would fail saying that the table already
exists, which is an error we were getting.
In the first case, the query would work okay, but it would still not do
the cleanup, hence we would leave the partitioned files from the
PARTITION step there. Hence ensure_no_intermediate_data_leak would fail.
To make things more explicit and prevent such bugs in the future,
ALL_WORKERS is named as ALL_NON_COORD_WORKERS. And a new flag to return
all the active nodes is added as ALL_DATA_NODES. For repartition case,
we don't use the only-reference table nodes but this version makes the
code simpler and there shouldn't be any significant performance issue
with that.
DESCRIPTION: Force aliases in deparsing for queries with anonymous column references
Fixes: #3985
The root cause has todo with discrepancies in the query tree we create. I think in the future we should spend some time on categorising all changes we made to ruleutils and see if we can change the data structure `query` we pass to the deparser to have an actual valid postgres query for the deparser to render.
For now the fix is to keep track, besides changing the names of the entries in the target list, also if we have a reference to an anonymous columns. If there are anonymous columns we set the `printaliases` flag to true which forces the deparser to add the aliases.
Rename TargetWorkerSet enums to make them more explicit about what they
mean. Ideally it would be good to treat everything as a node without the
'worker' concept because it makes things complicated. Another
improvement could be to rename TargetWorkerSet as TargetNodeSet but it
goes to renaming many occurrences of Worker, which is probably too big
for this PR.
Static analysis found some issues where we used the result from
ExtractResultRelationRTE, without checking that it wasn't NULL. It seems
like in all these cases it can never actually be NULL, since we have checked
before that it isn't a SELECT query. So, this PR is mostly to make static
analysis happy (and protect a bit against future changes of the code).
Static analysis found an issue where we could dereference `NULL`, because
`CreateDummyPlacement` could return `NULL` when there were no workers. This
PR changes it so that it never returns `NULL`, which was intended by
@marcocitus when doing this change: https://github.com/citusdata/citus/pull/3887/files#r438136433
While adding tests for citus on a single node I also added some more basic
tests and it turns out we error out on repartition joins. This has been
present since `shouldhaveshards` was introduced and is not trivial to fix.
So I created a separate issue for this: https://github.com/citusdata/citus/issues/3996