* Separate build of citus.so and citus_columnar.so.
Because columnar code is statically-linked to both modules, it doesn't
make sense to load them both at once.
A subsequent commit will make the modules entirely separate and allow
loading them both simultaneously.
Author: Yanwen Jin
* Separate citus and citus_columnar modules.
Now the modules are independent. Columnar can be loaded by itself, or
along with citus.
Co-authored-by: Jeff Davis <jefdavi@microsoft.com>
We've had custom versions of Postgres its `foreach` macro which with a
hidden ListCell for quite some time now. People like these custom
macros, because they are easier to use and require less boilerplate.
This adds similar custom versions of Postgres its `forboth` macro. Now
you don't need ListCells anymore when looping over two lists at the same
time.
Before this commit, we erroneously converted the sequence
type to the column's type it is used. However, it is possible
that the sequence is used in an expression which then converted
to a type that cannot be a sequence, such as text.
With this commit, we only try this conversion if the column
type is a supported sequence type (e.g., smallint, int and bigint).
Note that we do this conversion because if the column type is a
bigint and the sequence is NOT a bigint, users would be in trouble
because sequences would generate values that are out of the range
of the column. (The other ways are already not supported such as
the column is int and the sequence is bigint would fail on the worker.)
In other words, with this commit, we scope this optimization only
when the target column type is a supported sequence type. Otherwise,
we let users to more freely use the sequences.
TEXT SEARCH DICTIONARY objects depend on TEXT SEARCH TEMPLATE objects.
Since we do not yet support distributed TS TEMPLATE objects, we skip
dependency checks for text search templates, similar to what we do for
roles.
The user is expected to manually create the TEXT SEARCH TEMPLATE objects
before a) adding new nodes, b) creating TEXT SEARCH DICTIONARY objects.
If a worker node is being added, a command is sent to get the server_id of the worker from the pg_dist_node_metadata table. If the worker's id is the same as the node executing the code, we will know the node is trying to add itself. If the node tries to add itself without specifying `groupid:=0` the operation will result in an error.
1) Remove useless columns
2) Show backends that are blocked on a DDL even before
gpid is assigned
3) One minor bugfix, where we clear distributedCommandOriginator
properly.
DESCRIPTION: Move pg_dist_object to pg_catalog
Historically `pg_dist_object` had been created in the `citus` schema as an experiment to understand if we could move our catalog tables to a branded schema. We quickly realised that this interfered with the UX on our managed services and other environments, where users connected via a user with the name of `citus`.
By default postgres put the username on the search_path. To be able to read the catalog in the `citus` schema we would need to grant access permissions to the schema. This caused newly created objects like tables etc, to default to this schema for creation. This failed due to the write permissions to that schema.
With this change we move the `pg_dist_object` catalog table to the `pg_catalog` schema, where our other schema's are also located. This makes the catalog table visible and readable by any user, like our other catalog tables, for debugging purposes.
Note: due to the change of schema, we had to disable 1 test that was running into a discrepancy between the schema and binary. Secondly, we needed to make the lookup functions for the `pg_dist_object` relation and their indexes less strict on the fallback of the naming due to an other test that, due to an unfortunate cache invalidation, needed to lookup the relation again. This makes that we won't default to _only_ resolving from `pg_catalog` outside of upgrades.
Clusters created pre-Citus 11 mostly didn't have metadata sync enabled.
For those clusters, we add a utility UDF which fixes some minor issues
and sync the necessary objects to the workers.
* [Columnar] Build columnar.so and let citus depends on it
Co-authored-by: Yanwen Jin <yanwjin@microsoft.com>
Co-authored-by: Ying Xu <32597660+yxu2162@users.noreply.github.com>
Co-authored-by: jeff-davis <Jeffrey.Davis@microsoft.com>
DESCRIPTION: Add GUC to control ddl creation behaviour in transactions
Historically we would _not_ propagate objects when we are in a transaction block. Creation of distributed tables would not always work in sequential mode, hence objects created in the same transaction as distributing a table that would use the just created object wouldn't work. The benefit was that the user could still benefit from parallelism.
Now that the creation of distributed tables is supported in sequential mode it would make sense for users to force transactional consistency of ddl commands for distributed tables. A transaction could switch more aggressively to sequential mode when creating new objects in a transaction.
We don't change the default behaviour just yet.
Also, many objects would not even propagate their creation when the transaction was already set to sequential, leaving the probability of a self deadlock. The new policy checks solve this discrepancy between objects as well.
The issue in question is caused when rebalance / replication call `FullShardPlacementList` which returns all shard placements (including those in disabled nodes with `citus_disable_node`). Eventually, `FindFillStateForPlacement` looks for the state across active workers and fails to find a state for the placements which are in the disabled workers causing a seg fault shortly after.
Approach:
* `ActivePlacementHash` was not using the status of the shard placement's node to determine if the node it is active. Initially, I just fixed that.
* Additionally, I refactored the code which handles active shards in replication / rebalance to:
* use a single function to determine if a shard placement is active.
* do the shard active shard filtering before calling `RebalancePlacementUpdates` and `ReplicationPlacementUpdates`, so test methods like `shard_placement_rebalance_array` and `shard_placement_replication_array` which have different shard placement active requirements can do their own filtering while using the same rebalance / replicate logic that `rebalance_table_shards` and `replicate_table_shards` use.
Fix#5664
CitusInitiatedBackend was a pre-mature implemenation of the whole
GlobalPID infrastructure. We used it to track whether any individual
query is triggered by Citus or not.
As of now, after GlobalPID is already in place, we don't need
CitusInitiatedBackend, in fact it could even be wrong.
Before this commit, dumping wait edges can only be used for
distributed deadlock detection purposes. With this commit,
we open the possibility that we can use it for any backend.
CREATE FUNCTION command together with it's dependencies.
If the function depends on any nondistributable object,
function will be created only locally. Parameterless
version of create_distributed_function becomes obsolete
with this change, it will deprecated from the code with a subsequent PR.
* When a worker tried to create a collation which had a dependency in the same worker node,
it would cause a deadlock, now it throws the correct "not a coordinator" error.
DESCRIPTION: Implement TEXT SEARCH CONFIGURATION propagation
The change adds support to Citus for propagating TEXT SEARCH CONFIGURATION objects. TSConfig objects cannot always be created in one create statement, and instead require a create statement followed by many alter statements to get turned into the object they should represent.
To support this we add functionality to the worker to create or replace objects based on a list of statements. When the lists of the local object and the remote object correspond 1:1 we skip the creation of the object and simply mark it distributed. This is especially important for TSConfig objects as initdb pre-populates databases with a dozen configurations (for many different languages).
When the user creates a new TSConfig based on the copy of an existing configuration there is no direct link to the object copied from. Since there is no link we can't simply rely on propagating the dependencies to the worker and send a qualified
The low-level StoreAllActiveTransactions() function filters out
backends that exited.
Before this commit, if you run a pgbench, after that you'd still
see the backends show up:
```SQL
select count(*) from get_global_active_transactions();
┌───────┐
│ count │
├───────┤
│ 538 │
└───────┘
```
After this patch, only active backends show-up:
```SQL
select count(*) from get_global_active_transactions();
┌───────┐
│ count │
├───────┤
│ 72 │
└───────┘
```
Removed dependency for EnsureTableOwner. Also removed pg_fini() and columnar_tableam_finish() Still need to remove CheckCitusVersion dependency to make Columnar_tableam.h dependency free from Citus.
Replaces citus.enable_object_propagation with citus.enable_metadata_sync
Also, within Citus 11 release cycle, we added citus.enable_metadata_sync_by_default,
that is also replaced with citus.enable_metadata_sync.
In essence, when citus.enable_metadata_sync is set to true, all the objects
and the metadata is send to the remote node.
We strongly advice that the users never changes the value of
this GUC.
With this commit, rebalancer backends are identified by application_name = citus_rebalancer
and the regular internal backends are identified by application_name = citus_internal
With this commit we've started to propagate sequences and shell
tables within the object dependency resolution. So, ensuring any
dependencies for any object will consider shell tables and sequences
as well. Separate logics for both shell tables and sequences have
been removed.
Since both shell tables and sequences logic were implemented as a
part of the metadata handling before that logic, we were propagating
them while syncing table metadata. With this commit we've divided
metadata (which means anything except shards thereafter) syncing
logic into multiple parts and implemented it either as a part of
ActivateNode. You can check the functions called in ActivateNode
to check definition of different metadata.
Definitions of start_metadata_sync_to_node and citus_activate_node
have also been updated. citus_activate_node will basically create
an active node with all metadata and reference table shards.
start_metadata_sync_to_node will be same with citus_activate_node
except replicating reference tables. stop_metadata_sync_to_node
will remove all the metadata. All of those UDFs need to be called
by superuser.
BEGIN/COMMIT transaction block or in a UDF calling another UDF.
(2) Prohibit/Limit the delegated function not to do a 2PC (or any work on a
remote connection).
(3) Have a safety net to ensure the (2) i.e. we should block the connections
from the delegated procedure or make sure that no 2PC happens on the node.
(4) Such delegated functions are restricted to use only the distributed argument
value.
Note: To limit the scope of the project we are considering only Functions(not
procedures) for the initial work.
DESCRIPTION: Introduce a new flag "force_delegation" in create_distributed_function(),
which will allow a function to be delegated in an explicit transaction block.
Fixes#3265
Once the function is delegated to the worker, on that node during the planning
distributed_planner()
TryToDelegateFunctionCall()
CheckDelegatedFunctionExecution()
EnableInForceDelegatedFuncExecution()
Save the distribution argument (Constant)
ExecutorStart()
CitusBeginScan()
IsShardKeyValueAllowed()
Ensure to not use non-distribution argument.
ExecutorRun()
AdaptiveExecutor()
StartDistributedExecution()
EnsureNoRemoteExecutionFromWorkers()
Ensure all the shards are local to the node in the remoteTaskList.
NonPushableInsertSelectExecScan()
InitializeCopyShardState()
EnsureNoRemoteExecutionFromWorkers()
Ensure all the shards are local to the node in the placementList.
This also fixes a minor issue: Properly handle expressions+parameters in distribution arguments
In addition to starting a new transaction, we also need to tell other
backends --including the ones spawned for connections opened to
localhost to build indexes on shards of this relation-- that concurrent
index builds can safely ignore us.
Normally, DefineIndex() only does that if index doesn't have any
predicates (i.e.: where clause) and no index expressions at all.
However, now that we already called standard process utility, index
build on the shell table is finished anyway.
The reason behind doing so is that we cannot guarantee not grabbing any
snapshots via adaptive executor, and the backends creating indexes on
local shards (if any) might block on waiting for current xact of the
current backend to finish, which would cause self deadlocks that are not
detectable.
We prefer the background daemon to only sync node metadata. That's
why we move placement metadata changes from disable node to
activate node. With that, we can make sure that disable node
only changes node metadata, whereas activate node syncs all
the metadata changes. In essence, we already expect all
nodes to be up when a node is activated. So, this does not change
the behavior much.
Dropping sequences means we need to recreate
and hence losing the sequence.
With this commit, we keep the existing sequences
such that resyncing wouldn't drop the sequence.
We do that by breaking the dependency of the sequence
from the table.
Split distributed/version_compat.h into dependency-free
pg_version_compat.h, and the original which still has
dependencies. The original doesn't have much purpose, but until other
files have better discipline about including the correct header files,
then it's still needed.
Also make distributed/listutils.h dependency-free. Should be moved
outside of 'distributed' subdirectory, but that will cause significant
code churn, so leave for another cleanup patch.
Now both files can be included in columnar without creating a
dependency on citus.
Simply applies
```SQL
SELECT textlike(command, citus.grep_remote_commands)
```
And, if returns true, the command is logged. Else, the log is ignored.
When citus.grep_remote_commands is empty string, all commands are
logged.
Before that PR we were updating citus.pg_dist_object metadata, which keeps
the metadata related to objects on Citus, only on the coordinator node. In
order to allow using those object from worker nodes (or erroring out with
proper error message) we've started to propagate that metedata to worker
nodes as well.
As of master branch, Citus does all the modifications to replicated tables
(e.g., reference tables and distributed tables with replication factor > 1),
via 2PC and avoids any shardstate=3. As a side-effect of those changes,
handling node failures for replicated tables change.
With this PR, when one (or multiple) node failures happen, the users would
see query errors on modifications. If the problem is intermitant, that's OK,
once the node failure(s) recover by themselves, the modification queries would
succeed. If the node failure(s) are permenant, the users should call
`SELECT citus_disable_node(...)` to disable the node. As soon as the node is
disabled, modification would start to succeed. However, now the old node gets
behind. It means that, when the node is up again, the placements should be
re-created on the node. First, use `SELECT citus_activate_node()`. Then, use
`SELECT replicate_table_shards(...)` to replicate the missing placements on
the re-activated node.
With this commit, we make sure to use a dedicated connection per
node for all the metadata operations within the same transaction.
This is needed because the same metadata (e.g., metadata includes
the distributed table on the workers) can be modified accross
multiple connections.
With this connection we guarantee that there is a single metadata connection.
But note that this connection can be used for any other operation.
In other words, this connection is not only reserved for metadata
operations.
The checks for preventing to remove a node are very much reference
table centric. We are soon going to add the same checks for replicated
tables. So, make the checks generic such that:
(a) replicated tables fit naturally
(b) we can the same checks in `citus_disable_node`.
- [x] Add some more regression test coverage
- [x] Make sure returning works fine in case of
local execution + remote execution
(task->partiallyLocalOrRemote works as expected, already added tests)
- [x] Implement locking properly (and add isolation tests)
- [x] We do #shardcount round-trips on `SerializeNonCommutativeWrites`.
We made it a single round-trip.
- [x] Acquire locks for subselects on the workers & add isolation tests
- [x] Add a GUC to prevent modification from the workers, hence increase the
coordinator-only throughput
- The performance slightly drops (~%15), unless
`citus.allow_modifications_from_workers_to_replicated_tables`
is set to false
In the past, we allowed users to manually switch to 1PC
(e.g., one phase commit). However, with this commit, we
don't. All multi-shard modifications are done via 2PC.
With Citus 9.0, we introduced `citus.single_shard_commit_protocol` which
defaults to 2PC.
With this commit, we prevent any user to set it to 1PC and drop support
for `citus.single_shard_commit_protocol`.
Although this might add some overhead for users, it is already the default
behaviour (so less likely) and marking placements as INVALID is much
worse.
- citus_get_all_dependencies_for_object: emulate what Citus
would qualify as
dependency when adding
a new node
- citus_get_dependencies_for_object: emulate what Citus would qualify
as dependency when creating an
object
Example use:
```SQL
-- find all the depedencies of table test
SELECT
pg_identify_object(t.classid, t.objid, t.objsubid)
FROM
(SELECT * FROM pg_get_object_address('table', '{test}', '{}')) as addr
JOIN LATERAL
citus_get_all_dependencies_for_object(addr.classid, addr.objid, addr.objsubid) as t(classid oid, objid oid, objsubid int)
ON TRUE
ORDER BY 1;
```
Add/fix tests
Fix creating partitions
Add test for mx - partition creating case
Enable cascading to partitioned tables
Fix mx partition adding test
Fix cascading through fkeys
Style
Disable converting with non-inherited fkeys
Fix detach bug
Early return in case of cascade & Add tests
Style
Fix undistribute_table bug & Fix test outputs
Remove RemovePartitionRelationIds
Test with undistribute_table
Add test for mx+convert+undistribute
Remove redundant usage of CreatePartitionedCitusLocalTable
Add some comments
Introduce bulk functions for generating attach/detach partition commands
Fix: Convert partitioned tables after adding fkey
Change the error message for partitions
Introduce function ErrorIfPartitionTableAddedToMetadata
Polish attach/detach command generation functions
Use time_partitions for testing
Move mx tests to citus_local_tables_mx
Add new partitioned table to cascade test
Add test with time series management UDFs
Fix test output
Fix: Assertion fail on relation access tracking
Style
Refactor creating partitioned citus local tables
Remove CreatePartitionedCitusLocalTable
Style
Error out if converting multi-level table
Revert some old tests
Error out adding partitioned partition
Polish
Polish/address
Fix create table partition of case
Use CascadeOperationForRelationIdList if no cascade needed
Fix create partition bug
Revert / Add new tests to mx
Style
Fix dropping fkey bug
Add test with IF NOT EXISTS
Convert to CLT when doing ATTACH PARTITION
Add comments
Add more tests with time series management
Edit the error message for converting the child
Use OR instead of AND in ErrorIfUnsupportedAlterTableStmt
Edit/improve tests
Disable ddl prop when dropping default column definitions
Disable/enable ddl prop just before/after the command
Add comment
Add sequence test
Add trigger test
Remove NeedCascadeViaForeignKeys
Add one more insert to sequence test
Add comment
Style
Fix test output shard ids
Update comments
Disable creating fkey on partitions
Move partition check to CreateCitusLocalTable
Add comment
Add check for attachingmulti-level partition
Add test for pg_constraint
Check pg_dist_partition in tests
Add test inserting on the worker
Not flush pending writes if given tid belongs to a "flushed" or
"aborted" stripe write, or to an "in-progress" stripe write of
another backend.
That way, we would reduce the cases where we flush single-tuple
stripes during index scan.
To do that, we follow below steps for index look-up's:
- Do not flush any pending writes and do stripe metadata look-up for
given tid.
If tuple with tid is found, then no need to do another look-up
since we already found the tuple without needing to flush pending
writes.
- If tuple is not found without flushing pending writes, then we have two
scenarios:
- If given tid belongs to a pending write of my backend, then do stripe
metadata look-up for given tid. But this time first **flush any pending
writes**.
- Otherwise, just return false from `index_fetch_tuple` since flushing
pending writes wouldn't help.
Recently there are some warnings during the compilation of Citus.
Part of the warnings come due to the `columnar_tableam.h` header not being properly guarded with defines and ifndef's.
This PR fixes these warnings.
If it is certain that we will not use any `parallel_worker`s for a columnar table,
then stripe entries inserted by aborted transactions become visible to
`SnapshotAny` and that causes `REINDEX` to fail by throwing a duplicate key
error.
To fix that:
* consider three states for a stripe write operation:
"flushed", "aborted", or "in-progress",
* make sure to have a clear separation between them, and
* act according to those three states when reading from a columnar table
make_simple_restrictinfo and pull_varnos functions now have a new parameter
These new macros give us the ability to use this new parameter for PG14 and they don't give the parameter for previous versions
Relevant PG commit:
55dc86eca70b1dc18a79c141b3567efed910329d
index_insert function now has a new parameter, indexUnchanged
This new macro give us the ability to use these new parameter for PG14 and they don't give the parameters for previous versions
Existing parameter is set to false
Relevant PG commit:
9dc718bdf2b1a574481a45624d42b674332e2903
es_result_relation_info is removed from Estate. In this commit we make some changes to handle that.
resultRelationInfo filed is added to ModifyState to support the removed field.
Relevant PG commits:
1375422c7826a2bf387be29895e961614f69de4b
a04daa97a4339c38e304cd6164d37da540d665a8
GetOldestXmin function is removed so we use GetOldestNonRemovableTransactionId functions instead
GetOldestNonRemovableTransactionId_compat picks the appropriate one
Relevant PG commit:
dc7420c2c9274a283779ec19718d2d16323640c0
get_partition_parent and RelationGetPartitionDesc functions now have new parameters to also include detached partitions
Thess new macros give us the ability to use these new parameter for PG14 and they don't give the parameters for previous versions
Existing parameters are set to not accept detached partitions
Relevant PG commit:
71f4c8c6f74ba021e55d35b1128d22fb8c6e1629
In two commits vacuumFlags in PGXACT is moved and then renamed to status flags
This macro uses the appropriate version of the flag
Relevant PG commits:
5788e258bb26495fab65ff3aa486268d1c50b123
cd9c1b3e197a9b53b840dcc87eb41b04d601a5f9
SetTuplestoreDestReceiverParams function now has two new parameters
This new macro give us the ability to use this new parameter for PG14 and it doesn't give the parameter for previous versions
Existing parameters are set to NULL to keep previous behavior
Relevant PG commit:
2f48ede080f42b97b594fb14102c82ca1001b80c
Some Copy related functions copied from Postgres had support for both old and new protocols
Postgres removed support for old version so we remove it too
Relevant PG commit:
3174d69fb96a66173224e60ec7053b988d5ed4d9
New macros: standard_ProcessUtility_compat, ProcessUtility_compat, ColumnarProcessUtility_compat, PrevProcessUtilityHook_compat
The functions now have a new bool parameter: readOnlyTree
These new macros give us the ability to use this new parameter for PG14 and it doesn't give the parameter for previous versions
In multi_ProcessUtility and ColumnarProcessUtility, before doing anything else, we check if readOnlyTree parameter is true and create a copy of pstmt
Existing readOnlyTree parameters are set to false since we already handle the read only case at multi_ProcessUtility and ColumnarProcessUtility
Relevant PG commit:
7c337b6b527b7052e6a751f966d5734c56f668b5
BeginCopyFrom function now has a new whereClause parameter.
In the function this parameter is assigned to the whereClause field of the CopyFromState returned
Currently in Postgres there is only one place where this argument isn't NULL, and in previous PG version the whereClause argument of copy state is set right after the function call
Since we don't have such example all current whereClause parameters are set to NULL
Relevant PG commit:
c532d15dddff14b01fe9ef1d465013cb8ef186df
CopyState struct is divided into parts and one of them is CopyFromState
This macro uses the appropriate one for PG versions
Relevant PG commit:
c532d15dddff14b01fe9ef1d465013cb8ef186df
In ReindexStmt concurrent field is moved to options and then options are converted to params list.
This macro uses previous fields for previous versions and the new params list with a new function named IsReindexWithParam for PG14
Relevant PG commits:
844c05abc3f1c1703bf17cf44ab66351ed9711d2
b5913f6120792465f4394b93c15c2e2ac0c08376
VacOptTernaryValue enum is renamed to VacOptValue.
In the enum there were three values, VACOPT_TERNARY_DEFAULT, VACOPT_TERNARY_DISABLED, and VACOPT_TERNARY_ENABLED
Now there are four values VACOPTVALUE_UNSPECIFIED, VACOPTVALUE_AUTO, VACOPTVALUE_DISABLED, and VACOPTVALUE_ENABLED
New macros are VacOptValue_compat, VACOPTVALUE_UNSPECIFIED_COMPAT, VACOPTVALUE_DISABLED_COMPAT, and VACOPTVALUE_ENABLED_COMPAT
The VACOPTVALUE_UNSPECIFIED_COMPAT matches VACOPT_TERNARY_DEFAULT and VACOPTVALUE_UNSPECIFIED. And there are no macro for VACOPTVALUE_AUTO.
Relevant PG commit:
3499df0dee8c4ea51d264a674df5b5e31991319a
New macros: FuncnameGetCandidates_compat and expand_function_arguments_compat
The functions (the ones without _compat) now have a new bool include_out_arguments parameter
These new macros give us the ability to use this new parameter for PG14 and it doesn't give the parameter for previous versions
Existing include_out_arguments parameters are set to 'false' to keep current behavior
Relevant PG commit:
e56bce5d43789cce95d099554ae9593ada92b3b7
stats function now have a new bool print_to_stderr parameter
This new macro gives us the ability to use this new parameter for PG14 and it doesn't give the parameter for previous versions
Existing print_to_stderr parameter is set to true to keep current behavior
Relevant PG commit:
43620e328617c1f41a2a54c8cee01723064e3ffa
getObjectTypeDescription and getObjectIdentity functions now have a new bool missing_ok parameter
These new macros give us the ability to use this new parameter for PG14 and they don't give the parameter for previous versions
Currently all missing_ok parameters are set to false to keep current behavior
Relevant PG commit:
2a10fdc4307a667883f7a3369cb93a721ade9680
The STATUS_WAITING define is removed and an enum with PROC_WAIT_STATUS_WAITING is added instead
This macro uses appropriate one
Relevant PG commit:
a513f1dfbf2c29a51b0f7cbd5913ce2d2ee452c5
AlterTableStmt's relkind field is changed into objtype
New AlterTableStmtObjType macro uses the appropriate one
Relevant PG commit:
cc35d8933a211d9965eb1c1d2749a903d5735db2
Allow ColumnarScans to push down join quals by generating
parameterized paths. This significantly expands the utility of chunk
group filtering, making a ColumnarScan behave similar to an index when
on the inner of a nested loop join.
Also, evaluate all parameters on beginscan/rescan, which also works
for external parameters.
Fixes#4488.
Previously, we were doing `first_row_number` reservation for the first
row written to current `WriteState` but were doing `stripe_id`
reservation when flushing the `WriteState` and were inserting the
related record to `columnar.stripe` at that time as well.
However, inserting `columnar.stripe` record at flush-time is
problematic. This is because, as told in #5160, if relation has
any index-based constraints and if there are two concurrent
writes that are inserting conflicting key values for that constraint,
then postgres relies on `tableAM->fetch_index_tuple`
(=`columnar_fetch_index_tuple`) callback to return `true` when
indexAM is checking against possible constraint violations.
However, pending writes of other backends are not visible to concurrent
sessions in columnar since we were not inserting the stripe metadata
record until flushing the stripe.
With this commit, we split stripe reservation into two phases:
i) Reserve `stripe_id` and insert a "dummy" record to `columnar.stripe`
at the very same time we reserve `first_row_number`, i.e. when writing
the first row to the current `WriteState`.
ii) At flush time, do the storage level allocation and complete the
missing fields of the dummy record inserted into `columnar.stripe`
during i).
That way, any concurrent writes would be able to check against possible
constraint violations by using `SnapshotDirty` when scanning
`columnar.stripe`.
Note that `columnar_fetch_index_tuple` still wouldn't be able to fill
the output tupleslot for the requested tid but it would at least return
`true` for such index look-up's and we believe this should be sufficient
for the caller indexAM callback to make the concurrent writer block on
prior one.
That is how we fix#5160.
Only downside of reserving `stripe_id` at the same time we reserve
`first_row_number` is that now any aborted writes would also waste
some amount of `stripe_id` as in the case of `first_row_number` but
we are just wasting them one-by-one.
Considering the fact that we waste `first_row_number` by the amount
stripe row limit (=150k by default) in such cases, this shouldn't be
important at all.
Before starting to scan a columnar table, we always flush the pending
writes to disk.
However, we increment command counter after modifying metadata tables.
On the other hand, now that we _don't always use_ xact snapshot to scan
a columnar table, writes that we just flushed might not be visible to
the query that just flushed pending writes to disk since curcid of
provided snapshot would become smaller than the command id being used
when modifying metadata tables.
To give an example, before this change, below was a possible scenario
due to the changes that we made to use the correct snapshot.
```sql
CREATE TABLE t(a int, b int) USING columnar;
BEGIN;
INSERT INTO t VALUES (5, 10);
SELECT * FROM t;
┌───┬───┐
│ a │ b │
├───┼───┤
└───┴───┘
(0 rows)
SELECT * FROM t;
┌───┬────┐
│ a │ b │
├───┼────┤
│ 5 │ 10 │
└───┴────┘
(1 row)
```
* Synchronize hasmetadata flag on mx workers
* Switch to sequential execution
* Add test
* Use SetWorkerColumn
* Add test for stop_sync
* Remove usage of UpdateHasmetadataOnWorkersWithMetadata
* Remove MarkNodeMetadataSynced
* Fix test for metadatasynced
* Remove MarkNodeMetadataSynced
* Style
* Remove MarkNodeHasMetadata
* Remove UpdateDistNodeBoolAttr
* Refactor SetWorkerColumn
* Use SetWorkerColumnLocalOnly when setting up dependencies
* Use SetWorkerColumnLocalOnly in TriggerSyncMetadataToPrimaryNodes
* Style
* Make update command generator functions static
* Set metadatasynced before syncing
* Call SetWorkerColumn only if the sync is successful
* Try to sync all nodes
* Fix indexno
* Update metadatasynced locally first
* Break if a node fails to sync metadata
* Send worker commands optional
* Style & Rebase
* Add raiseOnError param to SetWorkerColumn
* Style
* Set metadatasynced for all metadata nodes
* Style
* Introduce SetWorkerColumnOptional
* Polish
* Style
* Dont send set command to not synced metadata nodes
* Style
* Polish
* Add test for stop_sync
* Add test for shouldhaveshards
* Add test for isactive flag
* Sort by placementid in the function verify_metadata
* Cover edge cases for failing nodes
* Add comments
* Add nodeport to isactive test
* Add warning if metadata out of sync
* Update warning message
As we use the current user to sync the metadata to the nodes
with #5105 (and many other PRs), there is no reason that
prevents us to use the coordinated transaction for metadata syncing.
This commit also renames few functions to reflect their actual
implementation.
Before this commit, creating a partition after a DROP column
on the parent (position before dist. key) was leading to
partition to have the wrong distribution column.
update_distributed_table_colocation can be called by the relation
owner, and internally it updates pg_dist_partition. With this
commit, update_distributed_table_colocation uses an internal
UDF to access pg_dist_partition.
As a result, this operation can now be done by regular users
on MX.
* Fix UNION not being pushdown
Postgres optimizes column fields that are not needed in the output. We
were relying on these fields to understand if it is safe to push down a
union query.
This fix looks at the parse query, which has the original column fields
to detect if it is safe to push down a union query.
* Add more tests
* Simplify code and make it more robust
* Process varlevelsup > 0 in FindReferencedTableColumn
* Only look for outers vars in union path
* Add more comments
* Remove UNION ALL specific logic for pulling up childvars
These two options were not included when creating the sequences on the
workers as part of metadata syncing.
The missing `data_type` part of the definition made finding the cause
of #5126 harder than necessary, because of confusing errors.
Before this commit, we always synced the metadata with superuser.
However, that creates various edge cases such as visibility errors
or self distributed deadlocks or complicates user access checks.
Instead, with this commit, we use the current user to sync the metadata.
Note that, `start_metadata_sync_to_node` still requires super user
because accessing certain metadata (like pg_dist_node) always require
superuser (e.g., the current user should be a superuser).
However, metadata syncing operations regarding the distributed
tables can now be done with regular users, as long as the user
is the owner of the table. A table owner can still insert non-sense
metadata, however it'd only affect its own table. So, we cannot do
anything about that.
Ignore orphaned shards in more places
Only use active shard placements in RouterInsertTaskList
Use IncludingOrphanedPlacements in some more places
Fix comment
Add tests
* Alter seq type when we first use the seq in a dist table
* Don't allow type changes when seq is used in dist table
* ALTER SEQUENCE propagation
* Tests for ALTER SEQUENCE propagation
* Relocate AlterSequenceType and ensure dependencies for sequence
* Support for citus local tables, and other fixes
* Final formatting
Moving shards of reference tables was possible in at least one case:
```sql
select citus_disable_node('localhost', 9702);
create table r(x int);
select create_reference_table('r');
set citus.replicate_reference_tables_on_activate = off;
select citus_activate_node('localhost', 9702);
select citus_move_shard_placement(102008, 'localhost', 9701, 'localhost', 9702);
```
This would then remove the reference table shard on the source, causing
all kinds of issues. This fixes that by disallowing all shard moves
except for shards of distributed tables.
Co-authored-by: Onur Tirtir <onurcantirtir@gmail.com>
The first and main issue was that we were putting absolute pointers into
shared memory for the `steps` field of the `ProgressMonitorData`. This
pointer was being overwritten every time a process requested the monitor
steps, which is the only reason why this even worked in the first place.
To quote a part of a relevant stack overflow answer:
> First of all, putting absolute pointers in shared memory segments is
> terrible terible idea - those pointers would only be valid in the
> process that filled in their values. Shared memory segments are not
> guaranteed to attach at the same virtual address in every process.
> On the contrary - they attach where the system deems it possible when
> `shmaddr == NULL` is specified on call to `shmat()`
Source: https://stackoverflow.com/a/10781921/2570866
In this case a race condition occurred when a second process overwrote
the pointer in between the first process its write and read of the steps
field.
This issue is fixed by not storing the pointer in shared memory anymore.
Instead we now calculate it's position every time we need it.
The second race condition I have not been able to trigger, but I found
it while investigating this. This issue was that we published the handle
of the shared memory segment, before we initialized the data in the
steps. This means that during initialization of the data, a call to
`get_rebalance_progress()` could read partial data in an unsynchronized
manner.
With local query caching, we try to avoid deparse/parse stages as the
operation is too costly.
However, we can do deparse/parse operations once per cached queries, right
before we put the plan into the cache. With that, we avoid edge
cases like (4239) or (5038).
In a sense, we are making the local plan caching behave similar for non-cached
local/remote queries, by forcing to deparse the query once.