* Replace master_add_node with citus_add_node
* Replace master_activate_node with citus_activate_node
* Replace master_add_inactive_node with citus_add_inactive_node
* Use master udfs in old scripts
* Replace master_add_secondary_node with citus_add_secondary_node
* Replace master_disable_node with citus_disable_node
* Replace master_drain_node with citus_drain_node
* Replace master_remove_node with citus_remove_node
* Replace master_set_node_property with citus_set_node_property
* Replace master_unmark_object_distributed with citus_unmark_object_distributed
* Replace master_update_node with citus_update_node
* Replace master_update_shard_statistics with citus_update_shard_statistics
* Replace master_update_table_statistics with citus_update_table_statistics
* Rename master_conninfo_cache_invalidate to citus_conninfo_cache_invalidate
Rename master_dist_local_group_cache_invalidate to citus_dist_local_group_cache_invalidate
* Replace master_copy_shard_placement with citus_copy_shard_placement
* Replace master_move_shard_placement with citus_move_shard_placement
* Rename master_dist_node_cache_invalidate to citus_dist_node_cache_invalidate
* Rename master_dist_object_cache_invalidate to citus_dist_object_cache_invalidate
* Rename master_dist_partition_cache_invalidate to citus_dist_partition_cache_invalidate
* Rename master_dist_placement_cache_invalidate to citus_dist_placement_cache_invalidate
* Rename master_dist_shard_cache_invalidate to citus_dist_shard_cache_invalidate
* Drop master_modify_multiple_shards
* Rename master_drop_all_shards to citus_drop_all_shards
* Drop master_create_distributed_table
* Drop master_create_worker_shards
* Revert old function definitions
* Add missing revoke statement for citus_disable_node
CREATE TABLE does not invalidate foreign key graph but some other set of
ddl commands do.
Previously, as we run multi_foreign_key & multi_foreign_key_relation_graph
in parallel, it's possible that multi_foreign_key invalidates foreign key
graph via some ddl commands and create table test in
multi_foreign_key_relation_graph becomes flaky.
So we un-parallelize those two tests.
We used to need WarnAboutLeakedPreparedTransaction()
as we didn't have auto 2PC recovery. But, we long have
2PC recovery by https://github.com/citusdata/citus/pull/1574
So, we don't need anymore.
* Rethrow original concurrent index creation failure message
* Alter test outputs for concurrent index creation
* Detect duplicate table failure in concurrent index creation
* Add test for conc. index creation w/out duplicates
* Prevent deadlock for long named partitioned index creation on single node
* Create IsSingleNodeCluster function
* Use both local and sequential execution
On top of our foreign key graph, implement the infrastructure to get
list of relations that are connected to input relation via a foreign key
graph.
We need this to support cascading create_citus_local_table &
undistribute_table operations.
Also add regression tests to see what our foreign key graph is able to
capture currently.
With this commit, we remove visited flags from ForeignConstraintRelationshipNode
struct since keeping local state in global object is both dangerous and
meaningless.
Also to improve readability, this commit also converts needless recursion to
iterative DFS to avoid passing local hash-map as another parameter to
GetConnectedListHelper function.
Attribute number in a subquery RTE and relation RTE means different
things. In a relation attribute number will point to the column number
in the table definition including the dropped columns as well however in
subquery, it means the index in the target list. When we convert a
relation RTE to subquery RTE we should either correct all the relevant
attribute numbers or we can just add a dummy column for the dropped
columns. We choose the latter in this commit because it is practically
too vulnerable to update all the vars in a query.
Another thing this commit fixes is that in case a join restriction
clause list contains a false clause, we should just returns a false
clause instead of the whole list, because the whole list will contain
restrictions from other RTEs as well and this breaks the query, which
can be seen from the output changes, now it is much simpler.
Also instead of adding single tests for dropped columns, we choose to
run the whole mixed queries with tables with dropped columns, this
revealed some bugs already, which are fixed in this commit.
It seems that there are only very few cases where that is useful, and
for now we prefer not having that check. This means that we might
perform some unnecessary checks, but that should be rare and not
performance critical.
Instead of sending NULL's over a network, we now convert the subqueries
in the form of:
SELECT t.a, NULL, NULL FROM (SELECT a FROM table)t;
And we recursively plan the inner part so that we don't send the NULL's
over network. We still need the NULLs in the outer subquery because we
currently don't have an easy way of updating all the necessary places in
the query.
Add some documentation for how the conversion is done
Baseinfo also has pushed down filters etc, so it makes more sense to use
BaseRestrictInfo to determine what columns have constant equality
filters.
Also RteIdentity is used for removing conversion candidates instead of
rteIndex.
It seems that most of the updates were broken, we weren't aware of it
because there wasn't any data in the tables. They are broken mostly
because local tables do not have a shard id and some code paths should
be updated with that information, currently when there is an invalid
shard id, it is assumed to be pruned.
Consider local tables in router planner
In case there is a local table, the shard id will not be valid and there
are some checks that rely on shard id, we should skip these in case of
local tables, which is handled with a dummy placement.
Add citus local table dist table join tests
add local-dist table mixed joins tests
AllDataLocallyAccessible and ContainsLocalTableSubqueryJoin are removed.
We can possibly remove ModifiesLocalTableWithRemoteCitusLocalTable as
well. Though this removal has a side effect that now when all the data
is locally available, we could still wrap a relation into a subquery, I
guess that should be resolved in the router planner itself.
Add more tests
When we wrap an RTE to subquery we are updating the variables varno's as
1, however we should also update the varno's of vars in quals.
Also some other small code quality improvements are done.
The previous algorithm was not consistent and it could convert different
RTEs based on the table orders in the query. Now we convert local tables
if there is a distributed table which doesn't have a unique index. So if
there are 4 tables, local1, local2, dist1, dist2_with_pkey then we will
convert local1 and local2 in `auto` mode. Converting a distributed table
is not that logical because as there is a distributed table without a
unique index, we will need to convert the local tables anyway. So
converting the distributed table with pkey is redundant.
Remove FillLocalAndDistributedRTECandidates and use
ShouldConvertLocalTableJoinsToSubqueries, which simplifies things as we
rely on a single function to decide whether we should continue
converting RTE to subquery.
We should not recursively plan an already routable plannable query. An
example of this is (SELECT * FROM local JOIN (SELECT * FROM dist) d1
USING(a));
So we let the recursive planner do all of its work and at the end we
convert the final query to to handle unsupported joins. While doing each
conversion, we check if it is router plannable, if so we stop.
Only consider range table entries that are in jointree
If a range table is not in jointree then there is no point in
considering that because we are trying to convert range table entries to
subqueries for join use case.
Check equality in quals
We want to recursively plan distributed tables only if they have an
equality filter on a unique column. So '>' and '<' operators will not
trigger recursive planning of distributed tables in local-distributed
table joins.
Recursively plan distributed table only if the filter is constant
If the filter is not a constant then the join might return multiple rows
and there is a chance that the distributed table will return huge data.
Hence if the filter is not constant we choose to recursively plan the
local table.
When doing local-distributed table joins we convert one of them to
subquery. The current policy is that we convert distributed tables to
subquery if it has a unique index on a column that has unique
index(primary key also has a unique index).
The logical planner cannot handle joins between local and distributed table.
Instead, we can recursively plan one side of the join and let the logical
planner handle the rest.
Our algorithm is a little smart, trying not to recursively plan distributed
tables, but favors local tables.
UPDATEs on partitioned tables that affect only row partitions should
succeed, the rest should fail.
Also rename CStoreScan to ColumnarScan to make the error message more
relevant.
A utility function is added so that each caller can implement a handler
for each index on a given table. This means that the caller doesn't need
to worry about how to access each index, the only thing that it needs to
do each to implement a function to which each index on the table is
passed iteratively.
When Citus needs to parallelize queries on the local node (e.g., the node
executing the distributed query and the shards are the same), we need to
be mindful about the connection management. The reason is that the client
backends that are running distributed queries are competing with the client
backends that Citus initiates to parallelize the queries in order to get
a slot on the max_connections.
In that regard, we implemented a "failover" mechanism where if the distributed
queries cannot get a connection, the execution failovers the tasks to the local
execution.
The failover logic is follows:
- As the connection manager if it is OK to get a connection
- If yes, we are good.
- If no, we fail the workerPool and the failure triggers
the failover of the tasks to local execution queue
The decision of getting a connection is follows:
/*
* For local nodes, solely relying on citus.max_shared_pool_size or
* max_connections might not be sufficient. The former gives us
* a preview of the future (e.g., we let the new connections to establish,
* but they are not established yet). The latter gives us the close to
* precise view of the past (e.g., the active number of client backends).
*
* Overall, we want to limit both of the metrics. The former limit typically
* kics in under regular loads, where the load of the database increases in
* a reasonable pace. The latter limit typically kicks in when the database
* is issued lots of concurrent sessions at the same time, such as benchmarks.
*/
When distributing a columnar table, as well as changing options on a distributed columnar table, this patch will forward the settings from the coordinator to the workers.
For propagating options changes on an already distributed table this change is pretty straight forward. Before applying the change in options locally we will create a `DDLJob` that contains a call to `alter_columnar_table_set(...)` for every shard placement with all settings of the current table. This goes both for setting an option as well as resetting. This will reset the values to the defaults configured on the coordinator. Having the effect that the coordinator is authoritative on the settings and makes sure the shards have the same settings set as the table on the coordinator.
When a columnar table is distributed it is using the `TableDDLCommand` infra structure to create a new kind of `TableDDLCommand`. This new type, called a `TableDDLCommandFunction` contains a context and 2 function pointers to execute. One function returns the command as applied on the table, the second function will return the sql command to apply to a shard with a given shard id. The schema name is ignored as it will use the fully qualified name of the shard in the same schema as the base table.
Multi-row execution already uses sequential execution. When shards
are local, using local execution is profitable as it avoids
an extra connection establishment to the local node.
Join test gets too many clients error too frequently hence we should
not run anything concurrently with that. Hopefully this will fix the
flakiness of test.
If MemoryContextAlloc errors out -e.g. during an OOM-, ConnectionHashEntry->connections
stays as NULL.
With this commit, we add isValid flag to ConnectionHashEntry that should be set to true
right after we allocate & initialize ConnectionHashEntry->connections list properly, and we
check it before accesing to ConnectionHashEntry->connections.
This is to avoid flaky changes like the following in test outputs:
-CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s.
+CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.02 s.
Columnar options were by accident linked to the relfilenode instead of the regclass/relation oid. This PR moves everything related to columnar options to their own catalog table.
The name of the function is different than the implemantation. Because
the function is designed to only consider SELECT queries. Also this
changes the assert with an error.
Refactor internals on how Citus creates the SQL commands it sends to recreate shards.
Before Citus collected solely ddl commands as `char *`'s to recreate a table. If they were used to create a shard they were wrapped with `worker_apply_shard_ddl_command` and send to the workers. On the workers the UDF wrapping the ddl command would rewrite the parsetree to replace tables names with their shard name equivalent.
This worked well, but poses an issue when adding columnar. Due to limitations in Postgres on creating custom options on table access methods we need to fall back on a UDF to set columnar specific options. Now, to recreate the table, we can not longer rely on having solely DDL statements to recreate a table.
A prototype was made to run this UDF wrapped in `worker_apply_shard_ddl_command`. This became pretty messy, hard to understand and subsequently hard to maintain.
This PR proposes a refactor of the internal representation of table ddl commands into a `TableDDLCommand` structure. The current implementation only supports a `char *` as its contents. Based on the use of the DDL statement (eg. creating the table -mx- or creating a shard) one of two different functions can be called to get the statement to send to the worker:
- `GetTableDDLCommand(TableDDLCommand *command)`: This function returns that ddl command to create the table. In this implementation it will just return the `char *`. This has the same functionality as getting the old list and not wrapping it.
- `GetShardedTableDDLCommand(TableDDLCommand *command, uint64 shardId, char *schemaName)`: This function returns the ddl command wrapped in `worker_apply_shard_ddl_command` with the `shardId` as an argument. Due to backwards compatibility it also accepts a. `schemaName`. The exact purpose is not directly clear. Ideally new implementations would work with fully qualified statements and ignore the `schemaName`.
A future implementation could accept 2.function pointers and a `void *` for context to let the two pointers work on. This gives greater flexibility in controlling what commands get send in which situations. Also, in a future, we could implement the intermediate step of creating the `parsetree` datastructure of statements based on the contents in the catalog with a corresponding deparser. For sharded queries a mutator could be ran over the parsetree to rewrite the tablenames to the names with the shard identifier. This will completely omit the requirement for `worker_apply_shard_ddl_command`.
Considering the adaptive connection management
improvements that we plan to roll soon, it makes it
very helpful to know the number of active client
backends.
We are doing this addition to simplify yhe adaptive connection
management for single node Citus. In single node Citus, both the
client backends and Citus parallel queries would compete to get
slots on Postgres' `max_connections` on the same Citus database.
With adaptive connection management, we have the counters for
Citus parallel queries. That helps us to adaptively decide
on the remote executions pool size (e.g., throttle connections
if necessary).
However, we do not have any counters for the total number of
client backends on the database. For single node Citus, we
should consider all the client backends, not only the remote
connections that Citus does.
Of course Postgres internally knows how many client
backends are active. However, to get that number Postgres
iterates over all the backends. For examaple, see [pg_stat_get_db_numbackends](8e90ec5580/src/backend/utils/adt/pgstatfuncs.c (L1240))
where Postgres iterates over all the backends.
For our purpuses, we need this information on every connection
establishment. That's why we cannot affort to do this kind of
iterattion.
CitusTableTypeIdList() function iterates on all the entries of pg_dist_partition
and loads all the metadata in to the cache. This can be quite memory intensive
especially when there are lots of distributed tables.
When partitioned tables are used, it is common to have many distributed tables
given that each partition also becomes a distributed table.
CitusTableTypeIdList() is used on every CREATE TABLE .. PARTITION OF.. command
as well. It means that, anytime a partition is created, Citus loads all the
metadata to the cache. Note that Citus typically only loads the accessed table's
metadata to the cache.
* Move local execution after the remote execution
Before this commit, when both local and remote tasks
exist, the executor was starting the execution with
local execution. There is no strict requirements on
this.
Especially considering the adaptive connection management
improvements that we plan to roll soon, moving the local
execution after to the remote execution makes more sense.
The adaptive connection management for single node Citus
would look roughly as follows:
- Try to connect back to the coordinator for running
parallel queries.
- If succeeds, go on and execute tasks in parallel
- If fails, fallback to the local execution
So, we'll use local execution as a fallback mechanism. And,
moving it after to the remote execution allows us to implement
such further scenarios.
The adaptive executor emulates the TCP's slow start algorithm.
Whenever the executor needs new connections, it doubles the number
of connections established in the previous iteration.
This approach is powerful. When the remote queries are very short
(like index lookup with < 1ms), even a single connection is sufficent
most of the time. When the remote queries are long, the executor
can quickly establish necessary number of connections.
One missing piece on our implementation seems that the executor
keeps doubling the number of connections even if the previous
connection attempts have been finalized. Instead, we should
wait until all the attempts are finalized. This is how TCP's
slow-start works. Plus, it decreases the unnecessary pressure
on the remote nodes.
I got this warning when compiling citus:
```
../columnar/write_state_management.c: In function ‘PendingWritesInUpperTransactions’:
../columnar/write_state_management.c:364:20: warning: ‘entry’ may be used uninitialized in this function [-Wmaybe-uninitialized]
if (found && entry->writeStateStack != NULL)
~~~~~^~~~~~~~~~~~~~~~
```
I fixed this by checking by always initializing entry, by using an early
return if `WriteStateMap` didn't exist. Instead of using the `found`
variable to check for existence of the key, I now simply check the
`entry` variable itself.
To quote the postgres comment on the hash_enter function:
> If foundPtr isn't NULL, then *foundPtr is set true if we found an
> existing entry in the table, false otherwise. This is needed in the
> HASH_ENTER case, but is redundant with the return value otherwise.
Before this commit, we let AdaptiveExecutorPreExecutorRun()
to be effective multiple times on every FETCH on cursors.
That does not affect the correctness of the query results,
but adds significant overhead.
It seems that we forgot to pass the revelant
flag to enable Postgres' parallel query
capabilities on the shards when user does
EXPLAIN ANALYZE on a distributed table.
TableAM API doesn't allow us to pass around a state variable along all of the tuple inserts belonging to the same command. We require this in columnar store, since we batch them, and when we have enough rows we flush them as stripes.
To do that, we keep a (relfilenode) -> stack of (subxact id, TableWriteState) global mapping.
**Inserts**
Whenever we want to insert a tuple, we look up for the relation's relfilenode in this mapping. If top of the stack matches current subtransaction, we us the existing TableWriteState. Otherwise, we allocate a new TableWriteState and push it on top of stack.
**(Sub)Transaction Commit/Aborts**
When the subtransaction or transaction is committed, we flush and pop all entries matching current SubTransactionId.
When the subtransaction or transaction is committed, we pop all entries matching current SubTransactionId and discard them without flushing.
**Reads**
Since we might have unwritten rows which needs to be read by a table scan, we flush write states on SELECTs. Since flushing the write state of upper transactions in a subtransaction will cause metadata being written in wrong subtransaction, we ERROR out if any of the upper subtransactions have unflushed rows.
**Table Drops**
We record in which subtransaction the table was dropped. When committing a subtransaction in which table was dropped, we propagate the drop to upper transaction. When aborting a subtransaction in which table was dropped, we mark table as not deleted.
* Update failure test dependencies
There was a security alert for cryptography. The vulnerability was fixed
in 3.2.0. The vulnebarility:
"RSA decryption was vulnerable to Bleichenbacher timing vulnerabilities,
which would impact people using RSA decryption in online scenarios."
The fix:
58494b41d6
It wasn't enough to only update crpytography because mitm was
incompatible with the new version, so mitm is also upgraded.
The steps to do in local:
python -m pip install -U cryptography
python -m pip install -U mitmproxy
If one wishes to iterate through a List and insert list elements in
PG13, it is not safe to use for_each_ptr as the List representation
in PostgreSQL no longer linked lists, but arrays, and it is possible
that the whole array is repalloc'ed if ther is not sufficient space
available.
See postgres commit 1cff1b95ab6ddae32faa3efe0d95a820dbfdc164 for more
information
When a relation is used on an OUTER JOIN with FALSE filters,
set_rel_pathlist_hook may not be called for the table.
There might be other cases as well, so do not rely on the hook
for classification of the tables.
Aliases that postgres choose for partitioned tables in explain output
might change in different pg versions, so normalize them and remove
the alternative test output