Before this commit, we let AdaptiveExecutorPreExecutorRun()
to be effective multiple times on every FETCH on cursors.
That does not affect the correctness of the query results,
but adds significant overhead.
(cherry picked from commit c433c66f2b)
* Not take ShareUpdateExlusiveLock on pg_dist_transaction
We were taking ShareUpdateExlusiveLock on pg_dist_transaction during
recovery to prevent multiple recoveries happening concurrenly. VACUUM(
not FULL) also takes ShareUpdateExclusiveLock, and they can conflict. It
seems that VACUUM will skip the table if there is a conflicting lock
already taken unless it is doing the vacuum to prevent id wraparound, in
which case there can be a deadlock. I guess the deadlock happens if:
- VACUUM takes a lock on pg_dist_transaction and is done for id
wraparound problem
- The transaction in the maintenance tries to take a lock but
cannot as that conflicts with the lock acquired by VACUUM
- The transaction in the maintenance daemon has a very old xid hence
VACUUM cannot proceed.
If we take a row exclusive lock in transaction recovery then it wouldn't
conflict with VACUUM hence it could proceed so the deadlock would be
resolved. To prevent concurrent transaction recoveries happening, an
advisory lock is taken with ShareUpdateExlusiveLock as before.
* Use CITUS_OPERATIONS tag
(cherry picked from commit e7cd1ed0ee)
Conflicts:
src/backend/distributed/transaction/transaction_recovery.c
* Use CalculateUniformHashRangeIndex in HashPartitionId
INT32_MIN definition can change among different platforms hence it is
possible to get overflow, we would see crashes because of this in debian
distros. We have already solved a similar problem with introducing
CalculateUniformHashRangeIndex method, hence to solve it we can use the
same method, this also removes some duplication and has a single place
to decide that.
* Use PG_INT32_XX instead of INT32_XX to be safer
(cherry picked from commit ef841115de)
We were using ALL_WORKERS TargetWorkerSet while sending temporary schema
creation and cleanup. We(well mostly I) thought that ALL_WORKERS would also include coordinator when it is added as a worker. It turns out that it was FILTERING OUT the coordinator even if it is added as a worker to the cluster.
So to have some context here, in repartitions, for each jobId we create
(at least we were supposed to) a schema in each worker node in the cluster. Then we partition each shard table into some intermediate files, which is called the PARTITION step. So after this partition step each node has some intermediate files having tuples in those nodes. Then we fetch the partition files to necessary worker nodes, which is called the FETCH step. Then from the files we create intermediate tables in the temporarily created schemas, which is called a MERGE step. Then after evaluating the result, we remove the temporary schemas(one for each job ID in each node) and files.
If node 1 has file1, and node 2 has file2 after PARTITION step, it is
enough to either move file1 from node1 to node2 or vice versa. So we
prune one of them.
In the MERGE step, if the schema for a given jobID doesn't exist, the
node tries to use the `public` schema if it is a superuser, which is
actually added for testing in the past.
So when we were not sending schema creation comands for each job ID to
the coordinator(because we were using ALL_WORKERS flag, and it doesn't
include the coordinator), we would basically not have any schemas for
repartitions in the coordinator. The PARTITION step would be executed on
the coordinator (because the tasks are generated in the planner part)
and it wouldn't give us any error because it doesn't have anything to do
with the temporary schemas(that we didn't create). But later two things
would happen:
- If by chance the fetch is pruned on the coordinator side, we the other
nodes would fetch the partitioned files from the coordinator and execute
the query as expected, because it has all the information.
- If the fetch tasks are not pruned in the coordinator, in the MERGE
step, the coordinator would either error out saying that the necessary
schema doesn't exist, or it would try to create the temporary tables
under public schema ( if it is a superuser). But then if we had the same
task ID with different jobID it would fail saying that the table already
exists, which is an error we were getting.
In the first case, the query would work okay, but it would still not do
the cleanup, hence we would leave the partitioned files from the
PARTITION step there. Hence ensure_no_intermediate_data_leak would fail.
To make things more explicit and prevent such bugs in the future,
ALL_WORKERS is named as ALL_NON_COORD_WORKERS. And a new flag to return
all the active nodes is added as ALL_DATA_NODES. For repartition case,
we don't use the only-reference table nodes but this version makes the
code simpler and there shouldn't be any significant performance issue
with that.
(cherry picked from commit 6532506f4b92b1316eea0812b2bcedb818d3b25c)
The names were not explicit about what they do, and we have many
misusages in the codebase, so they are renamed to be more explicit.
(cherry picked from commit 09962a7e2ff340705b6b193bbfececa2d48e0855)
Rename TargetWorkerSet enums to make them more explicit about what they
mean. Ideally it would be good to treat everything as a node without the
'worker' concept because it makes things complicated. Another
improvement could be to rename TargetWorkerSet as TargetNodeSet but it
goes to renaming many occurrences of Worker, which is probably too big
for this PR.
(cherry picked from commit de4b9569359e4f10d4ebf3fbcf7159ee6e2328db)
Static analysis found some issues where we used the result from
ExtractResultRelationRTE, without checking that it wasn't NULL. It seems
like in all these cases it can never actually be NULL, since we have checked
before that it isn't a SELECT query. So, this PR is mostly to make static
analysis happy (and protect a bit against future changes of the code).
(cherry picked from commit 759e628dd5)
Some GUCs support a list of values which is indicated by GUC_LIST_INPUT flag.
When an ALTER ROLE .. SET statement is executed, the new configuration
default for affected users and databases are stored in the
setconfig(text[]) column in a pg_db_role_setting record.
If a GUC that supports a list of values is used in an ALTER ROLE .. SET
statement, we need to split the text into items delimited by commas.
(cherry picked from commit e534dbae4a)
#3866 removed the shard ID hash in metadata_cache.c to simplify cache management,
but we observed a significant performance regression that was being masked by the
performance improvement provided by #3654 in our benchmarks, but #3654 only
applies to specific workloads.
This PR brings back the shard ID cache as it existed before #3866 with some extra
measures to handle invalidation. When we load a table entry, we overwrite
ShardIdCacheEntry->tableEntry pointers for all the shards in that table, though
it's possible that the table no longer contains the old shard ID or the table
entry is never reloaded, which would leave a dangling pointer once the table
entry is freed. To handle that case, we remove all shard ID cache entries that
point exactly to that table entry when a table is freed (at the end of the
transaction or any call to CitusTableCacheFlushInvalidatedEntries).
Co-authored-by: SaitTalhaNisanci <s.talhanisanci@gmail.com>
Co-authored-by: Marco Slot <marco.slot@gmail.com>
Co-authored-by: Jelte Fennema <github-tech@jeltef.nl>
This is so we don't need to calculate it twice in
insert_select_executor.c and multi_explain.c, which can
cause discrepancy if an update in one of them is not
reflected in the other site.
* Not set TaskExecution with adaptive executor
Adaptive executor is using a utility method from task tracker for
repartition joins, however adaptive executor doesn't need taskExecution.
It is only used by task tracker. This causes a problem when explain
analyze is used because what taskExecution is pointing to might be
random.
We solve this by not setting taskExecution from adaptive executor. So it
will stay NULL as set by CreateTask.
* use same memory context as task for taskExecution
Co-authored-by: Jelte Fennema <github-tech@jeltef.nl>
In #3901 the "Data received from worker(s)" sections were added to EXPLAIN
ANALYZE. After merging @pykello posted some review comments. This addresses
those comments as well as fixing a other issues that I found while addressing
them. The things this does:
1. Fix `EXPLAIN ANALYZE EXECUTE p1` to not increase received data on every
execution
2. Fix `EXPLAIN ANALYZE EXECUTE p1(1)` to not return 0 bytes as received data
allways.
3. Move `EXPLAIN ANALYZE` specific logic to `multi_explain.c` from
`adaptive_executor.c`
4. Change naming of new explain sections to `Tuple data received from node(s)`.
Firstly because a task can reference the coordinator too, so "worker(s)" was
incorrect. Secondly to indicate that this is tuple data and not all network
traffic that was performed.
5. Rename `totalReceivedData` in our codebase to `totalReceivedTupleData` to
make it clearer that it's a tuple data counter, not all network traffic.
6. Actually add `binary_protocol` test to `multi_schedule` (woops)
7. Fix a randomly failing test in `local_shard_execution.sql`.
Shard id to index mapping stored in cache entry as there may now be multiple entries alive for a given relation
insert_select_executor: revert copying cache entry, which was a hack added to avoid memory safety issues
Sadly this does not actually work yet for binary protocol data, because
when doing EXPLAIN ANALYZE we send two commands at the same time. This
means we cannot use `SendRemoteCommandParams`, and thus cannot use the
binary protocol. This can still be useful though when using the text
protocol, to find out that a lot of data is being sent.
* Insert select with master query
* Use relid to set custom_scan_tlist varno
* Reviews
* Fixes null check
Co-authored-by: Marco Slot <marco.slot@gmail.com>
This can save a lot of data to be sent in some cases, thus improving
performance for which inter query bandwidth is the bottleneck.
There's some issues with enabling this as default, so that's currently not done.
DESCRIPTION: Adds support to partially push down tdigest aggregates
tdigest extensions: https://github.com/tvondra/tdigest
This PR implements the partial pushdown of tdigest calculations when possible. The extension adds a tdigest type which can be combined into the same structure. There are several aggregate functions that can be used to get;
- a quantile
- a list of quantiles
- the quantile of a hypothetical value
- a list of quantiles for a list of hypothetical values
These function can work both on values or tdigest types.
Since we can create tdigest values either by combining them, or based on a group of values we can rewrite the aggregates in such a way that most of the computation gets delegated to the compute on the shards. This both speeds up the percentile calculations because the values don't have to be sorted while at the same time making the transfer size from the shards to the coordinator significantly less.
We still recursively plan some cases, eg:
- INSERTs
- SELECT FOR UPDATE when reference tables in query
- Everything must be same single shard & replication model
We wrap worker tasks in worker_save_query_explain_analyze() so we can fetch
their explain output later by a call worker_last_saved_explain_analyze().
Fixes#3519Fixes#2347Fixes#2613Fixes#621
Implements worker_save_query_explain_analyze and worker_last_saved_explain_analyze.
worker_save_query_explain_analyze executes and returns results of query while
saving its EXPLAIN ANALYZE to be fetched later.
worker_last_saved_explain_analyze returns the saved EXPLAIN ANALYZE result.
If we want to get necessary lockmode for a relation RangeVar within
a query, we can get the lockmode easily from the RangeVar itself (if
pg version >= 12).
However, if we want to decide the lockmode appropriate for the
"query", we can derive this information by using GetQueryLockMode
according to the code comment from RangeTblEntry->rellockmode.
Implements a new `TupleDestination` interface to allow custom tuple processing per task.
This can be specially useful if a task contains multiple queries. An example of this EXPLAIN
ANALYZE, where it needs to add some UDF calls to the query to fetch the explain output
from worker after fetching the actual query results.
We should check the remove type in IsDropCitusStmt because if the remove
type is not OBJECT_EXTENSION then the stored objects in
dropStmt->objects may not be of type Value. This was crashing PG-13.
Also rename the method as IsDropCitusExtensionStmt.
To reduce code duplication, implement function that pushes search_path
to be NIL and sets addCatalog to true so that all objects outside of
pg_catalog will be schema-prefixed.
SELECT_TASK is renamed to READ_TASK as a SELECT with modifying CTEs will be a MODIFYING_TASK
RouterInsertJob: Assert originalQuery->commandType == CMD_INSERT
CreateModifyPlan: Assert originalQuery->commandType != CMD_SELECT
Remove unused function IsModifyDistributedPlan
DistributedExecution, ExecutionParams, DistributedPlan: Rename hasReturning to expectResults
SELECTs set expectResults to true
Rename CreateSingleTaskRouterPlan to CreateSingleTaskRouterSelectPlan
This PR removes ExecuteUtilityTaskListWithoutResults and uses the same
path for local execution via ExecuteTaskListExtended.
ExecuteUtilityTaskList is added. ExecuteLocalTaskListExtended now has a
parameter for utility commands so that it can call the right method. In
order not to change the existing calls,
ExecuteTaskListExtendedInternal is added, which is the main method that
runs the execution, via local and remote execution.
Physical planner doesn't support parameters. If the parameters have already
been resolved when the physical planner handling the queries, mark it.
The reason is that the executor is unaware of this, and sends the parameters
along with the worker queries, which fails for composite types.
(See `DissuadePlannerFromUsingPlan()` for the details of paramater resolving)
As reported in #3787, we were having issues while building citus with "GCC Red Hat 10" (maybe in some other versions of gcc as well).
Fixes "multiple definition of 'CitusNodeTagNames'" error by explicitly specifying storage of CitusNodeTagNames to be extern.