Since PG14 we can now use binary encoding for arrays and composite types
that contain user defined types. This was fixed in this commit in
Postgres: 670c0a1d47
This change starts using that knowledge, by not necessarily falling back
to text encoding anymore for those types.
While doing this and testing a bit more I found various cases where
binary encoding would fail that our checks didn't cover. This fixes
those cases and adds tests for those. It also fixes EXPLAIN ANALYZE
never using binary encoding, which was a leftover of workaround that
was not necessary anymore.
Finally, it changes the default for both `citus.enable_binary_protocol`
and `citus.binary_worker_copy_format` to `true` for PG14 and up. In our
cloud offering `binary_worker_copy_format` already was true by default.
`enable_binary_protocol` had some bug with MX and user defined types,
this bug was fixed by the above mentioned fixes.
CopyState struct is divided into parts and one of them is CopyFromState
This macro uses the appropriate one for PG versions
Relevant PG commit:
c532d15dddff14b01fe9ef1d465013cb8ef186df
In short, add wrappers around Postgres' AddWaitEventToSet() and
ModifyWaitEvent().
AddWaitEventToSet()/ModifyWaitEvent*() may throw hard errors. For
example, when the underlying socket for a connection is closed by
the remote server and already reflected by the OS, however
Citus hasn't had a chance to get this information. In that case,
if replication factor is >1, Citus can failover to other nodes
for executing the query. Even if replication factor = 1, Citus
can give much nicer errors.
So CitusAddWaitEventSetToSet()/CitusModifyWaitEvent() simply puts
AddWaitEventToSet()/ModifyWaitEvent() into a PG_TRY/PG_CATCH block
in order to catch any hard errors, and returns this information to
the caller.
Ignore orphaned shards in more places
Only use active shard placements in RouterInsertTaskList
Use IncludingOrphanedPlacements in some more places
Fix comment
Add tests
With local query caching, we try to avoid deparse/parse stages as the
operation is too costly.
However, we can do deparse/parse operations once per cached queries, right
before we put the plan into the cache. With that, we avoid edge
cases like (4239) or (5038).
In a sense, we are making the local plan caching behave similar for non-cached
local/remote queries, by forcing to deparse the query once.
Previously this was usually done after argument parsing. This can cause
SEGFAULTs if the number or type of arguments changes in a new version.
By checking that Citus version is correct before doing any argument
parsing we protect against these types of issues. Issues like this have
occurred in pg_auto_failover, so it's not just a theoretical issue.
The main reason why these calls were not at the top of functions is
really just historical. It was because in the past we didn't allow
statements before declarations. Thus having this check before the
argument parsing would have only been possible if we first declared all
variables.
In addition to moving existing CheckCitusVersion calls it also adds
these calls to rebalancer related functions (they were missing there).
Comment from the code:
/*
* Iterate until all the tasks are finished. Once all the tasks
* are finished, ensure that that all the connection initializations
* are also finished. Otherwise, those connections are terminated
* abruptly before they are established (or failed). Instead, we let
* the ConnectionStateMachine() to properly handle them.
*
* Note that we could have the connections that are not established
* as a side effect of slow-start algorithm. At the time the algorithm
* decides to establish new connections, the execution might have tasks
* to finish. But, the execution might finish before the new connections
* are established.
*/
Note that the abruptly terminated connections lead to the following errors:
2020-11-16 21:09:09.800 CET [16633] LOG: could not accept SSL connection: Connection reset by peer
2020-11-16 21:09:09.872 CET [16657] LOG: could not accept SSL connection: Undefined error: 0
2020-11-16 21:09:09.894 CET [16667] LOG: could not accept SSL connection: Connection reset by peer
To easily reproduce the issue:
- Create a single node Citus
- Add the coordinator to the metadata
- Create a distributed table with shards on the coordinator
- f.sql: select count(*) from test;
- pgbench -f /tmp/f.sql postgres -T 12 -c 40 -P 1 or pgbench -f /tmp/f.sql postgres -T 12 -c 40 -P 1 -C
With this commit, the executor becomes smarter about refrain to open
new connections. The very basic example is that, if the connection
establishments take 1000ms and task executions as 5 msecs, the executor
becomes smart enough to not establish new connections.
With https://github.com/citusdata/citus/pull/4806 we enabled
2PC for any non-read-only local task. However, if the execution
is a single task, enabling 2PC (CoordinatedTransactionShouldUse2PC)
hits an assertion as we are not in a coordinated transaction.
There is no downside of using a coordinated transaction for single
task local queries.
Because setting the flag doesn't necessarily mean that we'll
use 2PC. If connections are read-only, we will not use 2PC.
In other words, we'll use 2PC only for connections that modified
any placements.
Before this commit, Citus used 2PC no matter what kind of
local query execution happens.
For example, if the coordinator has shards (and the workers as well),
even a simple SELECT query could start 2PC:
```SQL
WITH cte_1 AS (SELECT * FROM test LIMIT 10) SELECT count(*) FROM cte_1;
```
In this query, the local execution of the shards (and also intermediate
result reads) triggers the 2PC.
To prevent that, Citus now distinguishes local reads and local writes.
And, Citus switches to 2PC only if a modification happens. This may
still lead to unnecessary 2PCs when there is a local modification
and remote SELECTs only. Though, we handle that separately
via #4587.
When we use PROCESS_UTILITY_TOPLEVEL it causes some problems when
combined with other extensions such as pg_audit. With this commit we use
PROCESS_UTILITY_QUERY in the codebase to fix those problems.
When COPY is used for copying into co-located files, it was
not allowed to use local execution. The primary reason was
Citus treating co-located intermediate results as co-located
shards, and COPY into the distributed table was done via
"format result". And, local execution of such COPY commands
was not implemented.
With this change, we implement support for local execution with
"format result". To do that, we use the buffer for every file
on shardState->copyOutState, similar to how local copy on
shards are implemented. In fact, the logic is similar to
local copy on shards, but instead of writing to the shards,
Citus writes the results to a file.
The logic relies on LOCAL_COPY_FLUSH_THRESHOLD, and flushes
only when the size exceeds the threshold. But, unlike local
copy on shards, in this case we write the headers and footers
just once.
* Make undistribute_table() and citus_create_local_table() work with columnar
* Rename and use LocallyExecuteUtilityTask for UDF check
* Remove 'local' references in ExecuteUtilityCommand
As described in the comment, we have observed crashes in production
due to a segfault caused by the dereference of a NULL pointer in our
connection statemachine.
As a mitigation, preventing system crashes, we provide an error with
a small explanation of the issue. Unfortunately the case is not
reliably reproduced yet, hence the inability to add tests.
DESCRIPTION: Prevent segfaults when SAVEPOINT handling cannot recover from connection failures
Baseinfo also has pushed down filters etc, so it makes more sense to use
BaseRestrictInfo to determine what columns have constant equality
filters.
Also RteIdentity is used for removing conversion candidates instead of
rteIndex.
It seems that most of the updates were broken, we weren't aware of it
because there wasn't any data in the tables. They are broken mostly
because local tables do not have a shard id and some code paths should
be updated with that information, currently when there is an invalid
shard id, it is assumed to be pruned.
Consider local tables in router planner
In case there is a local table, the shard id will not be valid and there
are some checks that rely on shard id, we should skip these in case of
local tables, which is handled with a dummy placement.
Add citus local table dist table join tests
add local-dist table mixed joins tests
When Citus needs to parallelize queries on the local node (e.g., the node
executing the distributed query and the shards are the same), we need to
be mindful about the connection management. The reason is that the client
backends that are running distributed queries are competing with the client
backends that Citus initiates to parallelize the queries in order to get
a slot on the max_connections.
In that regard, we implemented a "failover" mechanism where if the distributed
queries cannot get a connection, the execution failovers the tasks to the local
execution.
The failover logic is follows:
- As the connection manager if it is OK to get a connection
- If yes, we are good.
- If no, we fail the workerPool and the failure triggers
the failover of the tasks to local execution queue
The decision of getting a connection is follows:
/*
* For local nodes, solely relying on citus.max_shared_pool_size or
* max_connections might not be sufficient. The former gives us
* a preview of the future (e.g., we let the new connections to establish,
* but they are not established yet). The latter gives us the close to
* precise view of the past (e.g., the active number of client backends).
*
* Overall, we want to limit both of the metrics. The former limit typically
* kics in under regular loads, where the load of the database increases in
* a reasonable pace. The latter limit typically kicks in when the database
* is issued lots of concurrent sessions at the same time, such as benchmarks.
*/
Multi-row execution already uses sequential execution. When shards
are local, using local execution is profitable as it avoids
an extra connection establishment to the local node.
* Move local execution after the remote execution
Before this commit, when both local and remote tasks
exist, the executor was starting the execution with
local execution. There is no strict requirements on
this.
Especially considering the adaptive connection management
improvements that we plan to roll soon, moving the local
execution after to the remote execution makes more sense.
The adaptive connection management for single node Citus
would look roughly as follows:
- Try to connect back to the coordinator for running
parallel queries.
- If succeeds, go on and execute tasks in parallel
- If fails, fallback to the local execution
So, we'll use local execution as a fallback mechanism. And,
moving it after to the remote execution allows us to implement
such further scenarios.
The adaptive executor emulates the TCP's slow start algorithm.
Whenever the executor needs new connections, it doubles the number
of connections established in the previous iteration.
This approach is powerful. When the remote queries are very short
(like index lookup with < 1ms), even a single connection is sufficent
most of the time. When the remote queries are long, the executor
can quickly establish necessary number of connections.
One missing piece on our implementation seems that the executor
keeps doubling the number of connections even if the previous
connection attempts have been finalized. Instead, we should
wait until all the attempts are finalized. This is how TCP's
slow-start works. Plus, it decreases the unnecessary pressure
on the remote nodes.
Before this commit, we let AdaptiveExecutorPreExecutorRun()
to be effective multiple times on every FETCH on cursors.
That does not affect the correctness of the query results,
but adds significant overhead.
With this commit, we make sure that local execution adds the
intermediate result size as the distributed execution adds. Plus,
it enforces the citus.max_intermediate_result_size value.
We should not access CurrentLocalExecutionStatus directly because that
would mean that we could also set it directly, which we shouldn't
because we have checks to see if the new state is possible, otherwise we
error.
After the connection timeout, we fail the session/pool. However, the
underlying connection can still be trying to connect. That is dangerous
because the new placement executions have already been in place. The
executor cannot handle the situation where multiple of
EXECUTION_ORDER_ANY task executions succeeds.
Adding a regression test doesn't seem easily doable. To reproduce the issue
- Add 2 worker nodes
- create a reference table
- set citus.node_connection_timeout to 1ms (requires code change)
- Continiously execute `SELECT count(*) FROM ref_table`
- Sometime later, you hit an out-of-array access in
`ScheduleNextPlacementExecution()` hence crashing.
- The reason for that is sometimes the first connection
successfully established while the executor is already
trying to execute the query on the second node.
Introduce table entry utility functions
Citus table cache entry utilities are introduced so that we can easily
extend existing functionality with minimum changes, specifically changes
to these functions. For example IsNonDistributedTableCacheEntry can be
extended for citus local tables without the need to scan the whole
codebase and update each relevant part.
* Introduce utility functions to find the type of tables
A table type can be a reference table, a hash/range/append distributed
table. Utility methods are created so that we don't have to worry about
how a table is considered as a reference table etc. This also makes it
easy to extend the table types.
* Add IsCitusTableType utilities
* Rename IsCacheEntryCitusTableType -> IsCitusTableTypeCacheEntry
* Change citus table types in some checks
CMDTAG_SELECT exists in PG12 hence defining a MACRO such as
CMDTAG_SELECT -> "SELECT" is not possible. I chose CMDTAG_SELECT_COMPAT
because with the COMPAT suffix it is explicit that it maps to different
things in different versions and also has a less chance of mapping
something irrevelant. For example if we used SELECT as a macro, then it
would map every SELECT to whatever it is mapping to, which might have
unexpected/undesired behaviour.
The error message when index has opclassopts is improved and the commit
from postgres side is also included for future reference.
Also some minor style related changes are applied.
This commit mostly adds pg_get_triggerdef_command to our ruleutils_13.
This doesn't add anything extra for ruleutils 13 so it is basically a copy
of the change on ruleutils_12
addRangeTableEntryXXX methods return a ParseNamespaceItem with pg >= 13.
RangeTableEntryFromNSItem macro is added so that we return the range
table entry from the ParseNamespaceItem in pg>=13 and for pg < 13 rte
would already be returned with addRangeTableEntryXXX methods.
Commit on Postgres side:
5815696bc66b3092f6361f53e0394909647042c8
PortalDefineQuery doesn't accept char* for command tag anymore with PG
>= 13. We are currently only using it with Select, therefore a Portal
define query compat for select is created.
Commit on PG side:
2f9661311b83dc481fc19f6e3bda015392010a40
As the new planner and pg_plan_query_compat methods expect the query
string as well, macros are defined to be compatible in different
versions of postgres.
Relevant commit on Postgres:
6aba63ef3e606db71beb596210dd95fa73c44ce2
Command on Postgres:
git log --all --grep="pg_plan_query"
With PG13 heap_* (heap_open, heap_close etc) are replaced with table_*
(table_open, table_close etc).
It is better to use the new table access methods in the codebase and
define the macros for the previous versions as we can easily remove the
macro without having to change the codebase when we drop the support for
the old version.
Commits that introduced this change on Postgres:
f25968c49697db673f6cd2a07b3f7626779f1827
e0c4ec07284db817e1f8d9adfb3fffc952252db0
4b21acf522d751ba5b6679df391d5121b6c4a35f
Command to see relevant commits on Postgres side:
git log --all --grep="heap_open"
With this patch, we introduce `locally_reserved_shared_connections.c/h` files
which are responsible for reserving some space in shared memory counters
upfront.
We sometimes need to reserve connections, but not necessarily
establish them. For example:
- COPY command should reserve connections as it cannot know which
connections it needs in which order. COPY establishes connections
as any input data hits the workers. For example, for router COPY
command, it only establishes 1 connection.
As discussed here (https://github.com/citusdata/citus/pull/3849#pullrequestreview-431792473),
COPY needs to reserve connections up-front, otherwise we can end
up with resource starvation/un-detected deadlocks.
The executor relies on WorkerPool, and many other places rely on WorkerNode.
With this commit, we make sure that they are sorted via the same function/logic.
* use adaptive executor even if task-tracker is set
* Update check-multi-mx tests for adaptive executor
Basically repartition joins are enabled where necessary. For parallel
tests max adaptive executor pool size is decresed to 2, otherwise we
would get too many clients error.
* Update limit_intermediate_size test
It seems that when we use adaptive executor instead of task tracker, we
exceed the intermediate result size less in the test. Therefore updated
the tests accordingly.
* Update multi_router_planner
It seems that there is one problem with multi_router_planner when we use
adaptive executor, we should fix the following error:
+ERROR: relation "authors_range_840010" does not exist
+CONTEXT: while executing command on localhost:57637
* update repartition join tests for check-multi
* update isolation tests for repartitioning
* Error out if shard_replication_factor > 1 with repartitioning
As we are removing the task tracker, we cannot switch to it if
shard_replication_factor > 1. In that case, we simply error out.
* Remove MULTI_EXECUTOR_TASK_TRACKER
* Remove multi_task_tracker_executor
Some utility methods are moved to task_execution_utils.c.
* Remove task tracker protocol methods
* Remove task_tracker.c methods
* remove unused methods from multi_server_executor
* fix style
* remove task tracker specific tests from worker_schedule
* comment out task tracker udf calls in tests
We were using task tracker udfs to test permissions in
multi_multiuser.sql. We should find some other way to test them, then we
should remove the commented out task tracker calls.
* remove task tracker test from follower schedule
* remove task tracker tests from multi mx schedule
* Remove task-tracker specific functions from worker functions
* remove multi task tracker extra schedule
* Remove unused methods from multi physical planner
* remove task_executor_type related things in tests
* remove LoadTuplesIntoTupleStore
* Do initial cleanup for repartition leftovers
During startup, task tracker would call TrackerCleanupJobDirectories and
TrackerCleanupJobSchemas to clean up leftover directories and job
schemas. With adaptive executor, while doing repartitions it is possible
to leak these things as well. We don't retry cleanups, so it is possible
to have leftover in case of errors.
TrackerCleanupJobDirectories is renamed as
RepartitionCleanupJobDirectories since it is repartition specific now,
however TrackerCleanupJobSchemas cannot be used currently because it is
task tracker specific. The thing is that this function is a no-op
currently.
We should add cleaning up intermediate schemas to DoInitialCleanup
method when that problem is solved(We might want to solve it in this PR
as well)
* Revert "remove task tracker tests from multi mx schedule"
This reverts commit 03ecc0a681.
* update multi mx repartition parallel tests
* not error with task_tracker_conninfo_cache_invalidate
* not run 4 repartition queries in parallel
It seems that when we run 4 repartition queries in parallel we get too
many clients error on CI even though we don't get it locally. Our guess
is that, it is because we open/close many connections without doing some
work and postgres has some delay to close the connections. Hence even
though connections are removed from the pg_stat_activity, they might
still not be closed. If the above assumption is correct, it is unlikely
for it to happen in practice because:
- There is some network latency in clusters, so this leaves some times
for connections to be able to close
- Repartition joins return some data and that also leaves some time for
connections to be fully closed.
As we don't get this error in our local, we currently assume that it is
not a bug. Ideally this wouldn't happen when we get rid of the
task-tracker repartition methods because they don't do any pruning and
might be opening more connections than necessary.
If this still gives us "too many clients" error, we can try to increase
the max_connections in our test suite(which is 100 by default).
Also there are different places where this error is given in postgres,
but adding some backtrace it seems that we get this from
ProcessStartupPacket. The backtraces can be found in this link:
https://circleci.com/gh/citusdata/citus/138702
* Set distributePlan->relationIdList when it is needed
It seems that we were setting the distributedPlan->relationIdList after
JobExecutorType is called, which would choose task-tracker if
replication factor > 1 and there is a repartition query. However, it
uses relationIdList to decide if the query has a repartition query, and
since it was not set yet, it would always think it is not a repartition
query and would choose adaptive executor when it should choose
task-tracker.
* use adaptive executor even with shard_replication_factor > 1
It seems that we were already using adaptive executor when
replication_factor > 1. So this commit removes the check.
* remove multi_resowner.c and deprecate some settings
* remove TaskExecution related leftovers
* change deprecated API error message
* not recursively plan single relatition repartition subquery
* recursively plan single relation repartition subquery
* test depreceated task tracker functions
* fix overlapping shard intervals in range-distributed test
* fix error message for citus_metadata_container
* drop task-tracker deprecated functions
* put the implemantation back to worker_cleanup_job_schema_cachesince citus cloud uses it
* drop some functions, add downgrade script
Some deprecated functions are dropped.
Downgrade script is added.
Some gucs are deprecated.
A new guc for repartition joins bucket size is added.
* order by a test to fix flappiness
The reason we should use ActiveReadableNodeList instead of ActiveReadableNonCoordinatorNodeList is that if coordinator is added to cluster as a worker, it should be counted as well. Otherwise if there is only coordinator in the cluster, the count will be 0, hence we get a warning.
In MultiTaskTrackerExecute, we should connect to coordinator if it is
added to the cluster because it will also be assigned tasks.
We were using ALL_WORKERS TargetWorkerSet while sending temporary schema
creation and cleanup. We(well mostly I) thought that ALL_WORKERS would also include coordinator when it is added as a worker. It turns out that it was FILTERING OUT the coordinator even if it is added as a worker to the cluster.
So to have some context here, in repartitions, for each jobId we create
(at least we were supposed to) a schema in each worker node in the cluster. Then we partition each shard table into some intermediate files, which is called the PARTITION step. So after this partition step each node has some intermediate files having tuples in those nodes. Then we fetch the partition files to necessary worker nodes, which is called the FETCH step. Then from the files we create intermediate tables in the temporarily created schemas, which is called a MERGE step. Then after evaluating the result, we remove the temporary schemas(one for each job ID in each node) and files.
If node 1 has file1, and node 2 has file2 after PARTITION step, it is
enough to either move file1 from node1 to node2 or vice versa. So we
prune one of them.
In the MERGE step, if the schema for a given jobID doesn't exist, the
node tries to use the `public` schema if it is a superuser, which is
actually added for testing in the past.
So when we were not sending schema creation comands for each job ID to
the coordinator(because we were using ALL_WORKERS flag, and it doesn't
include the coordinator), we would basically not have any schemas for
repartitions in the coordinator. The PARTITION step would be executed on
the coordinator (because the tasks are generated in the planner part)
and it wouldn't give us any error because it doesn't have anything to do
with the temporary schemas(that we didn't create). But later two things
would happen:
- If by chance the fetch is pruned on the coordinator side, we the other
nodes would fetch the partitioned files from the coordinator and execute
the query as expected, because it has all the information.
- If the fetch tasks are not pruned in the coordinator, in the MERGE
step, the coordinator would either error out saying that the necessary
schema doesn't exist, or it would try to create the temporary tables
under public schema ( if it is a superuser). But then if we had the same
task ID with different jobID it would fail saying that the table already
exists, which is an error we were getting.
In the first case, the query would work okay, but it would still not do
the cleanup, hence we would leave the partitioned files from the
PARTITION step there. Hence ensure_no_intermediate_data_leak would fail.
To make things more explicit and prevent such bugs in the future,
ALL_WORKERS is named as ALL_NON_COORD_WORKERS. And a new flag to return
all the active nodes is added as ALL_DATA_NODES. For repartition case,
we don't use the only-reference table nodes but this version makes the
code simpler and there shouldn't be any significant performance issue
with that.
Rename TargetWorkerSet enums to make them more explicit about what they
mean. Ideally it would be good to treat everything as a node without the
'worker' concept because it makes things complicated. Another
improvement could be to rename TargetWorkerSet as TargetNodeSet but it
goes to renaming many occurrences of Worker, which is probably too big
for this PR.
#3866 removed the shard ID hash in metadata_cache.c to simplify cache management,
but we observed a significant performance regression that was being masked by the
performance improvement provided by #3654 in our benchmarks, but #3654 only
applies to specific workloads.
This PR brings back the shard ID cache as it existed before #3866 with some extra
measures to handle invalidation. When we load a table entry, we overwrite
ShardIdCacheEntry->tableEntry pointers for all the shards in that table, though
it's possible that the table no longer contains the old shard ID or the table
entry is never reloaded, which would leave a dangling pointer once the table
entry is freed. To handle that case, we remove all shard ID cache entries that
point exactly to that table entry when a table is freed (at the end of the
transaction or any call to CitusTableCacheFlushInvalidatedEntries).
Co-authored-by: SaitTalhaNisanci <s.talhanisanci@gmail.com>
Co-authored-by: Marco Slot <marco.slot@gmail.com>
Co-authored-by: Jelte Fennema <github-tech@jeltef.nl>
This is so we don't need to calculate it twice in
insert_select_executor.c and multi_explain.c, which can
cause discrepancy if an update in one of them is not
reflected in the other site.
* Not set TaskExecution with adaptive executor
Adaptive executor is using a utility method from task tracker for
repartition joins, however adaptive executor doesn't need taskExecution.
It is only used by task tracker. This causes a problem when explain
analyze is used because what taskExecution is pointing to might be
random.
We solve this by not setting taskExecution from adaptive executor. So it
will stay NULL as set by CreateTask.
* use same memory context as task for taskExecution
Co-authored-by: Jelte Fennema <github-tech@jeltef.nl>
As suggested by @marcocitus in https://github.com/citusdata/citus/pull/3911#issuecomment-643978531, there was
a regression in #3893. If another backend would write a file during deletion of
the intermediate results directory, this file would not necessarily be deleted.
The approach used in `CitusRemoveDirectory` is to try recursive removal of the
directory again if it has failed. This does not work here, since when a file
can not be removed for other reasons (e.g. `EPERM`) it will not throw an error
anymore. So then we would get into an infinite removal loop. Instead I now
`rename` the directory before removing it. That way other backends will not
write files to it anymore.
We sort the workerList because adaptive connection management
(e.g., OPTIONAL_CONNECTION) requires any concurrent executions
to wait for the connections in the same order to prevent any
starvation. If we don't sort, we might end up with:
Execution 1: Get connection for worker 1, wait for worker 2
Execution 2: Get connection for worker 2, wait for worker 1
and, none could proceed. Instead, we enforce every execution establish
the required connections to workers in the same order.
In #3901 the "Data received from worker(s)" sections were added to EXPLAIN
ANALYZE. After merging @pykello posted some review comments. This addresses
those comments as well as fixing a other issues that I found while addressing
them. The things this does:
1. Fix `EXPLAIN ANALYZE EXECUTE p1` to not increase received data on every
execution
2. Fix `EXPLAIN ANALYZE EXECUTE p1(1)` to not return 0 bytes as received data
allways.
3. Move `EXPLAIN ANALYZE` specific logic to `multi_explain.c` from
`adaptive_executor.c`
4. Change naming of new explain sections to `Tuple data received from node(s)`.
Firstly because a task can reference the coordinator too, so "worker(s)" was
incorrect. Secondly to indicate that this is tuple data and not all network
traffic that was performed.
5. Rename `totalReceivedData` in our codebase to `totalReceivedTupleData` to
make it clearer that it's a tuple data counter, not all network traffic.
6. Actually add `binary_protocol` test to `multi_schedule` (woops)
7. Fix a randomly failing test in `local_shard_execution.sql`.
Shard id to index mapping stored in cache entry as there may now be multiple entries alive for a given relation
insert_select_executor: revert copying cache entry, which was a hack added to avoid memory safety issues
Sadly this does not actually work yet for binary protocol data, because
when doing EXPLAIN ANALYZE we send two commands at the same time. This
means we cannot use `SendRemoteCommandParams`, and thus cannot use the
binary protocol. This can still be useful though when using the text
protocol, to find out that a lot of data is being sent.
This can save a lot of data to be sent in some cases, thus improving
performance for which inter query bandwidth is the bottleneck.
There's some issues with enabling this as default, so that's currently not done.
We still recursively plan some cases, eg:
- INSERTs
- SELECT FOR UPDATE when reference tables in query
- Everything must be same single shard & replication model
We wrap worker tasks in worker_save_query_explain_analyze() so we can fetch
their explain output later by a call worker_last_saved_explain_analyze().
Fixes#3519Fixes#2347Fixes#2613Fixes#621
This is a different version of #3634. It also removes SwallowErrors, but
instead of modifying our own functions to not throw errors, it uses the
postgres built in `PathNameDeleteTemporaryDir` function. This function
does not throw errors.
Since this change is for a bugfix, I tried to minimize the changes.
PRs with the following changes would be good to do separately from this
PR:
1. Use PathName(Create|Open|Delete)Temporary(File|Dir) to open and
remove all files/dirs instead of our own custom file functions.
2. Prefix our outmost files/directories with `PG_TEMP_FILE_PREFIX` so
that they are identified by Postgres as temporary files, which will be
removed at postmaster start. This way we do not have to do this cleanup
ourselves.
3. Store the files in the temporary table space if it exists.
Fixes#3634Fixes#3618
If we want to get necessary lockmode for a relation RangeVar within
a query, we can get the lockmode easily from the RangeVar itself (if
pg version >= 12).
However, if we want to decide the lockmode appropriate for the
"query", we can derive this information by using GetQueryLockMode
according to the code comment from RangeTblEntry->rellockmode.