When COPY is used for copying into co-located files, it was
not allowed to use local execution. The primary reason was
Citus treating co-located intermediate results as co-located
shards, and COPY into the distributed table was done via
"format result". And, local execution of such COPY commands
was not implemented.
With this change, we implement support for local execution with
"format result". To do that, we use the buffer for every file
on shardState->copyOutState, similar to how local copy on
shards are implemented. In fact, the logic is similar to
local copy on shards, but instead of writing to the shards,
Citus writes the results to a file.
The logic relies on LOCAL_COPY_FLUSH_THRESHOLD, and flushes
only when the size exceeds the threshold. But, unlike local
copy on shards, in this case we write the headers and footers
just once.
As suggested by @marcocitus in https://github.com/citusdata/citus/pull/3911#issuecomment-643978531, there was
a regression in #3893. If another backend would write a file during deletion of
the intermediate results directory, this file would not necessarily be deleted.
The approach used in `CitusRemoveDirectory` is to try recursive removal of the
directory again if it has failed. This does not work here, since when a file
can not be removed for other reasons (e.g. `EPERM`) it will not throw an error
anymore. So then we would get into an infinite removal loop. Instead I now
`rename` the directory before removing it. That way other backends will not
write files to it anymore.
This is a different version of #3634. It also removes SwallowErrors, but
instead of modifying our own functions to not throw errors, it uses the
postgres built in `PathNameDeleteTemporaryDir` function. This function
does not throw errors.
Since this change is for a bugfix, I tried to minimize the changes.
PRs with the following changes would be good to do separately from this
PR:
1. Use PathName(Create|Open|Delete)Temporary(File|Dir) to open and
remove all files/dirs instead of our own custom file functions.
2. Prefix our outmost files/directories with `PG_TEMP_FILE_PREFIX` so
that they are identified by Postgres as temporary files, which will be
removed at postmaster start. This way we do not have to do this cleanup
ourselves.
3. Store the files in the temporary table space if it exists.
Fixes#3634Fixes#3618
When the file does not exist, it could mean two different things.
First -- and a lot more common -- case is that a failure happened
in a concurrent backend on the same distributed transaction. And,
one of the backends in that transaction has already been roll
backed, which has already removed the file. If we throw an error
here, the user might see this error instead of the actual error
message. Instead, we prefer to WARN the user and pretend that the
file has no data in it. In the end, the user would see the actual
error message for the failure.
Second, in case of any bugs in intermediate result broadcasts,
we could try to read a non-existing file. That is most likely
to happen during development. Thus, when asserts enabled, we throw
an error instead of WARNING so that the developers cannot miss.
When we have a query like the following:
```SQL
WITH a AS (SELECT * FROM foo LIMIT 10) SELECT max(x) FROM a JOIN bar 2 USING (y);
```
Citus currently opens side channels for doing the
`COPY "1_1"` FROM STDIN (format 'result')
before starting the execution of
`SELECT * FROM foo LIMIT 10`
Since we need at least 1 connection per worker to do
`SELECT * FROM foo LIMIT 10`
We need to have 2 connections to worker in order to broadcast the results.
However, we don't actually send a single row over the side channel until the
execution of `SELECT * FROM foo LIMIT 10` is completely done (and connections
unclaimed) and the results are written to a tuple store. We could actually
reuse the same connection for doing the `COPY "1_1"` FROM STDIN (format 'result').
This also fixes the issue that Citus doesn't obey `citus.max_adaptive_executor_pool_size`
when the query includes an intermediate result.
We don't need any side channel connections. That is actually
problematic in the sense that it creates extra connections.
Say, citus.max_adaptive_executor_pool_size equals to 1, Citus
ends up using one extra connection for the intermediate results.
Thus, not obeying citus.max_adaptive_executor_pool_size.
In this PR, we remove the following entities from the codebase
to allow further commits to implement not requiring extra connection
for the intermediate results:
- The connection flag REQUIRE_SIDECHANNEL
- The function GivePurposeToConnection
- The ConnectionPurpose struct and related fields
Sometimes during errors workers will create files while we're deleting intermediate directories
example:
DEBUG: could not remove file "base/pgsql_job_cache/10_0_431": Directory not empty
DETAIL: WARNING from localhost:57637
Areas for further optimization:
- Don't save subquery results to a local file on the coordinator when the subquery is not in the having clause
- Push the the HAVING with subquery to the workers if there's a group by on the distribution column
- Don't push down the results to the workers when we don't push down the HAVING clause, only the coordinator needs it
Fixes#520Fixes#756Closes#2047
* Add tuplestore helpers
* More detailed error messages in tuplestore
* Add CreateTupleDescCopy to SetupTuplestore
* Use new SetupTuplestore helper function
* Remove unnecessary copy
* Remove comment about undefined behaviour
The file handling the utility functions (DDL) for citus organically grew over time and became unreasonably large. This refactor takes that file and refactored the functionality into separate files per command. Initially modeled after the directory and file layout that can be found in postgres.
Although the size of the change is quite big there are barely any code changes. Only one two functions have been added for readability purposes:
- PostProcessIndexStmt which is extracted from PostProcessUtility
- PostProcessAlterTableStmt which is extracted from multi_ProcessUtility
A README.md has been added to `src/backend/distributed/commands` describing the contents of the module and every file in the module.
We need more documentation around the overloading of the COPY command, for now the boilerplate has been added for people with better knowledge to fill out.
Make sure that intermediate results use a connection that is
not associated with any placement. That is useful in two ways:
- More complex queries can be executed with CTEs
- Safely use the same connections when there is a foreign key
to reference table from a distributed table, which needs to
use the same connection for modifications since the reference
table might cascade to the distributed table.