DESCRIPTION: Implement TEXT SEARCH CONFIGURATION propagation
The change adds support to Citus for propagating TEXT SEARCH CONFIGURATION objects. TSConfig objects cannot always be created in one create statement, and instead require a create statement followed by many alter statements to get turned into the object they should represent.
To support this we add functionality to the worker to create or replace objects based on a list of statements. When the lists of the local object and the remote object correspond 1:1 we skip the creation of the object and simply mark it distributed. This is especially important for TSConfig objects as initdb pre-populates databases with a dozen configurations (for many different languages).
When the user creates a new TSConfig based on the copy of an existing configuration there is no direct link to the object copied from. Since there is no link we can't simply rely on propagating the dependencies to the worker and send a qualified
With this commit, rebalancer backends are identified by application_name = citus_rebalancer
and the regular internal backends are identified by application_name = citus_internal
We fall back to local execution if we cannot establish any more
connections to local node. However, we should not do that for the
commands that we don't know how to execute locally (or we know we
shouldn't execute locally). To fix that, we take localExecutionSupported
take into account in CanFailoverPlacementExecutionToLocalExecution too.
Moreover, we also prompt a more accurate hint message to inform user
about whether the execution is failed because local execution is
disabled by them, or because local execution wasn't possible for given
command.
BEGIN/COMMIT transaction block or in a UDF calling another UDF.
(2) Prohibit/Limit the delegated function not to do a 2PC (or any work on a
remote connection).
(3) Have a safety net to ensure the (2) i.e. we should block the connections
from the delegated procedure or make sure that no 2PC happens on the node.
(4) Such delegated functions are restricted to use only the distributed argument
value.
Note: To limit the scope of the project we are considering only Functions(not
procedures) for the initial work.
DESCRIPTION: Introduce a new flag "force_delegation" in create_distributed_function(),
which will allow a function to be delegated in an explicit transaction block.
Fixes#3265
Once the function is delegated to the worker, on that node during the planning
distributed_planner()
TryToDelegateFunctionCall()
CheckDelegatedFunctionExecution()
EnableInForceDelegatedFuncExecution()
Save the distribution argument (Constant)
ExecutorStart()
CitusBeginScan()
IsShardKeyValueAllowed()
Ensure to not use non-distribution argument.
ExecutorRun()
AdaptiveExecutor()
StartDistributedExecution()
EnsureNoRemoteExecutionFromWorkers()
Ensure all the shards are local to the node in the remoteTaskList.
NonPushableInsertSelectExecScan()
InitializeCopyShardState()
EnsureNoRemoteExecutionFromWorkers()
Ensure all the shards are local to the node in the placementList.
This also fixes a minor issue: Properly handle expressions+parameters in distribution arguments
Simply applies
```SQL
SELECT textlike(command, citus.grep_remote_commands)
```
And, if returns true, the command is logged. Else, the log is ignored.
When citus.grep_remote_commands is empty string, all commands are
logged.
PostgreSQL does not need calling this function since 7.4 release, and it
is a NOOP.
For more details, check PostgreSQL commit below :
commit dd04e958c8b03c0f0512497651678c7816af3198
Author: Tom Lane <tgl@sss.pgh.pa.us>
Date: Sun Mar 9 03:34:10 2003 +0000
tuplestore_donestoring() isn't needed anymore, but provide a no-op
macro definition so as not to create compatibility problems.
diff --git a/src/include/utils/tuplestore.h b/src/include/utils/tuplestore.h
index b46babacd1..76fe9fb428 100644
--- a/src/include/utils/tuplestore.h
+++ b/src/include/utils/tuplestore.h
@@ -17,7 +17,7 @@
* Portions Copyright (c) 1996-2002, PostgreSQL Global Development Group
* Portions Copyright (c) 1994, Regents of the University of California
*
- * $Id: tuplestore.h,v 1.8 2003/03/09 02:19:13 tgl Exp $
+ * $Id: tuplestore.h,v 1.9 2003/03/09 03:34:10 tgl Exp $
*
*-------------------------------------------------------------------------
*/
@@ -41,6 +41,9 @@ extern Tuplestorestate *tuplestore_begin_heap(bool randomAccess,
extern void tuplestore_puttuple(Tuplestorestate *state, void *tuple);
+/* tuplestore_donestoring() used to be required, but is no longer used */
+#define tuplestore_donestoring(state) ((void) 0)
+
/* backwards scan is only allowed if randomAccess was specified 'true' */
extern void *tuplestore_gettuple(Tuplestorestate *state, bool forward,
bool *should_free);
- [x] Add some more regression test coverage
- [x] Make sure returning works fine in case of
local execution + remote execution
(task->partiallyLocalOrRemote works as expected, already added tests)
- [x] Implement locking properly (and add isolation tests)
- [x] We do #shardcount round-trips on `SerializeNonCommutativeWrites`.
We made it a single round-trip.
- [x] Acquire locks for subselects on the workers & add isolation tests
- [x] Add a GUC to prevent modification from the workers, hence increase the
coordinator-only throughput
- The performance slightly drops (~%15), unless
`citus.allow_modifications_from_workers_to_replicated_tables`
is set to false
This change creates a slightly higher abstraction of the `PartitionedResultDestReceiver` where it decouples the partitioning from writing it to a file. This allows for easier reuse for other `DestReceiver`'s that would like to route different tuples to different `DestReceiver`'s.
Originally there was a lot of state kept in `PartitionedResultDestReceiver` to be able to lazily create `FileDestReceivers` when the first tuple arrived for that target. This convoluted the implementation of the processing of tuples with where they should go.
This refactor changes that where it makes the `PartitionedResultDestReceiver` completely agnostic of what kind of Receivers it is writing to. When constructed you pass it a list of `DestReceiver` compatible pointers with the length of `partitionCount`. Internally the `PartitionedResultDestReceiver` keeps track of which `DestReceiver`'s have been started or not, and start them when they first receive a tuple.
Alternatively, if the instantiating code of the `PartitionedResultDestReceiver` wants, the startup can be turned from lazily to eagerly. When the startup is eager (not lazy) all `rStartup` functions on the list of `DestReceiver`'s are called during the startup of the `PartitionedResultDestReceiver` and marked as such.
A downside of this approach is the following. On highly partitioned destinations we now need to allocate a `FileDestReceiver` for every target, _always_. When the data passed into the `PartitionedResultDestReceiver` is highly skewed to a small set of `FileDestReceiver`'s this will waste some memory. Given the small size of a `FileDestReceiver`, and the fact that actual file handles are only created during the processing of the startup of the `FileDestReceiver` I think this memory waste is not a problem. If this would become a problem we could refactor the source list into some kind of generator object which can generate the `DestReceiver`'s on the fly.
It seems like the decision for 2PC is more complicated than
it should be.
With this change, we do one behavioral change. In essense,
before this commit, when a SELECT task with replication factor > 1
is executed, the executor was triggering 2PC. And, in fact,
the transaction manager (`ConnectionModifiedPlacement()`) was
able to understand not to trigger 2PC when no modification happens.
However, for transaction blocks like:
BEGIN;
-- a command that triggers 2PC
-- A SELECT command on replication > 1
..
COMMIT;
The SELECT was used to be qualified as required 2PC. And, as a side-effect
the executor was setting `xactProperties.errorOnAnyFailure = true;`
So, the commands was failing at the time of execution. Now, they fail at
the end of the transaction.
In the past, we allowed users to manually switch to 1PC
(e.g., one phase commit). However, with this commit, we
don't. All multi-shard modifications are done via 2PC.
With Citus 9.0, we introduced `citus.single_shard_commit_protocol` which
defaults to 2PC.
With this commit, we prevent any user to set it to 1PC and drop support
for `citus.single_shard_commit_protocol`.
Although this might add some overhead for users, it is already the default
behaviour (so less likely) and marking placements as INVALID is much
worse.
Since PG14 we can now use binary encoding for arrays and composite types
that contain user defined types. This was fixed in this commit in
Postgres: 670c0a1d47
This change starts using that knowledge, by not necessarily falling back
to text encoding anymore for those types.
While doing this and testing a bit more I found various cases where
binary encoding would fail that our checks didn't cover. This fixes
those cases and adds tests for those. It also fixes EXPLAIN ANALYZE
never using binary encoding, which was a leftover of workaround that
was not necessary anymore.
Finally, it changes the default for both `citus.enable_binary_protocol`
and `citus.binary_worker_copy_format` to `true` for PG14 and up. In our
cloud offering `binary_worker_copy_format` already was true by default.
`enable_binary_protocol` had some bug with MX and user defined types,
this bug was fixed by the above mentioned fixes.
CopyState struct is divided into parts and one of them is CopyFromState
This macro uses the appropriate one for PG versions
Relevant PG commit:
c532d15dddff14b01fe9ef1d465013cb8ef186df
In short, add wrappers around Postgres' AddWaitEventToSet() and
ModifyWaitEvent().
AddWaitEventToSet()/ModifyWaitEvent*() may throw hard errors. For
example, when the underlying socket for a connection is closed by
the remote server and already reflected by the OS, however
Citus hasn't had a chance to get this information. In that case,
if replication factor is >1, Citus can failover to other nodes
for executing the query. Even if replication factor = 1, Citus
can give much nicer errors.
So CitusAddWaitEventSetToSet()/CitusModifyWaitEvent() simply puts
AddWaitEventToSet()/ModifyWaitEvent() into a PG_TRY/PG_CATCH block
in order to catch any hard errors, and returns this information to
the caller.
Ignore orphaned shards in more places
Only use active shard placements in RouterInsertTaskList
Use IncludingOrphanedPlacements in some more places
Fix comment
Add tests
With local query caching, we try to avoid deparse/parse stages as the
operation is too costly.
However, we can do deparse/parse operations once per cached queries, right
before we put the plan into the cache. With that, we avoid edge
cases like (4239) or (5038).
In a sense, we are making the local plan caching behave similar for non-cached
local/remote queries, by forcing to deparse the query once.
Previously this was usually done after argument parsing. This can cause
SEGFAULTs if the number or type of arguments changes in a new version.
By checking that Citus version is correct before doing any argument
parsing we protect against these types of issues. Issues like this have
occurred in pg_auto_failover, so it's not just a theoretical issue.
The main reason why these calls were not at the top of functions is
really just historical. It was because in the past we didn't allow
statements before declarations. Thus having this check before the
argument parsing would have only been possible if we first declared all
variables.
In addition to moving existing CheckCitusVersion calls it also adds
these calls to rebalancer related functions (they were missing there).
Comment from the code:
/*
* Iterate until all the tasks are finished. Once all the tasks
* are finished, ensure that that all the connection initializations
* are also finished. Otherwise, those connections are terminated
* abruptly before they are established (or failed). Instead, we let
* the ConnectionStateMachine() to properly handle them.
*
* Note that we could have the connections that are not established
* as a side effect of slow-start algorithm. At the time the algorithm
* decides to establish new connections, the execution might have tasks
* to finish. But, the execution might finish before the new connections
* are established.
*/
Note that the abruptly terminated connections lead to the following errors:
2020-11-16 21:09:09.800 CET [16633] LOG: could not accept SSL connection: Connection reset by peer
2020-11-16 21:09:09.872 CET [16657] LOG: could not accept SSL connection: Undefined error: 0
2020-11-16 21:09:09.894 CET [16667] LOG: could not accept SSL connection: Connection reset by peer
To easily reproduce the issue:
- Create a single node Citus
- Add the coordinator to the metadata
- Create a distributed table with shards on the coordinator
- f.sql: select count(*) from test;
- pgbench -f /tmp/f.sql postgres -T 12 -c 40 -P 1 or pgbench -f /tmp/f.sql postgres -T 12 -c 40 -P 1 -C
With this commit, the executor becomes smarter about refrain to open
new connections. The very basic example is that, if the connection
establishments take 1000ms and task executions as 5 msecs, the executor
becomes smart enough to not establish new connections.
With https://github.com/citusdata/citus/pull/4806 we enabled
2PC for any non-read-only local task. However, if the execution
is a single task, enabling 2PC (CoordinatedTransactionShouldUse2PC)
hits an assertion as we are not in a coordinated transaction.
There is no downside of using a coordinated transaction for single
task local queries.
Because setting the flag doesn't necessarily mean that we'll
use 2PC. If connections are read-only, we will not use 2PC.
In other words, we'll use 2PC only for connections that modified
any placements.
Before this commit, Citus used 2PC no matter what kind of
local query execution happens.
For example, if the coordinator has shards (and the workers as well),
even a simple SELECT query could start 2PC:
```SQL
WITH cte_1 AS (SELECT * FROM test LIMIT 10) SELECT count(*) FROM cte_1;
```
In this query, the local execution of the shards (and also intermediate
result reads) triggers the 2PC.
To prevent that, Citus now distinguishes local reads and local writes.
And, Citus switches to 2PC only if a modification happens. This may
still lead to unnecessary 2PCs when there is a local modification
and remote SELECTs only. Though, we handle that separately
via #4587.
When we use PROCESS_UTILITY_TOPLEVEL it causes some problems when
combined with other extensions such as pg_audit. With this commit we use
PROCESS_UTILITY_QUERY in the codebase to fix those problems.
When COPY is used for copying into co-located files, it was
not allowed to use local execution. The primary reason was
Citus treating co-located intermediate results as co-located
shards, and COPY into the distributed table was done via
"format result". And, local execution of such COPY commands
was not implemented.
With this change, we implement support for local execution with
"format result". To do that, we use the buffer for every file
on shardState->copyOutState, similar to how local copy on
shards are implemented. In fact, the logic is similar to
local copy on shards, but instead of writing to the shards,
Citus writes the results to a file.
The logic relies on LOCAL_COPY_FLUSH_THRESHOLD, and flushes
only when the size exceeds the threshold. But, unlike local
copy on shards, in this case we write the headers and footers
just once.