If the expression is simple, such as, SELECT function() or PEFORM function()
in PL/PgSQL code, PL engine does a simple expression evaluation which can't
interpret the Citus CustomScan Node. Code checks for simple expressions when
executing an UDF but missed the DO-Block scenario, this commit fixes it.
Removed dependency for EnsureTableOwner. Also removed pg_fini() and columnar_tableam_finish() Still need to remove CheckCitusVersion dependency to make Columnar_tableam.h dependency free from Citus.
Previously, we were wrapping targetlist nodes with Vars that reference
to the result of the worker query, if the node itself is not `Const` or
not a `Param`. Indeed, we should not do that unless the node itself is
a `Var` node or contains a `Var` within it (e.g.: `OpExpr(Var(column_a) > 2)`).
Otherwise, when worker query returns empty result set, then combine
query exec would crash since the `Var` would be pointing to an empty
tuple slot, which is not desirable for the node-executor methods.
Replaces citus.enable_object_propagation with citus.enable_metadata_sync
Also, within Citus 11 release cycle, we added citus.enable_metadata_sync_by_default,
that is also replaced with citus.enable_metadata_sync.
In essence, when citus.enable_metadata_sync is set to true, all the objects
and the metadata is send to the remote node.
We strongly advice that the users never changes the value of
this GUC.
With this commit, rebalancer backends are identified by application_name = citus_rebalancer
and the regular internal backends are identified by application_name = citus_internal
With this commit we've started to propagate sequences and shell
tables within the object dependency resolution. So, ensuring any
dependencies for any object will consider shell tables and sequences
as well. Separate logics for both shell tables and sequences have
been removed.
Since both shell tables and sequences logic were implemented as a
part of the metadata handling before that logic, we were propagating
them while syncing table metadata. With this commit we've divided
metadata (which means anything except shards thereafter) syncing
logic into multiple parts and implemented it either as a part of
ActivateNode. You can check the functions called in ActivateNode
to check definition of different metadata.
Definitions of start_metadata_sync_to_node and citus_activate_node
have also been updated. citus_activate_node will basically create
an active node with all metadata and reference table shards.
start_metadata_sync_to_node will be same with citus_activate_node
except replicating reference tables. stop_metadata_sync_to_node
will remove all the metadata. All of those UDFs need to be called
by superuser.
When creating a new table, we bypass the buffer cache and write the
initial pages directly with smgrwrite(). However, you're supposed to
use smgrextend() when extending a relation, rather than smgrwrite().
There isn't much difference between them, but smgrextend() updates the
relation size cache, which seems important, although I haven't seen
any real bugs caused by that.
Also, write the block to disk only after WAL-logging it, so that we
can include the LSN of the WAL record in the version that we write
out. Currently, the page as written to disk has LSN 0. That doesn't
cause any user-visible issues either, at worst it could make us
WAL-log a full page image of the page earlier than necessary, but that
doesn't matter currently because we WAL-log full page images of all
changes anyway.
I bumped into that issue with LSN 0 in the page header when testing
Citus with Zenith (https://github.com/zenithdb/zenith/issues/1176).
Zenith contains a check that PANICs if you write a block to disk
without WAL-logging it, and it works by checking the LSN of the page
that's written out. In this case, we are WAL-logging the page even
though the LSN on the page is 0, so it was a false alarm, but I'd love
to get this changed in Citus to keep the check in Zenith simple.
A downside of WAL-logging the page first is that if you run out of
disk space, you have already created the WAL record. So if you then
crash and restart, WAL recovery will likely run out of disk space,
too, which is bad. In practice, we have the same problem in other
places, like rewriteheap.c. Also, if you are on the brink of running
out of disk space, you will probably run out at WAL replay anyway,
regardless of which order we write these few pages. But if we wanted
to fix that, we could first extend the relation with zeros, and then
WAL-log the pages. That's how heap extension works.
It would be even nicer to use the buffer cache for this, and skip the
smgrimmedsync() on the relation. However, that would require more
work, because we don't have the Relation struct for the relation here.
We could use ReadBufferWithoutRelcache(), but that doesn't work for
unlogged tables. Unlogged tables are currently not supported
(https://github.com/citusdata/citus/issues/4742), but that would
become a problem if we want to support them in the future.
CreateFakeRelcacheEntry() also doesn't work with unlogged tables. We
could do things differently for logged and unlogged tables, but that
complicates the code further.
Co-authored-by: jeff-davis <Jeffrey.Davis@microsoft.com>
Citus heavily relies on application_name, see
`IsCitusInitiatedRemoteBackend()`.
But if the user set the application name, such as export PGAPPNAME=test_name,
Citus uses that name while connecting to the remote node.
With this commit, we ensure that Citus always connects with
the "citus" user name to the remote nodes.
With https://github.com/citusdata/citus/pull/2780, we allow
COPY to use any number of connections that the executor used
in a tx block.
Meaning that, while COPYing data to the shards, create_distributed_table
could allow sequential mode.
We fall back to local execution if we cannot establish any more
connections to local node. However, we should not do that for the
commands that we don't know how to execute locally (or we know we
shouldn't execute locally). To fix that, we take localExecutionSupported
take into account in CanFailoverPlacementExecutionToLocalExecution too.
Moreover, we also prompt a more accurate hint message to inform user
about whether the execution is failed because local execution is
disabled by them, or because local execution wasn't possible for given
command.
multi_log_hook() hook is called by EmitErrorReport() when emitting the
ereport either to frontend or to the server logs. And some callers of
EmitErrorReport() (e.g.: errfinish()) seems to assume that string fields
of given ErrorData object needs to be freed. For this reason, we copy the
message into heap here.
I don't think we have faced with such a problem before but it seems worth
fixing as it is theoretically possible due to the reasoning above.
BEGIN/COMMIT transaction block or in a UDF calling another UDF.
(2) Prohibit/Limit the delegated function not to do a 2PC (or any work on a
remote connection).
(3) Have a safety net to ensure the (2) i.e. we should block the connections
from the delegated procedure or make sure that no 2PC happens on the node.
(4) Such delegated functions are restricted to use only the distributed argument
value.
Note: To limit the scope of the project we are considering only Functions(not
procedures) for the initial work.
DESCRIPTION: Introduce a new flag "force_delegation" in create_distributed_function(),
which will allow a function to be delegated in an explicit transaction block.
Fixes#3265
Once the function is delegated to the worker, on that node during the planning
distributed_planner()
TryToDelegateFunctionCall()
CheckDelegatedFunctionExecution()
EnableInForceDelegatedFuncExecution()
Save the distribution argument (Constant)
ExecutorStart()
CitusBeginScan()
IsShardKeyValueAllowed()
Ensure to not use non-distribution argument.
ExecutorRun()
AdaptiveExecutor()
StartDistributedExecution()
EnsureNoRemoteExecutionFromWorkers()
Ensure all the shards are local to the node in the remoteTaskList.
NonPushableInsertSelectExecScan()
InitializeCopyShardState()
EnsureNoRemoteExecutionFromWorkers()
Ensure all the shards are local to the node in the placementList.
This also fixes a minor issue: Properly handle expressions+parameters in distribution arguments
* Removed distributed dependency in columnar_metadata.c
* Changed columnar_debug.c so that it no longer needed distributed/tuplestore and made it return a record instead of a tuplestore
* removed distributed/commands.h dependency
* Made columnar_tableam.c dependency-free
* Fixed spacing for columnar_store_memory_stats function
* indentation fix
* fixed test failures
* Require superuser while activating a node
With this change, we require ActiveNode() (hence citus_add_node(),
citus_activate_node()) explicitly require for a superuser.
Before this commit, these functions were designed to work with
non-superuser roles with the relevent GRANTs given.
However, that is not a widely used way for calling the functions
above.
Due to possibility of non-super user calling the UDFs, they were
designed in a way that some commands were using some additional
short-lived superuser connections. That is:
(a) breaking transactional behavior (e.g., ROLLBACK
wouldn't fully rollback the whole transaction)
(b) Making it very complicated to reason about which
parts of the node activation goes over which connections,
and becoming vulnerable to deadlocks / visibility issues.
In addition to starting a new transaction, we also need to tell other
backends --including the ones spawned for connections opened to
localhost to build indexes on shards of this relation-- that concurrent
index builds can safely ignore us.
Normally, DefineIndex() only does that if index doesn't have any
predicates (i.e.: where clause) and no index expressions at all.
However, now that we already called standard process utility, index
build on the shell table is finished anyway.
The reason behind doing so is that we cannot guarantee not grabbing any
snapshots via adaptive executor, and the backends creating indexes on
local shards (if any) might block on waiting for current xact of the
current backend to finish, which would cause self deadlocks that are not
detectable.
With https://github.com/citusdata/citus/pull/5493 we introduced
metadata specific connections.
With this connection we guarantee that there is a single metadata connection.
But note that this connection can be used for any other operation.
In other words, this connection is not only reserved for metadata
operations.
However, as https://github.com/citusdata/citus-enterprise/issues/715 showed
us that the logic has a flaw. We allowed ineligible connections to be
picked as metadata connections: such as exclusively claimed connections
or not fully initialized connections.
With this commit, we make sure that we only consider eligable connections
for metadata operations.
We prefer the background daemon to only sync node metadata. That's
why we move placement metadata changes from disable node to
activate node. With that, we can make sure that disable node
only changes node metadata, whereas activate node syncs all
the metadata changes. In essence, we already expect all
nodes to be up when a node is activated. So, this does not change
the behavior much.
Dropping sequences means we need to recreate
and hence losing the sequence.
With this commit, we keep the existing sequences
such that resyncing wouldn't drop the sequence.
We do that by breaking the dependency of the sequence
from the table.
Split distributed/version_compat.h into dependency-free
pg_version_compat.h, and the original which still has
dependencies. The original doesn't have much purpose, but until other
files have better discipline about including the correct header files,
then it's still needed.
Also make distributed/listutils.h dependency-free. Should be moved
outside of 'distributed' subdirectory, but that will cause significant
code churn, so leave for another cleanup patch.
Now both files can be included in columnar without creating a
dependency on citus.
Previously, we cheated by using the RM_GENERIC_ID record type, but not
actually using the generic WAL API. This worked because we always took
a full page image, and saved the extra work of allocating and copying
to a temporary page.
But it introduced complexity, and perhaps fragility, so better to just
use the API properly. The performance penalty for a serial data load
seems to be less than 1%.
Before this commit, Citus was triggering metadata syncing
in the background when a function is distributed. However,
with Citus 11, we expect all clusters to have metadata synced
enabled. So, we do not expect any nodes not to have the metadata.
This change:
(a) pro: simplifies the code and opens up possibilities
to simplify futher by reducing the scope of
bg worker to only sync node metadata
(b) pro: explicitly asks users to sync the metadata such that
any unforseen impact can be easily detected
(c) con: For distributed functions without distribution
argument, we do not necessarily require the metadata
sycned. However, for completeness and simplicity, we
do so.
With Citus 11, the default behavior is to sync the metadata.
However, partitioned tables created pre-Citus 11 might have
index names that are not compatiable with metadata syncing.
See https://github.com/citusdata/citus/issues/4962 for the
details.
With this commit, we record the existence of partitioned tables
such that we can fix it later if any exists.
With this commit, fix_partition_shard_index_names()
works significantly faster.
For example,
32 shards, 365 partitions, 5 indexes drop from ~120 seconds to ~44 seconds
32 shards, 1095 partitions, 5 indexes drop from ~600 seconds to ~265 seconds
`queryStringList` can be really long, because it may contain #partitions * #indexes entries.
Before this change, we were actually going through the executor where each command
in the query string triggers 1 round trip per entry in queryStringList.
The aim of this commit is to avoid the round-trips by creating a single query string.
I first simply tried sending `q1;q2;..;qn` . However, the executor is designed to
handle `q1;q2;..;qn` type of query executions via the infrastructure mentioned
above (e.g., by tracking the query indexes in the list and doing 1 statement
per round trip).
One another option could have been to change the executor such that only track
the query index when `queryStringList` is provided not with queryString
including multiple `;`s . That is (a) more work (b) could cause weird edge
cases with failure handling (c) felt like coding a special case in to the executor