The previous default was 5 seconds, and we change it to 30 seconds.
The main motivation for this is that for busy clusters, 5 seconds
can be too aggressive. Especially with connection throttling, the servers
might be kept busy for a really long time, and users may see the
connection errors more frequently.
We've done some sanity checks, for really quick queries (like
`SELECT count(*) from table`), 30 seconds is a decent value even
if users execute 300 distributed queries on the coordinator. We've
verified this on Hyperscale(Citus).
Physical planner doesn't support parameters. If the parameters have already
been resolved when the physical planner handling the queries, mark it.
The reason is that the executor is unaware of this, and sends the parameters
along with the worker queries, which fails for composite types.
(See `DissuadePlannerFromUsingPlan()` for the details of paramater resolving)
As reported in #3787, we were having issues while building citus with "GCC Red Hat 10" (maybe in some other versions of gcc as well).
Fixes "multiple definition of 'CitusNodeTagNames'" error by explicitly specifying storage of CitusNodeTagNames to be extern.
We currently put the actual error message to the detail part. However,
many drivers don't show detail part.
As connection errors are somehow common, and hard to trace back, can't
we added the detail to the message itself.
In addition to that, we changed "connection error" message, as it
was confusing to the users who think that the error was happening
while connecting to the coordinator. In fact, this error is showing
up when the coordinator fails to connect remote nodes.
* invalidate plan cache in master_update_node
If a plan is cached by postgres but a user uses master_update_node, then
when the plan cache is used for the updated node, they will get the old
nodename/nodepost in the plan. This is because the plan cache doesn't
know about the master_update_node. This could be a problem in prepared
statements or anything that goes into plancache. As a solution the plan
cache is invalidated inside master_update_node.
* add invalidate_inactive_shared_connections test function
We introduce invalidate_inactive_shared_connections udf to be used in
testing. It is possible that a connection count for an inactive node
will be greater than 0 and in that case it will not be removed at the
time of invalidation. However, later we don't have a mechanism to remove
it, which means that it will stay in the hash. For this not to cause a
problem, we use this udf in testing.
* move invalidate_inactive_shared_connections to udfs from test as it will be used in mx
* remove the test udf
* remove the IsInactive check
This copies over fixes from reference counting branch,
all CitusTableCacheEntry data may be freed when a GetCitusTableCacheEntry call occurs for its relationId
This fix is not complete, but reference counting is being deferred until 9.4
CopyShardInterval: remove dest parameter, always return newly allocated object
We initially considered removing entries just before any change to
pg_dist_node. However, that ended-up being very complex and making
MX even more complex.
Instead, we're switching to a simpler solution, where we remove entries
when the counter gets to 0.
With certain workloads, this may have some performance penalty. But, two
notes on that:
- When counter == 0, it implies that the cluster is not busy
- With cached connections, that's not possible
When we call SetTaskQueryString we would set the task type to
TASK_QUERY_TEXT, and some parts of the codebase rely on the fact that if
TASK_QUERY_TEXT is set, the data can be read safely. However if
SetTaskQueryString is called with a NULL taskQueryString this can cause
crashes. In that case taskQueryType will simply be set to
TASK_QUERY_NULL.
DESCRIPTION: Alter role only works for citus managed roles
Alter role was implemented before we implemented good role management that hooks into the object propagation framework. This is a refactor of all alter role commands that have been implemented to
- be on by default
- only work for supported roles
- make the citus extension owner a supported role
Instead of distributing the alter role commands for roles at the beginning of the node activation role it now _only_ executes the alter role commands for all users in all databases and in the current database.
In preparation of full role support small refactors have been done in the deparser.
Earlier tests targeting other roles than the citus extension owner have been either slightly changed or removed to be put back where we have full role support.
Fixes#2549
We had 9+ parameters in some of the functions related to execution.
Execution params is created to simplify this a bit so that we can set
only the fields that we are interested in and it is easier to read.
With this commit, we're introducing a new infrastructure to throttle
connections to the worker nodes. This infrastructure is useful for
multi-shard queries, router queries are have not been affected by this.
The goal is to prevent establishing more than citus.max_shared_pool_size
number of connections per worker node in total, across sessions.
To do that, we've introduced a new connection flag OPTIONAL_CONNECTION.
The idea is that some connections are optional such as the second
(and further connections) for the adaptive executor. A single connection
is enough to finish the distributed execution, the others are useful to
execute the query faster. Thus, they can be consider as optional connections.
When an optional connection is not allowed to the adaptive executor, it
simply skips it and continues the execution with the already established
connections. However, it'll keep retrying to establish optional
connections, in case some slots are open again.
We currently don't use any cursor flags in local execution, but we can
use CURSOR_OPT_PARALLEL_OK flag to potentially benefit from parallelism
when possible.
This PR:
- Declares variables when they are needed.
- Creates DoCopyFromLocalTableIntoShards for better readability.
- Doesn't use a hardcoded value, instead use a variable for better
readability.
We have two variables that are related to local execution status.
TransactionAccessedLocalPlacement and
TransactionConnectedToLocalGroup. Only one of these fields should be
set, however we didn't have any check for this contraint and it was
error prone.
What those two variables are used is that we are trying to understand if
we should use local execution, the current session, or if we should be
using a connection to execute the current query, therefore the tasks. In
the enum, now it is more clear what these variables mean.
Also, now we have a method to change the local execution status. The
method will error if we are trying to transition from a state to a wrong
state. This will help us avoid problems.
* use local executon when in a transaction block
When we are inside a transaction block, there could be other methods
that need local execution, therefore we will use local execution in a
transaction block.
* update test outputs with transaction block local execution
* add a test to verify we dont leak intermediate schemas
* test that we don't leak intermediate schemas
We have tests to make sure that we don't intermediate any intermediate
files, tables etc but we don't test if we are leaking schemas. It makes
sense to test this as well.
* remove all repartition schemas in case of error
This solution is not an ideal one but it seems to be doing the job.
We should have a more generic solution for the cleanup but it seems that
putting the cleanup in the abort handler is dangerous and it was
crashing.
It is possible to return an error in ExecuteTaskListExtended after
performing local execution with the current structure. However there is
no point in execution the local tasks if we are going to return an error
later. So the local execution is moved after the error check.
When the file does not exist, it could mean two different things.
First -- and a lot more common -- case is that a failure happened
in a concurrent backend on the same distributed transaction. And,
one of the backends in that transaction has already been roll
backed, which has already removed the file. If we throw an error
here, the user might see this error instead of the actual error
message. Instead, we prefer to WARN the user and pretend that the
file has no data in it. In the end, the user would see the actual
error message for the failure.
Second, in case of any bugs in intermediate result broadcasts,
we could try to read a non-existing file. That is most likely
to happen during development. Thus, when asserts enabled, we throw
an error instead of WARNING so that the developers cannot miss.
When we have a query like the following:
```SQL
WITH a AS (SELECT * FROM foo LIMIT 10) SELECT max(x) FROM a JOIN bar 2 USING (y);
```
Citus currently opens side channels for doing the
`COPY "1_1"` FROM STDIN (format 'result')
before starting the execution of
`SELECT * FROM foo LIMIT 10`
Since we need at least 1 connection per worker to do
`SELECT * FROM foo LIMIT 10`
We need to have 2 connections to worker in order to broadcast the results.
However, we don't actually send a single row over the side channel until the
execution of `SELECT * FROM foo LIMIT 10` is completely done (and connections
unclaimed) and the results are written to a tuple store. We could actually
reuse the same connection for doing the `COPY "1_1"` FROM STDIN (format 'result').
This also fixes the issue that Citus doesn't obey `citus.max_adaptive_executor_pool_size`
when the query includes an intermediate result.
We don't need any side channel connections. That is actually
problematic in the sense that it creates extra connections.
Say, citus.max_adaptive_executor_pool_size equals to 1, Citus
ends up using one extra connection for the intermediate results.
Thus, not obeying citus.max_adaptive_executor_pool_size.
In this PR, we remove the following entities from the codebase
to allow further commits to implement not requiring extra connection
for the intermediate results:
- The connection flag REQUIRE_SIDECHANNEL
- The function GivePurposeToConnection
- The ConnectionPurpose struct and related fields
* explicitly return false if transaction connected to local node
* not set TransactionConnectedToLocalGroup if we are writing to a file
We use TransactionConnectedToLocalGroup to prevent local execution from
happening as that might cause visibility problems. As files are visible
to all transactions, we shouldn't set this variable if we are writing to
a file.
In case we don't care about the tupleStoreState in
ExecuteLocalTaskListExtended, it could be passed as null. In that case
we will get a seg error. This changes it so that a dummy tuple store
will be created when it is null.
Do not use local execution in ExecuteTaskListOutsideTransaction.
As we are going to run the tasks outside transaction, we shouldn't use local execution.
However, there is some problem when using local execution related to
repartition joins, when we solve that problem, we can execute the tasks
coming to this path with local execution.
Also logging the local command is simplified.
normalize job id in worker_hash_partition_table in test outputs.
For shardplacements, we were setting nodeid, nodename, nodeport and
nodegroup manually. This makes it very error prone, and it seems that we
already forgot to set some of them. This would mean that they would have
their default values, e.g group id would be 0 when its group id is not
0.
So the implication is that we would have inconsistent worker metadata.
A new method is introduced, and we call the method to set those fields
now, so that as long as we call this method, we won't be setting
inconsistent metadata.
It probably makes sense to have a struct for these fields. We already
have NodeMetadata but it doesn't have nodename or nodeport. So that
could be done over another refactor to make things simpler.
This is possible whenever we aren't pulling up intermediate rows
We want to do this because this was done in 9.2,
some queries rely on the performance of grouping causing distinct values
This change was introduced when implementing window functions on coordinator
The purpose of null_parameters is to make sure that citus doesn't crash
with null parameters. (The related issue is #3493.) The logs in this
file are not that important and they are flaky. The flakiness is related
to postgres part as well so it is hard to reproduce them. Therefore it
makes sense to decrease the log level.
look at sent commands to simplify complex logic in vacuum test
also normalize connection id as that can differ when we don't have to
choose a specific connection.
It seems that sometimes the pruning is deferred and sometimes not with
this statement. What we care in this test is to see that it doesn't
crash. I think we don't care about the log statement for this line. So
it makes sense to not log this statement, and care about the result.
ExecuteTaskListExtended is the common method for different codepaths,
and instead of writing separate local execution logics in different
codepaths, it makes more sense to have the logic here. We still need to
do some refactoring, this is an initial step.
After this commit, we can run create shard commands locally. There is a
special case with shard creation commands. A create shard command might
have a concatenated query string, however local execution did not know
how to execute a task with multiple query strings. This is also
implemented in this commit. We go over each query in the concatenated
query string and plan/execute them one by one.
A more clean solution to this would be to make sure that each task has a
single query. We currently cannot do that because we need to ensure the
task dependencies. However, it would make sense to do that at some point
and it would simplify the code a lot.
It seems that one of the deadlock detection tests fails way too often in
our CI. The difference is only ordering. Currently it seems that it is a
good idea to disable this test for the sake of development.
ExecuteLocalTaskList doesn't need scanState as it only uses
paramListInfo, distributedPlan and tupleStoreState. It is better to pass
only the variables that the function needs, so that we can call this
function from other places when we dont have scanState.
We had many fields in task related to query strings. It was kind of
complex, and only of them could be set at a time. Therefore it makes
more sense to abstract this and use a union so that it is clear that
only of them should be set.
We have three fields that could have query related strings:
- queryForLocation
- queryStringLazy
- perPlacementQueryStrings
Relatively, they can be set with:
- SetTaskQueryString
- SetTaskQueryIfShouldLazyDeparse
- SetTaskPerPlacementQueryStrings
The direct usage of the query related fields are also removed.
Rename queryForLocalExecution
Currently queryForLocalExecution is only used for deparsing purposes,
therefore it makes sense to rename it to what it is doing.
TaskQueryStringForPlacement simplifies how the executor gets the query
string for a given placement. Task will use the necessary fields to
return the correct query placement string. Executor doesn't need to know
the details for this.
rename TaskQueryString as TaskQueryStringAllPlacements
TaskQueryString returns the query string that will be the same for all
the placements. In INSERT..SELECT the query string can be different for
each placement. Adaptive executor uses TaskQueryStringForPlacement,
which returns the query string for a placement. It makes sense to rename
TaskQueryString as TaskQueryStringAllPlacements as it is returning the
query string for all placements.
rename SetTaskQuery as SetTaskQueryIfShouldLazyDeparse
SetTaskQuery does not always sets the task query. It can set the query
string as well. So it is more clear to name it
SetTaskQueryIfShouldLazyDeparse, since it will set the query not query
string only when we should deparse the query in a lazy way.
It is possible that a task will have different query string for each
placement. This is the case in INSERT..SELECT via repartitioning. When
we are setting task->perPlacementQueryString, we should set
queryStringLazy to NULL. Therefore a method for that purpose is created.
In PostgreSQL, user defaults for config parameters can be changed by
ALTER ROLE .. SET statements. We wish to propagate those defaults
accross the Citus cluster so that the behaviour will be similar in
different workers.
The defaults can either be set in a specific database, or the whole
cluster, similarly they can be set for a single role or all roles.
We propagate the ALTER ROLE .. SET if all the conditions below are met:
- The query affects the current database, or all databases
- The user is already created in worker nodes
Sometimes we have concatenated query strings for a task. However,
when we want to find each query string, it is not a trivial task.
Therefore, it makes sense to store this in task so that when we need
each query string we can easily get it.
Some refactoring:
Consolidate expression which decides whether GROUP BY/HAVING are pushed down
Rename early pullUpIntermediateRows to hasNonDistributableAggregates
Create WorkerColumnName to handle formatting WORKER_COLUMN_FORMAT
Ignore NULL StringInfo pointers to SafeToPushdownWindowFunction
Fix bug where SubqueryPushdownMultiNodeTree mutates supplied Query,
SafeToPushdownWindowFunction requires the original query as it relies on rtable
DESCRIPTION: Refactor dependency resolution and resolve from pg_shdepend
This PR refactors how dependencies are resolved by not assuming solely a `pg_depend` record describing the dependency. Instead we keep a definition of the dependency around which records how the dependency is resolved. This can be one of the following ways
- `pg_depend`, data will contain a copy of the `pg_depend` record
- `pg_shdepend`, data will contain a copy of the `pg_shdepend` record
- `ObjectAddress`, data will contain only an `ObjectAddress` describing a dependency
Irregardless of way the dependency was found it will always be able to get to the address of the dependency as that is the most important property.
For some checks we can inspect the source where the dependency was found and perform a deep inspection to decide if we want to follow the dependency. This is important to not distribute dependencies coming from extensions for example.
We cache connections between nodes in our connection management code.
This is good for speed. For security this can be a problem though. If
the user changes settings related to TLS encryption they want those to
be applied to future queries. This is especially important when they did
not have TLS enabled before and now they want to enable it. This can
normally be achieved by changing citus.node_conninfo. However, because
connections are not reopened there will still be old connections that
might not be encrypted at all.
This commit changes that by marking all connections to be shutdown at
the end of their current transaction. This way running transactions will
succeed, even if placement requires connections to be reused for this
transaction. But after this transaction completes any future statements
will use a connection created with the new connection options.
If a connection is requested and a connection is found that is marked
for shutdown, then we don't return this connection. Instead a new one is
created. This is needed to make sure that if there are no running
transactions, then the next statement will not use an old cached
connection, since connections are only actually shutdown at the end of a
transaction.
It seems that when logging is enabled we should not run local shard copy
in parallel with other tests. The reason is that it adds coordinator for
reference tables and if the parallel test creates a schema before this
test is run, the schema will be logged. So it is not deterministic.
If two tables have the same distribution column type, we implicitly
colocate them. This is useful since colocation has a big performance
impact in most applications.
When a table is rebalanced, all of the colocated tables are also
rebalanced. If table A and table B are colocated and we want to
rebalance table A, table B will also be rebalanced. We need replica
identity so that logical replication can replicate updates and deletes
during rebalancing. If table B does not have a replica identity we
error out.
A solution to this is to introduce a UDF so that colocation can be
updated. The remaining tables in the colocation group will stay
colocated. For example if table A, B and C are colocated and after
updating table B's colocations, table A and table C stay colocated.
The "updating colocation" step does not move any data around, it only
updated pg_dist_partition and pg_dist_colocation tables. Specifically it
creates a new colocation group for the table and updates the entry in
pg_dist_partition while invalidating any cache.
Citus coordinator (or MX nodes) caches `citus.max_cached_conns_per_worker` connections
per node. This means that, those connections are not terminated after each statement.
Instead, cached to avoid the cost of re-establishment. This is crucial for OLTP performance.
The problem with that approach is that, we never properly handle the termnation of
those cached connections. For instance, when a session on the coordinator disconnects,
you'd see the following logs on the workers:
```
2020-03-20 09:13:39.454 CET [64028] LOG: could not receive data from client: Connection reset by peer
```
With this patch, we're terminating the cached connections properly at the end of the connection.
We're getting a lot of random failures on CI regarding connection errors. This
works around that by not running that create lots of connections in parallel.
This is needed to automatically generate .bc (bitcode) files when
postgres is compiled with llvmjit support.
It also has the advantage that cmake is not required for the build
anymore.
As discussed with @JelteF; #3559 caused consistent errors on BSD (OSX). Given a group of people use this environment to develop on it is an undesirable change.
This reverts commit ca8f7119fe.
We have special logic to copy into intermediate results and we use a
custom format for that, "result" copy format. Postgres internally does
not know this format and if we use this locally it will error saying
that it does not know this format.
Files are visible to all transactions, which means that we can use any
connection to access files. In order to use the existing logic, it makes
sense that in case we have intermediate results, which means we will
write the results to a file, we preserve the same behavior, which is
opening connections to localhost. Therefore if we have intermediate
results we return false in ShouldExecuteCopyLocally.
We can use local copy in INSERT..SELECT, so the check that disables
local execution is removed.
Also a test for local copy where the data size >
LOCAL_COPY_FLUSH_THRESHOLD is added.
use local execution with insert..select
If current transaction is connected to local group we should not use
local copy, because we might not see some of the changes that are made
over the connection to the local group.
A copy will be executed locally if
- Local execution is enabled and current transaction accessed a local placement
- Local execution is enabled and we are inside a transaction block.
So even if local execution is enabled but we are not in a transaction block, the copy will not be run locally.
This will not run locally:
```
COPY distributed_table FROM STDIN;
....
```
This will run locally:
```
SET citus.enable_local_execution to 'on';
BEGIN;
COPY distributed_table FROM STDIN;
COMMIT;
....
```
.
There are 3 ways to do a copy in postgres programmatically:
- from a file
- from a program
- from a callback function
I have chosen to implement it with a callback function, which means that we write the rows of copy from a callback function to the output buffer, which is used to insert tuples into the actual table.
For each shard id, we have a buffer that keeps the current rows to be written, we perform the actual copy operation either when:
- copy buffer for the given shard id reaches to a threshold, which is currently 512KB
- we reach to the end of the copy
The buffer size is debatable(512KB). At a given time, we might allocate (local placement * buffer size) memory at most.
The local copy uses the same copy format as remote copy, which means that we serialize the data in the same format as remote copy and send it locally.
There was also the option to use ExecSimpleRelationInsert to insert
slots one by one, which would avoid the extra
serialization/deserialization but doing some benchmarks it seems that
using buffers are significantly better in terms of the performance.
You can see this comment for more details: https://github.com/citusdata/citus/pull/3557#discussion_r389499054
On some distros (e.g. Redhat 7) there is cmake version 2 and cmake version 3,
safestringlib requires cmake version 3. On those distros the binary is called
cmake3, so try to use that one before falling back to regular cmake binary.
DESCRIPTION: Fix left join shard pruning in pushdown planner
Due to #2481 which moves outer join planning through the pushdown planner we caused a regression on the shard pruning behaviour for outer joins.
In the pushdown planner we make a union of the placement groups for all shards accessed by a query based on the filters we see during planning. Unfortunately implicit filters for left joins are not available during this part. This causes the inner part of an outer join to not prune any shards away. When we take the union of the placement groups it shows the behaviour of not having any shards pruned.
Since the inner part of an outer query will not return any rows if the outer part does not contain any rows we have observed we do not have to add the shard intervals of the inner part of an outer query to the list of shard intervals to query.
Fixes: #3512
* reimplement ExecuteUtilityTaskListWithoutResults for local utility command execution
* introduce new functions for local execution of utility commands
* change ErrorIfTransactionAccessedPlacementsLocally logic for local utility command execution
* enable local execution for TRUNCATE command on distributed & reference tables
* update existing tests for local utility command execution
* enable local execution for DDL commands on distributed & reference tables
* enable local execution for DROP command on distributed & reference tables
* add normalization rules for cascaded commands
* add new tests for local utility command execution
In between stat at the start of the loop and unlink/rmdir at the end the
item that the filename references might have changed. In some cases this
can be a security bug, but since we only delete the file/directory it
should not be for us as far as I can tell. It could in theory still
cause errors though if the a file is changed into a directory by some
other process. This commit makes the code robust against that, by not
using stat and only rely on error codes and retries.
This fixes 3 bugs:
1. `strtoul` never underflows, so that branch was useless
2. `strtoul` has ULONG_MAX instead of LONG_MAX when it overflows
3. `long` and `unsigned long` are not necessarily 64bit, they can be
either more or less. So now `strtoll` and `strtoull` are used
and 64 bit bounds are checked.
New stack memory can contain anything including passwords/private keys.
In these functions we return structs that can have their padding
bytes uninitialized. By first zeroing out the struct fully, we try to
ensure that any data that is in these padding bytes is at least
overwritten once. It might not be zero anymore after setting the fields,
but at least it shouldn't be private data anymore.