In PostgreSQL, user defaults for config parameters can be changed by
ALTER ROLE .. SET statements. We wish to propagate those defaults
across the Citus cluster so that the behavior will be similar in
different workers.
The defaults can either be set in a specific database, or the whole
cluster, similarly they can be set for a single role or all roles.
We propagate the ALTER ROLE .. SET if all the conditions below are met:
- The query affects the current database, or all databases
- The user is already created in worker nodes
In PostgreSQL, user defaults for config parameters can be changed by
ALTER ROLE .. SET statements. We wish to propagate those defaults
accross the Citus cluster so that the behaviour will be similar in
different workers.
The defaults can either be set in a specific database, or the whole
cluster, similarly they can be set for a single role or all roles.
We propagate the ALTER ROLE .. SET if all the conditions below are met:
- The query affects the current database, or all databases
- The user is already created in worker nodes
Sometimes we have concatenated query strings for a task. However,
when we want to find each query string, it is not a trivial task.
Therefore, it makes sense to store this in task so that when we need
each query string we can easily get it.
Some refactoring:
Consolidate expression which decides whether GROUP BY/HAVING are pushed down
Rename early pullUpIntermediateRows to hasNonDistributableAggregates
Create WorkerColumnName to handle formatting WORKER_COLUMN_FORMAT
Ignore NULL StringInfo pointers to SafeToPushdownWindowFunction
Fix bug where SubqueryPushdownMultiNodeTree mutates supplied Query,
SafeToPushdownWindowFunction requires the original query as it relies on rtable
DESCRIPTION: Refactor dependency resolution and resolve from pg_shdepend
This PR refactors how dependencies are resolved by not assuming solely a `pg_depend` record describing the dependency. Instead we keep a definition of the dependency around which records how the dependency is resolved. This can be one of the following ways
- `pg_depend`, data will contain a copy of the `pg_depend` record
- `pg_shdepend`, data will contain a copy of the `pg_shdepend` record
- `ObjectAddress`, data will contain only an `ObjectAddress` describing a dependency
Irregardless of way the dependency was found it will always be able to get to the address of the dependency as that is the most important property.
For some checks we can inspect the source where the dependency was found and perform a deep inspection to decide if we want to follow the dependency. This is important to not distribute dependencies coming from extensions for example.
We cache connections between nodes in our connection management code.
This is good for speed. For security this can be a problem though. If
the user changes settings related to TLS encryption they want those to
be applied to future queries. This is especially important when they did
not have TLS enabled before and now they want to enable it. This can
normally be achieved by changing citus.node_conninfo. However, because
connections are not reopened there will still be old connections that
might not be encrypted at all.
This commit changes that by marking all connections to be shutdown at
the end of their current transaction. This way running transactions will
succeed, even if placement requires connections to be reused for this
transaction. But after this transaction completes any future statements
will use a connection created with the new connection options.
If a connection is requested and a connection is found that is marked
for shutdown, then we don't return this connection. Instead a new one is
created. This is needed to make sure that if there are no running
transactions, then the next statement will not use an old cached
connection, since connections are only actually shutdown at the end of a
transaction.
It seems that when logging is enabled we should not run local shard copy
in parallel with other tests. The reason is that it adds coordinator for
reference tables and if the parallel test creates a schema before this
test is run, the schema will be logged. So it is not deterministic.
If two tables have the same distribution column type, we implicitly
colocate them. This is useful since colocation has a big performance
impact in most applications.
When a table is rebalanced, all of the colocated tables are also
rebalanced. If table A and table B are colocated and we want to
rebalance table A, table B will also be rebalanced. We need replica
identity so that logical replication can replicate updates and deletes
during rebalancing. If table B does not have a replica identity we
error out.
A solution to this is to introduce a UDF so that colocation can be
updated. The remaining tables in the colocation group will stay
colocated. For example if table A, B and C are colocated and after
updating table B's colocations, table A and table C stay colocated.
The "updating colocation" step does not move any data around, it only
updated pg_dist_partition and pg_dist_colocation tables. Specifically it
creates a new colocation group for the table and updates the entry in
pg_dist_partition while invalidating any cache.
Citus coordinator (or MX nodes) caches `citus.max_cached_conns_per_worker` connections
per node. This means that, those connections are not terminated after each statement.
Instead, cached to avoid the cost of re-establishment. This is crucial for OLTP performance.
The problem with that approach is that, we never properly handle the termnation of
those cached connections. For instance, when a session on the coordinator disconnects,
you'd see the following logs on the workers:
```
2020-03-20 09:13:39.454 CET [64028] LOG: could not receive data from client: Connection reset by peer
```
With this patch, we're terminating the cached connections properly at the end of the connection.
We're getting a lot of random failures on CI regarding connection errors. This
works around that by not running that create lots of connections in parallel.
This is needed to automatically generate .bc (bitcode) files when
postgres is compiled with llvmjit support.
It also has the advantage that cmake is not required for the build
anymore.
As discussed with @JelteF; #3559 caused consistent errors on BSD (OSX). Given a group of people use this environment to develop on it is an undesirable change.
This reverts commit ca8f7119fe.
We have special logic to copy into intermediate results and we use a
custom format for that, "result" copy format. Postgres internally does
not know this format and if we use this locally it will error saying
that it does not know this format.
Files are visible to all transactions, which means that we can use any
connection to access files. In order to use the existing logic, it makes
sense that in case we have intermediate results, which means we will
write the results to a file, we preserve the same behavior, which is
opening connections to localhost. Therefore if we have intermediate
results we return false in ShouldExecuteCopyLocally.
We can use local copy in INSERT..SELECT, so the check that disables
local execution is removed.
Also a test for local copy where the data size >
LOCAL_COPY_FLUSH_THRESHOLD is added.
use local execution with insert..select