We have two variables that are related to local execution status.
TransactionAccessedLocalPlacement and
TransactionConnectedToLocalGroup. Only one of these fields should be
set, however we didn't have any check for this contraint and it was
error prone.
What those two variables are used is that we are trying to understand if
we should use local execution, the current session, or if we should be
using a connection to execute the current query, therefore the tasks. In
the enum, now it is more clear what these variables mean.
Also, now we have a method to change the local execution status. The
method will error if we are trying to transition from a state to a wrong
state. This will help us avoid problems.
For shardplacements, we were setting nodeid, nodename, nodeport and
nodegroup manually. This makes it very error prone, and it seems that we
already forgot to set some of them. This would mean that they would have
their default values, e.g group id would be 0 when its group id is not
0.
So the implication is that we would have inconsistent worker metadata.
A new method is introduced, and we call the method to set those fields
now, so that as long as we call this method, we won't be setting
inconsistent metadata.
It probably makes sense to have a struct for these fields. We already
have NodeMetadata but it doesn't have nodename or nodeport. So that
could be done over another refactor to make things simpler.
This is possible whenever we aren't pulling up intermediate rows
We want to do this because this was done in 9.2,
some queries rely on the performance of grouping causing distinct values
This change was introduced when implementing window functions on coordinator
ExecuteTaskListExtended is the common method for different codepaths,
and instead of writing separate local execution logics in different
codepaths, it makes more sense to have the logic here. We still need to
do some refactoring, this is an initial step.
After this commit, we can run create shard commands locally. There is a
special case with shard creation commands. A create shard command might
have a concatenated query string, however local execution did not know
how to execute a task with multiple query strings. This is also
implemented in this commit. We go over each query in the concatenated
query string and plan/execute them one by one.
A more clean solution to this would be to make sure that each task has a
single query. We currently cannot do that because we need to ensure the
task dependencies. However, it would make sense to do that at some point
and it would simplify the code a lot.
We had many fields in task related to query strings. It was kind of
complex, and only of them could be set at a time. Therefore it makes
more sense to abstract this and use a union so that it is clear that
only of them should be set.
We have three fields that could have query related strings:
- queryForLocation
- queryStringLazy
- perPlacementQueryStrings
Relatively, they can be set with:
- SetTaskQueryString
- SetTaskQueryIfShouldLazyDeparse
- SetTaskPerPlacementQueryStrings
The direct usage of the query related fields are also removed.
Rename queryForLocalExecution
Currently queryForLocalExecution is only used for deparsing purposes,
therefore it makes sense to rename it to what it is doing.
TaskQueryStringForPlacement simplifies how the executor gets the query
string for a given placement. Task will use the necessary fields to
return the correct query placement string. Executor doesn't need to know
the details for this.
rename TaskQueryString as TaskQueryStringAllPlacements
TaskQueryString returns the query string that will be the same for all
the placements. In INSERT..SELECT the query string can be different for
each placement. Adaptive executor uses TaskQueryStringForPlacement,
which returns the query string for a placement. It makes sense to rename
TaskQueryString as TaskQueryStringAllPlacements as it is returning the
query string for all placements.
rename SetTaskQuery as SetTaskQueryIfShouldLazyDeparse
SetTaskQuery does not always sets the task query. It can set the query
string as well. So it is more clear to name it
SetTaskQueryIfShouldLazyDeparse, since it will set the query not query
string only when we should deparse the query in a lazy way.
It is possible that a task will have different query string for each
placement. This is the case in INSERT..SELECT via repartitioning. When
we are setting task->perPlacementQueryString, we should set
queryStringLazy to NULL. Therefore a method for that purpose is created.
Sometimes we have concatenated query strings for a task. However,
when we want to find each query string, it is not a trivial task.
Therefore, it makes sense to store this in task so that when we need
each query string we can easily get it.
Some refactoring:
Consolidate expression which decides whether GROUP BY/HAVING are pushed down
Rename early pullUpIntermediateRows to hasNonDistributableAggregates
Create WorkerColumnName to handle formatting WORKER_COLUMN_FORMAT
Ignore NULL StringInfo pointers to SafeToPushdownWindowFunction
Fix bug where SubqueryPushdownMultiNodeTree mutates supplied Query,
SafeToPushdownWindowFunction requires the original query as it relies on rtable
A copy will be executed locally if
- Local execution is enabled and current transaction accessed a local placement
- Local execution is enabled and we are inside a transaction block.
So even if local execution is enabled but we are not in a transaction block, the copy will not be run locally.
This will not run locally:
```
COPY distributed_table FROM STDIN;
....
```
This will run locally:
```
SET citus.enable_local_execution to 'on';
BEGIN;
COPY distributed_table FROM STDIN;
COMMIT;
....
```
.
There are 3 ways to do a copy in postgres programmatically:
- from a file
- from a program
- from a callback function
I have chosen to implement it with a callback function, which means that we write the rows of copy from a callback function to the output buffer, which is used to insert tuples into the actual table.
For each shard id, we have a buffer that keeps the current rows to be written, we perform the actual copy operation either when:
- copy buffer for the given shard id reaches to a threshold, which is currently 512KB
- we reach to the end of the copy
The buffer size is debatable(512KB). At a given time, we might allocate (local placement * buffer size) memory at most.
The local copy uses the same copy format as remote copy, which means that we serialize the data in the same format as remote copy and send it locally.
There was also the option to use ExecSimpleRelationInsert to insert
slots one by one, which would avoid the extra
serialization/deserialization but doing some benchmarks it seems that
using buffers are significantly better in terms of the performance.
You can see this comment for more details: https://github.com/citusdata/citus/pull/3557#discussion_r389499054
DESCRIPTION: Fix left join shard pruning in pushdown planner
Due to #2481 which moves outer join planning through the pushdown planner we caused a regression on the shard pruning behaviour for outer joins.
In the pushdown planner we make a union of the placement groups for all shards accessed by a query based on the filters we see during planning. Unfortunately implicit filters for left joins are not available during this part. This causes the inner part of an outer join to not prune any shards away. When we take the union of the placement groups it shows the behaviour of not having any shards pruned.
Since the inner part of an outer query will not return any rows if the outer part does not contain any rows we have observed we do not have to add the shard intervals of the inner part of an outer query to the list of shard intervals to query.
Fixes: #3512
Add failing tests, make changes to avoid crashes at least
Fix HAVING subquery pushdown ignoring reference table only subqueries,
also include HAVING in recursive planning
Given that we have a function IsDistributedTable which includes reference tables,
it seems best to have IsDistributedTableRTE & QueryContainsDistributedTableRTE
reflect that they do not include reference tables in their check
Similarly SublinkList's name should reflect that it only scans WHERE
contain_agg_clause asserts that we don't have SubLinks,
use contain_aggs_of_level as suggested by pg sourcecode
Before this commit, we considered !ContainsRecurringRTE() enough
for NotContainsOnlyRecurringTuples. However, instead, we can check
for existince of any distributed table.
DESCRIPTION: Fixes a bug that causes wrong results with complex outer joins
There are 2 problems with our early exit strategy that this commit fixes:
1- When we decide that a subplan results are sent to all worker nodes,
we used to skip traversing the whole distributed plan, instead of
skipping only the subplan.
2- We used to consider all available nodes in the cluster (secondaries
and inactive nodes as well as active primaries) when deciding on early
exit strategy. This resulted in failures to early exit when there are
secondaries or inactive nodes.
Semmle reported quite some places where we use a value that could be NULL. Most of these are not actually a real issue, but better to be on the safe side with these things and make the static analysis happy.
DESCRIPTION: Replace the query planner for the coordinator part with the postgres planner
Closes#2761
Citus had a simple rule based planner for the query executed on the query coordinator. This planner grew over time with the addigion of SQL support till it was getting close to the functionality of the postgres planner. Except the code was brittle and its complexity rose which made it hard to add new SQL support.
Given its resemblance with the postgres planner it was a long outstanding wish to replace our hand crafted planner with the well supported postgres planner. This patch replaces our planner with a call to postgres' planner.
Due to the functionality of the postgres planner we needed to support both projections and filters/quals on the citus custom scan node. When a sort operation is planned above the custom scan it might require fields to be reordered in the custom scan before returning the tuple (projection). The postgres planner assumes every custom scan node implements projections. Because we controlled the plan that was created we prevented reordering in the custom scan and never had implemented it before.
A same optimisation applies to having clauses that could have been where clauses. Instead of applying the filter as a having on the aggregate it will push it down into the plan which could reach a custom scan node.
For both filters and projections we have implemented them when tuples are read from the tuple store. If no projections or filters are required it will directly return the tuple from the tuple store. Otherwise it will loop tuples from the tuple store through the filter and projection until a tuple is found and returned.
Besides filters being pushed down a side effect of having quals that could have been a where clause is that a call to read intermediate result could be called before the first tuple is fetched from the custom scan. This failed because the intermediate result would only be pulled to the coordinator on the first tuple fetch. To overcome this problem we do run the distributed subplans now before we run the postgres executor. This ensures the intermediate result is present on the coordinator in time. We do account for total time instrumentation by removing the instrumentation before handing control to the psotgres executor and update the timings our self.
For future SQL support it is enough to create a valid query structure for the part of the query to be executed on the query coordinating node. As a utility we do serialise and print the query at debug level4 for engineers to inspect what kind of query is being planned on the query coordinator.
We don't actually use these functions anymore since merging #1477.
Advantages of removing:
1. They add work whenever we add a new node.
2. They contain some usage of stdlib APIs that are banned by Microsoft.
Removing it means we don't have to replace those with safe ones.
Comparison between differently sized integers in loop conditions can cause
infinite loops. This can happen when doing something like this:
```c
int64 very_big = MAX_INT32 + 1;
for (int32 i = 0; i < very_big; i++) {
// do something
}
// never reached because i overflows before it can reach the value of very_big
```