* WIP
* wip
* add basic logic to run a single job with repartioning joins with adaptive executor
* fix some warnings and return in ExecuteDependedTasks if there is none
* Add the logic to run depended jobs in adaptive executor
The execution of depended tasks logic is changed. With the current
logic:
- All tasks are created from the top level task list.
- At one iteration:
- CurTasks whose dependencies are executed are found.
- CurTasks are executed in parallel with adapter executor main
logic.
- The iteration is repeated until all tasks are completed.
* Separate adaptive executor repartioning logic
* Remove duplicate parts
* cleanup directories and schemas
* add basic repartion tests for adaptive executor
* Use the first placement to fetch data
In task tracker, when there are replicas, we try to fetch from a replica
for which a map task is succeeded. TaskExecution is used for this,
however TaskExecution is not used in adaptive executor. So we cannot use
the same thing as task tracker.
Since adaptive executor fails when a map task fails (There is no retry
logic yet). We know that if we try to execute a fetch task, all of its
map tasks already succeeded, so we can just use the first one to fetch
from.
* fix clean directories logic
* do not change the search path while creating a udf
* Enable repartition joins with adaptive executor with only enable_reparitition_joins guc
* Add comments to adaptive_executor_repartition
* dont run adaptive executor repartition test in paralle with other tests
* execute cleanup only in the top level execution
* do cleanup only in the top level ezecution
* not begin a transaction if repartition query is used
* use new connections for repartititon specific queries
New connections are opened to send repartition specific queries. The
opened connections will be closed at the FinishDistributedExecution.
While sending repartition queries no transaction is begun so that
we can see all changes.
* error if a modification was done prior to repartition execution
* not start a transaction if a repartition query and sql task, and clean temporary files and schemas at each subplan level
* fix cleanup logic
* update tests
* add missing function comments
* add test for transaction with DDL before repartition query
* do not close repartition connections in adaptive executor
* rollback instead of commit in repartition join test
* use close connection instead of shutdown connection
* remove unnecesary connection list, ensure schema owner before removing directory
* rename ExecuteTaskListRepartition
* put fetch query string in planner not executor as we currently support only replication factor = 1 with adaptive executor and repartition query and we know the query string in the planner phase in that case
* split adaptive executor repartition to DAG execution logic and repartition logic
* apply review items
* apply review items
* use an enum for remote transaction state and fix cleanup for repartition
* add outside transaction flag to find connections that are unclaimed instead of always opening a new transaction
* fix style
* wip
* rename removejobdir to partition cleanup
* do not close connections at the end of repartition queries
* do repartition cleanup in pg catch
* apply review items
* decide whether to use transaction or not at execution creation
* rename isOutsideTransaction and add missing comment
* not error in pg catch while doing cleanup
* use replication factor of the creation time, not current time to decide if task tracker should be chosen
* apply review items
* apply review items
* apply review item
Use partition column's collation for range distributed tables
Don't allow non deterministic collations for hash distributed tables
CoPartitionedTables: don't compare unequal types
Previously,
- we'd push down ORDER BY, but this doesn't order intermediate results between workers
- we'd keep FILTER on master aggregate, which would raise an error about unexpected cstrings
Support for ARRAY[] expressions is limited to having a consistent shape,
eg ARRAY[(int,text),(int,text)] as opposed to ARRAY[(int,text),(float,text)] or ARRAY[(int,text),(int,text,float)]
Initialization of queryWindowClause and queryOrderByLimit "memset" underflow these variables.
It's possible due to the invalid usage sizeof this part of the program cause buffer overflow and function return data corruption in future changes.
In plain words, each distributed plan pulls the necessary intermediate
results to the worker nodes that the plan hits. This is primarily useful
in three ways.
(i) If the distributed plan that uses intermediate
result(s) is a router query, then the intermediate results are only
broadcasted to a single node.
(ii) If a distributed plan consists of only intermediate results, which
is not uncommon, the intermediate results are broadcasted to a single
node only.
(iii) If a distributed query hits a sub-set of the shards in multiple
workers, the intermediate results will be broadcasted to the relevant
node(s).
The final item (iii) becomes crucial for append/range distributed
tables where typically the distributed queries hit a small subset of
shards/workers.
To do this, for each query that Citus creates a distributed plan, we keep
track of the subPlans used in the queryTree, and save it in the distributed
plan. Just before Citus executes each subPlan, Citus first keeps track of
every worker node that the distributed plan hits, and marks every subPlan
should be broadcasted to these nodes. Later, for each subPlan which is a
distributed plan, Citus does this operation recursively since these
distributed plans may access to different subPlans, and those have to be
recorded as well.
DESCRIPTION: Expression in reference join
Fixed: #2582
This patch allows arbitrary expressions in the join clause when joining to a reference table. An example of such joins could be found in CHbenCHmark queries 7, 8, 9 and 11; `mod((s_w_id * s_i_id),10000) = su_suppkey` and `ascii(substr(c_state,1,1)) = n2.n_nationkey`. Since the join is on a reference table these queries are able to be pushed down to the workers.
To implement these queries we will widen the `IsJoinClause` predicate to not check if the expressions are a type `Var` after stripping the implicit coerciens. Instead we define a join clause when the `Var`'s in a clause come from more than 1 table.
This allows more clauses to pass into the logical planner's `MultiNodeTree(...)` planning function. To compensate for this we tighten down the `LocalJoin`, `SinglePartitionJoin` and `DualPartitionJoin` to check for direct column references when planning. This allows the planner to work with arbitrary join expressions on reference tables.
When the user picks "round-robin" policy, the aim is that the load
is distributed across nodes. However, for reference tables on the
coordinator, since local execution kicks in immediately, round-robin
is ignored.
With this change, we're excluding the placement on the coordinator.
Although the approach seems a little bit invasive because of
modifications in the placement list, that sounds acceptable.
We could have done this in some other ways such as:
1) Add a field to "Task->roundRobinPlacement" (or such), which is
updated as the first element after RoundRobinPolicy is applied.
During the execution, if that placement is local to the coordinator,
skip it and try the other remote placements.
2) On TaskAccessesLocalNode()@local_execution.c, check
task_assignment_policy, if round-robin selected and there is local
placement on the coordinator, skip it. However, task assignment is done
on planning, but this decision is happening on the execution, which
could create weird edge cases.
This change was actually already intended in #3124. However, the
postgres Makefile manually enables this warning too. This way we undo
that.
To confirm that it works two functions were changed to make use of not
having the warning anymore.
Phase 1 seeks to implement minimal infrastructure, so does not include:
- dynamic generation of support aggregates to handle multiple arguments
- configuration methods to direct aggregation strategy,
or mark an aggregate's serialize/deserialize as safe to operate across nodes
Aggregates can be distributed when:
- they have a single argument
- they have a combinefunc
- their transition type is not a pseudotype
This is necassery to support Q20 of the CHbenCHmark: #2582.
To summarize the fix: The subquery is converted into an INNER JOIN on a
table. This fixes the issue, since an INNER JOIN on a table is already
supported by the repartion planner.
The way this replacement is happening.:
1. Postgres replaces `col in (subquery)` with a SEMI JOIN (subquery) on col = subquery_result
2. If this subquery is simple enough Postgres will replace it with a
regular read from a table
3. If the subquery returns unique results (e.g. a primary key) Postgres
will convert the SEMI JOIN into an INNER JOIN during the planning. It
will not change this in the rewritten query though.
4. We check if Postgres sends us any SEMI JOINs during its join order
planning, if it doesn't we replace all SEMI JOINs in the rewritten
query with INNER JOIN (which we already support).
Postgres doesn't require you to add all columns that are in the target list to
the GROUP BY when you group by a unique column (or columns). It even actively
removes these group by clauses when you do.
This is normally fine, but for repartition joins it is not. The reason for this
is that the temporary tables don't have these primary key columns. So when the
worker executes the query it will complain that it is missing columns in the
group by.
This PR fixes that by adding an ANY_VALUE aggregate around each variable in
the target list that does is not contained in the group by or in an aggregate.
This is done only for repartition joins.
The ANY_VALUE aggregate chooses the value from an undefined row in the
group.
It looks like the logic to prevent RETURNING in reference tables to
have duplicate entries that comes from local and remote executions
leads to missing some tuples for distributed tables.
With this PR, we're ensuring to kick in the logic for reference tables
only.