citus

Commit Graph

Author	SHA1	Message	Date
Colm	beb222ea8d	PG17 compatibility: fix multi-1 diffs caused by PG17 optimizer enhancements (#7769 ) This fix ensures that the expected DEBUG error messages from the router planner in `multi_router_planner`, `multi_router_planner_fast_path` and `query_single_shard_table` are present with PG17. In `query_single_shard_table` the diff: ``` SELECT COUNT() FROM citus_local_table t1 WHERE t1.b IN ( SELECT b+1 FROM nullkey_c1_t1 t2 WHERE t2.b = t1.a ); -DEBUG: router planner does not support queries that reference non-colocated distributed tables +DEBUG: Local tables cannot be used in distributed queries. ``` occurred because of[ this PG17 commit](https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=9f1337639) which enables the optimizer to pull up a correlated ANY subquery to a join. The fix inhibits subquery pull up by including a volatile function in the predicate involving the ANY subquery, preserving the pre-PG17 optimizer treatment of the query. In the case of `multi_router_planner` and `multi_router_planner_fast_path` the diffs: ``` -- partition_column is null clause does not prune out any shards, -- all shards remain after shard pruning, not router plannable SELECT FROM articles_hash a WHERE a.author_id is null; -DEBUG: Router planner cannot handle multi-shard select queries +DEBUG: Creating router plan ``` are because of [this PG17 commit](https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=b262ad440), which enables the optimizer to detect and remove redundant IS (NOT) NULL expressions. The fix is to adjust the table definition so the column used for distribution is not marked NOT NULL, thus preserving the pre-PG17 query planning behavior. Finallly, a rule is added to `normalize.sed` to ignore DEBUG logging in CREATE MATERIALIZED VIEW AS statements introduced by [this PG17 commit](https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=b4da732fd64); _when creating materialized views, use REFRESH logic to load data_, a consequence of which is that with `client_min_messages` at `DEBUG2` Postgres emits extra detail for CREATE MATERIALIZED VIEW AS statements. ``` CREATE MATERIALIZED VIEW mv_articles_hash_empty AS SELECT * FROM articles_hash WHERE author_id = 1; DEBUG: Creating router plan DEBUG: query has a single distribution column value: 1 +DEBUG: drop auto-cascades to type multi_router_planner.pg_temp_61391 +DEBUG: drop auto-cascades to type multi_router_planner.pg_temp_61391[] ``` The rule can be changed to a normalization, or possibly dropped, when 17 becomes the minimum supported version.	2025-03-12 12:25:49 +03:00
Önder Kalacı	cb5eb73048	Add support for router INSERT .. SELECT commands (#7077 ) Tradionally our planner works in the following order: router - > pushdown -> repartition -> pull to coordinator However, for INSERT .. SELECT commands, we did not support "router". In practice, that is not a big issue, because pushdown planning can handle router case as well. However, with PG 16, certain outer joins are converted to JOIN without any conditions (e.g., JOIN .. ON (true)) and the filters are pushed down to the tables. When the filters are pushed down to the tables, router planner can detect. However, pushdown planner relies on JOIN conditions. An example query: ``` INSERT INTO agg_events (user_id) SELECT raw_events_first.user_id FROM raw_events_first LEFT JOIN raw_events_second ON raw_events_first.user_id = raw_events_second.user_id WHERE raw_events_first.user_id = 10; ``` As a side effect of this change, now we can also relax certain limitation that "pushdown" planner emposes, but not "router". So, with this PR, we also allow those. Closes https://github.com/citusdata/citus/pull/6772 DESCRIPTION: Prevents unnecessarily pulling the data into coordinator for some INSERT .. SELECT queries that target a single-shard group	2023-07-28 15:07:20 +03:00
Önder Kalacı	862dae823e	Expand EnableNonColocatedRouterQueryPushdown to cover shard colocation (e.g., shard index) (#7076 ) Previously, we only checked whether the relations are colocated, but we ignore the shard indexes. That causes certain queries still to be accidentally router. We should enforce colocation checks for both shard index and table colocation id to make the check restrictive enough. For example, the following query should not be router, and after this patch, it won't: ```SQL SELECT user_id FROM ((SELECT user_id FROM raw_events_first WHERE user_id = 15) EXCEPT (SELECT user_id FROM raw_events_second where user_id = 17)) as foo; ``` DESCRIPTION: Enforce shard level colocation with citus.enable_non_colocated_router_query_pushdown	2023-07-25 16:20:13 +03:00
Onur Tirtir	8ff9dde4b3	Prevent pushing down INSERT .. SELECT queries that we shouldn't (and allow some more) (#6752 ) Previously INSERT .. SELECT planner were pushing down some queries that should not be pushed down due to wrong colocation checks. It was checking whether one of the table in SELECT part and target table are colocated. But now, we check colocation for all tables in SELECT part and the target table. Another problem with INSERT .. SELECT planner was that some queries, which is valid to be pushed down, were not pushed down due to unnecessary checks which are currently supported. e.g. UNION check. As solution, we reused the pushdown planner checks for INSERT .. SELECT planner. DESCRIPTION: Fixes a bug that causes incorrectly pushing down some INSERT .. SELECT queries that we shouldn't DESCRIPTION: Prevents unnecessarily pulling the data into coordinator for some INSERT .. SELECT queries DESCRIPTION: Drops support for pushing down INSERT .. SELECT with append table as target Fixes #6749. Fixes #1428. Fixes #6920. --------- Co-authored-by: aykutbozkurt <aykut.bozkurt1995@gmail.com>	2023-05-17 15:05:08 +03:00
Onur Tirtir	56d217b108	Mark objects as distributed even when pg_dist_node is empty (#6900 ) We mark objects as distributed objects in Citus metadata only if we need to propagate given the command that creates it to worker nodes. For this reason, we were not doing this for the objects that are created while pg_dist_node is empty. One implication of doing so is that we defer the schema propagation to the time when user creates the first distributed table in the schema. However, this doesn't help for schema-based sharding (#6866) because we want to sync pg_dist_tenant_schema to the worker nodes even for empty schemas too. * Support test dependencies for isolation tests without a schedule * Comment out a test due to a known issue (#6901) * Also, reduce the verbosity for some log messages and make some tests compatible with run_test.py.	2023-05-16 11:45:42 +03:00
Marco Slot	fcaabfdcf3	Remove remaining master_create_distributed_table usages (#6477 ) Co-authored-by: Marco Slot <marco.slot@gmail.com>	2022-11-04 16:30:06 +01:00
Burak Velioglu	fa6866ed36	Start to propagate functions to worker nodes with CREATE FUNCTION command together with it's dependencies. If the function depends on any nondistributable object, function will be created only locally. Parameterless version of create_distributed_function becomes obsolete with this change, it will deprecated from the code with a subsequent PR.	2022-02-18 13:56:51 +03:00
Naisila Puka	dbb88f6f8b	Fix insert query with CTEs/sublinks/subqueries etc (#4700 ) * Fix insert query with CTE * Add more cases with deferred pruning but false fast path * Add more tests * Better readability with if statements	2021-02-23 18:00:47 +03:00
Marco Slot	707a6554b1	Support co-located/recurring correlated subqueries	2020-12-15 14:17:16 +01:00
Marco Slot	f2538a456f	Support co-located/recurring sublinks in the target list	2020-12-13 15:45:24 +01:00
Sait Talha Nisanci	01c23b0df2	update test outputs with task-tracker removal	2020-07-21 16:25:08 +03:00
Sait Talha Nisanci	1dbd545cf4	replace task-tracker with adaptive in tests	2020-07-21 16:21:01 +03:00
Onder Kalaci	c25de2cf22	Remove flag from As it doesn't make any sense anymore	2020-07-20 12:45:05 +02:00
SaitTalhaNisanci	b3af63c8ce	Remove task tracker executor (#3850 ) * use adaptive executor even if task-tracker is set * Update check-multi-mx tests for adaptive executor Basically repartition joins are enabled where necessary. For parallel tests max adaptive executor pool size is decresed to 2, otherwise we would get too many clients error. * Update limit_intermediate_size test It seems that when we use adaptive executor instead of task tracker, we exceed the intermediate result size less in the test. Therefore updated the tests accordingly. * Update multi_router_planner It seems that there is one problem with multi_router_planner when we use adaptive executor, we should fix the following error: +ERROR: relation "authors_range_840010" does not exist +CONTEXT: while executing command on localhost:57637 * update repartition join tests for check-multi * update isolation tests for repartitioning * Error out if shard_replication_factor > 1 with repartitioning As we are removing the task tracker, we cannot switch to it if shard_replication_factor > 1. In that case, we simply error out. * Remove MULTI_EXECUTOR_TASK_TRACKER * Remove multi_task_tracker_executor Some utility methods are moved to task_execution_utils.c. * Remove task tracker protocol methods * Remove task_tracker.c methods * remove unused methods from multi_server_executor * fix style * remove task tracker specific tests from worker_schedule * comment out task tracker udf calls in tests We were using task tracker udfs to test permissions in multi_multiuser.sql. We should find some other way to test them, then we should remove the commented out task tracker calls. * remove task tracker test from follower schedule * remove task tracker tests from multi mx schedule * Remove task-tracker specific functions from worker functions * remove multi task tracker extra schedule * Remove unused methods from multi physical planner * remove task_executor_type related things in tests * remove LoadTuplesIntoTupleStore * Do initial cleanup for repartition leftovers During startup, task tracker would call TrackerCleanupJobDirectories and TrackerCleanupJobSchemas to clean up leftover directories and job schemas. With adaptive executor, while doing repartitions it is possible to leak these things as well. We don't retry cleanups, so it is possible to have leftover in case of errors. TrackerCleanupJobDirectories is renamed as RepartitionCleanupJobDirectories since it is repartition specific now, however TrackerCleanupJobSchemas cannot be used currently because it is task tracker specific. The thing is that this function is a no-op currently. We should add cleaning up intermediate schemas to DoInitialCleanup method when that problem is solved(We might want to solve it in this PR as well) * Revert "remove task tracker tests from multi mx schedule" This reverts commit `03ecc0a681`. * update multi mx repartition parallel tests * not error with task_tracker_conninfo_cache_invalidate * not run 4 repartition queries in parallel It seems that when we run 4 repartition queries in parallel we get too many clients error on CI even though we don't get it locally. Our guess is that, it is because we open/close many connections without doing some work and postgres has some delay to close the connections. Hence even though connections are removed from the pg_stat_activity, they might still not be closed. If the above assumption is correct, it is unlikely for it to happen in practice because: - There is some network latency in clusters, so this leaves some times for connections to be able to close - Repartition joins return some data and that also leaves some time for connections to be fully closed. As we don't get this error in our local, we currently assume that it is not a bug. Ideally this wouldn't happen when we get rid of the task-tracker repartition methods because they don't do any pruning and might be opening more connections than necessary. If this still gives us "too many clients" error, we can try to increase the max_connections in our test suite(which is 100 by default). Also there are different places where this error is given in postgres, but adding some backtrace it seems that we get this from ProcessStartupPacket. The backtraces can be found in this link: https://circleci.com/gh/citusdata/citus/138702 * Set distributePlan->relationIdList when it is needed It seems that we were setting the distributedPlan->relationIdList after JobExecutorType is called, which would choose task-tracker if replication factor > 1 and there is a repartition query. However, it uses relationIdList to decide if the query has a repartition query, and since it was not set yet, it would always think it is not a repartition query and would choose adaptive executor when it should choose task-tracker. * use adaptive executor even with shard_replication_factor > 1 It seems that we were already using adaptive executor when replication_factor > 1. So this commit removes the check. * remove multi_resowner.c and deprecate some settings * remove TaskExecution related leftovers * change deprecated API error message * not recursively plan single relatition repartition subquery * recursively plan single relation repartition subquery * test depreceated task tracker functions * fix overlapping shard intervals in range-distributed test * fix error message for citus_metadata_container * drop task-tracker deprecated functions * put the implemantation back to worker_cleanup_job_schema_cachesince citus cloud uses it * drop some functions, add downgrade script Some deprecated functions are dropped. Downgrade script is added. Some gucs are deprecated. A new guc for repartition joins bucket size is added. * order by a test to fix flappiness	2020-07-18 13:11:36 +03:00
Philip Dubé	1722d8ac8b	Allow routing modifying CTEs We still recursively plan some cases, eg: - INSERTs - SELECT FOR UPDATE when reference tables in query - Everything must be same single shard & replication model	2020-06-11 15:14:06 +00:00
Marco Slot	cb3d90bdc8	Simplify INSERT logic in router planner	2020-03-10 15:54:40 +01:00
Onder Kalaci	975c4c2264	Do not prune shards if the distribution key is NULL The root of the problem is that, standard_planner() converts the following qual ``` {OPEXPR :opno 98 :opfuncid 67 :opresulttype 16 :opretset false :opcollid 0 :inputcollid 100 :args ( {VAR :varno 1 :varattno 1 :vartype 25 :vartypmod -1 :varcollid 100 :varlevelsup 0 :varnoold 1 :varoattno 1 :location 45 } {CONST :consttype 25 :consttypmod -1 :constcollid 100 :constlen -1 :constbyval false :constisnull true :location 51 :constvalue <> } ) :location 49 } ``` To ``` ( {CONST :consttype 16 :consttypmod -1 :constcollid 0 :constlen 1 :constbyval true :constisnull true :location -1 :constvalue <> } ) ``` So, Citus doesn't deal with NULL values in real-time or non-fast path router queries. And, in the FastPathRouter planner, we check constisnull in DistKeyInSimpleOpExpression(). However, in deferred pruning case, we do not check for isnull for const. Thus, the fix consists of two parts: - Let PruneShards() not crash when NULL parameter is passed - For deferred shard pruning in fast-path queries, explicitly check that we have CONST which is not NULL	2020-02-13 15:00:31 +01:00
Hadi Moshayedi	89463f9760	Repartitioned INSERT/SELECT: cast columns in SELECT targets	2020-01-16 23:24:52 -08:00
Onder Kalaci	dc17c2658e	Defer shard pruning for fast-path router queries to execution This is purely to enable better performance with prepared statements. Before this commit, the fast path queries with prepared statements where the distribution key includes a parameter always went through distributed planning. After this change, we only go through distributed planning on the first 5 executions.	2020-01-16 16:59:36 +01:00
Onder Kalaci	5cb203b276	Update regression tests-1 These set of tests has changed in both PG 11 and PG 12. The changes are only about CTE inlining kicking in both versions, and yielding the exact same distributed planning.	2020-01-16 12:28:15 +01:00
Philip Dubé	bf7d86a3e8	Fix typo: aggragate -> aggregate	2020-01-07 01:16:09 +00:00
Jelte Fennema	acd12a6de5	Normalize tests: s/read_intermediate_result\('[0-9]+_/read_intermediate_result('XXX_/g	2020-01-06 09:32:03 +01:00
Jelte Fennema	21dbd4e55d	Normalize tests: s/generating subplan [0-9]+\_/generating subplan XXX\_/g	2020-01-06 09:32:03 +01:00
Jelte Fennema	58723dd8b0	Normalize tests: s/DEBUG: Plan [0-9]+/DEBUG: Plan XXX/g	2020-01-06 09:32:03 +01:00
Jelte Fennema	7730bd449c	Normalize tests: Remove trailing whitespace	2020-01-06 09:32:03 +01:00
Jelte Fennema	6353c9907f	Normalize tests: Line info varies between versions	2020-01-06 09:32:03 +01:00
Jelte Fennema	7f3de68b0d	Normalize tests: header separator length	2020-01-06 09:32:03 +01:00
Philip Dubé	c563e0825c	Strip trailing whitespace and add final newline (#3186 ) This brings files in line with our editorconfig file	2019-11-21 14:25:37 +01:00
Jelte Fennema	7abedc38b0	Support subqueries in HAVING (#3098 ) Areas for further optimization: - Don't save subquery results to a local file on the coordinator when the subquery is not in the having clause - Push the the HAVING with subquery to the workers if there's a group by on the distribution column - Don't push down the results to the workers when we don't push down the HAVING clause, only the coordinator needs it Fixes #520 Fixes #756 Closes #2047	2019-10-16 16:40:14 +02:00
Nils Dijk	936d546a3c	Refactor Ensure Schema Exists to Ensure Dependecies Exists (#2882 ) DESCRIPTION: Refactor ensure schema exists to dependency exists Historically we only supported schema's as table dependencies to be created on the workers before a table gets distributed. This PR puts infrastructure in place to walk pg_depend to figure out which dependencies to create on the workers. Currently only schema's are supported as objects to create before creating a table. We also keep track of dependencies that have been created in the cluster. When we add a new node to the cluster we use this catalog to know which objects need to be created on the worker. Side effect of knowing which objects are already distributed is that we don't have debug messages anymore when creating schema's that are already created on the workers.	2019-09-04 14:10:20 +02:00
Önder Kalacı	40da78c6fd	Introduce the adaptive executor (#2798 ) With this commit, we're introducing the Adaptive Executor. The commit message consists of two distinct sections. The first part explains how the executor works. The second part consists of the commit messages of the individual smaller commits that resulted in this commit. The readers can search for the each of the smaller commit messages on https://github.com/citusdata/citus and can learn more about the history of the change. /------------------------------------------------------------------------- * adaptive_executor.c * * The adaptive executor executes a list of tasks (queries on shards) over * a connection pool per worker node. The results of the queries, if any, * are written to a tuple store. * * The concepts in the executor are modelled in a set of structs: * * - DistributedExecution: * Execution of a Task list over a set of WorkerPools. * - WorkerPool * Pool of WorkerSessions for the same worker which opportunistically * executes "unassigned" tasks from a queue. * - WorkerSession: * Connection to a worker that is used to execute "assigned" tasks * from a queue and may execute unasssigned tasks from the WorkerPool. * - ShardCommandExecution: * Execution of a Task across a list of placements. * - TaskPlacementExecution: * Execution of a Task on a specific placement. * Used in the WorkerPool and WorkerSession queues. * * Every connection pool (WorkerPool) and every connection (WorkerSession) * have a queue of tasks that are ready to execute (readyTaskQueue) and a * queue/set of pending tasks that may become ready later in the execution * (pendingTaskQueue). The tasks are wrapped in a ShardCommandExecution, * which keeps track of the state of execution and is referenced from a * TaskPlacementExecution, which is the data structure that is actually * added to the queues and describes the state of the execution of a task * on a particular worker node. * * When the task list is part of a bigger distributed transaction, the * shards that are accessed or modified by the task may have already been * accessed earlier in the transaction. We need to make sure we use the * same connection since it may hold relevant locks or have uncommitted * writes. In that case we "assign" the task to a connection by adding * it to the task queue of specific connection (in * AssignTasksToConnections). Otherwise we consider the task unassigned * and add it to the task queue of a worker pool, which means that it * can be executed over any connection in the pool. * * A task may be executed on multiple placements in case of a reference * table or a replicated distributed table. Depending on the type of * task, it may not be ready to be executed on a worker node immediately. * For instance, INSERTs on a reference table are executed serially across * placements to avoid deadlocks when concurrent INSERTs take conflicting * locks. At the beginning, only the "first" placement is ready to execute * and therefore added to the readyTaskQueue in the pool or connection. * The remaining placements are added to the pendingTaskQueue. Once * execution on the first placement is done the second placement moves * from pendingTaskQueue to readyTaskQueue. The same approach is used to * fail over read-only tasks to another placement. * * Once all the tasks are added to a queue, the main loop in * RunDistributedExecution repeatedly does the following: * * For each pool: * - ManageWorkPool evaluates whether to open additional connections * based on the number unassigned tasks that are ready to execute * and the targetPoolSize of the execution. * * Poll all connections: * - We use a WaitEventSet that contains all (non-failed) connections * and is rebuilt whenever the set of active connections or any of * their wait flags change. * * We almost always check for WL_SOCKET_READABLE because a session * can emit notices at any time during execution, but it will only * wake up WaitEventSetWait when there are actual bytes to read. * * We check for WL_SOCKET_WRITEABLE just after sending bytes in case * there is not enough space in the TCP buffer. Since a socket is * almost always writable we also use WL_SOCKET_WRITEABLE as a * mechanism to wake up WaitEventSetWait for non-I/O events, e.g. * when a task moves from pending to ready. * * For each connection that is ready: * - ConnectionStateMachine handles connection establishment and failure * as well as command execution via TransactionStateMachine. * * When a connection is ready to execute a new task, it first checks its * own readyTaskQueue and otherwise takes a task from the worker pool's * readyTaskQueue (on a first-come-first-serve basis). * * In cases where the tasks finish quickly (e.g. <1ms), a single * connection will often be sufficient to finish all tasks. It is * therefore not necessary that all connections are established * successfully or open a transaction (which may be blocked by an * intermediate pgbouncer in transaction pooling mode). It is therefore * essential that we take a task from the queue only after opening a * transaction block. * * When a command on a worker finishes or the connection is lost, we call * PlacementExecutionDone, which then updates the state of the task * based on whether we need to run it on other placements. When a * connection fails or all connections to a worker fail, we also call * PlacementExecutionDone for all queued tasks to try the next placement * and, if necessary, mark shard placements as inactive. If a task fails * to execute on all placements, the execution fails and the distributed * transaction rolls back. * * For multi-row INSERTs, tasks are executed sequentially by * SequentialRunDistributedExecution instead of in parallel, which allows * a high degree of concurrency without high risk of deadlocks. * Conversely, multi-row UPDATE/DELETE/DDL commands take aggressive locks * which forbids concurrency, but allows parallelism without high risk * of deadlocks. Note that this is unrelated to SEQUENTIAL_CONNECTION, * which indicates that we should use at most one connection per node, but * can run tasks in parallel across nodes. This is used when there are * writes to a reference table that has foreign keys from a distributed * table. * * Execution finishes when all tasks are done, the query errors out, or * the user cancels the query. * ------------------------------------------------------------------------- / All the commits involved here: * Initial unified executor prototype * Latest changes * Fix rebase conflicts to master branch * Add missing variable for assertion * Ensure that master_modify_multiple_shards() returns the affectedTupleCount * Adjust intermediate result sizes The real-time executor uses COPY command to get the results from the worker nodes. Unified executor avoids that which results in less data transfer. Simply adjust the tests to lower sizes. * Force one connection per placement (or co-located placements) when requested The existing executors (real-time and router) always open 1 connection per placement when parallel execution is requested. That might be useful under certain circumstances: (a) User wants to utilize as much as CPUs on the workers per distributed query (b) User has a transaction block which involves COPY command Also, lots of regression tests rely on this execution semantics. So, we'd enable few of the tests with this change as well. * For parameters to be resolved before using them For the details, see PostgreSQL's copyParamList() * Unified executor sorts the returning output * Ensure that unified executor doesn't ignore sequential execution of DDLJob's Certain DDL commands, mainly creating foreign keys to reference tables, should be executed sequentially. Otherwise, we'd end up with a self distributed deadlock. To overcome this situaiton, we set a flag `DDLJob->executeSequentially` and execute it sequentially. Note that we have to do this because the command might not be called within a transaction block, and we cannot call `SetLocalMultiShardModifyModeToSequential()`. This fixes at least two test: multi_insert_select_on_conflit.sql and multi_foreign_key.sql Also, I wouldn't mind scattering local `targetPoolSize` variables within the code. The reason is that we'll soon have a GUC (or a global variable based on a GUC) that'd set the pool size. In that case, we'd simply replace `targetPoolSize` with the global variables. * Fix 2PC conditions for DDL tasks * Improve closing connections that are not fully established in unified execution * Support foreign keys to reference tables in unified executor The idea for supporting foreign keys to reference tables is simple: Keep track of the relation accesses within a transaction block. - If a parallel access happens on a distributed table which has a foreign key to a reference table, one cannot modify the reference table in the same transaction. Otherwise, we're very likely to end-up with a self-distributed deadlock. - If an access to a reference table happens, and then a parallel access to a distributed table (which has a fkey to the reference table) happens, we switch to sequential mode. Unified executor misses the function calls that marks the relation accesses during the execution. Thus, simply add the necessary calls and let the logic kick in. * Make sure to close the failed connections after the execution * Improve comments * Fix savepoints in unified executor. * Rebuild the WaitEventSet only when necessary * Unclaim connections on all errors. * Improve failure handling for unified executor - Implement the notion of errorOnAnyFailure. This is similar to Critical Connections that the connection managament APIs provide - If the nodes inside a modifying transaction expand, activate 2PC - Fix few bugs related to wait event sets - Mark placement INACTIVE during the execution as much as possible as opposed to we do in the COMMIT handler - Fix few bugs related to scheduling next placement executions - Improve decision on when to use 2PC Improve the logic to start a transaction block for distributed transactions - Make sure that only reference table modifications are always executed with distributed transactions - Make sure that stored procedures and functions are executed with distributed transactions * Move waitEventSet to DistributedExecution This could also be local to RunDistributedExecution(), but in that case we had to mark it as "volatile" to avoid PG_TRY()/PG_CATCH() issues, and cast it to non-volatile when doing WaitEventSetFree(). We thought that would make code a bit harder to read than making this non-local, so we move it here. See comments for PG_TRY() in postgres/src/include/elog.h and "man 3 siglongjmp" for more context. * Fix multi_insert_select test outputs Two things: 1) One complex transaction block is now supported. Simply update the test output 2) Due to dynamic nature of the unified executor, the orders of the errors coming from the shards might change (e.g., all of the queries on the shards would fail, but which one appears on the error message?). To fix that, we simply added it to our shardId normalization tool which happens just before diff. * Fix subeury_and_cte test The error message is updated from: failed to execute task To: more than one row returned by a subquery or an expression which is a lot clearer to the user. * Fix intermediate_results test outputs Simply update the error message from: could not receive query results to result "squares" does not exist which makes a lot more sense. * Fix multi_function_in_join test The error messages update from: Failed to execute task XXX To: function f(..) does not exist * Fix multi_query_directory_cleanup test The unified executor does not create any intermediate files. * Fix with_transactions test A test case that just started to work fine * Fix multi_router_planner test outputs The error message is update from: Could not receive query results To: Relation does not exists which is a lot more clearer for the users * Fix multi_router_planner_fast_path test The error message is update from: Could not receive query results To: Relation does not exists which is a lot more clearer for the users * Fix isolation_copy_placement_vs_modification by disabling select_opens_transaction_block * Fix ordering in isolation_multi_shard_modify_vs_all * Add executor locks to unified executor * Make sure to allocate enought WaitEvents The previous code was missing the waitEvents for the latch and postmaster death. * Fix rebase conflicts for master rebase * Make sure that TRUNCATE relies on unified executor * Implement true sequential execution for multi-row INSERTS Execute the individual tasks executed one by one. Note that this is different than MultiShardConnectionType == SEQUENTIAL_CONNECTION case (e.g., sequential execution mode). In that case, running the tasks across the nodes in parallel is acceptable and implemented in that way. However, the executions that are qualified here would perform poorly if the tasks across the workers are executed in parallel. We currently qualify only one class of distributed queries here, multi-row INSERTs. If we do not enforce true sequential execution, concurrent multi-row upserts could easily form a distributed deadlock when the upserts touch the same rows. * Remove SESSION_LIFESPAN flag in unified_executor * Apply failure test updates We've changed the failure behaviour a bit, and also the error messages that show up to the user. This PR covers majority of the updates. * Unified executor honors citus.node_connection_timeout With this commit, unified executor errors out if even a single connection cannot be established within citus.node_connection_timeout. And, as a side effect this fixes failure_connection_establishment test. * Properly increment/decrement pool size variables Before this commit, the idle and active connection counts were not properly calculated. * insert_select_executor goes through unified executor. * Add missing file for task tracker * Modify ExecuteTaskListExtended()'s signature * Sort output of INSERT ... SELECT ... RETURNING * Take partition locks correctly in unified executor * Alternative implementation for force_max_query_parallelization * Fix compile warnings in unified executor * Fix style issues * Decrement idleConnectionCount when idle connection is lost * Always rebuild the wait event sets In the previous implementation, on waitFlag changes, we were only modifying the wait events. However, we've realized that it might be an over optimization since (a) we couldn't see any performance benefits (b) we see some errors on failures and because of (a) we prefer to disable it now. * Make sure to allocate enough sized waitEventSet With multi-row INSERTs, we might have more sessions than taskworkerCount after few calls of RunDistributedExecution() because the previous sessions would also be alive. Instead, re-allocate events when the connectino set changes. Implement SELECT FOR UPDATE on reference tables On master branch, we do two extra things on SELECT FOR UPDATE queries on reference tables: - Acquire executor locks - Execute the query on all replicas With this commit, we're implementing the same logic on the new executor. * SELECT FOR UPDATE opens transaction block even if SelectOpensTransactionBlock disabled Otherwise, users would be very confused and their logic is very likely to break. * Fix build error * Fix the newConnectionCount calculation in ManageWorkerPool * Fix rebase conflicts * Fix minor test output differences * Fix citus indent * Remove duplicate sorts that is added with rebase * Create distributed table via executor * Fix wait flags in CheckConnectionReady * failure_savepoints output for unified executor. * failure_vacuum output (pg 10) for unified executor. * Fix WaitEventSetWait timeout in unified executor * Stabilize failure_truncate test output * Add an ORDER BY to multi_upsert * Fix regression test outputs after rebase to master * Add executor.c comment * Rename executor.c to adaptive_executor.c * Do not schedule tasks if the failed placement is not ready to execute Before the commit, we were blindly scheduling the next placement executions even if the failed placement is not on the ready queue. Now, we're ensuring that if failed placement execution is on a failed pool or session where the execution is on the pendingQueue, we do not schedule the next task. Because the other placement execution should be already running. * Implement a proper custom scan node for adaptive executor - Switch between the executors, add GUC to set the pool size - Add non-adaptive regression test suites - Enable CIRCLE CI for non-adaptive tests - Adjust test output files * Add slow start interval to the executor * Expose max_cached_connection_per_worker to user * Do not start slow when there are cached connections * Consider ExecutorSlowStartInterval in NextEventTimeout * Fix memory issues with ReceiveResults(). * Disable executor via TaskExecutorType * Make sure to execute the tests with the other executor * Use task_executor_type to enable-disable adaptive executor * Remove useless code * Adjust the regression tests * Add slow start regression test * Rebase to master * Fix test failures in adaptive executor. * Rebase to master - 2 * Improve comments & debug messages * Set force_max_query_parallelization in isolation_citus_dist_activity * Force max parallelization for creating shards when asked to use exclusive connection. * Adjust the default pool size * Expand description of max_adaptive_executor_pool_size GUC * Update warnings in FinishRemoteTransactionCommit() * Improve session clean up at the end of execution Explicitly list all the states that the execution might end, otherwise warn. * Remove MULTI_CONNECTION_WAIT_RETRY which is not used at all * Add more ORDER BYs to multi_mx_partitioning	2019-06-28 14:04:40 +02:00
Hanefi Onaldi	7e8fd49b94	Create Schemas as superuser on all shard/table creation UDFs - All the schema creations on the workers will now be via superuser connections - If a shard is being repaired or a shard is replicated, we will create the schema only in the relevant worker; and in all the other cases where a schema creation is needed, we will block operations until we ensure the schema exists in all the workers	2019-06-26 17:12:28 +02:00
Philip Dubé	84fe626378	multi_router_planner: refactor error propagation	2019-06-26 10:32:01 +02:00
Onder Kalaci	f144bb4911	Introduce fast path router planning In this context, we define "Fast Path Planning for SELECT" as trivial queries where Citus can skip relying on the standard_planner() and handle all the planning. For router planner, standard_planner() is mostly important to generate the necessary restriction information. Later, the restriction information generated by the standard_planner is used to decide whether all the shards that a distributed query touches reside on a single worker node. However, standard_planner() does a lot of extra things such as cost estimation and execution path generations which are completely unnecessary in the context of distributed planning. There are certain types of queries where Citus could skip relying on standard_planner() to generate the restriction information. For queries in the following format, Citus does not need any information that the standard_planner() generates: SELECT ... FROM single_table WHERE distribution_key = X; or DELETE FROM single_table WHERE distribution_key = X; or UPDATE single_table SET value_1 = value_2 + 1 WHERE distribution_key = X; Note that the queries might not be as simple as the above such that GROUP BY, WINDOW FUNCIONS, ORDER BY or HAVING etc. are all acceptable. The only rule is that the query is on a single distributed (or reference) table and there is a "distribution_key = X;" in the WHERE clause. With that, we could use to decide the shard that a distributed query touches reside on a worker node.	2019-02-21 13:27:01 +03:00

34 Commits (35d1160ace75b44b0942bc29b7c8678ca84fe728)