Commit Graph

708 Commits (07cca85227f8f162cb03ecfff8d9626f23a9af93)

Author SHA1 Message Date
Onder Kalaci 0b0c779c77 Introduce the concept of Local Execution
/*
 * local_executor.c
 *
 * The scope of the local execution is locally executing the queries on the
 * shards. In other words, local execution does not deal with any local tables
 * that are not shards on the node that the query is being executed. In that sense,
 * the local executor is only triggered if the node has both the metadata and the
 * shards (e.g., only Citus MX worker nodes).
 *
 * The goal of the local execution is to skip the unnecessary network round-trip
 * happening on the node itself. Instead, identify the locally executable tasks and
 * simply call PostgreSQL's planner and executor.
 *
 * The local executor is an extension of the adaptive executor. So, the executor uses
 * adaptive executor's custom scan nodes.
 *
 * One thing to note that Citus MX is only supported with replication factor = 1, so
 * keep that in mind while continuing the comments below.
 *
 * On the high level, there are 3 slightly different ways of utilizing local execution:
 *
 * (1) Execution of local single shard queries of a distributed table
 *
 *      This is the simplest case. The executor kicks at the start of the adaptive
 *      executor, and since the query is only a single task the execution finishes
 *      without going to the network at all.
 *
 *      Even if there is a transaction block (or recursively planned CTEs), as long
 *      as the queries hit the shards on the same, the local execution will kick in.
 *
 * (2) Execution of local single queries and remote multi-shard queries
 *
 *      The rule is simple. If a transaction block starts with a local query execution,
 *      all the other queries in the same transaction block that touch any local shard
 *      have to use the local execution. Although this sounds restrictive, we prefer to
 *      implement in this way, otherwise we'd end-up with as complex scenarious as we
 *      have in the connection managements due to foreign keys.
 *
 *      See the following example:
 *      BEGIN;
 *          -- assume that the query is executed locally
 *          SELECT count(*) FROM test WHERE key = 1;
 *
 *          -- at this point, all the shards that reside on the
 *          -- node is executed locally one-by-one. After those finishes
 *          -- the remaining tasks are handled by adaptive executor
 *          SELECT count(*) FROM test;
 *
 *
 * (3) Modifications of reference tables
 *
 *		Modifications to reference tables have to be executed on all nodes. So, after the
 *		local execution, the adaptive executor keeps continuing the execution on the other
 *		nodes.
 *
 *		Note that for read-only queries, after the local execution, there is no need to
 *		kick in adaptive executor.
 *
 *  There are also few limitations/trade-offs that is worth mentioning. First, the
 *  local execution on multiple shards might be slow because the execution has to
 *  happen one task at a time (e.g., no parallelism). Second, if a transaction
 *  block/CTE starts with a multi-shard command, we do not use local query execution
 *  since local execution is sequential. Basically, we do not want to lose parallelism
 *  across local tasks by switching to local execution. Third, the local execution
 *  currently only supports queries. In other words, any utility commands like TRUNCATE,
 *  fails if the command is executed after a local execution inside a transaction block.
 *  Forth, the local execution cannot be mixed with the executors other than adaptive,
 *  namely task-tracker, real-time and router executors. Finally, related with the
 *  previous item, COPY command cannot be mixed with local execution in a transaction.
 *  The implication of that any part of INSERT..SELECT via coordinator cannot happen
 *  via the local execution.
 */
2019-09-12 11:51:25 +02:00
SaitTalhaNisanci d99deab7d9
Add upgrade postgres version test (#2940)
* Add creating a citus cluster script

Creating a citus cluster is automated.
Before running this script:
- Citus should be installed and its control file should be added to postgres. (make install)
- Postgres should be installed.

* Initialize upgrade test table and fill

* Finalize the layout of upgrade tests

Postgres upgrade function is added.
The newly added UDFs(citus_prepare_pg_upgrade, citus_finish_pg_upgrade) are used to
perform upgrade.

* Refactor upgrade test and add config file

* Add schedules for upgrade testing

* Use pg_regress for upgrade tests

pg_regress is used for creating a simple distributed table in
upgrade tests. After upgrading another schedule is used to verify
that the distributed table exists. Router and realtime queries are
used for verifying.

* Run upgrade tests as a postgres user in a temp dir

postgres user is used for psql to be consistent at running tests.
A temp dir is created and the temp dir's permissions are changed so
that postgres user can access it. All psql commands are now run with
postgres user.

"Select * from t" query is changed as "Select * from t order by a"
so that the result is always in the same order.

* Add docopt and arguments for the upgrade script

Docopt dependency is added to parse flags in script.
Some refactoring in variable names is done.

* Add readme for upgrade tests

* Refactor upgrade tests

Use relative data path instead of absolute assuming that this script will
always be run from 'src/test/regress'
Remove 'citus-path' flag
Use specific version for docopt instead of *
Use named args in string formatting

* Resolve a security problem

Instead of using string formatting in subprocess.call, arguments
list is used. Otherwise users could do shell injection.
Shell = True is removed from subprocess call as it is not recommended
to use this.

* Add how the test works to readme

* Refactor some variables to be consistent

* Update upgrade script based on the reviews

It was possible that postgres server would stay running even when the script
crashes, atexit library is used to ensure that we always do a teardown where we stop
the databases.

Some formatting is done in the code for better readability.

Config class is used instead of a dictonary.

A target for upgrade test is added to makefile.

Unused flags/functions/variables are removed.

* Format commands and remove unnecessary flag from readme
2019-09-10 17:56:04 +03:00
Philip Dubé b301cf628a Test worker_cleanup_job_schema_cache actually drops schemas 2019-09-05 16:52:24 +00:00
Philip Dubé 8979fd038b worker_check_invalid_arguments: invalid task/job ids 2019-09-05 16:52:24 +00:00
Philip Dubé 5f9e88b260 multi_multiuser: test that worker_merge_files_and_query doesn't allow privilege escalation 2019-09-05 16:52:24 +00:00
Philip Dubé bdd30bb181 Don't allow distributing by a generated column 2019-09-04 14:50:17 +00:00
Nils Dijk 936d546a3c
Refactor Ensure Schema Exists to Ensure Dependecies Exists (#2882)
DESCRIPTION: Refactor ensure schema exists to dependency exists

Historically we only supported schema's as table dependencies to be created on the workers before a table gets distributed. This PR puts infrastructure in place to walk pg_depend to figure out which dependencies to create on the workers. Currently only schema's are supported as objects to create before creating a table.

We also keep track of dependencies that have been created in the cluster. When we add a new node to the cluster we use this catalog to know which objects need to be created on the worker.

Side effect of knowing which objects are already distributed is that we don't have debug messages anymore when creating schema's that are already created on the workers.
2019-09-04 14:10:20 +02:00
Philip Dubé da00c62eea create_distributed_table: include COLLATE on columns 2019-08-29 14:22:54 +00:00
Matthias Kurz fc069dc611 Test SET LOCAL propagation when GUC is used in RLS policy 2019-08-22 20:29:52 +00:00
Philip Dubé 6b0d8ed83d SortList in FinalizedShardPlacementList, makes 3 failure tests consistent between 11/12 2019-08-22 19:30:56 +00:00
Philip Dubé 693d4695d7 Create a test 'pg12' for pg12 features & error on unsupported new features
Unsupported new features: COPY FROM WHERE, GENERATED ALWAYS AS, non-heap table access methods
2019-08-22 19:30:56 +00:00
Philip Dubé e84fcc0b12 Modify tests to be consistent between versions
Normalize
UNION to prevent optimization
Remove WITH OIDS
Sort ddl events
client_min_messages no longer accepts FATAL
2019-08-22 19:30:50 +00:00
Hadi Moshayedi a5b087c89b Support FKs between reference tables 2019-08-21 16:11:27 -07:00
Philip Dubé f4b90419ae Raise an error when REINDEX TABLE or INDEX is invoked on a distributed relation 2019-08-21 17:03:14 +00:00
Philip Dubé f62d4a6712 citus_rm_job_directory for multi_query_directory_cleanup 2019-08-19 17:04:42 +00:00
Philip Dubé 9777f22e1e Avoid invalid array accesses to partitionFileArray 2019-08-19 17:04:42 +00:00
Philip Dubé cd951fa9ca Avoid multiple pg_dist_colocation records being created for reference tables
master_deactivate_node is updated to decrement the replication factor
Otherwise deactivation could have create_reference_table produce a second record

UpdateColocationGroupReplicationFactor is renamed UpdateColocationGroupReplicationFactorForReferenceTables
& the implementation looks up the record based on distributioncolumntype == InvalidOid, rather than by id
Otherwise the record's replication factor fails to be maintained when there are no reference tables
2019-08-13 17:21:02 +00:00
Nils Dijk be6b7bec69
Add UDF citus_(prepare|finish)_pg_upgrade to aid with upgrading citus (#2877)
DESCRIPTION: Add functions to help with postgres upgrades

Currently there is [a list of manual steps](https://docs.citusdata.com/en/v8.2/admin_guide/upgrading_citus.html?highlight=upgrade#upgrading-postgresql-version-from-10-to-11) to perform during a postgres upgrade. These steps guarantee our catalog tables are kept and counter values are maintained across upgrades.

Having more than 1 command in our docs for users to manually execute during upgrades is error prone for both the user, and our docs. There are already 2 catalog tables that have been introduced to citus that have not been added to our docs for backing up during upgrades (`pg_authinfo` and `pg_dist_poolinfo`).

As we add more functionality to citus we run into situations where there are more steps required either before or after the upgrade. At the same time, when we move catalog tables to a place where the contents will be maintained automatically during upgrades we could have less steps in our docs. This will come to a hard to maintain matrix of citus versions and steps to be performed.

Instead we could take ownership of these steps within the extension itself. This PR introduces two new functions for the user to use instead of long lists of error prone instructions to follow.
 - `citus_prepare_pg_upgrade`
    This function should be called by the user right before shutting down the cluster. This will ensure all citus catalog tables are backed up in a location where the information will be retained during an upgrade.
- `citus_finish_pg_upgrade`
    This function should be called right after a pg_upgrade of the cluster. This will restore the catalog tables to the state before the upgrade happend.

Both functions need to be executed both on the coordinator and on all the workers, in the same fashion our current documentation instructs to do.

There are two known problems with this function in its current form, which is also a problem with our docs. We should schedule time in the future to improve on this, but having it automated now is better as we are about to add extra steps to take after upgrades.
 - When you install citus in a clean cluster we do enable ssl for communication between the coordinator and the workers. If an upgrade to a clean cluster is performed we do not setup ssl on the new cluster causing the communication to fail.
 - There are no automated tests added in this PR to execute an upgrade test durning every build. 
    Our current test infrastructure does not allow for 2 versions of postgres to exist in the same environment. We will need to invest time to create a new testing harness that could run the following scenario:
      1. Create cluster
      2. Run extensible scripts to execute arbitrary statements on this cluster
      3. Perform an upgrade by preparing, upgrading and finishing
      4. Run extensible scripts to verify all objects created by earlier scripts exists in correct form in the upgraded cluster

    Given the non trivial amount of work involved for such a suite I'd like to land this before we have 
automated testing.

On a side note; As the reviewer noticed, the tables created in the public namespace are not visible in `psql` with `\d`. The backup catalog tables have the same name as the tables in `pg_catalog`. Due to postgres internals `pg_catalog` is first in the search path and therefore the non-qualified name would alwasy resolve to `pg_catalog.pg_dist_*`. Internally this is called a non-visible table as it would resolve to a different table without a qualified name. Only visible tables are shown with `\d`.
2019-08-13 15:53:10 +02:00
Philip Dubé 5459c01956 multi_partitioning_utils: version_above_ten 2019-08-09 15:25:59 +00:00
Philip Dubé 5e835e7565 Fix multi_repair_shards. There's already a group/shardid entry, pg11 gives us back the inserted one, pg12 gives us the preexisting one 2019-08-09 15:25:59 +00:00
Philip Dubé 66ce2d2d2d Materialize c1 to keep subplan ids in sync 2019-08-09 15:25:59 +00:00
Philip Dubé 9065ef429c foreign_key_to_reference_table: terse to avoid differing order of drop cascade details 2019-08-09 15:25:59 +00:00
Philip Dubé 0d9e5bde9c window_functions: 'ORDER BY time' when using lag(time) & coordinator_plan 2019-08-09 15:25:59 +00:00
Philip Dubé 7992077fd9 multi_modifying_xacts: don't differ in output if reference table select tries broken worker first 2019-08-09 15:25:59 +00:00
Philip Dubé 546b71ac18 multi_router_planner: be terse for ctes with false wheres 2019-08-09 15:25:59 +00:00
Philip Dubé a523a5b773 multi_null_minmax_value_pruning: no versioning & coordinator_plan 2019-08-09 15:25:59 +00:00
Philip Dubé 871dabdc63 Force CTE materialization in pg12 2019-08-09 15:25:59 +00:00
Philip Dubé 667c67891e intermediate_results: COSTS OFF 2019-08-09 15:25:59 +00:00
Onder Kalaci 060ac11476 Do not record relation accessess unnecessarily
Before this commit, we've recorded the relation accesses in 3 different
places
    - FindPlacementListConnection         -- applies all executor in tx block
    - StartPlacementExecutionOnSession()  -- adaptive executor only
    - StartPlacementListConnection()      -- router/real-time only

This is different than Citus 8.2, and could lead to query execution times
increase considerably on multi-shard commands in transaction block
that are on partitioned tables.

Benchmarks:

```
1+8 c5.4xlarge cluster

Empty distributed partitioned table with 365 partitions: https://gist.github.com/onderkalaci/1edace4ed6bd6f061c8a15594865bb51#file-partitions_365-sql

./pgbench -f /tmp/multi_shard.sql -c10 -j10 -P 1 -T 120 postgres://citus:w3r6KLJpv3mxe9E-NIUeJw@c.fy5fkjcv45vcepaogqcaskmmkee.db.citusdata.com:5432/citus?sslmode=require

cat  /tmp/multi_shard.sql
BEGIN;
	DELETE FROM collections_list;
	DELETE FROM collections_list;
	DELETE FROM collections_list;
COMMIT;
cat  /tmp/single_shard.sql
BEGIN;
	DELETE FROM collections_list WHERE key = :aid;
	DELETE FROM collections_list WHERE key = :aid;
	DELETE FROM collections_list WHERE key = :aid;
COMMIT;

cat  /tmp/mix.sql
BEGIN;
	DELETE FROM collections_list WHERE key = :aid;
	DELETE FROM collections_list WHERE key = :aid;
	DELETE FROM collections_list WHERE key = :aid;

	DELETE FROM collections_list;
	DELETE FROM collections_list;
	DELETE FROM collections_list;
COMMIT;
```

The table shows `latency average` of pgbench runs explained above, so we have a pretty solid improvement even over 8.2.2.

| Test  | Citus 8.2.2  |  Citus 8.3.1   | Citus 8.3.2 (this branch)  | Citus 8.3.1 (FKEYs disabled via GUC)  |
| ------------- | ------------- | ------------- |------------- | ------------- |
|multi_shard |  2370.083 ms  |3605.040 ms |1324.094 ms |1247.255 ms  |
| single_shard  | 85.338 ms  |120.934 ms  |73.216 ms  | 78.765 ms |
| mix  | 2434.459 ms | 3727.080 ms  |1306.456 ms  | 1280.326 ms |
2019-08-08 18:42:08 +02:00
Hadi Moshayedi b1ab805ce2 Fix a typo in foreign_key_restriction_enforcement 2019-08-02 16:06:52 -07:00
Philip Dubé 19bcb1b4f7 multi_modifications: extend to demonstrate issue in adaptive executor 2019-08-01 23:55:04 +00:00
Philip Dubé 0e233c63a3 multi_colocation_utils: sort by nodeport, not placementid
multi_copy: replace smgr with aclitem, smgr is removed in pg12
2019-07-25 14:33:43 +00:00
Philip Dubé acbaa38a62 Squash migrations for versions 5/6, don't use WITH OIDS 2019-07-24 11:03:29 -07:00
Philip Dubé 6598c68993 Fix multi_prune_shard_list & don't set next_shard_id unnecessarily in multi_null_minmax_value_pruning 2019-07-23 19:44:18 +00:00
Marco Slot efbe58eab2 Fix SQL schema version, we skipped 8.3 2019-07-17 16:05:25 +02:00
Philip Dubé befd0caddd Tests: normalize sql_procedure and custom_aggregate_support
Also fix typo in multi_insert_select
2019-07-10 14:36:17 +00:00
Nils Dijk 791cc26a86
Fix an issue with subquery map merge jobs as non-root
Also automated all manual tests around multi user isolation for internal citus udf's

automate upgrade_to_reference_table tests
add negative tests for lock_relation_if_exists
add tests for permissions on worker_cleanup_job_schema_cache
add tests for worker_fetch_partition_file
add tests for worker_merge_files_into_table
fix problem with worker_merge_files_and_run_query when run as non-super user and add tests for behaviour
2019-07-10 12:40:05 +02:00
Hadi Moshayedi 46608e42f9 Add hyperscale tutorial to the regression tests. 2019-07-10 10:47:55 +02:00
Marco Slot 70434bc716 Increase slow start time in test to make valgrind tests pass 2019-07-08 06:04:13 +02:00
Marco Slot 07d2266e11 Fix RESET and other types of SET 2019-07-05 19:30:48 +02:00
Hadi Moshayedi d233887d68 Fix multi_extension in check-multi-vg 2019-07-04 13:03:46 +02:00
Marco Slot d6c667946c Fix citus_executor_name mapping by reimplementing it in C 2019-06-29 22:38:29 +02:00
Önder Kalacı 40da78c6fd
Introduce the adaptive executor (#2798)
With this commit, we're introducing the Adaptive Executor. 


The commit message consists of two distinct sections. The first part explains
how the executor works. The second part consists of the commit messages of
the individual smaller commits that resulted in this commit. The readers
can search for the each of the smaller commit messages on 
https://github.com/citusdata/citus and can learn more about the history
of the change.

/*-------------------------------------------------------------------------
 *
 * adaptive_executor.c
 *
 * The adaptive executor executes a list of tasks (queries on shards) over
 * a connection pool per worker node. The results of the queries, if any,
 * are written to a tuple store.
 *
 * The concepts in the executor are modelled in a set of structs:
 *
 * - DistributedExecution:
 *     Execution of a Task list over a set of WorkerPools.
 * - WorkerPool
 *     Pool of WorkerSessions for the same worker which opportunistically
 *     executes "unassigned" tasks from a queue.
 * - WorkerSession:
 *     Connection to a worker that is used to execute "assigned" tasks
 *     from a queue and may execute unasssigned tasks from the WorkerPool.
 * - ShardCommandExecution:
 *     Execution of a Task across a list of placements.
 * - TaskPlacementExecution:
 *     Execution of a Task on a specific placement.
 *     Used in the WorkerPool and WorkerSession queues.
 *
 * Every connection pool (WorkerPool) and every connection (WorkerSession)
 * have a queue of tasks that are ready to execute (readyTaskQueue) and a
 * queue/set of pending tasks that may become ready later in the execution
 * (pendingTaskQueue). The tasks are wrapped in a ShardCommandExecution,
 * which keeps track of the state of execution and is referenced from a
 * TaskPlacementExecution, which is the data structure that is actually
 * added to the queues and describes the state of the execution of a task
 * on a particular worker node.
 *
 * When the task list is part of a bigger distributed transaction, the
 * shards that are accessed or modified by the task may have already been
 * accessed earlier in the transaction. We need to make sure we use the
 * same connection since it may hold relevant locks or have uncommitted
 * writes. In that case we "assign" the task to a connection by adding
 * it to the task queue of specific connection (in
 * AssignTasksToConnections). Otherwise we consider the task unassigned
 * and add it to the task queue of a worker pool, which means that it
 * can be executed over any connection in the pool.
 *
 * A task may be executed on multiple placements in case of a reference
 * table or a replicated distributed table. Depending on the type of
 * task, it may not be ready to be executed on a worker node immediately.
 * For instance, INSERTs on a reference table are executed serially across
 * placements to avoid deadlocks when concurrent INSERTs take conflicting
 * locks. At the beginning, only the "first" placement is ready to execute
 * and therefore added to the readyTaskQueue in the pool or connection.
 * The remaining placements are added to the pendingTaskQueue. Once
 * execution on the first placement is done the second placement moves
 * from pendingTaskQueue to readyTaskQueue. The same approach is used to
 * fail over read-only tasks to another placement.
 *
 * Once all the tasks are added to a queue, the main loop in
 * RunDistributedExecution repeatedly does the following:
 *
 * For each pool:
 * - ManageWorkPool evaluates whether to open additional connections
 *   based on the number unassigned tasks that are ready to execute
 *   and the targetPoolSize of the execution.
 *
 * Poll all connections:
 * - We use a WaitEventSet that contains all (non-failed) connections
 *   and is rebuilt whenever the set of active connections or any of
 *   their wait flags change.
 *
 *   We almost always check for WL_SOCKET_READABLE because a session
 *   can emit notices at any time during execution, but it will only
 *   wake up WaitEventSetWait when there are actual bytes to read.
 *
 *   We check for WL_SOCKET_WRITEABLE just after sending bytes in case
 *   there is not enough space in the TCP buffer. Since a socket is
 *   almost always writable we also use WL_SOCKET_WRITEABLE as a
 *   mechanism to wake up WaitEventSetWait for non-I/O events, e.g.
 *   when a task moves from pending to ready.
 *
 * For each connection that is ready:
 * - ConnectionStateMachine handles connection establishment and failure
 *   as well as command execution via TransactionStateMachine.
 *
 * When a connection is ready to execute a new task, it first checks its
 * own readyTaskQueue and otherwise takes a task from the worker pool's
 * readyTaskQueue (on a first-come-first-serve basis).
 *
 * In cases where the tasks finish quickly (e.g. <1ms), a single
 * connection will often be sufficient to finish all tasks. It is
 * therefore not necessary that all connections are established
 * successfully or open a transaction (which may be blocked by an
 * intermediate pgbouncer in transaction pooling mode). It is therefore
 * essential that we take a task from the queue only after opening a
 * transaction block.
 *
 * When a command on a worker finishes or the connection is lost, we call
 * PlacementExecutionDone, which then updates the state of the task
 * based on whether we need to run it on other placements. When a
 * connection fails or all connections to a worker fail, we also call
 * PlacementExecutionDone for all queued tasks to try the next placement
 * and, if necessary, mark shard placements as inactive. If a task fails
 * to execute on all placements, the execution fails and the distributed
 * transaction rolls back.
 *
 * For multi-row INSERTs, tasks are executed sequentially by
 * SequentialRunDistributedExecution instead of in parallel, which allows
 * a high degree of concurrency without high risk of deadlocks.
 * Conversely, multi-row UPDATE/DELETE/DDL commands take aggressive locks
 * which forbids concurrency, but allows parallelism without high risk
 * of deadlocks. Note that this is unrelated to SEQUENTIAL_CONNECTION,
 * which indicates that we should use at most one connection per node, but
 * can run tasks in parallel across nodes. This is used when there are
 * writes to a reference table that has foreign keys from a distributed
 * table.
 *
 * Execution finishes when all tasks are done, the query errors out, or
 * the user cancels the query.
 *
 *-------------------------------------------------------------------------
 */



All the commits involved here:
* Initial unified executor prototype

* Latest changes

* Fix rebase conflicts to master branch

* Add missing variable for assertion

* Ensure that master_modify_multiple_shards() returns the affectedTupleCount

* Adjust intermediate result sizes

The real-time executor uses COPY command to get the results
from the worker nodes. Unified executor avoids that which
results in less data transfer. Simply adjust the tests to lower
sizes.

* Force one connection per placement (or co-located placements) when requested

The existing executors (real-time and router) always open 1 connection per
placement when parallel execution is requested.

That might be useful under certain circumstances:

(a) User wants to utilize as much as CPUs on the workers per
distributed query
(b) User has a transaction block which involves COPY command

Also, lots of regression tests rely on this execution semantics.
So, we'd enable few of the tests with this change as well.

* For parameters to be resolved before using them

For the details, see PostgreSQL's copyParamList()

* Unified executor sorts the returning output

* Ensure that unified executor doesn't ignore sequential execution of DDLJob's

Certain DDL commands, mainly creating foreign keys to reference tables,
should be executed sequentially. Otherwise, we'd end up with a self
distributed deadlock.

To overcome this situaiton, we set a flag `DDLJob->executeSequentially`
and execute it sequentially. Note that we have to do this because
the command might not be called within a transaction block, and
we cannot call `SetLocalMultiShardModifyModeToSequential()`.

This fixes at least two test: multi_insert_select_on_conflit.sql and
multi_foreign_key.sql

Also, I wouldn't mind scattering local `targetPoolSize` variables within
the code. The reason is that we'll soon have a GUC (or a global
variable based on a GUC) that'd set the pool size. In that case, we'd
simply replace `targetPoolSize` with the global variables.

* Fix 2PC conditions for DDL tasks

* Improve closing connections that are not fully established in unified execution

* Support foreign keys to reference tables in unified executor

The idea for supporting foreign keys to reference tables is simple:
Keep track of the relation accesses within a transaction block.
    - If a parallel access happens on a distributed table which
      has a foreign key to a reference table, one cannot modify
      the reference table in the same transaction. Otherwise,
      we're very likely to end-up with a self-distributed deadlock.
    - If an access to a reference table happens, and then a parallel
      access to a distributed table (which has a fkey to the reference
      table) happens, we switch to sequential mode.

Unified executor misses the function calls that marks the relation
accesses during the execution. Thus, simply add the necessary calls
and let the logic kick in.

* Make sure to close the failed connections after the execution

* Improve comments

* Fix savepoints in unified executor.

* Rebuild the WaitEventSet only when necessary

* Unclaim connections on all errors.

* Improve failure handling for unified executor

   - Implement the notion of errorOnAnyFailure. This is similar to
     Critical Connections that the connection managament APIs provide
   - If the nodes inside a modifying transaction expand, activate 2PC
   - Fix few bugs related to wait event sets
   - Mark placement INACTIVE during the execution as much as possible
     as opposed to we do in the COMMIT handler
   - Fix few bugs related to scheduling next placement executions
   - Improve decision on when to use 2PC

Improve the logic to start a transaction block for distributed transactions

- Make sure that only reference table modifications are always
  executed with distributed transactions
- Make sure that stored procedures and functions are executed
  with distributed transactions

* Move waitEventSet to DistributedExecution

This could also be local to RunDistributedExecution(), but in that case
we had to mark it as "volatile" to avoid PG_TRY()/PG_CATCH() issues, and
cast it to non-volatile when doing WaitEventSetFree(). We thought that
would make code a bit harder to read than making this non-local, so we
move it here. See comments for PG_TRY() in postgres/src/include/elog.h
and "man 3 siglongjmp" for more context.

* Fix multi_insert_select test outputs

Two things:
   1) One complex transaction block is now supported. Simply update
      the test output
   2) Due to dynamic nature of the unified executor, the orders of
      the errors coming from the shards might change (e.g., all of
      the queries on the shards would fail, but which one appears
      on the error message?). To fix that, we simply added it to
      our shardId normalization tool which happens just before diff.

* Fix subeury_and_cte test

The error message is updated from:
	failed to execute task
To:
        more than one row returned by a subquery or an expression

which is a lot clearer to the user.

* Fix intermediate_results test outputs

Simply update the error message from:
	could not receive query results
to
	result "squares" does not exist

which makes a lot more sense.

* Fix multi_function_in_join test

The error messages update from:
     Failed to execute task XXX
To:
     function f(..) does not exist

* Fix multi_query_directory_cleanup test

The unified executor does not create any intermediate files.

* Fix with_transactions test

A test case that just started to work fine

* Fix multi_router_planner test outputs

The error message is update from:
	Could not receive query results
To:
	Relation does not exists

which is a lot more clearer for the users

* Fix multi_router_planner_fast_path test

The error message is update from:
	Could not receive query results
To:
	Relation does not exists

which is a lot more clearer for the users

* Fix isolation_copy_placement_vs_modification by disabling select_opens_transaction_block

* Fix ordering in isolation_multi_shard_modify_vs_all

* Add executor locks to unified executor

* Make sure to allocate enought WaitEvents

The previous code was missing the waitEvents for the latch and
postmaster death.

* Fix rebase conflicts for master rebase

* Make sure that TRUNCATE relies on unified executor

* Implement true sequential execution for multi-row INSERTS

Execute the individual tasks executed one by one. Note that this is different than
MultiShardConnectionType == SEQUENTIAL_CONNECTION case (e.g., sequential execution
mode). In that case, running the tasks across the nodes in parallel is acceptable
and implemented in that way.

However, the executions that are qualified here would perform poorly if the
tasks across the workers are executed in parallel. We currently qualify only
one class of distributed queries here, multi-row INSERTs. If we do not enforce
true sequential execution, concurrent multi-row upserts could easily form
a distributed deadlock when the upserts touch the same rows.

* Remove SESSION_LIFESPAN flag in unified_executor

* Apply failure test updates

We've changed the failure behaviour a bit, and also the error messages
that show up to the user. This PR covers majority of the updates.

* Unified executor honors citus.node_connection_timeout

With this commit, unified executor errors out if even
a single connection cannot be established within
citus.node_connection_timeout.

And, as a side effect this fixes failure_connection_establishment
test.

* Properly increment/decrement pool size variables

Before this commit, the idle and active connection
counts were not properly calculated.

* insert_select_executor goes through unified executor.

* Add missing file for task tracker

* Modify ExecuteTaskListExtended()'s signature

* Sort output of INSERT ... SELECT ... RETURNING

* Take partition locks correctly in unified executor

* Alternative implementation for force_max_query_parallelization

* Fix compile warnings in unified executor

* Fix style issues

* Decrement idleConnectionCount when idle connection is lost

* Always rebuild the wait event sets

In the previous implementation, on waitFlag changes, we were only
modifying the wait events. However, we've realized that it might
be an over optimization since (a) we couldn't see any performance
benefits (b) we see some errors on failures and because of (a)
we prefer to disable it now.

* Make sure to allocate enough sized waitEventSet

With multi-row INSERTs, we might have more sessions than
task*workerCount after few calls of RunDistributedExecution()
because the previous sessions would also be alive.

Instead, re-allocate events when the connectino set changes.

* Implement SELECT FOR UPDATE on reference tables

On master branch, we do two extra things on SELECT FOR UPDATE
queries on reference tables:
   - Acquire executor locks
   - Execute the query on all replicas

With this commit, we're implementing the same logic on the
new executor.

* SELECT FOR UPDATE opens transaction block even if SelectOpensTransactionBlock disabled

Otherwise, users would be very confused and their logic is very likely
to break.

* Fix build error

* Fix the newConnectionCount calculation in ManageWorkerPool

* Fix rebase conflicts

* Fix minor test output differences

* Fix citus indent

* Remove duplicate sorts that is added with rebase

* Create distributed table via executor

* Fix wait flags in CheckConnectionReady

* failure_savepoints output for unified executor.

* failure_vacuum output (pg 10) for unified executor.

* Fix WaitEventSetWait timeout in unified executor

* Stabilize failure_truncate test output

* Add an ORDER BY to multi_upsert

* Fix regression test outputs after rebase to master

* Add executor.c comment

* Rename executor.c to adaptive_executor.c

* Do not schedule tasks if the failed placement is not ready to execute

Before the commit, we were blindly scheduling the next placement executions
even if the failed placement is not on the ready queue. Now, we're ensuring
that if failed placement execution is on a failed pool or session where the
execution is on the pendingQueue, we do not schedule the next task. Because
the other placement execution should be already running.

* Implement a proper custom scan node for adaptive executor

- Switch between the executors, add GUC to set the pool size
- Add non-adaptive regression test suites
- Enable CIRCLE CI for non-adaptive tests
- Adjust test output files

* Add slow start interval to the executor

* Expose max_cached_connection_per_worker to user

* Do not start slow when there are cached connections

* Consider ExecutorSlowStartInterval in NextEventTimeout

* Fix memory issues with ReceiveResults().

* Disable executor via TaskExecutorType

* Make sure to execute the tests with the other executor

* Use task_executor_type to enable-disable adaptive executor

* Remove useless code

* Adjust the regression tests

* Add slow start regression test

* Rebase to master

* Fix test failures in adaptive executor.

* Rebase to master - 2

* Improve comments & debug messages

* Set force_max_query_parallelization in isolation_citus_dist_activity

* Force max parallelization for creating shards when asked to use exclusive connection.

* Adjust the default pool size

* Expand description of max_adaptive_executor_pool_size GUC

* Update warnings in FinishRemoteTransactionCommit()

* Improve session clean up at the end of execution

Explicitly list all the states that the execution might end,
otherwise warn.

* Remove MULTI_CONNECTION_WAIT_RETRY which is not used at all

* Add more ORDER BYs to multi_mx_partitioning
2019-06-28 14:04:40 +02:00
Hanefi Onaldi 4e08477fed Add test case for issue 2575 2019-06-26 17:12:28 +02:00
Hanefi Onaldi 7e8fd49b94 Create Schemas as superuser on all shard/table creation UDFs
- All the schema creations on the workers will now be  via superuser connections
- If a shard is being repaired or a shard is replicated, we will create the
  schema only in the relevant worker; and in all the other cases where a schema
  creation is needed, we will block operations until we ensure the schema exists
  in all the workers
2019-06-26 17:12:28 +02:00
Philip Dubé aa0c47848e subquery_and_cte: test rejecting volatile ctes
Also update isolation_citus_dist_activity from after merge
2019-06-26 16:27:07 +02:00
Philip Dubé 18575ccfd3 Add tests to subquery_and_cte, update check-multi-mx expected results 2019-06-26 10:32:01 +02:00
Philip Dubé 77efec04a0 Router Planner: accept SELECT_CMD ctes in modification queries 2019-06-26 10:32:01 +02:00
Hadi Moshayedi 3d0a521295 Show just coordinator plan in some test outputs. 2019-06-24 12:24:30 +02:00
Hanefi Onaldi 7a6eb2aba0
Fix one regression test that fails on enterprise (#2786)
GRANT queries are propagated on Enterprise. If a user attempts to
create a user and run a GRANT query before creating it on workers, we
fail. This issue does not happen in community as the user needs to run
the GRANTs on the workers manually.
2019-06-21 15:46:28 +03:00