Commit Graph

167 Commits (63efc140de3c7d97e77393a36adef7df57b6a93b)

Author SHA1 Message Date
Onder Kalaci 5482d5822f Keep more statistics about connection establishment times
When DEBUG4 enabled, Citus now prints per connection establishment
time.
2021-04-16 14:56:31 +02:00
SaitTalhaNisanci b453563e88
Warm up connections params hash (#4872)
ConnParams(AuthInfo and PoolInfo) gets a snapshot, which will block the
remote connectinos to localhost. And the release of snapshot will be
blocked by the snapshot. This leads to a deadlock.

We warm up the conn params hash before starting a new transaction so
that the entries will already be there when we start a new transaction.
Hence GetConnParams will not get a snapshot.
2021-04-12 13:08:38 +03:00
Marco Slot fbc2147e11 Replace MAX_PUT_COPY_DATA_BUFFER_SIZE by citus.remote_copy_flush_threshold GUC 2021-03-16 06:00:38 +01:00
Marco Slot 1646fca445 Add GUC to set maximum connection lifetime 2021-03-16 01:57:57 +01:00
Philip Dubé 4e22f02997 Fix various typos due to zealous repetition 2021-03-04 19:28:15 +00:00
Onder Kalaci fc9a23792c COPY uses adaptive connection management on local node
With #4338, the executor is smart enough to failover to
local node if there is not enough space in max_connections
for remote connections.

For COPY, the logic is different. With #4034, we made COPY
work with the adaptive connection management slightly
differently. The cause of the difference is that COPY doesn't
know which placements are going to be accessed hence requires
to get connections up-front.

Similarly, COPY decides to use local execution up-front.

With this commit, we change the logic for COPY on local nodes:

Try to reserve a connection to local host. This logic follows
the same logic (e.g., citus.local_shared_pool_size) as the
executor because COPY also relies on TryToIncrementSharedConnectionCounter().
If reservation to local node fails, switch to local execution
Apart from this, if local execution is disabled, we follow the
exact same logic for multi-node Citus. It means that if we are
out of the connection, we'd give an error.
2021-02-04 09:45:07 +01:00
Onur Tirtir 253c19062a
Rename IsCitusInitiatedBackend to IsCitusInitiatedRemoteBackend (#4562) 2021-01-23 01:07:43 +03:00
Onur Tirtir 941c8fbf32
Automatically undistribute citus local tables when no more fkeys with reference tables (#4538) 2021-01-22 18:15:41 +03:00
Onder Kalaci c546ec5e78 Local node connection management
When Citus needs to parallelize queries on the local node (e.g., the node
executing the distributed query and the shards are the same), we need to
be mindful about the connection management. The reason is that the client
backends that are running distributed queries are competing with the client
backends that Citus initiates to parallelize the queries in order to get
a slot on the max_connections.

In that regard, we implemented a "failover" mechanism where if the distributed
queries cannot get a connection, the execution failovers the tasks to the local
execution.

The failover logic is follows:

- As the connection manager if it is OK to get a connection
	- If yes, we are good.
	- If no, we fail the workerPool and the failure triggers
	  the failover of the tasks to local execution queue

The decision of getting a connection is follows:

/*
 * For local nodes, solely relying on citus.max_shared_pool_size or
 * max_connections might not be sufficient. The former gives us
 * a preview of the future (e.g., we let the new connections to establish,
 * but they are not established yet). The latter gives us the close to
 * precise view of the past (e.g., the active number of client backends).
 *
 * Overall, we want to limit both of the metrics. The former limit typically
 * kics in under regular loads, where the load of the database increases in
 * a reasonable pace. The latter limit typically kicks in when the database
 * is issued lots of concurrent sessions at the same time, such as benchmarks.
 */
2020-12-03 14:16:13 +03:00
Onur Tirtir 03bcccdee0
Fix hostname length check in StartNodeUserDatabaseConnection (#4363)
Copying string before hostname length check makes the check useless
2020-11-30 20:00:35 +03:00
Onur Tirtir 7f3d1182ed
Handle invalid connection hash entries (#4362)
If MemoryContextAlloc errors out -e.g. during an OOM-, ConnectionHashEntry->connections
stays as NULL.

With this commit, we add isValid flag to ConnectionHashEntry that should be set to true
right after we allocate & initialize ConnectionHashEntry->connections list properly, and we
check it before accesing to ConnectionHashEntry->connections.
2020-11-30 19:44:03 +03:00
Onur Tirtir f80f4839ad Remove unused functions that cppcheck found 2020-10-19 13:50:52 +03:00
Onur Tirtir ba208eae4d
Record non-distributed table accesses in local executor (#4139) 2020-09-07 18:19:08 +03:00
Sait Talha Nisanci 1112b254a7 adapt recently added code for pg13
This commit mostly adds pg_get_triggerdef_command to our ruleutils_13.
This doesn't add anything extra for ruleutils 13 so it is basically a copy
of the change on ruleutils_12
2020-08-04 15:18:27 +03:00
Sait Talha Nisanci 01632c56a0 Change utils/hashutils.h to common/hashfn.h for PG >= 13
Commit on postgres side:
05d8449e73694585b59f8b03aaa087f04cc4679a

Command on postgres side:
git log --all --grep="hashutils"

include common/hashfn.h for pg >= 13

tag_hash was moved from hsearch.h to hashutils.h then to hashfn.h

Commits on Postgres side:
9341c783cc42ffae5860c86bdc713bd47d734ffd
2020-08-04 15:10:22 +03:00
Onder Kalaci eeb8c81de2 Implement shared connection count reservation & enable `citus.max_shared_pool_size` for COPY
With this patch, we introduce `locally_reserved_shared_connections.c/h` files
which are responsible for reserving some space in shared memory counters
upfront.

We sometimes need to reserve connections, but not necessarily
establish them. For example:
-  COPY command should reserve connections as it cannot know which
   connections it needs in which order. COPY establishes connections
   as any input data hits the workers. For example, for router COPY
   command, it only establishes 1 connection.

   As discussed here (https://github.com/citusdata/citus/pull/3849#pullrequestreview-431792473),
   COPY needs to reserve connections up-front, otherwise we can end
   up with resource starvation/un-detected deadlocks.
2020-08-03 18:51:40 +02:00
Onder Kalaci a2f53dff74 Make FindAvailableConnection() more strict
With adaptive connection management, we might have some connections
which are not fully initialized. Those connections should not be
qualified as available.
2020-07-23 15:59:50 +02:00
Onder Kalaci cfb633601d Minor refactorings in COPY command execution
1) Rename CONNECTION_PER_PLACEMENT to REQUIRE_CLEAN_CONNECTION. This is
mostly to make things clear as the new name reveals more.

2) We also make sure that mark all the copy connections critical,
even if they are accessed earlier in the transction
2020-07-23 15:36:19 +02:00
Onder Kalaci 52c0fccb08 Move executor specific logic to a function
Because as we're planning to use the same logic, it'd be nice to
use the exact same functions.
2020-07-22 15:09:47 +02:00
SaitTalhaNisanci b3af63c8ce
Remove task tracker executor (#3850)
* use adaptive executor even if task-tracker is set

* Update check-multi-mx tests for adaptive executor

Basically repartition joins are enabled where necessary. For parallel
tests max adaptive executor pool size is decresed to 2, otherwise we
would get too many clients error.

* Update limit_intermediate_size test

It seems that when we use adaptive executor instead of task tracker, we
exceed the intermediate result size less in the test. Therefore updated
the tests accordingly.

* Update multi_router_planner

It seems that there is one problem with multi_router_planner when we use
adaptive executor, we should fix the following error:
+ERROR:  relation "authors_range_840010" does not exist
+CONTEXT:  while executing command on localhost:57637

* update repartition join tests for check-multi

* update isolation tests for repartitioning

* Error out if shard_replication_factor > 1 with repartitioning

As we are removing the task tracker, we cannot switch to it if
shard_replication_factor > 1. In that case, we simply error out.

* Remove MULTI_EXECUTOR_TASK_TRACKER

* Remove multi_task_tracker_executor

Some utility methods are moved to task_execution_utils.c.

* Remove task tracker protocol methods

* Remove task_tracker.c methods

* remove unused methods from multi_server_executor

* fix style

* remove task tracker specific tests from worker_schedule

* comment out task tracker udf calls in tests

We were using task tracker udfs to test permissions in
multi_multiuser.sql. We should find some other way to test them, then we
should remove the commented out task tracker calls.

* remove task tracker test from follower schedule

* remove task tracker tests from multi mx schedule

* Remove task-tracker specific functions from worker functions

* remove multi task tracker extra schedule

* Remove unused methods from multi physical planner

* remove task_executor_type related things in tests

* remove LoadTuplesIntoTupleStore

* Do initial cleanup for repartition leftovers

During startup, task tracker would call TrackerCleanupJobDirectories and
TrackerCleanupJobSchemas to clean up leftover directories and job
schemas. With adaptive executor, while doing repartitions it is possible
to leak these things as well. We don't retry cleanups, so it is possible
to have leftover in case of errors.

TrackerCleanupJobDirectories is renamed as
RepartitionCleanupJobDirectories since it is repartition specific now,
however TrackerCleanupJobSchemas cannot be used currently because it is
task tracker specific. The thing is that this function is a no-op
currently.

We should add cleaning up intermediate schemas to DoInitialCleanup
method when that problem is solved(We might want to solve it in this PR
as well)

* Revert "remove task tracker tests from multi mx schedule"

This reverts commit 03ecc0a681.

* update multi mx repartition parallel tests

* not error with task_tracker_conninfo_cache_invalidate

* not run 4 repartition queries in parallel

It seems that when we run 4 repartition queries in parallel we get too
many clients error on CI even though we don't get it locally. Our guess
is that, it is because we open/close many connections without doing some
work and postgres has some delay to close the connections. Hence even
though connections are removed from the pg_stat_activity, they might
still not be closed. If the above assumption is correct, it is unlikely
for it to happen in practice because:
- There is some network latency in clusters, so this leaves some times
for connections to be able to close
- Repartition joins return some data and that also leaves some time for
connections to be fully closed.

As we don't get this error in our local, we currently assume that it is
not a bug. Ideally this wouldn't happen when we get rid of the
task-tracker repartition methods because they don't do any pruning and
might be opening more connections than necessary.

If this still gives us "too many clients" error, we can try to increase
the max_connections in our test suite(which is 100 by default).

Also there are different places where this error is given in postgres,
but adding some backtrace it seems that we get this from
ProcessStartupPacket. The backtraces can be found in this link:
https://circleci.com/gh/citusdata/citus/138702

* Set distributePlan->relationIdList when it is needed

It seems that we were setting the distributedPlan->relationIdList after
JobExecutorType is called, which would choose task-tracker if
replication factor > 1 and there is a repartition query. However, it
uses relationIdList to decide if the query has a repartition query, and
since it was not set yet, it would always think it is not a repartition
query and would choose adaptive executor when it should choose
task-tracker.

* use adaptive executor even with shard_replication_factor > 1

It seems that we were already using adaptive executor when
replication_factor > 1. So this commit removes the check.

* remove multi_resowner.c and deprecate some settings

* remove TaskExecution related leftovers

* change deprecated API error message

* not recursively plan single relatition repartition subquery

* recursively plan single relation repartition subquery

* test depreceated task tracker functions

* fix overlapping shard intervals in range-distributed test

* fix error message for citus_metadata_container

* drop task-tracker deprecated functions

* put the implemantation back to worker_cleanup_job_schema_cachesince citus cloud uses it

* drop some functions, add downgrade script

Some deprecated functions are dropped.
Downgrade script is added.
Some gucs are deprecated.
A new guc for repartition joins bucket size is added.

* order by a test to fix flappiness
2020-07-18 13:11:36 +03:00
Nils Dijk d0b6e62c9a
change wording to allowlist and the likes (#3906)
In the same line as #3904

Change wording to better reflect use and remove words that enforce/maintain bias.
2020-07-15 16:24:40 +02:00
Marco Slot d1bab78d79 Remove master from file hierarchy 2020-06-16 17:49:09 +02:00
Jelte Fennema 0e12d045b1
Support use of binary protocol in between nodes (#3877)
This can save a lot of data to be sent in some cases, thus improving
performance for which inter query bandwidth is the bottleneck.
There's some issues with enabling this as default, so that's currently not done.
2020-06-12 15:02:51 +02:00
Jelte Fennema b87bae71bb
Error out when using different users in the same transaction (#3869)
Fixes #3867

As described in the issue above we return incorrect results when
changing user within a transaction. This causes us to error out instead.
2020-06-10 14:07:40 +02:00
Jelte Fennema c6f5d5fe88
Add some asserts to pass static analysis (#3805) 2020-04-29 11:19:11 +02:00
Onder Kalaci bc54c5125f Increase the default value of citus.node_connection_timeout
The previous default was 5 seconds, and we change it to 30 seconds.
The main motivation for this is that for busy clusters, 5 seconds
can be too aggressive. Especially with connection throttling, the servers
might be kept busy for a really long time, and users may see the
connection errors more frequently.

We've done some sanity checks, for really quick queries (like
`SELECT count(*) from table`), 30 seconds is a decent value even
if users execute 300 distributed queries on the coordinator. We've
verified this on Hyperscale(Citus).
2020-04-24 15:16:42 +02:00
Onder Kalaci e182215d96 Improve connection error message from the worker nodes
We currently put the actual error message to the detail part. However,
many drivers don't show detail part.

As connection errors are somehow common, and hard to trace back, can't
we added the detail to the message itself.

In addition to that, we changed "connection error" message, as it
was confusing to the users who think that the error was happening
while connecting to the coordinator. In fact, this error is showing
up when the coordinator fails to connect remote nodes.
2020-04-20 13:32:55 +02:00
SaitTalhaNisanci 1d0f4bdcd2
invalidate plan cache in master_update_node (#3758)
* invalidate plan cache in master_update_node

If a plan is cached by postgres but a user uses master_update_node, then
when the plan cache is used for the updated node, they will get the old
nodename/nodepost in the plan. This is because the plan cache doesn't
know about the master_update_node. This could be a problem in prepared
statements or anything that goes into plancache. As a solution the plan
cache is invalidated inside master_update_node.

* add invalidate_inactive_shared_connections test function

We introduce invalidate_inactive_shared_connections udf to be used in
testing. It is possible that a connection count for an inactive node
will be greater than 0 and in that case it will not be removed at the
time of invalidation. However, later we don't have a mechanism to remove
it, which means that it will stay in the hash. For this not to cause a
problem, we use this udf in testing.

* move invalidate_inactive_shared_connections to udfs from test as it will be used in mx

* remove the test udf

* remove the IsInactive check
2020-04-17 17:43:48 +03:00
Önder Kalacı a919f09c96
Remove the entries from the shared connection counter hash when no connections remain (#3775)
We initially considered removing entries just before any change to
pg_dist_node. However, that ended-up being very complex and making
MX even more complex.

Instead, we're switching to a simpler solution, where we remove entries
when the counter gets to 0.

With certain workloads, this may have some performance penalty. But, two
notes on that:
 - When counter == 0, it implies that the cluster is not busy
 - With cached connections, that's not possible
2020-04-17 17:14:58 +03:00
Marco Slot 8b83306a27 Issue worker messages with the same log level 2020-04-14 21:08:25 +02:00
Onder Kalaci aa6b641828 Throttle connections to the worker nodes
With this commit, we're introducing a new infrastructure to throttle
connections to the worker nodes. This infrastructure is useful for
multi-shard queries, router queries are have not been affected by this.

The goal is to prevent establishing more than citus.max_shared_pool_size
number of connections per worker node in total, across sessions.

To do that, we've introduced a new connection flag OPTIONAL_CONNECTION.
The idea is that some connections are optional such as the second
(and further connections) for the adaptive executor. A single connection
is enough to finish the distributed execution, the others are useful to
execute the query faster. Thus, they can be consider as optional connections.
When an optional connection is not allowed to the adaptive executor, it
simply skips it and continues the execution with the already established
connections. However, it'll keep retrying to establish optional
connections, in case some slots are open again.
2020-04-14 10:27:48 +02:00
Onder Kalaci 38b8a9ad62 Add citus_remote_connection_stats() function
This function is intended to be used for monitoring
the remote connections.
2020-04-14 10:03:27 +02:00
Onder Kalaci 0dbfbe0c37 Add the necessary shared memory infrastructure
- The hashmap in the shared memory
- The lock to access the hashmap
- The GUC to control the size
2020-04-14 10:03:26 +02:00
Philip Dubé ab0b59ad3b GetConnParams: Set runtimeParamStart before setting keywords/values to avoid out of bounds access 2020-04-10 13:14:06 +00:00
Onder Kalaci 9b29a32d7a Remove all references for side channel connections
We don't need any side channel connections. That is actually
problematic in the sense that it creates extra connections.
Say, citus.max_adaptive_executor_pool_size equals to 1, Citus
ends up using one extra connection for the intermediate results.
Thus, not obeying citus.max_adaptive_executor_pool_size.

In this PR, we remove the following entities from the codebase
to allow further commits to implement not requiring extra connection
for the intermediate results:

- The connection flag REQUIRE_SIDECHANNEL
- The function GivePurposeToConnection
- The ConnectionPurpose struct and related fields
2020-04-07 17:06:55 +02:00
Jelte Fennema 2aabe3e2ef
Mark all connections for shutdown when citus.node_conninfo chan… (#3642)
We cache connections between nodes in our connection management code.
This is good for speed. For security this can be a problem though. If
the user changes settings related to TLS encryption they want those to
be applied to future queries. This is especially important when they did
not have TLS enabled before and now they want to enable it. This can
normally be achieved by changing citus.node_conninfo.  However, because
connections are not reopened there will still be old connections that
might not be encrypted at all.

This commit changes that by marking all connections to be shutdown at
the end of their current transaction. This way running transactions will
succeed, even if placement requires connections to be reused for this
transaction. But after this transaction completes any future statements
will use a connection created with the new connection options.

If a connection is requested and a connection is found that is marked
for shutdown, then we don't return this connection. Instead a new one is
created. This is needed to make sure that if there are no running
transactions, then the next statement will not use an old cached
connection, since connections are only actually shutdown at the end of a
transaction.
2020-03-24 15:31:41 +01:00
Onder Kalaci 7b4eb9611b Properly terminate connections at the end session
Citus coordinator (or MX nodes) caches `citus.max_cached_conns_per_worker` connections
per node. This means that, those connections are not terminated after each statement.
Instead, cached to avoid the cost of re-establishment. This is crucial for OLTP performance.

The problem with that approach is that, we never properly handle the termnation of
those cached connections. For instance, when a session on the coordinator disconnects,
you'd see the following logs on the workers:

```
2020-03-20 09:13:39.454 CET [64028] LOG:  could not receive data from client: Connection reset by peer
```

With this patch, we're terminating the cached connections properly at the end of the connection.
2020-03-20 17:34:34 +01:00
Philip Dubé 20abc4d2b5
Replace foreach with foreach_ptr/foreach_oid (#3544) 2020-02-27 16:54:49 +01:00
Jelte Fennema 8de8b62669 Convert unsafe APIs to safe ones 2020-02-25 15:39:27 +01:00
Philip Dubé 52042d4a00 Prefer instr_time to TimestampTz when we want CLOCK_MONOTONIC 2020-02-19 00:34:17 +00:00
Jelte Fennema 3f7c5a5cf6
Semmle: Fix possible infite loops caused by overflow (#3503)
Comparison between differently sized integers in loop conditions can cause
infinite loops. This can happen when doing something like this:

```c
int64 very_big = MAX_INT32 + 1;
for (int32 i = 0; i < very_big; i++) {
    // do something
}
// never reached because i overflows before it can reach the value of very_big
```
2020-02-17 14:35:10 +01:00
Philip Dubé c252811884 dont: don't, wont: won't, acylic: acyclic 2020-02-05 17:32:22 +00:00
Marco Slot be77d3304f Fixup 2020-02-03 11:59:55 +01:00
Marco Slot b0fd6aa006 If reference tables was read over multiple connections, do not assign connection 2020-02-03 11:54:29 +01:00
Önder Kalacı ef7d1ea91d
Locally execute queries that don't need any data access (#3410)
* Update shardPlacement->nodeId to uint

As the source of the shardPlacement->nodeId is always workerNode->nodeId,
and that is uint32.

We had this hack because of: 0ea4e52df5 (r266421409)

And, that is gone with: 90056f7d3c (diff-c532177d74c72d3f0e7cd10e448ab3c6L1123)

So, we're safe to do it now.

* Relax the restrictions on using the local execution

Previously, whenever any local execution happens, we disabled further
commands to do any remote queries. The basic motivation for doing that
is to prevent any accesses in the same transaction block to access the
same placements over multiple sessions: one is local session the other
is remote session to the same placement.

However, the current implementation does not distinguish local accesses
being to a placement or not. For example, we could have local accesses
that only touches intermediate results. In that case, we should not
implement the same restrictions as they become useless.

So, this is a pre-requisite for executing the intermediate result only
queries locally.

* Update the error messages

As the underlying implementation has changed, reflect it in the error
messages.

* Keep track of connections to local node

With this commit, we're adding infrastructure to track if any connection
to the same local host is done or not.

The main motivation for doing this is that we've previously were more
conservative about not choosing local execution. Simply, we disallowed
local execution if any connection to any remote node is done. However,
if we want to use local execution for intermediate result only queries,
this'd be annoying because we expect all queries to touch remote node
before the final query.

Note that this approach is still limiting in Citus MX case, but for now
we can ignore that.

* Formalize the concept of Local Node

Also some minor refactoring while creating the dummy placement

* Write intermediate results locally when the results are only needed locally

Before this commit, Citus used to always broadcast all the intermediate
results to remote nodes. However, it is possible to skip pushing
the results to remote nodes always.

There are two notable cases for doing that:

   (a) When the query consists of only intermediate results
   (b) When the query is a zero shard query

In both of the above cases, we don't need to access any data on the shards. So,
it is a valuable optimization to skip pushing the results to remote nodes.

The pattern mentioned in (a) is actually a common patterns that Citus users
use in practice. For example, if you have the following query:

WITH cte_1 AS (...), cte_2 AS (....), ... cte_n (...)
SELECT ... FROM cte_1 JOIN cte_2 .... JOIN cte_n ...;

The final query could be operating only on intermediate results. With this patch,
the intermediate results of the ctes are not unnecessarily pushed to remote
nodes.

* Add specific regression tests

As there are edge cases in Citus MX and with round-robin policy,
use the same queries on those cases as well.

* Fix failure tests

By forcing not to use local execution for intermediate results since
all the tests expects the results to be pushed remotely.

* Fix flaky test

* Apply code-review feedback

Mostly style changes

* Limit the max value of pg_dist_node_seq to reserve for internal use
2020-01-23 18:28:34 +01:00
Onder Kalaci a0dff301c7 Update shardPlacement->nodeId to uint
As the source of the shardPlacement->nodeId is always workerNode->nodeId,
and that is uint32.

We had this hack because of: 0ea4e52df5 (r266421409)

And, that is gone with: 90056f7d3c (diff-c532177d74c72d3f0e7cd10e448ab3c6L1123)

So, we're safe to do it now.
2020-01-23 13:00:24 +01:00
Onder Kalaci 4be69bbf6f Fix reference table issue 2020-01-20 18:45:18 +00:00
Philip Dubé fdcc413559 Code cleanup of adaptive_executor, connection_management, placement_connection
adaptive_executor: sort includes, use foreach_ptr, remove lies from FinishDistributedExecution docs
connection_management: rename msecs, which isn't milliseconds
placement_connection: small typos
2020-01-17 17:44:47 +00:00
Marco Slot f1a0582973 Make ApplyLogRedaction a macro and redefine ereport 2020-01-13 18:24:36 +01:00
Marco Slot 06709ee108 Always use NOTICE in log_remote_commands and avoid redaction when possible 2020-01-13 18:24:36 +01:00