Sometimes in CI we run into this failure:
```diff
SELECT resultId, nodeport, rowcount, targetShardId, targetShardIndex
FROM partition_task_list_results('test', $$ SELECT * FROM source_table $$, 'target_table')
NATURAL JOIN pg_dist_node;
-WARNING: connection to the remote node localhost:xxxxx failed with the following error: connection not open
+ERROR: connection to the remote node localhost:9060 failed with the following error: connection not open
SELECT * FROM distributed_result_info ORDER BY resultId;
- resultid | nodeport | rowcount | targetshardid | targetshardindex
----------------------------------------------------------------------
- test_from_100800_to_0 | 9060 | 22 | 100805 | 0
- test_from_100801_to_0 | 57637 | 2 | 100805 | 0
- test_from_100801_to_1 | 57637 | 15 | 100806 | 1
- test_from_100802_to_1 | 57637 | 10 | 100806 | 1
- test_from_100802_to_2 | 57637 | 5 | 100807 | 2
- test_from_100803_to_2 | 57637 | 18 | 100807 | 2
- test_from_100803_to_3 | 57637 | 4 | 100808 | 3
- test_from_100804_to_3 | 9060 | 24 | 100808 | 3
-(8 rows)
-
+ERROR: current transaction is aborted, commands ignored until end of transaction block
-- fetch from worker 2 should fail
SAVEPOINT s1;
+ERROR: current transaction is aborted, commands ignored until end of transaction block
SELECT fetch_intermediate_results('{test_from_100802_to_1,test_from_100802_to_2}'::text[], 'localhost', :worker_2_port) > 0 AS fetched;
-ERROR: could not open file "base/pgsql_job_cache/xx_x_xxx/test_from_100802_to_1.data": No such file or directory
-CONTEXT: while executing command on localhost:xxxxx
+ERROR: current transaction is aborted, commands ignored until end of transaction block
ROLLBACK TO SAVEPOINT s1;
+ERROR: savepoint "s1" does not exist
-- fetch from worker 1 should succeed
SELECT fetch_intermediate_results('{test_from_100802_to_1,test_from_100802_to_2}'::text[], 'localhost', :worker_1_port) > 0 AS fetched;
- fetched
----------------------------------------------------------------------
- t
-(1 row)
-
+ERROR: current transaction is aborted, commands ignored until end of transaction block
-- make sure the results read are same as the previous transaction block
SELECT count(*), sum(x) FROM
read_intermediate_results('{test_from_100802_to_1,test_from_100802_to_2}'::text[],'binary') AS res (x int);
- count | sum
----------------------------------------------------------------------
- 15 | 863
-(1 row)
-
+ERROR: current transaction is aborted, commands ignored until end of transaction block
ROLLBACk;
```
As outlined in the #7306 I created, the reason for this is related to
only having a single connection open to the node. Finding and fixing the
full cause is not trivial, so instead this PR starts working around
this bug by forcing maximum parallelism. Preferably we'd want
this workaround not to be necessary, but that requires
spending time to fix this. For now having a less flaky CI is
good enough.
When Citus needs to parallelize queries on the local node (e.g., the node
executing the distributed query and the shards are the same), we need to
be mindful about the connection management. The reason is that the client
backends that are running distributed queries are competing with the client
backends that Citus initiates to parallelize the queries in order to get
a slot on the max_connections.
In that regard, we implemented a "failover" mechanism where if the distributed
queries cannot get a connection, the execution failovers the tasks to the local
execution.
The failover logic is follows:
- As the connection manager if it is OK to get a connection
- If yes, we are good.
- If no, we fail the workerPool and the failure triggers
the failover of the tasks to local execution queue
The decision of getting a connection is follows:
/*
* For local nodes, solely relying on citus.max_shared_pool_size or
* max_connections might not be sufficient. The former gives us
* a preview of the future (e.g., we let the new connections to establish,
* but they are not established yet). The latter gives us the close to
* precise view of the past (e.g., the active number of client backends).
*
* Overall, we want to limit both of the metrics. The former limit typically
* kics in under regular loads, where the load of the database increases in
* a reasonable pace. The latter limit typically kicks in when the database
* is issued lots of concurrent sessions at the same time, such as benchmarks.
*/