PG15 commit d1ef5631e620f9a5b6480a32bb70124c857af4f1
and PG16 commit 695f5deb7902865901eb2d50a70523af655c3a00
disallow replacing joins with scans in queries with pseudoconstant quals.
This commit prevents the set_join_pathlist_hook from being called
if any of the join restrictions is a pseudo-constant.
So in these cases, citus has no info on the join, never sees that
the query has an outer join, and ends up producing an incorrect plan.
PG17 fixes this by commit 9e9931d2bf40e2fea447d779c2e133c2c1256ef3
Therefore, we take this extra measure here for PG versions less than 17.
hasOuterJoin can never be true when set_join_pathlist_hook is absent.
DESCRIPTION: Adds JSON_TABLE() support
PG17 has added basic `JSON_TABLE()` functionality
`JSON_TABLE()` allows `JSON` data to be converted into a relational view
and thus used, for example, in a `FROM` clause, like other tabular data.
We treat `JSON_TABLE` the same as correlated functions (e.g., recurring
tuples). In the end, for multi-shard `JSON_TABLE` commands, we apply the
same restrictions as reference tables (e.g., cannot perform a lateral
outer join when a distributed subquery references a (reference
table)/(json table) etc.)
Relevant PG17 commits:
[basic JSON
table](https://github.com/postgres/postgres/commit/de3600452), [nested
paths in json
table](https://github.com/postgres/postgres/commit/bb766cde6)
Onder had previously added json table support for PG15BETA1, but we
reverted that commit because json table was reverted in PG15.
ce7f1a530f
Previous relevant PG15Beta1 commit:
https://github.com/postgres/postgres/commit/4e34747c8
Therefore, I referred to Onder's commit for this commit as well, with a
few changes due to some differences between PG15/PG17:
1) In PG15Beta1, we had also `PLAN` clauses for `JSON_TABLE`
https://github.com/postgres/postgres/commit/fadb48b00, and Onder's
commit includes tests for those as well. However, `PLAN` nodes are _not_
added in PG17. Therefore, I didn't include the `json_table_select_only`
test, which had mostly queries involving `PLAN`. I only included the
last query from that test.
2) In PG15 timeline (Citus 11.1), we didn't support outer joins where
the outer rel is a recurring one and the inner one is a non-recurring
one. However, [Onur added support for that one in Citus
11.2](https://github.com/citusdata/citus/pull/6512), therefore I updated
the tests from Onder's commit accordingly.
3) PG17 json table has nested paths and columns, therefore I added a
test
with a distributed table, which is exactly the same as the one in
sqljson_jsontable in PG17.
https://github.com/postgres/postgres/commit/bb766cde6
This pull request also adds some basic tests on validation of SQL/JSON
constructor functions JSON(), JSON_SCALAR(), and JSON_SERIALIZE(),
and also SQL/JSON query functions JSON_EXISTS(), JSON_QUERY(), and
JSON_VALUE(). The relevant PG commits are the following:
[JSON(), JSON_SCALAR(),
JSON_SERIALIZE()](https://github.com/postgres/postgres/commit/03734a7fe)
[JSON_EXISTS(), JSON_VALUE(),
JSON_QUERY()](https://github.com/postgres/postgres/commit/6185c9737)
This change adds a script to programatically group all includes in a
specific order. The script was used as a one time invocation to group
and sort all includes throught our formatted code. The grouping is as
follows:
- System includes (eg. `#include<...>`)
- Postgres.h (eg. `#include "postgres.h"`)
- Toplevel imports from postgres, not contained in a directory (eg.
`#include "miscadmin.h"`)
- General postgres includes (eg . `#include "nodes/..."`)
- Toplevel citus includes, not contained in a directory (eg. `#include
"citus_verion.h"`)
- Columnar includes (eg. `#include "columnar/..."`)
- Distributed includes (eg. `#include "distributed/..."`)
Because it is quite hard to understand the difference between toplevel
citus includes and toplevel postgres includes it hardcodes the list of
toplevel citus includes. In the same manner it assumes anything not
prefixed with `columnar/` or `distributed/` as a postgres include.
The sorting/grouping is enforced by CI. Since we do so with our own
script there are not changes required in our uncrustify configuration.
Postgres supports JSON_TABLE feature on PG 15.
We treat JSON_TABLE the same as correlated functions (e.g., recurring tuples).
In the end, for multi-shard JSON_TABLE commands, we apply the same
restrictions as reference tables (e.g., cannot be in the outer part of
an outer join etc.)
Co-authored-by: Onder Kalaci <onderkalaci@gmail.com>
As long as the VALUES clause contains constant values, we should not
recursively plan the queries/CTEs.
This is a follow-up work of #1805. So, we can easily apply OUTER join
checks as if VALUES clause is a reference table/immutable function.
In plain words, each distributed plan pulls the necessary intermediate
results to the worker nodes that the plan hits. This is primarily useful
in three ways.
(i) If the distributed plan that uses intermediate
result(s) is a router query, then the intermediate results are only
broadcasted to a single node.
(ii) If a distributed plan consists of only intermediate results, which
is not uncommon, the intermediate results are broadcasted to a single
node only.
(iii) If a distributed query hits a sub-set of the shards in multiple
workers, the intermediate results will be broadcasted to the relevant
node(s).
The final item (iii) becomes crucial for append/range distributed
tables where typically the distributed queries hit a small subset of
shards/workers.
To do this, for each query that Citus creates a distributed plan, we keep
track of the subPlans used in the queryTree, and save it in the distributed
plan. Just before Citus executes each subPlan, Citus first keeps track of
every worker node that the distributed plan hits, and marks every subPlan
should be broadcasted to these nodes. Later, for each subPlan which is a
distributed plan, Citus does this operation recursively since these
distributed plans may access to different subPlans, and those have to be
recorded as well.
This is necassery to support Q20 of the CHbenCHmark: #2582.
To summarize the fix: The subquery is converted into an INNER JOIN on a
table. This fixes the issue, since an INNER JOIN on a table is already
supported by the repartion planner.
The way this replacement is happening.:
1. Postgres replaces `col in (subquery)` with a SEMI JOIN (subquery) on col = subquery_result
2. If this subquery is simple enough Postgres will replace it with a
regular read from a table
3. If the subquery returns unique results (e.g. a primary key) Postgres
will convert the SEMI JOIN into an INNER JOIN during the planning. It
will not change this in the rewritten query though.
4. We check if Postgres sends us any SEMI JOINs during its join order
planning, if it doesn't we replace all SEMI JOINs in the rewritten
query with INNER JOIN (which we already support).
Areas for further optimization:
- Don't save subquery results to a local file on the coordinator when the subquery is not in the having clause
- Push the the HAVING with subquery to the workers if there's a group by on the distribution column
- Don't push down the results to the workers when we don't push down the HAVING clause, only the coordinator needs it
Fixes#520Fixes#756Closes#2047