diff --git a/images/insert-select-modes.png b/images/insert-select-modes.png new file mode 100755 index 000000000..6d5de65dc Binary files /dev/null and b/images/insert-select-modes.png differ diff --git a/src/backend/distributed/README.md b/src/backend/distributed/README.md index e3a191d38..84b1f194e 100644 --- a/src/backend/distributed/README.md +++ b/src/backend/distributed/README.md @@ -1672,7 +1672,7 @@ These stages are run in parallel for all dependent jobs (read: all tables in a j Once all dependent jobs are finished, the main Job is executed via the regular adaptive executor code path. The main job will include calls to read_intermediate_result that concatenate all the intermediate results for a particular hash range. - +Single hash re-partition join example Dependent jobs have similarities with subplans. A Job can only be distributed query without a merge step, which is what allows the results to be repartitioned, while a subplan can be any type of plan, and is always broadcast. One could imagine a subplan also being repartitioned if it is subsequently used in a join with a distributed table. The difference between subplans and jobs in the distributed query plans are one of the most significant technical debts. @@ -1704,11 +1704,7 @@ The COPY distributed_table TO .. syntax will typically return a lot of data and The INSERT.. SELECT command inserts the result of a SELECT query into a target table. In real-time analytics use cases, INSERT..SELECT enables transformation of an incoming stream of data inside the database. A typical example is maintaining a rollup table or converting raw data into a more structured form and adding indexes. -A diagram of a network - -Description automatically generated - -Source: citus-paper/sigmod2021/submission/citus-insert-select.pdf at main ยท citusdata/citus-paper (github.com) +INSERT..SELECT modes Citus has three different methods of handling INSERT..SELECT commands that insert into a distributed table as shown in the figure above. We identify these methods as: (1) co-located, where shards for the source and destination tables are co-located; (2) repartitioning, where the source and destination tables are not co-located and the operation requires a distributed reshuffle; and (3) pull to coordinator, where neither of the previous two methods can be applied. These three approaches can process around 100M, 10M, and 1M rows per second, respectively, in a single command.