Distributed PostgreSQL as an extension
 
 
 
 
 
 
Go to file
Jelte Fennema 86876c0473
CTE pushdown via CTE inlining in distributed planning (#3161)
Before this patch, Citus used to always recursively plan CTEs.
In PostgreSQL 12, there is a [logic](https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=608b167f9f9c4553c35bb1ec0eab9ddae643989b) for inlining CTEs, which is basically converting certain CTEs to subqueries.

With this patch, citus becomes capable of doing the same, can get rid of
recursively planning all the CTEs. Instead, the pushdown-able ones would
simply be converted to subquery pushdown. If the inlined CTE query cannot
be pushed down, it'd simply follow the recursive planning logic.

See an example below:

```SQL
-- the query that users pass
WITH some_users AS
(SELECT
        users_table.user_id
FROM
 users_table JOIN events_table USING (user_id) WHERE event_type = 5)
SELECT count(*) FROM users_table JOIN some_users USING (user_id);

-- worker query
SELECT count(*) AS COUNT
FROM ((users_table_102039 users_table
       JOIN users_table_102039 users_table_1 ON ((users_table_1.user_id OPERATOR(pg_catalog.=) users_table.user_id)))
      JOIN events_table_102071 events_table ON ((users_table.user_id OPERATOR(pg_catalog.=) events_table.user_id)))
WHERE (events_table.event_type OPERATOR(pg_catalog.=) 5)
```

There are few things to call-out for future reference and
help the reviewer(s) to understand the patch easier:

1) On top of Postgres' restrictions to inline CTEs, Citus enforces one more.
This is to prevent regressing on the SQL support. For example, the following
cte is OK to inline by Postgres. However, if inlined, Citus cannot plan the whole
query, so we prefer to skip inlining that cte:

```SQL
-- Citus should not inline this CTE because otherwise it cannot
-- plan the query
WITH cte_1 AS (SELECT * FROM test_table) 
SELECT 
	*, row_number() OVER () 
FROM 
	cte_1;
```

2) Some exotic queries with multiple colocation groups involved
   could become repartition joins. Basically, after the CTE inlining
   happens, ShouldRecursivelyPlanNonColocatedSubqueries() fails to
   detect that the query is a non-colocated subquery. We should improve
   there to fix it. But, since we fall-back to planning again, the query is
   successfully executed by Citus.
```SQL
SET citus.shard_count TO 4;
CREATE TABLE  colocation_1 (key int, value int);
SELECT create_distributed_table('colocation_1', 'key');

SET citus.shard_count TO 8;
CREATE TABLE  colocation_2 (key int, value int);
SELECT create_distributed_table('colocation_2', 'key');

-- which used to work because the cte was recursively planned
-- now the cte becomes a repartition join since
    --- (a) the cte is replaced to a subquery
    --- (b) since the subquery is very simple, postgres pulled it to become
    ---     a simple join
WITH cte AS (SELECT * FROM colocation_1)
SELECT count(*) FROM cte JOIN colocation_2 USING (key);
...
message: the query contains a join that requires repartitioning
detail: 
hint: Set citus.enable_repartition_joins to on to enable repartitioning
...
┌───────┐
│ count │
├───────┤
│     0 │
└───────┘
(1 row)
```

3) We decided to implement inlining CTEs even after standard planner. 
In Postgres 12+, the restriction information in CTEs are generated
because the CTEs are actually treated as subqueries via Postgres' inline
capabilities.

In Postgres 11-, the restriction information is not generated for CTEs. Because
of that, some queries work differently on pg 11 vs pg 12. To see such queries,
see cte_inline.sql file, where the file has two output files.

4) As a side-effect of (2), we're now able to inline CTEs for INSERT .. SELECT
queries as well. Postgres prevents it, I cannot see a reason to prevent it. With
this capability, some of the INSERT ... SELECT queries where the cte is in the
SELECT query could become pushdownable. See an example:

```SQL
INSERT INTO test_table
WITH fist_table_cte AS
  (SELECT * FROM test_table)
    SELECT
      key, value
    FROM
      fist_table_cte;
```

5) A class of queries now could be supported. Previously, if a CTE is used in the
outer part of an outer join, Citus would complained about that.

So, the following query:

```SQL
WITH cte AS (
  SELECT * FROM users_table WHERE user_id = 1 ORDER BY value_1
)
SELECT
  cte.user_id, cte.time, events_table.event_type
FROM
  cte
LEFT JOIN
  events_table ON cte.user_id = events_table.user_id
ORDER BY
  1,2,3
LIMIT
  5;
ERROR:  cannot pushdown the subquery
DETAIL:  Complex subqueries and CTEs cannot be in the outer part of the outer join
```

Becomes
```SQL
-- cte LEFT JOIN distributed_table should error out
WITH cte AS (
  SELECT * FROM users_table WHERE user_id = 1 ORDER BY value_1
)
SELECT
  cte.user_id, cte.time, events_table.event_type
FROM
  cte
LEFT JOIN
  events_table ON cte.user_id = events_table.user_id
ORDER BY
  1,2,3
LIMIT
  5;
 user_id |              time               | event_type
---------+---------------------------------+------------
       1 | Wed Nov 22 22:51:43.132261 2017 |          0
       1 | Wed Nov 22 22:51:43.132261 2017 |          0
       1 | Wed Nov 22 22:51:43.132261 2017 |          1
       1 | Wed Nov 22 22:51:43.132261 2017 |          1
       1 | Wed Nov 22 22:51:43.132261 2017 |          2
(5 rows)
```
2020-01-16 12:43:48 +01:00
.circleci Actually check that test output normalization is applied in CI (#3358) 2020-01-06 10:37:34 +01:00
.github Add DESCRIPTION to PR template 2018-12-12 05:35:12 +01:00
ci Ensure that only normalized test output is commited 2020-01-03 11:30:08 +01:00
config Add citus_version(), analogous to PG's version() 2017-10-16 18:09:29 -06:00
src Re-add test that broke with GUC workaround 2020-01-16 12:34:50 +01:00
.codecov.yml Update .codecov.yml after moving ruleutils files 2019-11-16 14:25:35 +01:00
.editorconfig Fix editorconfig syntax (#3272) 2019-12-06 17:05:04 +01:00
.gitattributes Move C files into the appropriate directory 2019-11-16 11:36:17 +01:00
.gitignore Ignore .vscode (#2969) 2019-09-18 17:08:22 +03:00
CHANGELOG.md Add changelog entry for 9.1.2 2019-12-30 11:33:10 +03:00
CONTRIBUTING.md update contributing (#3284) 2019-12-11 20:55:21 +03:00
LICENSE Strip trailing whitespace and add final newline (#3186) 2019-11-21 14:25:37 +01:00
Makefile Makefile fix DESTDIR together with cleanup (#3342) 2019-12-27 10:34:57 +01:00
Makefile.global.in add gitref to the output of citus_version (#3246) 2019-11-29 15:54:09 +01:00
README.md add circleci build status (#3310) (#3309) 2019-12-16 19:25:32 +03:00
aclocal.m4 Basic usage statistics collection. (#1656) 2017-10-11 09:55:15 -04:00
autogen.sh Changed product name to citus 2016-02-15 16:04:31 +02:00
configure add gitref to the output of citus_version (#3246) 2019-11-29 15:54:09 +01:00
configure.in add gitref to the output of citus_version (#3246) 2019-11-29 15:54:09 +01:00
github-banner.png Readme for 5.0 2016-03-18 13:32:13 -07:00
prep_buildtree Changed product name to citus 2016-02-15 16:04:31 +02:00

README.md

Citus Banner

Slack Status Latest Docs Circleci Status Code Coverage

What is Citus?

  • Open-source PostgreSQL extension (not a fork)
  • Built to scale out across multiple nodes
  • Distributed engine for query parallelization
  • Database designed to scale out multi-tenant applications, real-time analytics dashboards, and high-throughput transactional workloads

Citus is an open source extension to Postgres that distributes your data and your queries across multiple nodes. Because Citus is an extension to Postgres, and not a fork, Citus gives developers and enterprises a scale-out database while keeping the power and familiarity of a relational database. As an extension, Citus supports new PostgreSQL releases, and allows you to benefit from new features while maintaining compatibility with existing PostgreSQL tools.

Citus serves many use cases. Three common ones are:

  1. Multi-tenant & SaaS applications: Most B2B applications already have the notion of a tenant / customer / account built into their data model. Citus allows you to scale out your transactional relational database to 100K+ tenants with minimal changes to your application.

  2. Real-time analytics: Citus enables ingesting large volumes of data and running analytical queries on that data in human real-time. Example applications include analytic dashboards with sub-second response times and exploratory queries on unfolding events.

  3. High-throughput transactional workloads: By distributing your workload across a database cluster, Citus ensures low latency and high performance even with a large number of concurrent users and high volumes of transactions.

To learn more, visit citusdata.com and join the Citus slack to stay on top of the latest developments.

Getting started with Citus

The fastest way to get up and running is to deploy Citus in the cloud. You can also setup a local Citus database cluster with Docker.

Hyperscale (Citus) on Azure Database for PostgreSQL

Hyperscale (Citus) is a deployment option on Azure Database for PostgreSQL, a fully-managed database as a service. Hyperscale (Citus) employs the Citus open source extension so you can scale out across multiple nodes. To get started with Hyperscale (Citus), learn more on the Citus website or use the Hyperscale (Citus) Quickstart in the Azure docs.

Citus Cloud

Citus Cloud runs on top of AWS as a fully managed database as a service. You can provision a Citus Cloud account at https://console.citusdata.com and get started with just a few clicks.

Local Citus Cluster

If you're looking to get started locally, you can follow the following steps to get up and running.

  1. Install Docker Community Edition and Docker Compose
  • Mac:
    1. Download and install Docker.
    2. Start Docker by clicking on the applications icon.
  • Linux:
    curl -sSL https://get.docker.com/ | sh
    sudo usermod -aG docker $USER && exec sg docker newgrp `id -gn`
    sudo systemctl start docker
    
    sudo curl -sSL https://github.com/docker/compose/releases/download/1.11.2/docker-compose-`uname -s`-`uname -m` -o /usr/local/bin/docker-compose
    sudo chmod +x /usr/local/bin/docker-compose
    
    The above version of Docker Compose is sufficient for running Citus, or you can install the latest version.
  1. Pull and start the Docker images
curl -sSLO https://raw.githubusercontent.com/citusdata/docker/master/docker-compose.yml
docker-compose -p citus up -d
  1. Connect to the master database
docker exec -it citus_master psql -U postgres
  1. Follow the first tutorial instructions
  2. To shut the cluster down, run
docker-compose -p citus down

Talk to Contributors and Learn More

Documentation Try the Citus tutorial for a hands-on introduction or
the documentation for a more comprehensive reference.
Slack Chat with us in our community Slack channel.
Github Issues We track specific bug reports and feature requests on our project issues.
Twitter Follow @citusdata for general updates and PostgreSQL scaling tips.
Citus Blog Read our Citus Data Blog for posts on Postgres, Citus, and scaling your database.

Contributing

Citus is built on and of open source, and we welcome your contributions. The CONTRIBUTING.md file explains how to get started developing the Citus extension itself and our code quality guidelines.

Who is Using Citus?

Citus is deployed in production by many customers, ranging from technology start-ups to large enterprises. Here are some examples:

  • Algolia uses Citus to provide real-time analytics for over 1B searches per day. For faster insights, they also use TopN and HLL extensions. User Story
  • Heap uses Citus to run dynamic funnel, segmentation, and cohort queries across billions of users and has more than 700B events in their Citus database cluster. Watch Video
  • Pex uses Citus to ingest 80B data points per day and analyze that data in real-time. They use a 20+ node cluster on Google Cloud. User Story
  • MixRank uses Citus to efficiently collect and analyze vast amounts of data to allow inside B2B sales teams to find new customers. User Story
  • Agari uses Citus to secure more than 85 percent of U.S. consumer emails on two 6-8 TB clusters. User Story
  • Copper (formerly ProsperWorks) powers a cloud CRM service with Citus. User Story

You can read more user stories about how they employ Citus to scale Postgres for both multi-tenant SaaS applications as well as real-time analytics dashboards here.


Copyright © Citus Data, Inc.