citus

Distributed PostgreSQL as an extension

citus citus-extension database database-cluster distributed-database multi-tenant postgres postgresql relational-database scale sharding sql

Go to file

Markus Sintonen cdedb98c54 Improve shard pruning logic to understand OR-conditions. Previously a limitation in the shard pruning logic caused multi distribution value queries to always go into all the shards/workers whenever query also used OR conditions in WHERE clause. Related to https://github.com/citusdata/citus/issues/2593 and https://github.com/citusdata/citus/issues/1537 There was no good workaround for this limitation. The limitation caused quite a bit of overhead with simple queries being sent to all workers/shards (especially with setups having lot of workers/shards). An example of a previous plan which was inadequately pruned: ``` EXPLAIN SELECT count() FROM orders_hash_partitioned WHERE (o_orderkey IN (1,2)) AND (o_custkey = 11 OR o_custkey = 22); QUERY PLAN --------------------------------------------------------------------- Aggregate (cost=0.00..0.00 rows=0 width=0) -> Custom Scan (Citus Adaptive) (cost=0.00..0.00 rows=0 width=0) Task Count: 4 Tasks Shown: One of 4 -> Task Node: host=localhost port=xxxxx dbname=regression -> Aggregate (cost=13.68..13.69 rows=1 width=8) -> Seq Scan on orders_hash_partitioned_630000 orders_hash_partitioned (cost=0.00..13.68 rows=1 width=0) Filter: ((o_orderkey = ANY ('{1,2}'::integer[])) AND ((o_custkey = 11) OR (o_custkey = 22))) (9 rows) ``` After this commit the task count is what one would expect from the query defining multiple distinct values for the distribution column: ``` EXPLAIN SELECT count() FROM orders_hash_partitioned WHERE (o_orderkey IN (1,2)) AND (o_custkey = 11 OR o_custkey = 22); QUERY PLAN --------------------------------------------------------------------- Aggregate (cost=0.00..0.00 rows=0 width=0) -> Custom Scan (Citus Adaptive) (cost=0.00..0.00 rows=0 width=0) Task Count: 2 Tasks Shown: One of 2 -> Task Node: host=localhost port=xxxxx dbname=regression -> Aggregate (cost=13.68..13.69 rows=1 width=8) -> Seq Scan on orders_hash_partitioned_630000 orders_hash_partitioned (cost=0.00..13.68 rows=1 width=0) Filter: ((o_orderkey = ANY ('{1,2}'::integer[])) AND ((o_custkey = 11) OR (o_custkey = 22))) (9 rows) ``` "Core" of the pruning logic works as previously where it uses `PrunableInstances` to queue ORable valid constraints for shard pruning. The difference is that now we build a compact internal representation of the query expression tree with PruningTreeNodes before actual shard pruning is run. Pruning tree nodes represent boolean operators and the associated constraints of it. This internal format allows us to have compact representation of the query WHERE clauses which allows "core" pruning logic to work with OR-clauses correctly. For example query having `WHERE (o_orderkey IN (1,2)) AND (o_custkey=11 OR (o_shippriority > 1 AND o_shippriority < 10))` gets transformed into: 1. AND(o_orderkey IN (1,2), OR(X, AND(X, X))) 2. AND(o_orderkey IN (1,2), OR(X, X)) 3. AND(o_orderkey IN (1,2), X) Here X is any set of unknown condition(s) for shard pruning. This allow the final shard pruning to correctly recognize that shard pruning is done with the valid condition of `o_orderkey IN (1,2)`. Another example with unprunable condition in query `WHERE (o_orderkey IN (1,2)) OR (o_custkey=11 AND o_custkey=22)` gets transformed into: 1. OR(o_orderkey IN (1,2), AND(X, X)) 2. OR(o_orderkey IN (1,2), X) Which is recognized as unprunable due to the OR condition between distribution column and unknown constraint -> goes to all shards. Issue https://github.com/citusdata/citus/issues/1537 originally suggested transforming the query conditions into a full disjunctive normal form (DNF), but this process of transforming into DNF is quite a heavy operation. It may "blow up" into a really large DNF form with complex queries having non trivial `WHERE` clauses. I think the logic for shard pruning could be simplified further but I decided to leave the "core" of the shard pruning untouched.		2020-02-14 17:58:13 +00:00
.circleci	Actually check that test output normalization is applied in CI (#3358 )	2020-01-06 10:37:34 +01:00
.github	Add DESCRIPTION to PR template	2018-12-12 05:35:12 +01:00
ci	Ensure that only normalized test output is commited	2020-01-03 11:30:08 +01:00
config	Add citus_version(), analogous to PG's version()	2017-10-16 18:09:29 -06:00
src	Improve shard pruning logic to understand OR-conditions.	2020-02-14 17:58:13 +00:00
.codecov.yml	Update .codecov.yml after moving ruleutils files	2019-11-16 14:25:35 +01:00
.editorconfig	Fix editorconfig syntax (#3272 )	2019-12-06 17:05:04 +01:00
.gitattributes	Move C files into the appropriate directory	2019-11-16 11:36:17 +01:00
.gitignore	Ignore .vscode (#2969 )	2019-09-18 17:08:22 +03:00
CHANGELOG.md	Update CHANGELOG for 9.2.1 (#3501 )	2020-02-14 11:18:40 +03:00
CONTRIBUTING.md	update contributing (#3284 )	2019-12-11 20:55:21 +03:00
LICENSE	Strip trailing whitespace and add final newline (#3186 )	2019-11-21 14:25:37 +01:00
Makefile	Makefile fix DESTDIR together with cleanup (#3342 )	2019-12-27 10:34:57 +01:00
Makefile.global.in	add gitref to the output of citus_version (#3246 )	2019-11-29 15:54:09 +01:00
README.md	add circleci build status (#3310 ) (#3309 )	2019-12-16 19:25:32 +03:00
aclocal.m4	Basic usage statistics collection. (#1656 )	2017-10-11 09:55:15 -04:00
autogen.sh	Changed product name to citus	2016-02-15 16:04:31 +02:00
configure	Bump citus version to 9.3devel (#3482 )	2020-02-13 16:22:05 +03:00
configure.in	Bump citus version to 9.3devel (#3482 )	2020-02-13 16:22:05 +03:00
github-banner.png	Readme for 5.0	2016-03-18 13:32:13 -07:00
prep_buildtree	Changed product name to citus	2016-02-15 16:04:31 +02:00

README.md

What is Citus?

Open-source PostgreSQL extension (not a fork)
Built to scale out across multiple nodes
Distributed engine for query parallelization
Database designed to scale out multi-tenant applications, real-time analytics dashboards, and high-throughput transactional workloads

Citus is an open source extension to Postgres that distributes your data and your queries across multiple nodes. Because Citus is an extension to Postgres, and not a fork, Citus gives developers and enterprises a scale-out database while keeping the power and familiarity of a relational database. As an extension, Citus supports new PostgreSQL releases, and allows you to benefit from new features while maintaining compatibility with existing PostgreSQL tools.

Citus serves many use cases. Three common ones are:

Multi-tenant & SaaS applications: Most B2B applications already have the notion of a tenant / customer / account built into their data model. Citus allows you to scale out your transactional relational database to 100K+ tenants with minimal changes to your application.
Real-time analytics: Citus enables ingesting large volumes of data and running analytical queries on that data in human real-time. Example applications include analytic dashboards with sub-second response times and exploratory queries on unfolding events.
High-throughput transactional workloads: By distributing your workload across a database cluster, Citus ensures low latency and high performance even with a large number of concurrent users and high volumes of transactions.

To learn more, visit citusdata.com and join the Citus slack to stay on top of the latest developments.

Getting started with Citus

The fastest way to get up and running is to deploy Citus in the cloud. You can also setup a local Citus database cluster with Docker.

Hyperscale (Citus) on Azure Database for PostgreSQL

Hyperscale (Citus) is a deployment option on Azure Database for PostgreSQL, a fully-managed database as a service. Hyperscale (Citus) employs the Citus open source extension so you can scale out across multiple nodes. To get started with Hyperscale (Citus), learn more on the Citus website or use the Hyperscale (Citus) Quickstart in the Azure docs.

Citus Cloud

Citus Cloud runs on top of AWS as a fully managed database as a service. You can provision a Citus Cloud account at https://console.citusdata.com and get started with just a few clicks.

Local Citus Cluster

If you're looking to get started locally, you can follow the following steps to get up and running.

Install Docker Community Edition and Docker Compose

Mac:
1. Download and install Docker.
2. Start Docker by clicking on the application’s icon.

Linux:

curl -sSL https://get.docker.com/ | sh
sudo usermod -aG docker $USER && exec sg docker newgrp `id -gn`
sudo systemctl start docker

sudo curl -sSL https://github.com/docker/compose/releases/download/1.11.2/docker-compose-`uname -s`-`uname -m` -o /usr/local/bin/docker-compose
sudo chmod +x /usr/local/bin/docker-compose

The above version of Docker Compose is sufficient for running Citus, or you can install the latest version.

Pull and start the Docker images

curl -sSLO https://raw.githubusercontent.com/citusdata/docker/master/docker-compose.yml
docker-compose -p citus up -d

Connect to the master database

docker exec -it citus_master psql -U postgres

Follow the first tutorial instructions
To shut the cluster down, run

docker-compose -p citus down

Talk to Contributors and Learn More

Documentation	Try the Citus tutorial for a hands-on introduction or the documentation for a more comprehensive reference.
Slack	Chat with us in our community Slack channel.
Github Issues	We track specific bug reports and feature requests on our project issues.
Twitter	Follow @citusdata for general updates and PostgreSQL scaling tips.
Citus Blog	Read our Citus Data Blog for posts on Postgres, Citus, and scaling your database.

Contributing

Citus is built on and of open source, and we welcome your contributions. The CONTRIBUTING.md file explains how to get started developing the Citus extension itself and our code quality guidelines.

Who is Using Citus?

Citus is deployed in production by many customers, ranging from technology start-ups to large enterprises. Here are some examples:

Algolia uses Citus to provide real-time analytics for over 1B searches per day. For faster insights, they also use TopN and HLL extensions. User Story
Heap uses Citus to run dynamic funnel, segmentation, and cohort queries across billions of users and has more than 700B events in their Citus database cluster. Watch Video
Pex uses Citus to ingest 80B data points per day and analyze that data in real-time. They use a 20+ node cluster on Google Cloud. User Story
MixRank uses Citus to efficiently collect and analyze vast amounts of data to allow inside B2B sales teams to find new customers. User Story
Agari uses Citus to secure more than 85 percent of U.S. consumer emails on two 6-8 TB clusters. User Story
Copper (formerly ProsperWorks) powers a cloud CRM service with Citus. User Story

You can read more user stories about how they employ Citus to scale Postgres for both multi-tenant SaaS applications as well as real-time analytics dashboards here.

README.md Unescape Escape