Distributed PostgreSQL as an extension
 
 
 
 
 
 
Go to file
SaitTalhaNisanci 7ff4ce2169
Add adaptive executor support for repartition joins (#3169)
* WIP

* wip

* add basic logic to run a single job with repartioning joins with adaptive executor

* fix some warnings and return in ExecuteDependedTasks if there is none

* Add the logic to run depended jobs in adaptive executor

The execution of depended tasks logic is changed. With the current
logic:
- All tasks are created from the top level task list.
- At one iteration:
	- CurTasks whose dependencies are executed are found.
	- CurTasks are executed in parallel with adapter executor main
logic.
- The iteration is repeated until all tasks are completed.

* Separate adaptive executor repartioning logic

* Remove duplicate parts

* cleanup directories and schemas

* add basic repartion tests for adaptive executor

* Use the first placement to fetch data

In task tracker, when there are replicas, we try to fetch from a replica
for which a map task is succeeded. TaskExecution is used for this,
however TaskExecution is not used in adaptive executor. So we cannot use
the same thing as task tracker.

Since adaptive executor fails when a map task fails (There is no retry
logic yet). We know that if we try to execute a fetch task, all of its
map tasks already succeeded, so we can just use the first one to fetch
from.

* fix clean directories logic

* do not change the search path while creating a udf

* Enable repartition joins with adaptive executor with only enable_reparitition_joins guc

* Add comments to adaptive_executor_repartition

* dont run adaptive executor repartition test in paralle with other tests

* execute cleanup only in the top level execution

* do cleanup only in the top level ezecution

* not begin a transaction if repartition query is used

* use new connections for repartititon specific queries

New connections are opened to send repartition specific queries. The
opened connections will be closed at the FinishDistributedExecution.

While sending repartition queries no transaction is begun so that
we can see all changes.

* error if a modification was done prior to repartition execution

* not start a transaction if a repartition query and sql task, and clean temporary files and schemas at each subplan level

* fix cleanup logic

* update tests

* add missing function comments

* add test for transaction with DDL before repartition query

* do not close repartition connections in adaptive executor

* rollback instead of commit in repartition join test

* use close connection instead of shutdown connection

* remove unnecesary connection list, ensure schema owner before removing directory

* rename ExecuteTaskListRepartition

* put fetch query string in planner not executor as we currently support only replication factor = 1 with adaptive executor and repartition query and we know the query string in the planner phase in that case

* split adaptive executor repartition to DAG execution logic and repartition logic

* apply review items

* apply review items

* use an enum for remote transaction state and fix cleanup for repartition

* add outside transaction flag to find connections that are unclaimed instead of always opening a new transaction

* fix style

* wip

* rename removejobdir to partition cleanup

* do not close connections at the end of repartition queries

* do repartition cleanup in pg catch

* apply review items

* decide whether to use transaction or not at execution creation

* rename isOutsideTransaction and add missing comment

* not error in pg catch while doing cleanup

* use replication factor of the creation time, not current time to decide if task tracker should be chosen

* apply review items

* apply review items

* apply review item
2019-12-17 19:09:45 +03:00
.circleci Strip trailing whitespace and add final newline (#3186) 2019-11-21 14:25:37 +01:00
.github Add DESCRIPTION to PR template 2018-12-12 05:35:12 +01:00
ci Add a script that fixes style related things (#3234) 2019-12-09 14:23:53 +03:00
config Add citus_version(), analogous to PG's version() 2017-10-16 18:09:29 -06:00
src Add adaptive executor support for repartition joins (#3169) 2019-12-17 19:09:45 +03:00
.codecov.yml Update .codecov.yml after moving ruleutils files 2019-11-16 14:25:35 +01:00
.editorconfig Fix editorconfig syntax (#3272) 2019-12-06 17:05:04 +01:00
.gitattributes Move C files into the appropriate directory 2019-11-16 11:36:17 +01:00
.gitignore Ignore .vscode (#2969) 2019-09-18 17:08:22 +03:00
CHANGELOG.md Update CHANGELOG.md (#3241) 2019-11-28 16:13:58 +03:00
CONTRIBUTING.md update contributing (#3284) 2019-12-11 20:55:21 +03:00
LICENSE Strip trailing whitespace and add final newline (#3186) 2019-11-21 14:25:37 +01:00
Makefile Add a script that fixes style related things (#3234) 2019-12-09 14:23:53 +03:00
Makefile.global.in add gitref to the output of citus_version (#3246) 2019-11-29 15:54:09 +01:00
README.md add circleci build status (#3310) (#3309) 2019-12-16 19:25:32 +03:00
aclocal.m4 Basic usage statistics collection. (#1656) 2017-10-11 09:55:15 -04:00
autogen.sh Changed product name to citus 2016-02-15 16:04:31 +02:00
configure add gitref to the output of citus_version (#3246) 2019-11-29 15:54:09 +01:00
configure.in add gitref to the output of citus_version (#3246) 2019-11-29 15:54:09 +01:00
github-banner.png Readme for 5.0 2016-03-18 13:32:13 -07:00
prep_buildtree Changed product name to citus 2016-02-15 16:04:31 +02:00

README.md

Citus Banner

Slack Status Latest Docs Circleci Status Code Coverage

What is Citus?

  • Open-source PostgreSQL extension (not a fork)
  • Built to scale out across multiple nodes
  • Distributed engine for query parallelization
  • Database designed to scale out multi-tenant applications, real-time analytics dashboards, and high-throughput transactional workloads

Citus is an open source extension to Postgres that distributes your data and your queries across multiple nodes. Because Citus is an extension to Postgres, and not a fork, Citus gives developers and enterprises a scale-out database while keeping the power and familiarity of a relational database. As an extension, Citus supports new PostgreSQL releases, and allows you to benefit from new features while maintaining compatibility with existing PostgreSQL tools.

Citus serves many use cases. Three common ones are:

  1. Multi-tenant & SaaS applications: Most B2B applications already have the notion of a tenant / customer / account built into their data model. Citus allows you to scale out your transactional relational database to 100K+ tenants with minimal changes to your application.

  2. Real-time analytics: Citus enables ingesting large volumes of data and running analytical queries on that data in human real-time. Example applications include analytic dashboards with sub-second response times and exploratory queries on unfolding events.

  3. High-throughput transactional workloads: By distributing your workload across a database cluster, Citus ensures low latency and high performance even with a large number of concurrent users and high volumes of transactions.

To learn more, visit citusdata.com and join the Citus slack to stay on top of the latest developments.

Getting started with Citus

The fastest way to get up and running is to deploy Citus in the cloud. You can also setup a local Citus database cluster with Docker.

Hyperscale (Citus) on Azure Database for PostgreSQL

Hyperscale (Citus) is a deployment option on Azure Database for PostgreSQL, a fully-managed database as a service. Hyperscale (Citus) employs the Citus open source extension so you can scale out across multiple nodes. To get started with Hyperscale (Citus), learn more on the Citus website or use the Hyperscale (Citus) Quickstart in the Azure docs.

Citus Cloud

Citus Cloud runs on top of AWS as a fully managed database as a service. You can provision a Citus Cloud account at https://console.citusdata.com and get started with just a few clicks.

Local Citus Cluster

If you're looking to get started locally, you can follow the following steps to get up and running.

  1. Install Docker Community Edition and Docker Compose
  • Mac:
    1. Download and install Docker.
    2. Start Docker by clicking on the applications icon.
  • Linux:
    curl -sSL https://get.docker.com/ | sh
    sudo usermod -aG docker $USER && exec sg docker newgrp `id -gn`
    sudo systemctl start docker
    
    sudo curl -sSL https://github.com/docker/compose/releases/download/1.11.2/docker-compose-`uname -s`-`uname -m` -o /usr/local/bin/docker-compose
    sudo chmod +x /usr/local/bin/docker-compose
    
    The above version of Docker Compose is sufficient for running Citus, or you can install the latest version.
  1. Pull and start the Docker images
curl -sSLO https://raw.githubusercontent.com/citusdata/docker/master/docker-compose.yml
docker-compose -p citus up -d
  1. Connect to the master database
docker exec -it citus_master psql -U postgres
  1. Follow the first tutorial instructions
  2. To shut the cluster down, run
docker-compose -p citus down

Talk to Contributors and Learn More

Documentation Try the Citus tutorial for a hands-on introduction or
the documentation for a more comprehensive reference.
Slack Chat with us in our community Slack channel.
Github Issues We track specific bug reports and feature requests on our project issues.
Twitter Follow @citusdata for general updates and PostgreSQL scaling tips.
Citus Blog Read our Citus Data Blog for posts on Postgres, Citus, and scaling your database.

Contributing

Citus is built on and of open source, and we welcome your contributions. The CONTRIBUTING.md file explains how to get started developing the Citus extension itself and our code quality guidelines.

Who is Using Citus?

Citus is deployed in production by many customers, ranging from technology start-ups to large enterprises. Here are some examples:

  • Algolia uses Citus to provide real-time analytics for over 1B searches per day. For faster insights, they also use TopN and HLL extensions. User Story
  • Heap uses Citus to run dynamic funnel, segmentation, and cohort queries across billions of users and has more than 700B events in their Citus database cluster. Watch Video
  • Pex uses Citus to ingest 80B data points per day and analyze that data in real-time. They use a 20+ node cluster on Google Cloud. User Story
  • MixRank uses Citus to efficiently collect and analyze vast amounts of data to allow inside B2B sales teams to find new customers. User Story
  • Agari uses Citus to secure more than 85 percent of U.S. consumer emails on two 6-8 TB clusters. User Story
  • Copper (formerly ProsperWorks) powers a cloud CRM service with Citus. User Story

You can read more user stories about how they employ Citus to scale Postgres for both multi-tenant SaaS applications as well as real-time analytics dashboards here.


Copyright © Citus Data, Inc.