mirror of https://github.com/citusdata/citus.git
parent
07d3b4fd04
commit
62e0383150
|
@ -0,0 +1,324 @@
|
|||
# Introduction
|
||||
|
||||
Citus Columnar offers a per-table option for columnar storage to
|
||||
reduce IO requirements though compression and projection pushdown.
|
||||
|
||||
# Design Trade-Offs
|
||||
|
||||
Existing PostgreSQL row tables work well for OLTP:
|
||||
|
||||
* Support `UPDATE`/`DELETE` efficiently
|
||||
* Efficient single-tuple lookups
|
||||
|
||||
The Citus Columnar tables work best for analytic or DW workloads:
|
||||
|
||||
* Compression
|
||||
* Doesn't read unnecessary columns
|
||||
* Efficient `VACUUM`
|
||||
|
||||
# Limitations
|
||||
|
||||
* Append-only (no ``UPDATE``/``DELETE`` support)
|
||||
* No space reclamation (e.g. rolled-back transactions may still
|
||||
consume disk space)
|
||||
* No index support, index scans, or bitmap index scans
|
||||
* No tidscans
|
||||
* No sample scans
|
||||
* No TOAST support (large values supported inline)
|
||||
* No support for [``ON
|
||||
CONFLICT``](https://www.postgresql.org/docs/12/sql-insert.html#SQL-ON-CONFLICT)
|
||||
statements (except ``DO NOTHING`` actions with no target specified).
|
||||
* No support for tuple locks (``SELECT ... FOR SHARE``, ``SELECT
|
||||
... FOR UPDATE``)
|
||||
* No support for serializable isolation level
|
||||
* Support for PostgreSQL server versions 12+ only
|
||||
* No support for foreign keys, unique constraints, or exclusion
|
||||
constraints
|
||||
* No support for logical decoding
|
||||
* No support for intra-node parallel scans
|
||||
* No support for ``AFTER ... FOR EACH ROW`` triggers
|
||||
* No `UNLOGGED` columnar tables
|
||||
* No `TEMPORARY` columnar tables
|
||||
|
||||
Future iterations will incrementally lift the limitations listed above.
|
||||
|
||||
# User Experience
|
||||
|
||||
Create a Columnar table by specifying ``USING columnar`` when creating
|
||||
the table.
|
||||
|
||||
```sql
|
||||
CREATE TABLE my_columnar_table
|
||||
(
|
||||
id INT,
|
||||
i1 INT,
|
||||
i2 INT8,
|
||||
n NUMERIC,
|
||||
t TEXT
|
||||
) USING columnar;
|
||||
```
|
||||
|
||||
Insert data into the table and read from it like normal (subject to
|
||||
the limitations listed above).
|
||||
|
||||
To see internal statistics about the table, use ``VACUUM
|
||||
VERBOSE``. Note that ``VACUUM`` (without ``FULL``) is much faster on a
|
||||
columnar table, because it scans only the metadata, and not the actual
|
||||
data.
|
||||
|
||||
## Options
|
||||
|
||||
Set options using:
|
||||
|
||||
```sql
|
||||
alter_columnar_table_set(
|
||||
relid REGCLASS,
|
||||
chunk_row_count INT4 DEFAULT NULL,
|
||||
stripe_row_count INT4 DEFAULT NULL,
|
||||
compression NAME DEFAULT NULL,
|
||||
compression_level INT4)
|
||||
```
|
||||
|
||||
For example:
|
||||
|
||||
```sql
|
||||
SELECT alter_columnar_table_set(
|
||||
'my_columnar_table',
|
||||
compression => 'none',
|
||||
stripe_row_count => 10000);
|
||||
```
|
||||
|
||||
The following options are available:
|
||||
|
||||
* **compression**: `[none|pglz|zstd|lz4|lz4hc]` - set the compression type
|
||||
for _newly-inserted_ data. Existing data will not be
|
||||
recompressed/decompressed. The default value is `zstd` (if support
|
||||
has been compiled in).
|
||||
* **compression_level**: ``<integer>`` - Sets compression level. Valid
|
||||
settings are from 1 through 19. If the compression method does not
|
||||
support the level chosen, the closest level will be selected
|
||||
instead.
|
||||
* **stripe_row_count**: ``<integer>`` - the maximum number of rows per
|
||||
stripe for _newly-inserted_ data. Existing stripes of data will not
|
||||
be changed and may have more rows than this maximum value. The
|
||||
default value is `150000`.
|
||||
* **chunk_row_count**: ``<integer>`` - the maximum number of rows per
|
||||
chunk for _newly-inserted_ data. Existing chunks of data will not be
|
||||
changed and may have more rows than this maximum value. The default
|
||||
value is `10000`.
|
||||
|
||||
View options for all tables with:
|
||||
|
||||
```sql
|
||||
SELECT * FROM columnar.options;
|
||||
```
|
||||
|
||||
You can also adjust options with a `SET` command of one of the
|
||||
following GUCs:
|
||||
|
||||
* `columnar.compression`
|
||||
* `columnar.compression_level`
|
||||
* `columnar.stripe_row_count`
|
||||
* `columnar.chunk_row_count`
|
||||
|
||||
GUCs only affect newly-created *tables*, not any newly-created
|
||||
*stripes* on an existing table.
|
||||
|
||||
## Partitioning
|
||||
|
||||
Columnar tables can be used as partitions; and a partitioned table may
|
||||
be made up of any combination of row and columnar partitions.
|
||||
|
||||
```sql
|
||||
CREATE TABLE parent(ts timestamptz, i int, n numeric, s text)
|
||||
PARTITION BY RANGE (ts);
|
||||
|
||||
-- columnar partition
|
||||
CREATE TABLE p0 PARTITION OF parent
|
||||
FOR VALUES FROM ('2020-01-01') TO ('2020-02-01')
|
||||
USING COLUMNAR;
|
||||
-- columnar partition
|
||||
CREATE TABLE p1 PARTITION OF parent
|
||||
FOR VALUES FROM ('2020-02-01') TO ('2020-03-01')
|
||||
USING COLUMNAR;
|
||||
-- row partition
|
||||
CREATE TABLE p2 PARTITION OF parent
|
||||
FOR VALUES FROM ('2020-03-01') TO ('2020-04-01');
|
||||
|
||||
INSERT INTO parent VALUES ('2020-01-15', 10, 100, 'one thousand'); -- columnar
|
||||
INSERT INTO parent VALUES ('2020-02-15', 20, 200, 'two thousand'); -- columnar
|
||||
INSERT INTO parent VALUES ('2020-03-15', 30, 300, 'three thousand'); -- row
|
||||
```
|
||||
|
||||
When performing operations on a partitioned table with a mix of row
|
||||
and columnar partitions, take note of the following behaviors for
|
||||
operations that are supported on row tables but not columnar
|
||||
(e.g. ``UPDATE``, ``DELETE``, tuple locks, etc.):
|
||||
|
||||
* If the operation is targeted at a specific row partition
|
||||
(e.g. ``UPDATE p2 SET i = i + 1``), it will succeed; if targeted at
|
||||
a specified columnar partition (e.g. ``UPDATE p1 SET i = i + 1``),
|
||||
it will fail.
|
||||
* If the operation is targeted at the partitioned table and has a
|
||||
``WHERE`` clause that excludes all columnar partitions
|
||||
(e.g. ``UPDATE parent SET i = i + 1 WHERE ts = '2020-03-15'``), it
|
||||
will succeed.
|
||||
* If the operation is targeted at the partitioned table, but does not
|
||||
exclude all columnar partitions, it will fail; even if the actual
|
||||
data to be updated only affects row tables (e.g. ``UPDATE parent SET
|
||||
i = i + 1 WHERE n = 300``).
|
||||
|
||||
Because columnar tables do not support indexes, it's impossible to
|
||||
create indexes on the partitioned table if some partitions are
|
||||
columnar. Instead, you must create indexes on the individual row
|
||||
partitions. Similarly for constraints that require indexes, e.g.:
|
||||
|
||||
```sql
|
||||
CREATE INDEX p2_ts_idx ON p2 (ts);
|
||||
CREATE UNIQUE INDEX p2_i_unique ON p2 (i);
|
||||
ALTER TABLE p2 ADD UNIQUE (n);
|
||||
```
|
||||
|
||||
## Converting Between Row and Columnar
|
||||
|
||||
Note: ensure that you understand any advanced features that may be
|
||||
used with the table before converting it (e.g. row-level security,
|
||||
storage options, constraints, inheritance, etc.), and ensure that they
|
||||
are reproduced in the new table or partition appropriately. ``LIKE``,
|
||||
used below, is a shorthand that works only in simple cases.
|
||||
|
||||
```sql
|
||||
CREATE TABLE my_table(i INT8 DEFAULT '7');
|
||||
INSERT INTO my_table VALUES(1);
|
||||
-- convert to columnar
|
||||
SELECT alter_table_set_access_method('my_table', 'columnar');
|
||||
-- back to row
|
||||
SELECT alter_table_set_access_method('my_table', 'heap');
|
||||
```
|
||||
|
||||
# Performance Microbenchmark
|
||||
|
||||
*Important*: This microbenchmark is not intended to represent any real
|
||||
workload. Compression ratios, and therefore performance, will depend
|
||||
heavily on the specific workload. This is only for the purpose of
|
||||
illustrating a "columnar friendly" contrived workload that showcases
|
||||
the benefits of columnar.
|
||||
|
||||
## Schema
|
||||
|
||||
```sql
|
||||
CREATE TABLE perf_row(
|
||||
id INT8,
|
||||
ts TIMESTAMPTZ,
|
||||
customer_id INT8,
|
||||
vendor_id INT8,
|
||||
name TEXT,
|
||||
description TEXT,
|
||||
value NUMERIC,
|
||||
quantity INT4
|
||||
);
|
||||
|
||||
CREATE TABLE perf_columnar(LIKE perf_row) USING COLUMNAR;
|
||||
```
|
||||
|
||||
## Data
|
||||
|
||||
```sql
|
||||
CREATE OR REPLACE FUNCTION random_words(n INT4) RETURNS TEXT LANGUAGE plpython2u AS $$
|
||||
import random
|
||||
t = ''
|
||||
words = ['zero','one','two','three','four','five','six','seven','eight','nine','ten']
|
||||
for i in xrange(0,n):
|
||||
if (i != 0):
|
||||
t += ' '
|
||||
r = random.randint(0,len(words)-1)
|
||||
t += words[r]
|
||||
return t
|
||||
$$;
|
||||
```
|
||||
|
||||
```sql
|
||||
INSERT INTO perf_row
|
||||
SELECT
|
||||
g, -- id
|
||||
'2020-01-01'::timestamptz + ('1 minute'::interval * g), -- ts
|
||||
(random() * 1000000)::INT4, -- customer_id
|
||||
(random() * 100)::INT4, -- vendor_id
|
||||
random_words(7), -- name
|
||||
random_words(100), -- description
|
||||
(random() * 100000)::INT4/100.0, -- value
|
||||
(random() * 100)::INT4 -- quantity
|
||||
FROM generate_series(1,75000000) g;
|
||||
|
||||
INSERT INTO perf_columnar SELECT * FROM perf_row;
|
||||
```
|
||||
|
||||
## Compression Ratio
|
||||
|
||||
```
|
||||
=> SELECT pg_total_relation_size('perf_row')::numeric/pg_total_relation_size('perf_columnar') AS compression_ratio;
|
||||
compression_ratio
|
||||
--------------------
|
||||
5.4080768380134124
|
||||
(1 row)
|
||||
```
|
||||
|
||||
The overall compression ratio of columnar table, versus the same data
|
||||
stored with row storage, is **5.4X**.
|
||||
|
||||
```
|
||||
=> VACUUM VERBOSE perf_row;
|
||||
INFO: vacuuming "public.perf_row"
|
||||
INFO: "perf_row": found 0 removable, 10 nonremovable row versions in 1 out of 5769231 pages
|
||||
DETAIL: 0 dead row versions cannot be removed yet, oldest xmin: 3110
|
||||
There were 0 unused item identifiers.
|
||||
Skipped 0 pages due to buffer pins, 5769230 frozen pages.
|
||||
0 pages are entirely empty.
|
||||
CPU: user: 0.10 s, system: 0.05 s, elapsed: 0.26 s.
|
||||
INFO: vacuuming "pg_toast.pg_toast_17133"
|
||||
INFO: index "pg_toast_17133_index" now contains 0 row versions in 1 pages
|
||||
DETAIL: 0 index row versions were removed.
|
||||
0 index pages have been deleted, 0 are currently reusable.
|
||||
CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s.
|
||||
INFO: "pg_toast_17133": found 0 removable, 0 nonremovable row versions in 0 out of 0 pages
|
||||
DETAIL: 0 dead row versions cannot be removed yet, oldest xmin: 3110
|
||||
There were 0 unused item identifiers.
|
||||
Skipped 0 pages due to buffer pins, 0 frozen pages.
|
||||
0 pages are entirely empty.
|
||||
CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s.
|
||||
|
||||
=> VACUUM VERBOSE perf_columnar;
|
||||
INFO: statistics for "perf_columnar":
|
||||
storage id: 10000000020
|
||||
total file size: 8741486592, total data size: 8714771176
|
||||
compression rate: 4.96x
|
||||
total row count: 75000000, stripe count: 501, average rows per stripe: 149700
|
||||
chunk count: 60000, containing data for dropped columns: 0, zstd compressed: 60000
|
||||
```
|
||||
|
||||
``VACUUM VERBOSE`` reports a smaller compression ratio, because it
|
||||
only averages the compression ratio of the individual chunks, and does
|
||||
not account for the metadata savings of the columnar format.
|
||||
|
||||
## System
|
||||
|
||||
* 16GB physical memory
|
||||
* 128MB PG shared buffers
|
||||
|
||||
Note: because this was run on a system with enough physical memory to
|
||||
hold a substantial fraction of the table, the IO benefits of columnar
|
||||
won't be entirely realized by the query runtime unless the data size
|
||||
is substantially increased.
|
||||
|
||||
## Query
|
||||
|
||||
```sql
|
||||
-- OFFSET 1000 so that no rows are returned, and we collect only timings
|
||||
SELECT vendor_id, SUM(quantity) FROM perf_row GROUP BY vendor_id OFFSET 1000;
|
||||
SELECT vendor_id, SUM(quantity) FROM perf_columnar GROUP BY vendor_id OFFSET 1000;
|
||||
```
|
||||
|
||||
Timing (median of three runs):
|
||||
* row: 201700ms
|
||||
* columnar: 14202ms
|
||||
* speedup: **14X**
|
Loading…
Reference in New Issue