Columnar readme. (#4585)

Co-authored-by: Jeff Davis <jefdavi@microsoft.com>
2021-01-27 09:33:35 -08:00 · 2021-01-27 09:33:35 -08:00 · 62e0383150
parent 07d3b4fd04
commit 62e0383150
1 changed files with 324 additions and 0 deletions
--- a/src/backend/columnar/README.md
+++ b/src/backend/columnar/README.md
@ -0,0 +1,324 @@
 # Introduction
 Citus Columnar offers a per-table option for columnar storage to
 reduce IO requirements though compression and projection pushdown.
 # Design Trade-Offs
 Existing PostgreSQL row tables work well for OLTP:
 * Support `UPDATE`/`DELETE` efficiently
 * Efficient single-tuple lookups
 The Citus Columnar tables work best for analytic or DW workloads:
 * Compression
 * Doesn't read unnecessary columns
 * Efficient `VACUUM`
 # Limitations
 * Append-only (no ``UPDATE``/``DELETE`` support)
 * No space reclamation (e.g. rolled-back transactions may still
  consume disk space)
 * No index support, index scans, or bitmap index scans
 * No tidscans
 * No sample scans
 * No TOAST support (large values supported inline)
 * No support for [``ON
  CONFLICT``](https://www.postgresql.org/docs/12/sql-insert.html#SQL-ON-CONFLICT)
  statements (except ``DO NOTHING`` actions with no target specified).
 * No support for tuple locks (``SELECT ... FOR SHARE``, ``SELECT
  ... FOR UPDATE``)
 * No support for serializable isolation level
 * Support for PostgreSQL server versions 12+ only
 * No support for foreign keys, unique constraints, or exclusion
  constraints
 * No support for logical decoding
 * No support for intra-node parallel scans
 * No support for ``AFTER ... FOR EACH ROW`` triggers
 * No `UNLOGGED` columnar tables
 * No `TEMPORARY` columnar tables
 Future iterations will incrementally lift the limitations listed above.
 # User Experience
 Create a Columnar table by specifying ``USING columnar`` when creating
 the table.
 ```sql
 CREATE TABLE my_columnar_table
 (
    id INT,
    i1 INT,
    i2 INT8,
    n NUMERIC,
    t TEXT
 ) USING columnar;
 ```
 Insert data into the table and read from it like normal (subject to
 the limitations listed above).
 To see internal statistics about the table, use ``VACUUM
 VERBOSE``. Note that ``VACUUM`` (without ``FULL``) is much faster on a
 columnar table, because it scans only the metadata, and not the actual
 data.
 ## Options
 Set options using:
 ```sql
 alter_columnar_table_set(
    relid REGCLASS,
    chunk_row_count INT4 DEFAULT NULL,
    stripe_row_count INT4 DEFAULT NULL,
    compression NAME DEFAULT NULL,
    compression_level INT4)
 ```
 For example:
 ```sql
 SELECT alter_columnar_table_set(
    'my_columnar_table',
    compression => 'none',
    stripe_row_count => 10000);
 ```
 The following options are available:
 * **compression**: `[none|pglz|zstd|lz4|lz4hc]` - set the compression type
  for _newly-inserted_ data. Existing data will not be
  recompressed/decompressed. The default value is `zstd` (if support
  has been compiled in).
 * **compression_level**: ``<integer>`` - Sets compression level. Valid
  settings are from 1 through 19. If the compression method does not
  support the level chosen, the closest level will be selected
  instead.
 * **stripe_row_count**: ``<integer>`` - the maximum number of rows per
  stripe for _newly-inserted_ data. Existing stripes of data will not
  be changed and may have more rows than this maximum value. The
  default value is `150000`.
 * **chunk_row_count**: ``<integer>`` - the maximum number of rows per
  chunk for _newly-inserted_ data. Existing chunks of data will not be
  changed and may have more rows than this maximum value. The default
  value is `10000`.
 View options for all tables with:
 ```sql
 SELECT * FROM columnar.options;
 ```
 You can also adjust options with a `SET` command of one of the
 following GUCs:
 * `columnar.compression`
 * `columnar.compression_level`
 * `columnar.stripe_row_count`
 * `columnar.chunk_row_count`
 GUCs only affect newly-created *tables*, not any newly-created
 *stripes* on an existing table.
 ## Partitioning
 Columnar tables can be used as partitions; and a partitioned table may
 be made up of any combination of row and columnar partitions.
 ```sql
 CREATE TABLE parent(ts timestamptz, i int, n numeric, s text)
  PARTITION BY RANGE (ts);
 -- columnar partition
 CREATE TABLE p0 PARTITION OF parent
  FOR VALUES FROM ('2020-01-01') TO ('2020-02-01')
  USING COLUMNAR;
 -- columnar partition
 CREATE TABLE p1 PARTITION OF parent
  FOR VALUES FROM ('2020-02-01') TO ('2020-03-01')
  USING COLUMNAR;
 -- row partition
 CREATE TABLE p2 PARTITION OF parent
  FOR VALUES FROM ('2020-03-01') TO ('2020-04-01');
 INSERT INTO parent VALUES ('2020-01-15', 10, 100, 'one thousand'); -- columnar
 INSERT INTO parent VALUES ('2020-02-15', 20, 200, 'two thousand'); -- columnar
 INSERT INTO parent VALUES ('2020-03-15', 30, 300, 'three thousand'); -- row
 ```
 When performing operations on a partitioned table with a mix of row
 and columnar partitions, take note of the following behaviors for
 operations that are supported on row tables but not columnar
 (e.g. ``UPDATE``, ``DELETE``, tuple locks, etc.):
 * If the operation is targeted at a specific row partition
  (e.g. ``UPDATE p2 SET i = i + 1``), it will succeed; if targeted at
  a specified columnar partition (e.g. ``UPDATE p1 SET i = i + 1``),
  it will fail.
 * If the operation is targeted at the partitioned table and has a
  ``WHERE`` clause that excludes all columnar partitions
  (e.g. ``UPDATE parent SET i = i + 1 WHERE ts = '2020-03-15'``), it
  will succeed.
 * If the operation is targeted at the partitioned table, but does not
  exclude all columnar partitions, it will fail; even if the actual
  data to be updated only affects row tables (e.g. ``UPDATE parent SET
  i = i + 1 WHERE n = 300``).
 Because columnar tables do not support indexes, it's impossible to
 create indexes on the partitioned table if some partitions are
 columnar. Instead, you must create indexes on the individual row
 partitions. Similarly for constraints that require indexes, e.g.:
 ```sql
 CREATE INDEX p2_ts_idx ON p2 (ts);
 CREATE UNIQUE INDEX p2_i_unique ON p2 (i);
 ALTER TABLE p2 ADD UNIQUE (n);
 ```
 ## Converting Between Row and Columnar
 Note: ensure that you understand any advanced features that may be
 used with the table before converting it (e.g. row-level security,
 storage options, constraints, inheritance, etc.), and ensure that they
 are reproduced in the new table or partition appropriately. ``LIKE``,
 used below, is a shorthand that works only in simple cases.
 ```sql
 CREATE TABLE my_table(i INT8 DEFAULT '7');
 INSERT INTO my_table VALUES(1);
 -- convert to columnar
 SELECT alter_table_set_access_method('my_table', 'columnar');
 -- back to row
 SELECT alter_table_set_access_method('my_table', 'heap');
 ```
 # Performance Microbenchmark
 *Important*: This microbenchmark is not intended to represent any real
 workload. Compression ratios, and therefore performance, will depend
 heavily on the specific workload. This is only for the purpose of
 illustrating a "columnar friendly" contrived workload that showcases
 the benefits of columnar.
 ## Schema
 ```sql
 CREATE TABLE perf_row(
    id INT8,
    ts TIMESTAMPTZ,
    customer_id INT8,
    vendor_id INT8,
    name TEXT,
    description TEXT,
    value NUMERIC,
    quantity INT4
 );
 CREATE TABLE perf_columnar(LIKE perf_row) USING COLUMNAR;
 ```
 ## Data
 ```sql
 CREATE OR REPLACE FUNCTION random_words(n INT4) RETURNS TEXT LANGUAGE plpython2u AS $$
 import random
 t = ''
 words = ['zero','one','two','three','four','five','six','seven','eight','nine','ten']
 for i in xrange(0,n):
  if (i != 0):
    t += ' '
  r = random.randint(0,len(words)-1)
  t += words[r]
 return t
 $$;
 ```
 ```sql
 INSERT INTO perf_row
   SELECT
    g, -- id
    '2020-01-01'::timestamptz + ('1 minute'::interval * g), -- ts
    (random() * 1000000)::INT4, -- customer_id
    (random() * 100)::INT4, -- vendor_id
    random_words(7), -- name
    random_words(100), -- description
    (random() * 100000)::INT4/100.0, -- value
    (random() * 100)::INT4 -- quantity
   FROM generate_series(1,75000000) g;
 INSERT INTO perf_columnar SELECT * FROM perf_row;
 ```
 ## Compression Ratio
 ```
 => SELECT pg_total_relation_size('perf_row')::numeric/pg_total_relation_size('perf_columnar') AS compression_ratio;
 compression_ratio
 --------------------
 5.4080768380134124
 (1 row)
 ```
 The overall compression ratio of columnar table, versus the same data
 stored with row storage, is **5.4X**.
 ```
 => VACUUM VERBOSE perf_row;
 INFO:  vacuuming "public.perf_row"
 INFO:  "perf_row": found 0 removable, 10 nonremovable row versions in 1 out of 5769231 pages
 DETAIL:  0 dead row versions cannot be removed yet, oldest xmin: 3110
 There were 0 unused item identifiers.
 Skipped 0 pages due to buffer pins, 5769230 frozen pages.
 0 pages are entirely empty.
 CPU: user: 0.10 s, system: 0.05 s, elapsed: 0.26 s.
 INFO:  vacuuming "pg_toast.pg_toast_17133"
 INFO:  index "pg_toast_17133_index" now contains 0 row versions in 1 pages
 DETAIL:  0 index row versions were removed.
 0 index pages have been deleted, 0 are currently reusable.
 CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s.
 INFO:  "pg_toast_17133": found 0 removable, 0 nonremovable row versions in 0 out of 0 pages
 DETAIL:  0 dead row versions cannot be removed yet, oldest xmin: 3110
 There were 0 unused item identifiers.
 Skipped 0 pages due to buffer pins, 0 frozen pages.
 0 pages are entirely empty.
 CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s.
 => VACUUM VERBOSE perf_columnar;
 INFO:  statistics for "perf_columnar":
 storage id: 10000000020
 total file size: 8741486592, total data size: 8714771176
 compression rate: 4.96x
 total row count: 75000000, stripe count: 501, average rows per stripe: 149700
 chunk count: 60000, containing data for dropped columns: 0, zstd compressed: 60000
 ```
 ``VACUUM VERBOSE`` reports a smaller compression ratio, because it
 only averages the compression ratio of the individual chunks, and does
 not account for the metadata savings of the columnar format.
 ## System
 * 16GB physical memory
 * 128MB PG shared buffers
 Note: because this was run on a system with enough physical memory to
 hold a substantial fraction of the table, the IO benefits of columnar
 won't be entirely realized by the query runtime unless the data size
 is substantially increased.
 ## Query
 ```sql
 -- OFFSET 1000 so that no rows are returned, and we collect only timings
 SELECT vendor_id, SUM(quantity) FROM perf_row GROUP BY vendor_id OFFSET 1000;
 SELECT vendor_id, SUM(quantity) FROM perf_columnar GROUP BY vendor_id OFFSET 1000;
 ```
 Timing (median of three runs):
 * row: 201700ms
 * columnar: 14202ms
 * speedup: **14X**