From 62e0383150a134a289578bb72862e3942734b1eb Mon Sep 17 00:00:00 2001
From: jeff-davis <Jeffrey.Davis@microsoft.com>
Date: Wed, 27 Jan 2021 09:33:35 -0800
Subject: [PATCH] Columnar readme. (#4585)

Co-authored-by: Jeff Davis <jefdavi@microsoft.com>
---
 src/backend/columnar/README.md | 324 +++++++++++++++++++++++++++++++++
 1 file changed, 324 insertions(+)
 create mode 100644 src/backend/columnar/README.md
diff --git a/src/backend/columnar/README.md b/src/backend/columnar/README.md
new file mode 100644
index 000000000..2940b2b1f
--- /dev/null
+++ b/src/backend/columnar/README.md
@@ -0,0 +1,324 @@
+# Introduction
+
+Citus Columnar offers a per-table option for columnar storage to
+reduce IO requirements though compression and projection pushdown.
+
+# Design Trade-Offs
+
+Existing PostgreSQL row tables work well for OLTP:
+
+* Support `UPDATE`/`DELETE` efficiently
+* Efficient single-tuple lookups
+
+The Citus Columnar tables work best for analytic or DW workloads:
+
+* Compression
+* Doesn't read unnecessary columns
+* Efficient `VACUUM`
+
+# Limitations
+
+* Append-only (no ``UPDATE``/``DELETE`` support)
+* No space reclamation (e.g. rolled-back transactions may still
+  consume disk space)
+* No index support, index scans, or bitmap index scans
+* No tidscans
+* No sample scans
+* No TOAST support (large values supported inline)
+* No support for [``ON
+  CONFLICT``](https://www.postgresql.org/docs/12/sql-insert.html#SQL-ON-CONFLICT)
+  statements (except ``DO NOTHING`` actions with no target specified).
+* No support for tuple locks (``SELECT ... FOR SHARE``, ``SELECT
+  ... FOR UPDATE``)
+* No support for serializable isolation level
+* Support for PostgreSQL server versions 12+ only
+* No support for foreign keys, unique constraints, or exclusion
+  constraints
+* No support for logical decoding
+* No support for intra-node parallel scans
+* No support for ``AFTER ... FOR EACH ROW`` triggers
+* No `UNLOGGED` columnar tables
+* No `TEMPORARY` columnar tables
+
+Future iterations will incrementally lift the limitations listed above.
+
+# User Experience
+
+Create a Columnar table by specifying ``USING columnar`` when creating
+the table.
+
+```sql
+CREATE TABLE my_columnar_table
+(
+    id INT,
+    i1 INT,
+    i2 INT8,
+    n NUMERIC,
+    t TEXT
+) USING columnar;
+```
+
+Insert data into the table and read from it like normal (subject to
+the limitations listed above).
+
+To see internal statistics about the table, use ``VACUUM
+VERBOSE``. Note that ``VACUUM`` (without ``FULL``) is much faster on a
+columnar table, because it scans only the metadata, and not the actual
+data.
+
+## Options
+
+Set options using:
+
+```sql
+alter_columnar_table_set(
+    relid REGCLASS,
+    chunk_row_count INT4 DEFAULT NULL,
+    stripe_row_count INT4 DEFAULT NULL,
+    compression NAME DEFAULT NULL,
+    compression_level INT4)
+```
+
+For example:
+
+```sql
+SELECT alter_columnar_table_set(
+    'my_columnar_table',
+    compression => 'none',
+    stripe_row_count => 10000);
+```
+
+The following options are available:
+
+* **compression**: `[none|pglz|zstd|lz4|lz4hc]` - set the compression type
+  for _newly-inserted_ data. Existing data will not be
+  recompressed/decompressed. The default value is `zstd` (if support
+  has been compiled in).
+* **compression_level**: ``<integer>`` - Sets compression level. Valid
+  settings are from 1 through 19. If the compression method does not
+  support the level chosen, the closest level will be selected
+  instead.
+* **stripe_row_count**: ``<integer>`` - the maximum number of rows per
+  stripe for _newly-inserted_ data. Existing stripes of data will not
+  be changed and may have more rows than this maximum value. The
+  default value is `150000`.
+* **chunk_row_count**: ``<integer>`` - the maximum number of rows per
+  chunk for _newly-inserted_ data. Existing chunks of data will not be
+  changed and may have more rows than this maximum value. The default
+  value is `10000`.
+
+View options for all tables with:
+
+```sql
+SELECT * FROM columnar.options;
+```
+
+You can also adjust options with a `SET` command of one of the
+following GUCs:
+
+* `columnar.compression`
+* `columnar.compression_level`
+* `columnar.stripe_row_count`
+* `columnar.chunk_row_count`
+
+GUCs only affect newly-created *tables*, not any newly-created
+*stripes* on an existing table.
+
+## Partitioning
+
+Columnar tables can be used as partitions; and a partitioned table may
+be made up of any combination of row and columnar partitions.
+
+```sql
+CREATE TABLE parent(ts timestamptz, i int, n numeric, s text)
+  PARTITION BY RANGE (ts);
+
+-- columnar partition
+CREATE TABLE p0 PARTITION OF parent
+  FOR VALUES FROM ('2020-01-01') TO ('2020-02-01')
+  USING COLUMNAR;
+-- columnar partition
+CREATE TABLE p1 PARTITION OF parent
+  FOR VALUES FROM ('2020-02-01') TO ('2020-03-01')
+  USING COLUMNAR;
+-- row partition
+CREATE TABLE p2 PARTITION OF parent
+  FOR VALUES FROM ('2020-03-01') TO ('2020-04-01');
+
+INSERT INTO parent VALUES ('2020-01-15', 10, 100, 'one thousand'); -- columnar
+INSERT INTO parent VALUES ('2020-02-15', 20, 200, 'two thousand'); -- columnar
+INSERT INTO parent VALUES ('2020-03-15', 30, 300, 'three thousand'); -- row
+```
+
+When performing operations on a partitioned table with a mix of row
+and columnar partitions, take note of the following behaviors for
+operations that are supported on row tables but not columnar
+(e.g. ``UPDATE``, ``DELETE``, tuple locks, etc.):
+
+* If the operation is targeted at a specific row partition
+  (e.g. ``UPDATE p2 SET i = i + 1``), it will succeed; if targeted at
+  a specified columnar partition (e.g. ``UPDATE p1 SET i = i + 1``),
+  it will fail.
+* If the operation is targeted at the partitioned table and has a
+  ``WHERE`` clause that excludes all columnar partitions
+  (e.g. ``UPDATE parent SET i = i + 1 WHERE ts = '2020-03-15'``), it
+  will succeed.
+* If the operation is targeted at the partitioned table, but does not
+  exclude all columnar partitions, it will fail; even if the actual
+  data to be updated only affects row tables (e.g. ``UPDATE parent SET
+  i = i + 1 WHERE n = 300``).
+
+Because columnar tables do not support indexes, it's impossible to
+create indexes on the partitioned table if some partitions are
+columnar. Instead, you must create indexes on the individual row
+partitions. Similarly for constraints that require indexes, e.g.:
+
+```sql
+CREATE INDEX p2_ts_idx ON p2 (ts);
+CREATE UNIQUE INDEX p2_i_unique ON p2 (i);
+ALTER TABLE p2 ADD UNIQUE (n);
+```
+
+## Converting Between Row and Columnar
+
+Note: ensure that you understand any advanced features that may be
+used with the table before converting it (e.g. row-level security,
+storage options, constraints, inheritance, etc.), and ensure that they
+are reproduced in the new table or partition appropriately. ``LIKE``,
+used below, is a shorthand that works only in simple cases.
+
+```sql
+CREATE TABLE my_table(i INT8 DEFAULT '7');
+INSERT INTO my_table VALUES(1);
+-- convert to columnar
+SELECT alter_table_set_access_method('my_table', 'columnar');
+-- back to row
+SELECT alter_table_set_access_method('my_table', 'heap');
+```
+
+# Performance Microbenchmark
+
+*Important*: This microbenchmark is not intended to represent any real
+ workload. Compression ratios, and therefore performance, will depend
+ heavily on the specific workload. This is only for the purpose of
+ illustrating a "columnar friendly" contrived workload that showcases
+ the benefits of columnar.
+
+## Schema
+
+```sql
+CREATE TABLE perf_row(
+    id INT8,
+    ts TIMESTAMPTZ,
+    customer_id INT8,
+    vendor_id INT8,
+    name TEXT,
+    description TEXT,
+    value NUMERIC,
+    quantity INT4
+);
+
+CREATE TABLE perf_columnar(LIKE perf_row) USING COLUMNAR;
+```
+
+## Data
+
+```sql
+CREATE OR REPLACE FUNCTION random_words(n INT4) RETURNS TEXT LANGUAGE plpython2u AS $$
+import random
+t = ''
+words = ['zero','one','two','three','four','five','six','seven','eight','nine','ten']
+for i in xrange(0,n):
+  if (i != 0):
+    t += ' '
+  r = random.randint(0,len(words)-1)
+  t += words[r]
+return t
+$$;
+```
+
+```sql
+INSERT INTO perf_row
+   SELECT
+    g, -- id
+    '2020-01-01'::timestamptz + ('1 minute'::interval * g), -- ts
+    (random() * 1000000)::INT4, -- customer_id
+    (random() * 100)::INT4, -- vendor_id
+    random_words(7), -- name
+    random_words(100), -- description
+    (random() * 100000)::INT4/100.0, -- value
+    (random() * 100)::INT4 -- quantity
+   FROM generate_series(1,75000000) g;
+
+INSERT INTO perf_columnar SELECT * FROM perf_row;
+```
+
+## Compression Ratio
+
+```
+=> SELECT pg_total_relation_size('perf_row')::numeric/pg_total_relation_size('perf_columnar') AS compression_ratio;
+ compression_ratio
+--------------------
+ 5.4080768380134124
+(1 row)
+```
+
+The overall compression ratio of columnar table, versus the same data
+stored with row storage, is **5.4X**.
+
+```
+=> VACUUM VERBOSE perf_row;
+INFO:  vacuuming "public.perf_row"
+INFO:  "perf_row": found 0 removable, 10 nonremovable row versions in 1 out of 5769231 pages
+DETAIL:  0 dead row versions cannot be removed yet, oldest xmin: 3110
+There were 0 unused item identifiers.
+Skipped 0 pages due to buffer pins, 5769230 frozen pages.
+0 pages are entirely empty.
+CPU: user: 0.10 s, system: 0.05 s, elapsed: 0.26 s.
+INFO:  vacuuming "pg_toast.pg_toast_17133"
+INFO:  index "pg_toast_17133_index" now contains 0 row versions in 1 pages
+DETAIL:  0 index row versions were removed.
+0 index pages have been deleted, 0 are currently reusable.
+CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s.
+INFO:  "pg_toast_17133": found 0 removable, 0 nonremovable row versions in 0 out of 0 pages
+DETAIL:  0 dead row versions cannot be removed yet, oldest xmin: 3110
+There were 0 unused item identifiers.
+Skipped 0 pages due to buffer pins, 0 frozen pages.
+0 pages are entirely empty.
+CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s.
+
+=> VACUUM VERBOSE perf_columnar;
+INFO:  statistics for "perf_columnar":
+storage id: 10000000020
+total file size: 8741486592, total data size: 8714771176
+compression rate: 4.96x
+total row count: 75000000, stripe count: 501, average rows per stripe: 149700
+chunk count: 60000, containing data for dropped columns: 0, zstd compressed: 60000
+```
+
+``VACUUM VERBOSE`` reports a smaller compression ratio, because it
+only averages the compression ratio of the individual chunks, and does
+not account for the metadata savings of the columnar format.
+
+## System
+
+* 16GB physical memory
+* 128MB PG shared buffers
+
+Note: because this was run on a system with enough physical memory to
+hold a substantial fraction of the table, the IO benefits of columnar
+won't be entirely realized by the query runtime unless the data size
+is substantially increased.
+
+## Query
+
+```sql
+-- OFFSET 1000 so that no rows are returned, and we collect only timings
+SELECT vendor_id, SUM(quantity) FROM perf_row GROUP BY vendor_id OFFSET 1000;
+SELECT vendor_id, SUM(quantity) FROM perf_columnar GROUP BY vendor_id OFFSET 1000;
+```
+
+Timing (median of three runs):
+ * row: 201700ms
+ * columnar: 14202ms
+ * speedup: **14X**