During pg upgrades, we have seen that it is not guaranteed that a
columnar table will be created after metadata objects got created.
Prior to changes done in this commit, we had such a dependency
relationship in `pg_depend`:
```
columnar_table ----> columnarAM ----> citus extension
^ ^
| |
columnar.storage_id_seq -------------------- |
|
columnar.stripe -------------------------------
```
Since `pg_upgrade` just knows to follow topological sort of the objects
when creating database dump, above dependency graph doesn't imply that
`columnar_table` should be created before metadata objects such as
`columnar.storage_id_seq` and `columnar.stripe` are created.
For this reason, with this commit we add new records to `pg_depend` to
make columnarAM depending on all rel objects living in `columnar`
schema. That way, `pg_upgrade` will know it needs to create those before
creating `columnarAM`, and similarly, before creating any tables using
`columnarAM`.
Note that in addition to inserting those records via installation script,
we also do the same in `citus_finish_pg_upgrade()`. This is because,
`pg_upgrade` rebuilds catalog tables in the new cluster and that means,
we must insert them in the new cluster too.
Drop extension might cascade to columnar.options before dropping a
columnar table. In that case, we were getting below error when opening
columnar.options to delete records for the columnar table that we are
about to drop.: "ERROR: could not open relation with OID 0".
I somehow reproduced this bug easily when upgrading pg, that is why
adding added the test to after_pg_upgrade_schedule.
Under high write concurrency, we were sometimes reading columnar
metapage as all zeros.
In `WriteToBlock()`, if `clear == true`, then it will clear the page before
writing the new one, rather than just adding data to the page. That
means any concurrent connection that is holding only a pin will be
able to see the all-zero state between the `InitPage()` and the
`memcpy_s()`.
Moreover, postgres/storage/buffer/README states that:
> Buffer access rules:
>
> 1. To scan a page for tuples, one must hold a pin and either shared or
> exclusive content lock. To examine the commit status (XIDs and status bits)
> of a tuple in a shared buffer, one must likewise hold a pin and either shared
> or exclusive lock.
For those reasons, we have to make sure to never keep a pin on the
page without (at least) the shared lock, to avoid having such problems.
A write operation might trigger index deletion if index already had
dead entries for the key we are about to insert.
There are two ways of index deletion:
a) simple deletion
b) bottom-up deletion (>= pg14)
Since columnar_index_fetch_tuple never sets all_dead to true,
columnarAM doesn't ever expect to receive simple deletion requests
(columnar_index_delete_tuples) as we don't mark any index entries
as dead.
However, since columnarAM doesn't delete any dead entries via simple
deletion, postgres might ask for a more comprehensive deletion
(i.e.: bottom-up) at some point when pg >= 14.
So with this commit, we start gracefully ignoring bottom-up deletion
requests made to columnar_index_delete_tuples.
Given that users can anyway "VACUUM FULL" their columnar tables,
we don't see any problem in ignoring deletion requests.
* Make (columnar.stripe) first_row_number index a unique constraint
Since stripe_first_row_number_idx is required to scan a columnar
table, we need to make sure that it is created before doing anything
with columnar tables during pg upgrades.
However, a plain btree index is not a dependency of a table, so
pg_upgrade cannot guarantee that stripe_first_row_number_idx gets
created when creating columnar.stripe, unless we make it a unique
"constraint".
To do that, drop stripe_first_row_number_idx and create a unique
constraint with the same name to keep the code change at minimum.
* Add more pg upgrade tests for columnar
* Fix a logic error in uprade_columnar_after test
Co-authored-by: Onur Tirtir <onurcantirtir@gmail.com>
BuildStripeMetadata() calls HeapTupleHeaderGetXmin(), which must only
be called on a proper heap tuple with MVCC information. Make sure the
caller passes the heap tuple, and not a datum tuple.
Fixes#5318.
Considering all code-paths that we might interact with a columnar table,
add `CheckCitusVersion` calls to tableAM callbacks:
- initializing table scan (`columnar_beginscan` & `columnar_index_fetch_begin`)
- setting a new filenode for a relation (storage initializiation or a table rewrite)
- truncating the storage
- inserting tuple (single and multi)
Also add `CheckCitusVersion` call to:
- drop hook (`ColumnarTableDropHook`)
- `alter_columnar_table_set` & `alter_columnar_table_reset` UDFs
* Columnar: separate plain and exec quals.
Make a clear separation between plain quals, which contain constants
or extern params; and exec quals, which contain exec params and can't
be evaluated until a rescan.
Fixes#5258.
* more vanilla tests
Co-authored-by: Onur Tirtir <onurcantirtir@gmail.com>
When performing a partition-wise join, the planner will adjust paths
parameterized by the parent rel to instead parameterize by the child
rel directly. When this reparameterization happens, we also need to
adjust the join quals to reference the child rather than the parent.
Fixes#5257.
Not flush pending writes if given tid belongs to a "flushed" or
"aborted" stripe write, or to an "in-progress" stripe write of
another backend.
That way, we would reduce the cases where we flush single-tuple
stripes during index scan.
To do that, we follow below steps for index look-up's:
- Do not flush any pending writes and do stripe metadata look-up for
given tid.
If tuple with tid is found, then no need to do another look-up
since we already found the tuple without needing to flush pending
writes.
- If tuple is not found without flushing pending writes, then we have two
scenarios:
- If given tid belongs to a pending write of my backend, then do stripe
metadata look-up for given tid. But this time first **flush any pending
writes**.
- Otherwise, just return false from `index_fetch_tuple` since flushing
pending writes wouldn't help.
With 5825c44d5f, we made the changes to
skip aborted writes when scanning a columnar table.
However, looks like we forgot to handle such cases for the very first
call made to columnar_getnextslot. That means, that commit only
considered the intermediate stripe read operations.
However, functions called by columnar_getnextslot to find first stripe
to read (ColumnarBeginRead & ColumnarRescan) were not caring about
those aborted writes.
To fix that, we teach AdvanceStripeRead to find the very first stripe
to read, and then start using it where were blindly calling
FindNextStripeByRowNumber.
Previously, even when `EXPLAIN` output tells that we will do
index-only scan, it was never the case since columnar tables
don't have the visibility fork that postgres is looking for.
For this reason, visibility check done in
`IndexOnlyNext->VM_ALL_VISIBLE`
code-path was always returning false and postgres was reading
the tuple from the columnar relation itself.
Previously, for regular table scans, we were setting `RelOptInfo->partial_pathlist`
to `NIL` via `set_rel_pathlist_hook` to discard scan `Path`s that need to use any
parallel workers, this was working nicely.
However, when building indexes, this hook doesn't get called so we were not
able to prevent spawning parallel workers when building an index. For this
reason, 9b4dc2f804 added basic
implementation for `columnar_parallelscan_*` callbacks but also made some
changes to skip using those workers when building the index.
However, now that we are doing stripe reservation in two stages, we call
`heap_inplace_update` at some point to complete stripe reservation.
However, postgres throws an error if we call `heap_inplace_update` during
a parallel operation, even if we don't actually make use of those workers.
For this reason, with this pr, we make sure to not generate scan `Path`s that
need to use any parallel workers by using `get_relation_info_hook`.
This is indeed useful to prevent spawning parallel workers during index builds.
If it is certain that we will not use any `parallel_worker`s for a columnar table,
then stripe entries inserted by aborted transactions become visible to
`SnapshotAny` and that causes `REINDEX` to fail by throwing a duplicate key
error.
To fix that:
* consider three states for a stripe write operation:
"flushed", "aborted", or "in-progress",
* make sure to have a clear separation between them, and
* act according to those three states when reading from a columnar table
* Rename RecostColumnarPaths to CostColumnarPaths
* Rename RecostColumnarIndexPath to CostColumnarIndexPath
* Reorder args of CostColumnarScan to align with other two costing functions
* Not adjust index scan start-up cost
* Rename ColumnarIndexScanAddTotalCost to ColumnarIndexScanAdditionalCost
* Reflect that index scan will at least read one stripe in totalCost calculation
* Organize declarations in columnar_customscan.c
These changes were removed in commit: Introduces ExecSimpleRelationInsert_compat and modifyStateResultRelInfo macros
We shouldn't have removed them but instead kept them for before PG14
Postgres expects to set the HASH_STRINGS explicitly in case of the
default behaivor for string hash function.
Postgres Commit
b3817f5f774663d55931dd4fab9c5a94a15ae7ab
index_insert function now has a new parameter, indexUnchanged
This new macro give us the ability to use these new parameter for PG14 and they don't give the parameters for previous versions
Existing parameter is set to false
Relevant PG commit:
9dc718bdf2b1a574481a45624d42b674332e2903
es_result_relation_info is removed from Estate. In this commit we make some changes to handle that.
resultRelationInfo filed is added to ModifyState to support the removed field.
Relevant PG commits:
1375422c7826a2bf387be29895e961614f69de4b
a04daa97a4339c38e304cd6164d37da540d665a8
GetOldestXmin function is removed so we use GetOldestNonRemovableTransactionId functions instead
GetOldestNonRemovableTransactionId_compat picks the appropriate one
Relevant PG commit:
dc7420c2c9274a283779ec19718d2d16323640c0
New macros: standard_ProcessUtility_compat, ProcessUtility_compat, ColumnarProcessUtility_compat, PrevProcessUtilityHook_compat
The functions now have a new bool parameter: readOnlyTree
These new macros give us the ability to use this new parameter for PG14 and it doesn't give the parameter for previous versions
In multi_ProcessUtility and ColumnarProcessUtility, before doing anything else, we check if readOnlyTree parameter is true and create a copy of pstmt
Existing readOnlyTree parameters are set to false since we already handle the read only case at multi_ProcessUtility and ColumnarProcessUtility
Relevant PG commit:
7c337b6b527b7052e6a751f966d5734c56f668b5
VacOptTernaryValue enum is renamed to VacOptValue.
In the enum there were three values, VACOPT_TERNARY_DEFAULT, VACOPT_TERNARY_DISABLED, and VACOPT_TERNARY_ENABLED
Now there are four values VACOPTVALUE_UNSPECIFIED, VACOPTVALUE_AUTO, VACOPTVALUE_DISABLED, and VACOPTVALUE_ENABLED
New macros are VacOptValue_compat, VACOPTVALUE_UNSPECIFIED_COMPAT, VACOPTVALUE_DISABLED_COMPAT, and VACOPTVALUE_ENABLED_COMPAT
The VACOPTVALUE_UNSPECIFIED_COMPAT matches VACOPT_TERNARY_DEFAULT and VACOPTVALUE_UNSPECIFIED. And there are no macro for VACOPTVALUE_AUTO.
Relevant PG commit:
3499df0dee8c4ea51d264a674df5b5e31991319a
stats function now have a new bool print_to_stderr parameter
This new macro gives us the ability to use this new parameter for PG14 and it doesn't give the parameter for previous versions
Existing print_to_stderr parameter is set to true to keep current behavior
Relevant PG commit:
43620e328617c1f41a2a54c8cee01723064e3ffa
Allow ColumnarScans to push down join quals by generating
parameterized paths. This significantly expands the utility of chunk
group filtering, making a ColumnarScan behave similar to an index when
on the inner of a nested loop join.
Also, evaluate all parameters on beginscan/rescan, which also works
for external parameters.
Fixes#4488.
Previously, we were doing `first_row_number` reservation for the first
row written to current `WriteState` but were doing `stripe_id`
reservation when flushing the `WriteState` and were inserting the
related record to `columnar.stripe` at that time as well.
However, inserting `columnar.stripe` record at flush-time is
problematic. This is because, as told in #5160, if relation has
any index-based constraints and if there are two concurrent
writes that are inserting conflicting key values for that constraint,
then postgres relies on `tableAM->fetch_index_tuple`
(=`columnar_fetch_index_tuple`) callback to return `true` when
indexAM is checking against possible constraint violations.
However, pending writes of other backends are not visible to concurrent
sessions in columnar since we were not inserting the stripe metadata
record until flushing the stripe.
With this commit, we split stripe reservation into two phases:
i) Reserve `stripe_id` and insert a "dummy" record to `columnar.stripe`
at the very same time we reserve `first_row_number`, i.e. when writing
the first row to the current `WriteState`.
ii) At flush time, do the storage level allocation and complete the
missing fields of the dummy record inserted into `columnar.stripe`
during i).
That way, any concurrent writes would be able to check against possible
constraint violations by using `SnapshotDirty` when scanning
`columnar.stripe`.
Note that `columnar_fetch_index_tuple` still wouldn't be able to fill
the output tupleslot for the requested tid but it would at least return
`true` for such index look-up's and we believe this should be sufficient
for the caller indexAM callback to make the concurrent writer block on
prior one.
That is how we fix#5160.
Only downside of reserving `stripe_id` at the same time we reserve
`first_row_number` is that now any aborted writes would also waste
some amount of `stripe_id` as in the case of `first_row_number` but
we are just wasting them one-by-one.
Considering the fact that we waste `first_row_number` by the amount
stripe row limit (=150k by default) in such cases, this shouldn't be
important at all.
Before starting to scan a columnar table, we always flush the pending
writes to disk.
However, we increment command counter after modifying metadata tables.
On the other hand, now that we _don't always use_ xact snapshot to scan
a columnar table, writes that we just flushed might not be visible to
the query that just flushed pending writes to disk since curcid of
provided snapshot would become smaller than the command id being used
when modifying metadata tables.
To give an example, before this change, below was a possible scenario
due to the changes that we made to use the correct snapshot.
```sql
CREATE TABLE t(a int, b int) USING columnar;
BEGIN;
INSERT INTO t VALUES (5, 10);
SELECT * FROM t;
┌───┬───┐
│ a │ b │
├───┼───┤
└───┴───┘
(0 rows)
SELECT * FROM t;
┌───┬────┐
│ a │ b │
├───┼────┤
│ 5 │ 10 │
└───┴────┘
(1 row)
```
In next commit, we will adjust curcid of the snapshot being used when
scanning the columnar table.
However, for index scan, snapshot is provided not when beginning scan
but within fetch-tuple call.
For this reason, start flushing pending writes in init_columnar_read_state
since this seem to be a prerequisite step that needs to be done before
scanning a columnar table regardless of the scan method being used.
Seems that we always increment the command counter right after
finishing metadata table modification.
For this reason, it makes sense to call CommandCounterIncrement
within FinishModifyRelation.
All the callers except columnar_relation_copy_for_cluster were already
switching to right memory context when creating ColumnarReadState.
With this commit, we embed that logic into init_columnar_read_state
to avoid further such bugs.
That way, we start using the right memory context for
columnar_relation_copy_for_cluster too.
Instead of setting stripeReadState to NULL, call ColumnarResetRead
before re-scanning a columnar table since this function is already
designed for doing the necessary clean up when finishing a stripe
read.
Note that this change shouldn't have a great effect on memory usage
since AdvanceStripe was already doing the clean-up for all the
stripes except the last one.
Previously, we were only using chunk group reader for sequential scan.
However, to support index scans on columnar tables, now we use very
same low level functions for index scan too.
Since those low-level functions were only used for sequential scan, it
was guaranteed that we would never read the same chunk group more than
once, so we were freeing chunk buffers after deserializing them into a
separate buffer.
Now that we use those low level functions for index scan, we cannot
free chunk buffers since it's possible to read the same chunk group
again, such that:
- read chunk group 1 of stripe 5
- read chunk group 2 of stripe 5
- read chunk group 1 of stripe 5 again
Here, when we decide to read chunk group 1 for a second time,
chunk group 1 is not cached. Plus, before this commit, we were
freeing the chunk buffers for chunk group 1 after the first
read and then we were getting segfault or errors from low-level
de-compression APIs.
With this commit, we add (`CREATE INDEX` / `REINDEX`) `CONCURRENTLY` support for columnar tables.
For that, we implement `columnar_index_validate_scan` callback.
The reasoning behind the implementation is as follows:
* Postgres function `validate_index` provides all the TIDs that are currently in the
index to `columnar_index_validate_scan` callback via a `tupleSort` object..
* We start scanning the table by using `columnar_getnextslot` as usual.
Before moving forward, note that `columnar_getnextslot` guarantees
to return tuples in the order of their TIDs.
* For us to use during table scan, postgres provides a snapshot guaranteeing
that any tuples that are valid according to that snapshot but are not in the
index must be added to the index.
* Then for each tuple that we read from our table, we continue iterating
given `tupleSort` to find the first TID that is greater than or equal to our
tuple's TID.
If both TID's are equal to each other, then we skip the tuple since it's already
indexed.
If the TID that we read from tupleSort is greater then our tuple's TID, then
we decide to insert this tuple into index.
systable_getnext already uses ForwardScanDirection if relation has any
open indexes, but let's be more explicit doing ordered scan on columnar
catalog tables.
* Make VACUUM hint for upgrade scenario actually work
* Suggest using VACUUM if metapage doesn't exist
Plus, suggest upgrading sql version as another option.
* Always force read metapage block
* Fix two typos
* Columnar: introduce columnar storage API.
This new API is responsible for the low-level storage details of
columnar; translating large reads and writes into individual block
reads and writes that respect the page headers and emit WAL. It's also
responsible for the columnar metapage, resource reservations (stripe
IDs, row numbers, and data), and truncation.
This new API is not used yet, but will be used in subsequent
forthcoming commits.
* Columnar: add columnar_storage_info() for debugging purposes.
* Columnar: expose ColumnarMetadataNewStorageId().
* Columnar: always initialize metapage at creation time.
This avoids the complexity of dealing with tables where the metapage
has not yet been initialized.
* Columnar: columnar storage upgrade/downgrade UDFs.
Necessary upgrade/downgrade step so that new code doesn't see an old
metapage.
* Columnar: improve metadata.c comment.
* Columnar: make ColumnarMetapage internal to the storage API.
Callers should not have or need direct access to the metapage.
* Columnar: perform resource reservation using storage API.
* Columnar: implement truncate using storage API.
* Columnar: implement read/write paths with storage API.
* Columnar: add storage tests.
* Revert "Columnar: don't include stripe reservation locks in lock graph."
This reverts commit c3dcd6b9f8.
No longer needed because the columnar storage API takes care of
concurrency for resource reservation.
* Columnar: remove unnecessary lock when reserving.
No longer necessary because the columnar storage API takes care of
concurrent resource reservation.
* Add simple upgrade tests for storage/ branch
* fix multi_extension.out
Co-authored-by: Onur Tirtir <onurcantirtir@gmail.com>
* Columnar: use clause Vars for chunk group filtering.
This solves #4780 and also provides a cleaner separation between chunk
group filtering and projection pushdown.
* Columnar: sort and deduplicate Vars pulled from clauses.
* Columnar: cleanup variable names.
* Columnar: remove alternate test output.
* Columnar: do not recurse when looking for whereClauseVars.
Co-authored-by: Jeff Davis <jefdavi@microsoft.com>
* Columnar: fix misnamed file.
* Columnar: make compression not dependent on columnar.h.
* Columnar: rename columnar_metadata_tables.c to columnar_metadata.c.
* Columnar: make customscan not depend on columnar.h.
Co-authored-by: Jeff Davis <jefdavi@microsoft.com>
Postgres keeps AFTER trigger state for each transaction, because we can have deferred AFTER triggers which will be fired at the end of a transaction. Postgres cleans up this state at the end of transaction.
Postgres processes ON COMMIT triggers after cleaning-up the AFTER trigger states. So if we fire any triggers in ON COMMIT, the AFTER trigger state won't be cleaned-up properly and the transaction state will be left in an inconsistent state, which might result in assertion failure.
So with this commit, we remove foreign keys between columnar metadata tables and enforce constraints between them manually when dropping columnar tables.