Not flush pending writes if given tid belongs to a "flushed" or
"aborted" stripe write, or to an "in-progress" stripe write of
another backend.
That way, we would reduce the cases where we flush single-tuple
stripes during index scan.
To do that, we follow below steps for index look-up's:
- Do not flush any pending writes and do stripe metadata look-up for
given tid.
If tuple with tid is found, then no need to do another look-up
since we already found the tuple without needing to flush pending
writes.
- If tuple is not found without flushing pending writes, then we have two
scenarios:
- If given tid belongs to a pending write of my backend, then do stripe
metadata look-up for given tid. But this time first **flush any pending
writes**.
- Otherwise, just return false from `index_fetch_tuple` since flushing
pending writes wouldn't help.
If it is certain that we will not use any `parallel_worker`s for a columnar table,
then stripe entries inserted by aborted transactions become visible to
`SnapshotAny` and that causes `REINDEX` to fail by throwing a duplicate key
error.
To fix that:
* consider three states for a stripe write operation:
"flushed", "aborted", or "in-progress",
* make sure to have a clear separation between them, and
* act according to those three states when reading from a columnar table
These changes were removed in commit: Introduces ExecSimpleRelationInsert_compat and modifyStateResultRelInfo macros
We shouldn't have removed them but instead kept them for before PG14
es_result_relation_info is removed from Estate. In this commit we make some changes to handle that.
resultRelationInfo filed is added to ModifyState to support the removed field.
Relevant PG commits:
1375422c7826a2bf387be29895e961614f69de4b
a04daa97a4339c38e304cd6164d37da540d665a8
Previously, we were doing `first_row_number` reservation for the first
row written to current `WriteState` but were doing `stripe_id`
reservation when flushing the `WriteState` and were inserting the
related record to `columnar.stripe` at that time as well.
However, inserting `columnar.stripe` record at flush-time is
problematic. This is because, as told in #5160, if relation has
any index-based constraints and if there are two concurrent
writes that are inserting conflicting key values for that constraint,
then postgres relies on `tableAM->fetch_index_tuple`
(=`columnar_fetch_index_tuple`) callback to return `true` when
indexAM is checking against possible constraint violations.
However, pending writes of other backends are not visible to concurrent
sessions in columnar since we were not inserting the stripe metadata
record until flushing the stripe.
With this commit, we split stripe reservation into two phases:
i) Reserve `stripe_id` and insert a "dummy" record to `columnar.stripe`
at the very same time we reserve `first_row_number`, i.e. when writing
the first row to the current `WriteState`.
ii) At flush time, do the storage level allocation and complete the
missing fields of the dummy record inserted into `columnar.stripe`
during i).
That way, any concurrent writes would be able to check against possible
constraint violations by using `SnapshotDirty` when scanning
`columnar.stripe`.
Note that `columnar_fetch_index_tuple` still wouldn't be able to fill
the output tupleslot for the requested tid but it would at least return
`true` for such index look-up's and we believe this should be sufficient
for the caller indexAM callback to make the concurrent writer block on
prior one.
That is how we fix#5160.
Only downside of reserving `stripe_id` at the same time we reserve
`first_row_number` is that now any aborted writes would also waste
some amount of `stripe_id` as in the case of `first_row_number` but
we are just wasting them one-by-one.
Considering the fact that we waste `first_row_number` by the amount
stripe row limit (=150k by default) in such cases, this shouldn't be
important at all.
Seems that we always increment the command counter right after
finishing metadata table modification.
For this reason, it makes sense to call CommandCounterIncrement
within FinishModifyRelation.
systable_getnext already uses ForwardScanDirection if relation has any
open indexes, but let's be more explicit doing ordered scan on columnar
catalog tables.
* Columnar: introduce columnar storage API.
This new API is responsible for the low-level storage details of
columnar; translating large reads and writes into individual block
reads and writes that respect the page headers and emit WAL. It's also
responsible for the columnar metapage, resource reservations (stripe
IDs, row numbers, and data), and truncation.
This new API is not used yet, but will be used in subsequent
forthcoming commits.
* Columnar: add columnar_storage_info() for debugging purposes.
* Columnar: expose ColumnarMetadataNewStorageId().
* Columnar: always initialize metapage at creation time.
This avoids the complexity of dealing with tables where the metapage
has not yet been initialized.
* Columnar: columnar storage upgrade/downgrade UDFs.
Necessary upgrade/downgrade step so that new code doesn't see an old
metapage.
* Columnar: improve metadata.c comment.
* Columnar: make ColumnarMetapage internal to the storage API.
Callers should not have or need direct access to the metapage.
* Columnar: perform resource reservation using storage API.
* Columnar: implement truncate using storage API.
* Columnar: implement read/write paths with storage API.
* Columnar: add storage tests.
* Revert "Columnar: don't include stripe reservation locks in lock graph."
This reverts commit c3dcd6b9f8.
No longer needed because the columnar storage API takes care of
concurrency for resource reservation.
* Columnar: remove unnecessary lock when reserving.
No longer necessary because the columnar storage API takes care of
concurrent resource reservation.
* Add simple upgrade tests for storage/ branch
* fix multi_extension.out
Co-authored-by: Onur Tirtir <onurcantirtir@gmail.com>
* Columnar: fix misnamed file.
* Columnar: make compression not dependent on columnar.h.
* Columnar: rename columnar_metadata_tables.c to columnar_metadata.c.
* Columnar: make customscan not depend on columnar.h.
Co-authored-by: Jeff Davis <jefdavi@microsoft.com>