Instead of setting stripeReadState to NULL, call ColumnarResetRead
before re-scanning a columnar table since this function is already
designed for doing the necessary clean up when finishing a stripe
read.
Note that this change shouldn't have a great effect on memory usage
since AdvanceStripe was already doing the clean-up for all the
stripes except the last one.
Previously, we were only using chunk group reader for sequential scan.
However, to support index scans on columnar tables, now we use very
same low level functions for index scan too.
Since those low-level functions were only used for sequential scan, it
was guaranteed that we would never read the same chunk group more than
once, so we were freeing chunk buffers after deserializing them into a
separate buffer.
Now that we use those low level functions for index scan, we cannot
free chunk buffers since it's possible to read the same chunk group
again, such that:
- read chunk group 1 of stripe 5
- read chunk group 2 of stripe 5
- read chunk group 1 of stripe 5 again
Here, when we decide to read chunk group 1 for a second time,
chunk group 1 is not cached. Plus, before this commit, we were
freeing the chunk buffers for chunk group 1 after the first
read and then we were getting segfault or errors from low-level
de-compression APIs.
With this commit, we add (`CREATE INDEX` / `REINDEX`) `CONCURRENTLY` support for columnar tables.
For that, we implement `columnar_index_validate_scan` callback.
The reasoning behind the implementation is as follows:
* Postgres function `validate_index` provides all the TIDs that are currently in the
index to `columnar_index_validate_scan` callback via a `tupleSort` object..
* We start scanning the table by using `columnar_getnextslot` as usual.
Before moving forward, note that `columnar_getnextslot` guarantees
to return tuples in the order of their TIDs.
* For us to use during table scan, postgres provides a snapshot guaranteeing
that any tuples that are valid according to that snapshot but are not in the
index must be added to the index.
* Then for each tuple that we read from our table, we continue iterating
given `tupleSort` to find the first TID that is greater than or equal to our
tuple's TID.
If both TID's are equal to each other, then we skip the tuple since it's already
indexed.
If the TID that we read from tupleSort is greater then our tuple's TID, then
we decide to insert this tuple into index.
systable_getnext already uses ForwardScanDirection if relation has any
open indexes, but let's be more explicit doing ordered scan on columnar
catalog tables.
* Make VACUUM hint for upgrade scenario actually work
* Suggest using VACUUM if metapage doesn't exist
Plus, suggest upgrading sql version as another option.
* Always force read metapage block
* Fix two typos
* Columnar: introduce columnar storage API.
This new API is responsible for the low-level storage details of
columnar; translating large reads and writes into individual block
reads and writes that respect the page headers and emit WAL. It's also
responsible for the columnar metapage, resource reservations (stripe
IDs, row numbers, and data), and truncation.
This new API is not used yet, but will be used in subsequent
forthcoming commits.
* Columnar: add columnar_storage_info() for debugging purposes.
* Columnar: expose ColumnarMetadataNewStorageId().
* Columnar: always initialize metapage at creation time.
This avoids the complexity of dealing with tables where the metapage
has not yet been initialized.
* Columnar: columnar storage upgrade/downgrade UDFs.
Necessary upgrade/downgrade step so that new code doesn't see an old
metapage.
* Columnar: improve metadata.c comment.
* Columnar: make ColumnarMetapage internal to the storage API.
Callers should not have or need direct access to the metapage.
* Columnar: perform resource reservation using storage API.
* Columnar: implement truncate using storage API.
* Columnar: implement read/write paths with storage API.
* Columnar: add storage tests.
* Revert "Columnar: don't include stripe reservation locks in lock graph."
This reverts commit c3dcd6b9f8.
No longer needed because the columnar storage API takes care of
concurrency for resource reservation.
* Columnar: remove unnecessary lock when reserving.
No longer necessary because the columnar storage API takes care of
concurrent resource reservation.
* Add simple upgrade tests for storage/ branch
* fix multi_extension.out
Co-authored-by: Onur Tirtir <onurcantirtir@gmail.com>
* Columnar: use clause Vars for chunk group filtering.
This solves #4780 and also provides a cleaner separation between chunk
group filtering and projection pushdown.
* Columnar: sort and deduplicate Vars pulled from clauses.
* Columnar: cleanup variable names.
* Columnar: remove alternate test output.
* Columnar: do not recurse when looking for whereClauseVars.
Co-authored-by: Jeff Davis <jefdavi@microsoft.com>
* Columnar: fix misnamed file.
* Columnar: make compression not dependent on columnar.h.
* Columnar: rename columnar_metadata_tables.c to columnar_metadata.c.
* Columnar: make customscan not depend on columnar.h.
Co-authored-by: Jeff Davis <jefdavi@microsoft.com>
Postgres keeps AFTER trigger state for each transaction, because we can have deferred AFTER triggers which will be fired at the end of a transaction. Postgres cleans up this state at the end of transaction.
Postgres processes ON COMMIT triggers after cleaning-up the AFTER trigger states. So if we fire any triggers in ON COMMIT, the AFTER trigger state won't be cleaned-up properly and the transaction state will be left in an inconsistent state, which might result in assertion failure.
So with this commit, we remove foreign keys between columnar metadata tables and enforce constraints between them manually when dropping columnar tables.
Enables an overall plan to be parallel (e.g. over a partition
hierarchy), even though an individual ColumnarScan is not
parallel-aware.
Co-authored-by: Jeff Davis <jefdavi@microsoft.com>
Previously, if columnar.enable_custom_scan was false, parallel paths
could remain, leading to an unexpected error.
Also, ensure that cheapest_parameterized_paths is cleared if a custom
scan is used.
Co-authored-by: Jeff Davis <jefdavi@microsoft.com>
Fixing a division by zero in the cost calculations for scanning a columnar table.
Due to how the columns in a columnar table are counted an empty table would result in a division by zero. Instead this patch keeps the column selection ratio on zero when this happens, resulting in an accurate cost of zero pages to scan a columnar table.
fixes#4589
In pg11.8 it seemingly tries to parse the full sql file creating the extension,
since we use syntax introduced in postgres 12 this fails.
This patch rewrites the statement not recognized by pg11.8 to be dynamically
executed from a string literal via `EXECUTE`.
* Stronger check for triggers on columnar tables (#4493).
Previously, we used a simple ProcessUtility_hook. Change to use an
object_access_hook instead.
* Replace alter_table_set_access_method test on partition with foreign key
Co-authored-by: Jeff Davis <jefdavi@microsoft.com>
Co-authored-by: Marco Slot <marco.slot@gmail.com>
UPDATEs on partitioned tables that affect only row partitions should
succeed, the rest should fail.
Also rename CStoreScan to ColumnarScan to make the error message more
relevant.
When distributing a columnar table, as well as changing options on a distributed columnar table, this patch will forward the settings from the coordinator to the workers.
For propagating options changes on an already distributed table this change is pretty straight forward. Before applying the change in options locally we will create a `DDLJob` that contains a call to `alter_columnar_table_set(...)` for every shard placement with all settings of the current table. This goes both for setting an option as well as resetting. This will reset the values to the defaults configured on the coordinator. Having the effect that the coordinator is authoritative on the settings and makes sure the shards have the same settings set as the table on the coordinator.
When a columnar table is distributed it is using the `TableDDLCommand` infra structure to create a new kind of `TableDDLCommand`. This new type, called a `TableDDLCommandFunction` contains a context and 2 function pointers to execute. One function returns the command as applied on the table, the second function will return the sql command to apply to a shard with a given shard id. The schema name is ignored as it will use the fully qualified name of the shard in the same schema as the base table.
Columnar options were by accident linked to the relfilenode instead of the regclass/relation oid. This PR moves everything related to columnar options to their own catalog table.
I got this warning when compiling citus:
```
../columnar/write_state_management.c: In function ‘PendingWritesInUpperTransactions’:
../columnar/write_state_management.c:364:20: warning: ‘entry’ may be used uninitialized in this function [-Wmaybe-uninitialized]
if (found && entry->writeStateStack != NULL)
~~~~~^~~~~~~~~~~~~~~~
```
I fixed this by checking by always initializing entry, by using an early
return if `WriteStateMap` didn't exist. Instead of using the `found`
variable to check for existence of the key, I now simply check the
`entry` variable itself.
To quote the postgres comment on the hash_enter function:
> If foundPtr isn't NULL, then *foundPtr is set true if we found an
> existing entry in the table, false otherwise. This is needed in the
> HASH_ENTER case, but is redundant with the return value otherwise.
TableAM API doesn't allow us to pass around a state variable along all of the tuple inserts belonging to the same command. We require this in columnar store, since we batch them, and when we have enough rows we flush them as stripes.
To do that, we keep a (relfilenode) -> stack of (subxact id, TableWriteState) global mapping.
**Inserts**
Whenever we want to insert a tuple, we look up for the relation's relfilenode in this mapping. If top of the stack matches current subtransaction, we us the existing TableWriteState. Otherwise, we allocate a new TableWriteState and push it on top of stack.
**(Sub)Transaction Commit/Aborts**
When the subtransaction or transaction is committed, we flush and pop all entries matching current SubTransactionId.
When the subtransaction or transaction is committed, we pop all entries matching current SubTransactionId and discard them without flushing.
**Reads**
Since we might have unwritten rows which needs to be read by a table scan, we flush write states on SELECTs. Since flushing the write state of upper transactions in a subtransaction will cause metadata being written in wrong subtransaction, we ERROR out if any of the upper subtransactions have unflushed rows.
**Table Drops**
We record in which subtransaction the table was dropped. When committing a subtransaction in which table was dropped, we propagate the drop to upper transaction. When aborting a subtransaction in which table was dropped, we mark table as not deleted.