Commit Graph

16 Commits (548395fd77d2f73a026dfa8ed81aa8bae3b75e07)

Author SHA1 Message Date
Jelte Fennema-Nio 48c62095ff Actually sort includes after cherry-pick 2024-04-17 10:26:50 +02:00
Naisila Puka 7c6b4ce103
PG16 compatibility - outer join checks, subscription password, crash fixes (#7097)
PG16 compatibility - Part 4

Check out part 1 42d956888d
part 2 0d503dd5ac
part 3 907d72e60d

This commit is in the series of PG16 compatibility commits.
It adds some outer join checks to the planner,
the new password_required option to the subscription,
and a crash fix related to PGIOAlignedBlock, see below for more details:

- Fix PGIOAlignedBlock Assert crash in PG16 
Relevant PG commit:
faeedbcefd
faeedbcefd40bfdf314e048c425b6d9208896d90

- Pass planner info as argument to make_simple_restrictinfo 
Pre PG16 passing plannerInfo to make_simple_restrictinfo
was only needed for placeholder Vars, which is not the case
in this part of the codebase because we are building the
expression from shard intervals which don't have placeholder
vars.
However, PG16 is counting baserels appearing in clause_relids
and is deleting the rels mentioned in plannerinfo->outer_join_rels
Hence directly accessing plannerinfo.
We will crash if we leave it as NULL.
For reference
2489d76c49 (diff-e045c41eda9686451a7993e91518e40056b3739365e39eb1b70ae438dc1f7c76R207)
Relevant PG commit:
2489d76c49
2489d76c4906f4461a364ca8ad7e0751ead8aa0d

- Add outer join checks, root->simple_rel_array

- fix rebalancer to include passwork_required option 
Relevant PG commit:
c3afe8cf5a
c3afe8cf5a1e465bd71e48e4bc717f5bfdc7a7d6

More PG16 compatibility commits are coming soon ...
2023-08-04 14:51:28 +03:00
Naisila Puka 42d956888d
PG16 compatibility: Resolve compilation issues (#7005)
This PR provides successful compilation against PG16Beta2. It does some
necessary refactoring to prepare for full support of version 16, in
https://github.com/citusdata/citus/pull/6952 .

Change RelFileNode to RelFileNumber or RelFileLocator 
Relevant PG commit
b0a55e43299c4ea2a9a8c757f9c26352407d0ccc

new header for varatt.h 
Relevant PG commit:
d952373a987bad331c0e499463159dd142ced1ef

drop support for Abs, use fabs 
Relevant PG commit
357cfefb09115292cfb98d504199e6df8201c957

tuplesort PGcommit: d37aa3d35832afde94e100c4d2a9618b3eb76472 
Relevant PG commit:
d37aa3d35832afde94e100c4d2a9618b3eb76472

Fix vacuum in columnar 
Relevant PG commit:
4ce3afb82ecfbf64d4f6247e725004e1da30f47c
older one:
b6074846cebc33d752f1d9a66e5a9932f21ad177

Add alloc_flags to pg_clean_ascii 
Relevant PG commit:
45b1a67a0fcb3f1588df596431871de4c93cb76f

Merge GetNumConfigOptions() into get_guc_variables() 
Relevant PG commit:
3057465acfbea2f3dd7a914a1478064022c6eecd

Minor PG refactor PG_FUNCNAME_MACRO __func__ 
Relevant PG commit
320f92b744b44f961e5d56f5f21de003e8027a7f

Pass NULL context to stringToQualifiedNameList, typeStringToTypeName 
The pre-PG16 error behaviour for the following
stringToQualifiedNameList & typeStringToTypeName
was ereport(ERROR, ...)
Now with PG16 we have this context input. We preserve the same behaviour
by passing a NULL context, because of the following:
(copy paste comment from PG16)
If "context" isn't an ErrorSaveContext node, this behaves as
errstart(ERROR, domain), and the errsave() macro ends up acting
exactly like ereport(ERROR, ...).
Relevant PG commit
858e776c84f48841e7e16fba7b690b76e54f3675

Use RangeVarCallbackMaintainsTable instead of RangeVarCallbackOwnsTable 
Relevant PG commit:
60684dd834a222fefedd49b19d1f0a6189c1632e

FIX THIS: Not implemented grant-level control of role inheritance 
see PG commit
e3ce2de09d814f8770b2e3b3c152b7671bcdb83f

Make Scan node abstract 
PG commit:
8c73c11a0d39049de2c1f400d8765a0eb21f5228

Change in Var representations, get_relids_in_jointree 
PG commit
2489d76c4906f4461a364ca8ad7e0751ead8aa0d

Deadlock detection changes because SHM_QUEUE is removed 
Relevant PG Commit:
d137cb52cb7fd44a3f24f3c750fbf7924a4e9532

TU_UpdateIndexes 
Relevant PG commit
19d8e2308bc51ec4ab993ce90077342c915dd116

Use object_ownercheck and object_aclcheck functions 
Relevant PG commits:
afbfc02983f86c4d71825efa6befd547fe81a926
c727f511bd7bf3c58063737bcf7a8f331346f253

Rework Permission Info for successful compilation 
Relevant PG commits:
postgres/postgres@a61b1f7
postgres/postgres@b803b7d
---------

Co-authored-by: onderkalaci <onderkalaci@gmail.com>
2023-07-21 14:32:37 +03:00
Jeff Davis f944722c6a PG15: Use RelationGetSmgr() instead of RelationOpenSmgr().
Handle PG commit f10f0ae420.
2022-05-02 10:12:03 -07:00
jeff-davis b072b9235e
Columnar: fix checksums, broken in a4067913. (#5669)
Checksums must be set directly before writing the page. log_newpage()
sets the page LSN, and therefore invalidates the checksum.
2022-02-02 13:22:11 -08:00
Heikki Linnakangas a40679139b
Use smgrextend() when extending relation, and WAL-log first. (#5654)
When creating a new table, we bypass the buffer cache and write the
initial pages directly with smgrwrite(). However, you're supposed to
use smgrextend() when extending a relation, rather than smgrwrite().
There isn't much difference between them, but smgrextend() updates the
relation size cache, which seems important, although I haven't seen
any real bugs caused by that.

Also, write the block to disk only after WAL-logging it, so that we
can include the LSN of the WAL record in the version that we write
out. Currently, the page as written to disk has LSN 0. That doesn't
cause any user-visible issues either, at worst it could make us
WAL-log a full page image of the page earlier than necessary, but that
doesn't matter currently because we WAL-log full page images of all
changes anyway.

I bumped into that issue with LSN 0 in the page header when testing
Citus with Zenith (https://github.com/zenithdb/zenith/issues/1176).
Zenith contains a check that PANICs if you write a block to disk
without WAL-logging it, and it works by checking the LSN of the page
that's written out. In this case, we are WAL-logging the page even
though the LSN on the page is 0, so it was a false alarm, but I'd love
to get this changed in Citus to keep the check in Zenith simple.

A downside of WAL-logging the page first is that if you run out of
disk space, you have already created the WAL record. So if you then
crash and restart, WAL recovery will likely run out of disk space,
too, which is bad. In practice, we have the same problem in other
places, like rewriteheap.c. Also, if you are on the brink of running
out of disk space, you will probably run out at WAL replay anyway,
regardless of which order we write these few pages. But if we wanted
to fix that, we could first extend the relation with zeros, and then
WAL-log the pages. That's how heap extension works.

It would be even nicer to use the buffer cache for this, and skip the
smgrimmedsync() on the relation. However, that would require more
work, because we don't have the Relation struct for the relation here.
We could use ReadBufferWithoutRelcache(), but that doesn't work for
unlogged tables. Unlogged tables are currently not supported
(https://github.com/citusdata/citus/issues/4742), but that would
become a problem if we want to support them in the future.
CreateFakeRelcacheEntry() also doesn't work with unlogged tables. We
could do things differently for logged and unlogged tables, but that
complicates the code further.

Co-authored-by: jeff-davis <Jeffrey.Davis@microsoft.com>
2022-01-27 12:04:08 -08:00
jeff-davis 1546aa0d9f
Columnar: use proper generic WAL interface. (#5543)
Previously, we cheated by using the RM_GENERIC_ID record type, but not
actually using the generic WAL API. This worked because we always took
a full page image, and saved the extra work of allocating and copying
to a temporary page.

But it introduced complexity, and perhaps fragility, so better to just
use the API properly. The performance penalty for a serial data load
seems to be less than 1%.
2022-01-04 22:42:21 -08:00
Onur Tirtir 76b8006a9e
Allow overwriting columnar storage pages written by aborted xacts (#5484)
When refactoring storage layer in #4907, we deleted the code that allows
overwriting a disk page previously written but not known by metadata.

Readers can see the change that introduced the code allows doing so in
commit a8da9acc63.

The reasoning was that; as of 10.2, we started aligning page
reservations (`AlignReservation`) for subsequent writes right after
allocating pages from disk. That means, even if writer transaction
fails, subsequent writes are guaranteed to allocate a new page and write
to there. For this reason, attempting to write to a page allocated
before is not possible for a columnar table that user created when using
v10.2.x.

However, since the older versions of columnar doesn't do that, following
example scenario can still result in writing to such disk page, even if
user now upgraded to v10.2.x. This is because, when upgrading storage to
2.0 (`ColumnarStorageUpdateIfNeeded`), we calculate `reservedOffset` of
the metapage based on the highest used address known by stripe
metadata (`GetHighestUsedAddressAndId`). However, stripe metadata
doesn't have entries for aborted writes. As a result, highest used
address would be computed by ignoring pages that are allocated but not
used.

- User attempts writing to columnar table on Citus v10.0x/v10.1x.
- Write operation fails for some reason.
- User upgrades Citus to v10.2.x.
- When attempting to write to same columnar table, they hit to "attempt
  to write columnar data .." error since write operation done in the
  older version of columnar already allocated that page, and now we are
  overwriting it.

For this reason, with this commit, we re-do the change done in
a8da9acc63.

And for the reasons given above, it wasn't possible to add a test for
this commit via usual code-paths. For this reason, added a UDF only for
testing purposes so that we can reproduce the exact scenario in our
regression test suite.
2021-11-26 07:51:13 +01:00
Onur Tirtir 5d8f74bd0b
(Share) Lock buffer page when reading from columnar storage (#5338)
Under high write concurrency, we were sometimes reading columnar
metapage as all zeros.

In `WriteToBlock()`, if `clear == true`, then it will clear the page before
writing the new one, rather than just adding data to the page. That
means any concurrent connection that is holding only a pin will be
able to see the all-zero state between the `InitPage()` and the
`memcpy_s()`.

Moreover, postgres/storage/buffer/README states that:

> Buffer access rules:
>
> 1. To scan a page for tuples, one must hold a pin and either shared or
> exclusive content lock.  To examine the commit status (XIDs and status bits)
> of a tuple in a shared buffer, one must likewise hold a pin and either shared
> or exclusive lock.

For those reasons, we have to make sure to never keep a pin on the
page without (at least) the shared lock, to avoid having such problems.
2021-10-06 11:57:02 +03:00
jeff-davis cc58b58f73
Columnar: reserve metapage flag for UNLOGGED support. (#5237)
Reserve space in the metapage for a flag to support UNLOGGED tables in
the future without a metapage upgrade.
2021-09-03 08:40:55 -07:00
Onur Tirtir 889a2731cb
Split columnar stripe reservation into two phases (#5188)
Previously, we were doing `first_row_number` reservation for the first
row written to current `WriteState` but were doing `stripe_id`
reservation when flushing the `WriteState` and were inserting the
related record to `columnar.stripe` at that time as well.

However, inserting `columnar.stripe` record at flush-time is
problematic. This is because, as told in #5160, if relation has
any index-based constraints and if there are two concurrent
writes that are inserting conflicting key values for that constraint,
then postgres relies on `tableAM->fetch_index_tuple`
(=`columnar_fetch_index_tuple`) callback to return `true` when
indexAM is checking against possible constraint violations.

However, pending writes of other backends are not visible to concurrent
sessions in columnar since we were not inserting the stripe metadata
record until flushing the stripe.

With this commit, we split stripe reservation into two phases:
i) Reserve `stripe_id` and insert a "dummy" record to `columnar.stripe`
at the very same time we reserve `first_row_number`, i.e. when writing
the first row to the current `WriteState`.
ii) At flush time, do the storage level allocation and complete the
missing fields of the dummy record inserted into `columnar.stripe`
during i).

That way, any concurrent writes would be able to check against possible
constraint violations by using `SnapshotDirty` when scanning
`columnar.stripe`.

Note that `columnar_fetch_index_tuple` still wouldn't be able to fill
the output tupleslot for the requested tid but it would at least return
`true` for such index look-up's and we believe this should be sufficient
for the caller indexAM callback to make the concurrent writer block on
prior one.

That is how we fix #5160.

Only downside of reserving `stripe_id` at the same time we reserve
`first_row_number` is that now any aborted writes would also waste
some amount of `stripe_id` as in the case of `first_row_number` but
we are just wasting them one-by-one.

Considering the fact that we waste `first_row_number` by the amount
stripe row limit (=150k by default) in such cases, this shouldn't be
important at all.
2021-09-02 11:49:14 +03:00
Onur Tirtir 1af50e98b3 Fix a comment in ColumnarMetapageRead 2021-06-16 19:59:32 +03:00
Onur Tirtir 2552aee404 Handle old versioned columnar metapage after binary upgrade (#4956)
* Make VACUUM hint for upgrade scenario actually work

* Suggest using VACUUM if metapage doesn't exist

Plus, suggest upgrading sql version as another option.

* Always force read metapage block

* Fix two typos
2021-05-10 20:16:50 +03:00
Onur Tirtir 2e419ea177 Add first_row_number column to columnar.stripe for tid mapping 2021-05-10 20:16:50 +03:00
Onur Tirtir 9c1ac3127f Implement ColumnarOverwriteMetapage 2021-05-10 20:16:50 +03:00
jeff-davis 7b9aecff21 Columnnar: metapage changes. (#4907)
* Columnar: introduce columnar storage API.

This new API is responsible for the low-level storage details of
columnar; translating large reads and writes into individual block
reads and writes that respect the page headers and emit WAL. It's also
responsible for the columnar metapage, resource reservations (stripe
IDs, row numbers, and data), and truncation.

This new API is not used yet, but will be used in subsequent
forthcoming commits.

* Columnar: add columnar_storage_info() for debugging purposes.

* Columnar: expose ColumnarMetadataNewStorageId().

* Columnar: always initialize metapage at creation time.

This avoids the complexity of dealing with tables where the metapage
has not yet been initialized.

* Columnar: columnar storage upgrade/downgrade UDFs.

Necessary upgrade/downgrade step so that new code doesn't see an old
metapage.

* Columnar: improve metadata.c comment.

* Columnar: make ColumnarMetapage internal to the storage API.

Callers should not have or need direct access to the metapage.

* Columnar: perform resource reservation using storage API.

* Columnar: implement truncate using storage API.

* Columnar: implement read/write paths with storage API.

* Columnar: add storage tests.

* Revert "Columnar: don't include stripe reservation locks in lock graph."

This reverts commit c3dcd6b9f8.

No longer needed because the columnar storage API takes care of
concurrency for resource reservation.

* Columnar: remove unnecessary lock when reserving.

No longer necessary because the columnar storage API takes care of
concurrent resource reservation.

* Add simple upgrade tests for storage/ branch

* fix multi_extension.out

Co-authored-by: Onur Tirtir <onurcantirtir@gmail.com>
2021-05-10 20:16:46 +03:00