From a40679139b4fd5f0df7223ae7f7886e02edbb84a Mon Sep 17 00:00:00 2001 From: Heikki Linnakangas Date: Thu, 27 Jan 2022 22:04:08 +0200 Subject: [PATCH] Use smgrextend() when extending relation, and WAL-log first. (#5654) When creating a new table, we bypass the buffer cache and write the initial pages directly with smgrwrite(). However, you're supposed to use smgrextend() when extending a relation, rather than smgrwrite(). There isn't much difference between them, but smgrextend() updates the relation size cache, which seems important, although I haven't seen any real bugs caused by that. Also, write the block to disk only after WAL-logging it, so that we can include the LSN of the WAL record in the version that we write out. Currently, the page as written to disk has LSN 0. That doesn't cause any user-visible issues either, at worst it could make us WAL-log a full page image of the page earlier than necessary, but that doesn't matter currently because we WAL-log full page images of all changes anyway. I bumped into that issue with LSN 0 in the page header when testing Citus with Zenith (https://github.com/zenithdb/zenith/issues/1176). Zenith contains a check that PANICs if you write a block to disk without WAL-logging it, and it works by checking the LSN of the page that's written out. In this case, we are WAL-logging the page even though the LSN on the page is 0, so it was a false alarm, but I'd love to get this changed in Citus to keep the check in Zenith simple. A downside of WAL-logging the page first is that if you run out of disk space, you have already created the WAL record. So if you then crash and restart, WAL recovery will likely run out of disk space, too, which is bad. In practice, we have the same problem in other places, like rewriteheap.c. Also, if you are on the brink of running out of disk space, you will probably run out at WAL replay anyway, regardless of which order we write these few pages. But if we wanted to fix that, we could first extend the relation with zeros, and then WAL-log the pages. That's how heap extension works. It would be even nicer to use the buffer cache for this, and skip the smgrimmedsync() on the relation. However, that would require more work, because we don't have the Relation struct for the relation here. We could use ReadBufferWithoutRelcache(), but that doesn't work for unlogged tables. Unlogged tables are currently not supported (https://github.com/citusdata/citus/issues/4742), but that would become a problem if we want to support them in the future. CreateFakeRelcacheEntry() also doesn't work with unlogged tables. We could do things differently for logged and unlogged tables, but that complicates the code further. Co-authored-by: jeff-davis --- src/backend/columnar/columnar_storage.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/src/backend/columnar/columnar_storage.c b/src/backend/columnar/columnar_storage.c index 5d5b8cf6d..58eb87b7e 100644 --- a/src/backend/columnar/columnar_storage.c +++ b/src/backend/columnar/columnar_storage.c @@ -187,17 +187,17 @@ ColumnarStorageInit(SMgrRelation srel, uint64 storageId) phdr->pd_lower += sizeof(ColumnarMetapage); PageSetChecksumInplace(page, COLUMNAR_METAPAGE_BLOCKNO); - smgrwrite(srel, MAIN_FORKNUM, COLUMNAR_METAPAGE_BLOCKNO, page, true); log_newpage(&srel->smgr_rnode.node, MAIN_FORKNUM, COLUMNAR_METAPAGE_BLOCKNO, page, true); + smgrextend(srel, MAIN_FORKNUM, COLUMNAR_METAPAGE_BLOCKNO, page, true); /* write empty page */ PageInit(page, BLCKSZ, 0); PageSetChecksumInplace(page, COLUMNAR_EMPTY_BLOCKNO); - smgrwrite(srel, MAIN_FORKNUM, COLUMNAR_EMPTY_BLOCKNO, page, true); log_newpage(&srel->smgr_rnode.node, MAIN_FORKNUM, COLUMNAR_EMPTY_BLOCKNO, page, true); + smgrextend(srel, MAIN_FORKNUM, COLUMNAR_EMPTY_BLOCKNO, page, true); /* * An immediate sync is required even if we xlog'd the page, because the