mirror of https://github.com/citusdata/citus.git
Use smgrextend() when extending relation, and WAL-log first. (#5654)
When creating a new table, we bypass the buffer cache and write the initial pages directly with smgrwrite(). However, you're supposed to use smgrextend() when extending a relation, rather than smgrwrite(). There isn't much difference between them, but smgrextend() updates the relation size cache, which seems important, although I haven't seen any real bugs caused by that. Also, write the block to disk only after WAL-logging it, so that we can include the LSN of the WAL record in the version that we write out. Currently, the page as written to disk has LSN 0. That doesn't cause any user-visible issues either, at worst it could make us WAL-log a full page image of the page earlier than necessary, but that doesn't matter currently because we WAL-log full page images of all changes anyway. I bumped into that issue with LSN 0 in the page header when testing Citus with Zenith (https://github.com/zenithdb/zenith/issues/1176). Zenith contains a check that PANICs if you write a block to disk without WAL-logging it, and it works by checking the LSN of the page that's written out. In this case, we are WAL-logging the page even though the LSN on the page is 0, so it was a false alarm, but I'd love to get this changed in Citus to keep the check in Zenith simple. A downside of WAL-logging the page first is that if you run out of disk space, you have already created the WAL record. So if you then crash and restart, WAL recovery will likely run out of disk space, too, which is bad. In practice, we have the same problem in other places, like rewriteheap.c. Also, if you are on the brink of running out of disk space, you will probably run out at WAL replay anyway, regardless of which order we write these few pages. But if we wanted to fix that, we could first extend the relation with zeros, and then WAL-log the pages. That's how heap extension works. It would be even nicer to use the buffer cache for this, and skip the smgrimmedsync() on the relation. However, that would require more work, because we don't have the Relation struct for the relation here. We could use ReadBufferWithoutRelcache(), but that doesn't work for unlogged tables. Unlogged tables are currently not supported (https://github.com/citusdata/citus/issues/4742), but that would become a problem if we want to support them in the future. CreateFakeRelcacheEntry() also doesn't work with unlogged tables. We could do things differently for logged and unlogged tables, but that complicates the code further. Co-authored-by: jeff-davis <Jeffrey.Davis@microsoft.com>pull/5659/head^2
parent
dcb9c71f19
commit
a40679139b
|
@ -187,17 +187,17 @@ ColumnarStorageInit(SMgrRelation srel, uint64 storageId)
|
|||
phdr->pd_lower += sizeof(ColumnarMetapage);
|
||||
|
||||
PageSetChecksumInplace(page, COLUMNAR_METAPAGE_BLOCKNO);
|
||||
smgrwrite(srel, MAIN_FORKNUM, COLUMNAR_METAPAGE_BLOCKNO, page, true);
|
||||
log_newpage(&srel->smgr_rnode.node, MAIN_FORKNUM,
|
||||
COLUMNAR_METAPAGE_BLOCKNO, page, true);
|
||||
smgrextend(srel, MAIN_FORKNUM, COLUMNAR_METAPAGE_BLOCKNO, page, true);
|
||||
|
||||
/* write empty page */
|
||||
PageInit(page, BLCKSZ, 0);
|
||||
|
||||
PageSetChecksumInplace(page, COLUMNAR_EMPTY_BLOCKNO);
|
||||
smgrwrite(srel, MAIN_FORKNUM, COLUMNAR_EMPTY_BLOCKNO, page, true);
|
||||
log_newpage(&srel->smgr_rnode.node, MAIN_FORKNUM,
|
||||
COLUMNAR_EMPTY_BLOCKNO, page, true);
|
||||
smgrextend(srel, MAIN_FORKNUM, COLUMNAR_EMPTY_BLOCKNO, page, true);
|
||||
|
||||
/*
|
||||
* An immediate sync is required even if we xlog'd the page, because the
|
||||
|
|
Loading…
Reference in New Issue