NotificationsYou must be signed in to change notification settings
Fork6
Star31

Commit9358297

committed

freespace: Don't return blocks past the end of the main fork.

GetPageWithFreeSpace() callers assume the returned block exists in themain fork, failing with "could not read block" errors if that doesn'thold. Make that assumption reliable now. It hadn't been guaranteed,due to the weak WAL and data ordering of participating components. Mostoperations on the fsm fork are not WAL-logged. Relation extension isnot WAL-logged. Hence, an fsm-fork block on disk can reference amain-fork block that no WAL record has initialized. That could happenafter an OS crash, a replica promote, or a PITR restore. wal_log_hintsmakes the trouble easier to hit; a replica promote or PITR ending justafter a relevant fsm-fork FPI_FOR_HINT may yield this broken state. Thev16 RelationAddBlocks() mechanism also makes the trouble easier to hit,since it bulk-extends even without extension lock waiters. Commit917dc7d stopped trouble aroundtruncation, but vectors involving PageIsNew() pages remained.This implementation adds a RelationGetNumberOfBlocks() call when thecached relation size doesn't confirm a block exists. We've been unableto identify a benchmark that slows materially, but this may show up asadditional time in lseek(). An alternative without that overhead wouldbe a new ReadBufferMode such that ReadBufferExtended() returns NULLafter a 0-byte read, with all other errors handled normally. However,each GetFreeIndexPage() caller would then need code for the return-NULLcase. Back-patch to v14, due to earlier versions not caching relationsize and the absence of a pre-v16 problem report.Ronan Dunklau. Reported by Ronan Dunklau.Discussion:https://postgr.es/m/1878547.tdWV9SEqCh%40aivenlaptop

1 parent68ba46d commit9358297Copy full SHA for 9358297

File tree

4 files changed

+112

-20

lines changed

src
- backend/storage
  - freespace
    - README
    - freespace.c
  - smgr
    - smgr.c
- test/recovery/t
  - 008_fsm_truncation.pl

4 files changed

+112

-20

lines changed

`‎src/backend/storage/freespace/README‎`

Lines changed: 13 additions & 3 deletions

Original file line number	Diff line number	Diff line change
`@@ -169,9 +169,7 @@ Recovery`
`169`	`169`	`--------`
`170`	`170`
`171`	`171`	`The FSM is not explicitly WAL-logged. Instead, we rely on a bunch of`
`172`		`-self-correcting measures to repair possible corruption. As a result when`
`173`		`-we write to the FSM we treat that as a hint and thus use MarkBufferDirtyHint()`
`174`		`-rather than MarkBufferDirty().`
	`172`	`+self-correcting measures to repair possible corruption.`
`175`	`173`
`176`	`174`	`First of all, whenever a value is set on an FSM page, the root node of the`
`177`	`175`	`page is compared against the new value after bubbling up the change is`
`@@ -188,6 +186,18 @@ goes through fsm_set_avail(), so that the upper nodes on those pages are`
`188`	`186`	`immediately updated. Periodically, VACUUM calls FreeSpaceMapVacuum[Range]`
`189`	`187`	`to propagate the new free-space info into the upper pages of the FSM tree.`
`190`	`188`
	`189`	`+As a result when we write to the FSM we treat that as a hint and thus use`
	`190`	`+MarkBufferDirtyHint() rather than MarkBufferDirty(). Every read here uses`
	`191`	`+RBM_ZERO_ON_ERROR to bypass checksum mismatches and other verification`
	`192`	`+failures. We'd operate correctly without the full page images that`
	`193`	`+MarkBufferDirtyHint() provides, but they do decrease the chance of losing slot`
	`194`	`+knowledge to RBM_ZERO_ON_ERROR.`
	`195`	`+`
	`196`	`+Relation extension is not WAL-logged. Hence, after WAL replay, an on-disk FSM`
	`197`	`+slot may indicate free space in PageIsNew() blocks that never reached disk.`
	`198`	`+We detect this case by comparing against the actual relation size, and we mark`
	`199`	`+the block as full in that case.`
	`200`	`+`
`191`	`201`	`TODO`
`192`	`202`	`----`
`193`	`203`

`‎src/backend/storage/freespace/freespace.c‎`

Lines changed: 94 additions & 12 deletions

Original file line number	Diff line number	Diff line change
`@@ -112,6 +112,7 @@ static BlockNumber fsm_search(Relation rel, uint8 min_cat);`
`112`	`112`	`staticuint8fsm_vacuum_page(Relationrel,FSMAddressaddr,`
`113`	`113`	`BlockNumberstart,BlockNumberend,`
`114`	`114`	`bool*eof_p);`
	`115`	`+staticboolfsm_does_block_exist(Relationrel,BlockNumberblknumber);`
`115`	`116`
`116`	`117`
`117`	`118`	`/****** Public API ******/`
`@@ -128,6 +129,9 @@ static uint8 fsm_vacuum_page(Relation rel, FSMAddress addr,`
`128`	`129`	`* amount of free space available on that page and then try again (see`
`129`	`130`	`* RecordAndGetPageWithFreeSpace). If InvalidBlockNumber is returned,`
`130`	`131`	`* extend the relation.`
	`132`	`+ *`
	`133`	`+ * This can trigger FSM updates if any FSM entry is found to point to a block`
	`134`	`+ * past the end of the relation.`
`131`	`135`	`*/`
`132`	`136`	`BlockNumber`
`133`	`137`	`GetPageWithFreeSpace(Relationrel,SizespaceNeeded)`
`@@ -166,9 +170,17 @@ RecordAndGetPageWithFreeSpace(Relation rel, BlockNumber oldPage,`
`166`	`170`	`* Otherwise, search as usual.`
`167`	`171`	`*/`
`168`	`172`	`if (search_slot!=-1)`
`169`		`-returnfsm_get_heap_blk(addr,search_slot);`
`170`		`-else`
`171`		`-returnfsm_search(rel,search_cat);`
	`173`	`+{`
	`174`	`+BlockNumberblknum=fsm_get_heap_blk(addr,search_slot);`
	`175`	`+`
	`176`	`+/*`
	`177`	`+ * Check that the blknum is actually in the relation. Don't try to`
	`178`	`+ * update the FSM in that case, just fall back to the other case`
	`179`	`+ */`
	`180`	`+if (fsm_does_block_exist(rel,blknum))`
	`181`	`+returnblknum;`
	`182`	`+}`
	`183`	`+returnfsm_search(rel,search_cat);`
`172`	`184`	`}`
`173`	`185`
`174`	`186`	`/*`
`@@ -297,14 +309,25 @@ FreeSpaceMapPrepareTruncateRel(Relation rel, BlockNumber nblocks)`
`297`	`309`	`fsm_truncate_avail(BufferGetPage(buf),first_removed_slot);`
`298`	`310`
`299`	`311`	`/*`
`300`		`- * Truncation of a relation is WAL-logged at a higher-level, and we`
`301`		`- * will be called at WAL replay. But if checksums are enabled, we need`
`302`		`- * to still write a WAL record to protect against a torn page, if the`
`303`		`- * page is flushed to disk before the truncation WAL record. We cannot`
`304`		`- * use MarkBufferDirtyHint here, because that will not dirty the page`
`305`		`- * during recovery.`
	`312`	`+ * This change is non-critical, because fsm_does_block_exist() would`
	`313`	`+ * stop us from returning a truncated-away block. However, since this`
	`314`	`+ * may remove up to SlotsPerFSMPage slots, it's nice to avoid the cost`
	`315`	`+ * of that many fsm_does_block_exist() rejections. Use a full`
	`316`	`+ * MarkBufferDirty(), not MarkBufferDirtyHint().`
`306`	`317`	`*/`
`307`	`318`	`MarkBufferDirty(buf);`
	`319`	`+`
	`320`	`+/*`
	`321`	`+ * WAL-log like MarkBufferDirtyHint() might have done, just to avoid`
	`322`	`+ * differing from the rest of the file in this respect. This is`
	`323`	`+ * optional; see README mention of full page images. XXX consider`
	`324`	`+ * XLogSaveBufferForHint() for even closer similarity.`
	`325`	`+ *`
	`326`	`+ * A higher-level operation calls us at WAL replay. If we crash`
	`327`	`+ * before the XLOG_SMGR_TRUNCATE flushes to disk, main fork length has`
	`328`	`+ * not changed, and our fork remains valid. If we crash after that`
	`329`	`+ * flush, redo will return here.`
	`330`	`+ */`
`308`	`331`	`if (!InRecovery&&RelationNeedsWAL(rel)&&XLogHintBitIsNeeded())`
`309`	`332`	`log_newpage_buffer(buf, false);`
`310`	`333`
`@@ -674,8 +697,15 @@ fsm_search(Relation rel, uint8 min_cat)`
`674`	`697`	`(addr.level==FSM_BOTTOM_LEVEL),`
`675`	`698`	`false);`
`676`	`699`	`if (slot==-1)`
	`700`	`+{`
`677`	`701`	`max_avail=fsm_get_max_avail(BufferGetPage(buf));`
`678`		`-UnlockReleaseBuffer(buf);`
	`702`	`+UnlockReleaseBuffer(buf);`
	`703`	`+}`
	`704`	`+else`
	`705`	`+{`
	`706`	`+/* Keep the pin for possible update below */`
	`707`	`+LockBuffer(buf,BUFFER_LOCK_UNLOCK);`
	`708`	`+}`
`679`	`709`	`}`
`680`	`710`	`else`
`681`	`711`	`slot=-1;`
`@@ -687,8 +717,37 @@ fsm_search(Relation rel, uint8 min_cat)`
`687`	`717`	`* bottom.`
`688`	`718`	`*/`
`689`	`719`	`if (addr.level==FSM_BOTTOM_LEVEL)`
`690`		`-returnfsm_get_heap_blk(addr,slot);`
`691`		`-`
	`720`	`+{`
	`721`	`+BlockNumberblkno=fsm_get_heap_blk(addr,slot);`
	`722`	`+Pagepage;`
	`723`	`+`
	`724`	`+if (fsm_does_block_exist(rel,blkno))`
	`725`	`+{`
	`726`	`+ReleaseBuffer(buf);`
	`727`	`+returnblkno;`
	`728`	`+}`
	`729`	`+`
	`730`	`+/*`
	`731`	`+ * Block is past the end of the relation. Update FSM, and`
	`732`	`+ * restart from root. The usual "advancenext" behavior is`
	`733`	`+ * pessimal for this rare scenario, since every later slot is`
	`734`	`+ * unusable in the same way. We could zero all affected slots`
	`735`	`+ * on the same FSM page, but don't bet on the benefits of that`
	`736`	`+ * optimization justifying its compiled code bulk.`
	`737`	`+ */`
	`738`	`+page=BufferGetPage(buf);`
	`739`	`+LockBuffer(buf,BUFFER_LOCK_EXCLUSIVE);`
	`740`	`+fsm_set_avail(page,slot,0);`
	`741`	`+MarkBufferDirtyHint(buf, false);`
	`742`	`+UnlockReleaseBuffer(buf);`
	`743`	`+if (restarts++>10000)/* same rationale as below */`
	`744`	`+returnInvalidBlockNumber;`
	`745`	`+addr=FSM_ROOT_ADDRESS;`
	`746`	`+}`
	`747`	`+else`
	`748`	`+{`
	`749`	`+ReleaseBuffer(buf);`
	`750`	`+}`
`692`	`751`	`addr=fsm_get_child(addr,slot);`
`693`	`752`	`}`
`694`	`753`	`elseif (addr.level==FSM_ROOT_LEVEL)`
`@@ -856,3 +915,26 @@ fsm_vacuum_page(Relation rel, FSMAddress addr,`
`856`	`915`
`857`	`916`	`returnmax_avail;`
`858`	`917`	`}`
	`918`	`+`
	`919`	`+`
	`920`	`+/*`
	`921`	`+ * Check whether a block number is past the end of the relation. This can`
	`922`	`+ * happen after WAL replay, if the FSM reached disk but newly-extended pages`
	`923`	`+ * it refers to did not.`
	`924`	`+ */`
	`925`	`+staticbool`
	`926`	`+fsm_does_block_exist(Relationrel,BlockNumberblknumber)`
	`927`	`+{`
	`928`	`+SMgrRelationsmgr=RelationGetSmgr(rel);`
	`929`	`+`
	`930`	`+/*`
	`931`	`+ * If below the cached nblocks, the block surely exists. Otherwise, we`
	`932`	`+ * face a trade-off. We opt to compare to a fresh nblocks, incurring`
	`933`	`+ * lseek() overhead. The alternative would be to assume the block does`
	`934`	`+ * not exist, but that would cause FSM to set zero space available for`
	`935`	`+ * blocks that main fork extension just recorded.`
	`936`	`+ */`
	`937`	`+return ((BlockNumberIsValid(smgr->smgr_cached_nblocks[MAIN_FORKNUM])&&`
	`938`	`+blknumber<smgr->smgr_cached_nblocks[MAIN_FORKNUM])\|\|`
	`939`	`+blknumber<RelationGetNumberOfBlocks(rel));`
	`940`	`+}`

`‎src/backend/storage/smgr/smgr.c‎`

Lines changed: 3 additions & 2 deletions

Original file line number	Diff line number	Diff line change
`@@ -679,8 +679,9 @@ BlockNumber`
`679`	`679`	`smgrnblocks_cached(SMgrRelationreln,ForkNumberforknum)`
`680`	`680`	`{`
`681`	`681`	`/*`
`682`		`- * For now, we only use cached values in recovery due to lack of a shared`
`683`		`- * invalidation mechanism for changes in file size.`
	`682`	`+ * For now, this function uses cached values only in recovery due to lack`
	`683`	`+ * of a shared invalidation mechanism for changes in file size. Code`
	`684`	`+ * elsewhere reads smgr_cached_nblocks and copes with stale data.`
`684`	`685`	`*/`
`685`	`686`	`if (InRecovery&&reln->smgr_cached_nblocks[forknum]!=InvalidBlockNumber)`
`686`	`687`	`returnreln->smgr_cached_nblocks[forknum];`

`‎src/test/recovery/t/008_fsm_truncation.pl‎`

Lines changed: 2 additions & 3 deletions

Original file line number	Diff line number	Diff line change
`@@ -1,9 +1,8 @@`
`1`	`1`
`2`	`2`	`# Copyright (c) 2021-2024, PostgreSQL Global Development Group`
`3`	`3`
`4`		`-# Test WAL replay of FSM changes.`
`5`		`-#`
`6`		`-# FSM changes don't normally need to be WAL-logged, except for truncation.`
	`4`	`+# Test FSM-driven INSERT just after truncation clears FSM slots indicating`
	`5`	`+# free space in removed blocks.`
`7`	`6`	`# The FSM mustn't return a page that doesn't exist (anymore).`
`8`	`7`	`use strict;`
`9`	`8`	`use warningsFATAL=>'all';`

0 commit comments

Comments

(0)

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Commit9358297

File tree

4 files changed

4 files changed

`‎src/backend/storage/freespace/README‎`

`‎src/backend/storage/freespace/freespace.c‎`

`‎src/backend/storage/smgr/smgr.c‎`

`‎src/test/recovery/t/008_fsm_truncation.pl‎`

0 commit comments