Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Commit08059fc

Browse files
committed
freespace: Don't return blocks past the end of the main fork.
GetPageWithFreeSpace() callers assume the returned block exists in themain fork, failing with "could not read block" errors if that doesn'thold. Make that assumption reliable now. It hadn't been guaranteed,due to the weak WAL and data ordering of participating components. Mostoperations on the fsm fork are not WAL-logged. Relation extension isnot WAL-logged. Hence, an fsm-fork block on disk can reference amain-fork block that no WAL record has initialized. That could happenafter an OS crash, a replica promote, or a PITR restore. wal_log_hintsmakes the trouble easier to hit; a replica promote or PITR ending justafter a relevant fsm-fork FPI_FOR_HINT may yield this broken state. Thev16 RelationAddBlocks() mechanism also makes the trouble easier to hit,since it bulk-extends even without extension lock waiters. Commit917dc7d stopped trouble aroundtruncation, but vectors involving PageIsNew() pages remained.This implementation adds a RelationGetNumberOfBlocks() call when thecached relation size doesn't confirm a block exists. We've been unableto identify a benchmark that slows materially, but this may show up asadditional time in lseek(). An alternative without that overhead wouldbe a new ReadBufferMode such that ReadBufferExtended() returns NULLafter a 0-byte read, with all other errors handled normally. However,each GetFreeIndexPage() caller would then need code for the return-NULLcase. Back-patch to v14, due to earlier versions not caching relationsize and the absence of a pre-v16 problem report.Ronan Dunklau. Reported by Ronan Dunklau.Discussion:https://postgr.es/m/1878547.tdWV9SEqCh%40aivenlaptop
1 parentda11a14 commit08059fc

File tree

4 files changed

+112
-20
lines changed

4 files changed

+112
-20
lines changed

‎src/backend/storage/freespace/README

Lines changed: 13 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -169,9 +169,7 @@ Recovery
169169
--------
170170

171171
The FSM is not explicitly WAL-logged. Instead, we rely on a bunch of
172-
self-correcting measures to repair possible corruption. As a result when
173-
we write to the FSM we treat that as a hint and thus use MarkBufferDirtyHint()
174-
rather than MarkBufferDirty().
172+
self-correcting measures to repair possible corruption.
175173

176174
First of all, whenever a value is set on an FSM page, the root node of the
177175
page is compared against the new value after bubbling up the change is
@@ -188,6 +186,18 @@ goes through fsm_set_avail(), so that the upper nodes on those pages are
188186
immediately updated. Periodically, VACUUM calls FreeSpaceMapVacuum[Range]
189187
to propagate the new free-space info into the upper pages of the FSM tree.
190188

189+
As a result when we write to the FSM we treat that as a hint and thus use
190+
MarkBufferDirtyHint() rather than MarkBufferDirty(). Every read here uses
191+
RBM_ZERO_ON_ERROR to bypass checksum mismatches and other verification
192+
failures. We'd operate correctly without the full page images that
193+
MarkBufferDirtyHint() provides, but they do decrease the chance of losing slot
194+
knowledge to RBM_ZERO_ON_ERROR.
195+
196+
Relation extension is not WAL-logged. Hence, after WAL replay, an on-disk FSM
197+
slot may indicate free space in PageIsNew() blocks that never reached disk.
198+
We detect this case by comparing against the actual relation size, and we mark
199+
the block as full in that case.
200+
191201
TODO
192202
----
193203

‎src/backend/storage/freespace/freespace.c

Lines changed: 94 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -111,6 +111,7 @@ static BlockNumber fsm_search(Relation rel, uint8 min_cat);
111111
staticuint8fsm_vacuum_page(Relationrel,FSMAddressaddr,
112112
BlockNumberstart,BlockNumberend,
113113
bool*eof);
114+
staticboolfsm_does_block_exist(Relationrel,BlockNumberblknumber);
114115

115116

116117
/******** Public API ********/
@@ -127,6 +128,9 @@ static uint8 fsm_vacuum_page(Relation rel, FSMAddress addr,
127128
* amount of free space available on that page and then try again (see
128129
* RecordAndGetPageWithFreeSpace). If InvalidBlockNumber is returned,
129130
* extend the relation.
131+
*
132+
* This can trigger FSM updates if any FSM entry is found to point to a block
133+
* past the end of the relation.
130134
*/
131135
BlockNumber
132136
GetPageWithFreeSpace(Relationrel,SizespaceNeeded)
@@ -165,9 +169,17 @@ RecordAndGetPageWithFreeSpace(Relation rel, BlockNumber oldPage,
165169
* Otherwise, search as usual.
166170
*/
167171
if (search_slot!=-1)
168-
returnfsm_get_heap_blk(addr,search_slot);
169-
else
170-
returnfsm_search(rel,search_cat);
172+
{
173+
BlockNumberblknum=fsm_get_heap_blk(addr,search_slot);
174+
175+
/*
176+
* Check that the blknum is actually in the relation. Don't try to
177+
* update the FSM in that case, just fall back to the other case
178+
*/
179+
if (fsm_does_block_exist(rel,blknum))
180+
returnblknum;
181+
}
182+
returnfsm_search(rel,search_cat);
171183
}
172184

173185
/*
@@ -295,14 +307,25 @@ FreeSpaceMapPrepareTruncateRel(Relation rel, BlockNumber nblocks)
295307
fsm_truncate_avail(BufferGetPage(buf),first_removed_slot);
296308

297309
/*
298-
* Truncation of a relation is WAL-logged at a higher-level, and we
299-
* will be called at WAL replay. But if checksums are enabled, we need
300-
* to still write a WAL record to protect against a torn page, if the
301-
* page is flushed to disk before the truncation WAL record. We cannot
302-
* use MarkBufferDirtyHint here, because that will not dirty the page
303-
* during recovery.
310+
* This change is non-critical, because fsm_does_block_exist() would
311+
* stop us from returning a truncated-away block. However, since this
312+
* may remove up to SlotsPerFSMPage slots, it's nice to avoid the cost
313+
* of that many fsm_does_block_exist() rejections. Use a full
314+
* MarkBufferDirty(), not MarkBufferDirtyHint().
304315
*/
305316
MarkBufferDirty(buf);
317+
318+
/*
319+
* WAL-log like MarkBufferDirtyHint() might have done, just to avoid
320+
* differing from the rest of the file in this respect. This is
321+
* optional; see README mention of full page images. XXX consider
322+
* XLogSaveBufferForHint() for even closer similarity.
323+
*
324+
* A higher-level operation calls us at WAL replay. If we crash
325+
* before the XLOG_SMGR_TRUNCATE flushes to disk, main fork length has
326+
* not changed, and our fork remains valid. If we crash after that
327+
* flush, redo will return here.
328+
*/
306329
if (!InRecovery&&RelationNeedsWAL(rel)&&XLogHintBitIsNeeded())
307330
log_newpage_buffer(buf, false);
308331

@@ -719,8 +742,15 @@ fsm_search(Relation rel, uint8 min_cat)
719742
(addr.level==FSM_BOTTOM_LEVEL),
720743
false);
721744
if (slot==-1)
745+
{
722746
max_avail=fsm_get_max_avail(BufferGetPage(buf));
723-
UnlockReleaseBuffer(buf);
747+
UnlockReleaseBuffer(buf);
748+
}
749+
else
750+
{
751+
/* Keep the pin for possible update below */
752+
LockBuffer(buf,BUFFER_LOCK_UNLOCK);
753+
}
724754
}
725755
else
726756
slot=-1;
@@ -732,8 +762,37 @@ fsm_search(Relation rel, uint8 min_cat)
732762
* bottom.
733763
*/
734764
if (addr.level==FSM_BOTTOM_LEVEL)
735-
returnfsm_get_heap_blk(addr,slot);
736-
765+
{
766+
BlockNumberblkno=fsm_get_heap_blk(addr,slot);
767+
Pagepage;
768+
769+
if (fsm_does_block_exist(rel,blkno))
770+
{
771+
ReleaseBuffer(buf);
772+
returnblkno;
773+
}
774+
775+
/*
776+
* Block is past the end of the relation. Update FSM, and
777+
* restart from root. The usual "advancenext" behavior is
778+
* pessimal for this rare scenario, since every later slot is
779+
* unusable in the same way. We could zero all affected slots
780+
* on the same FSM page, but don't bet on the benefits of that
781+
* optimization justifying its compiled code bulk.
782+
*/
783+
page=BufferGetPage(buf);
784+
LockBuffer(buf,BUFFER_LOCK_EXCLUSIVE);
785+
fsm_set_avail(page,slot,0);
786+
MarkBufferDirtyHint(buf, false);
787+
UnlockReleaseBuffer(buf);
788+
if (restarts++>10000)/* same rationale as below */
789+
returnInvalidBlockNumber;
790+
addr=FSM_ROOT_ADDRESS;
791+
}
792+
else
793+
{
794+
ReleaseBuffer(buf);
795+
}
737796
addr=fsm_get_child(addr,slot);
738797
}
739798
elseif (addr.level==FSM_ROOT_LEVEL)
@@ -901,3 +960,26 @@ fsm_vacuum_page(Relation rel, FSMAddress addr,
901960

902961
returnmax_avail;
903962
}
963+
964+
965+
/*
966+
* Check whether a block number is past the end of the relation. This can
967+
* happen after WAL replay, if the FSM reached disk but newly-extended pages
968+
* it refers to did not.
969+
*/
970+
staticbool
971+
fsm_does_block_exist(Relationrel,BlockNumberblknumber)
972+
{
973+
SMgrRelationsmgr=RelationGetSmgr(rel);
974+
975+
/*
976+
* If below the cached nblocks, the block surely exists. Otherwise, we
977+
* face a trade-off. We opt to compare to a fresh nblocks, incurring
978+
* lseek() overhead. The alternative would be to assume the block does
979+
* not exist, but that would cause FSM to set zero space available for
980+
* blocks that main fork extension just recorded.
981+
*/
982+
return ((BlockNumberIsValid(smgr->smgr_cached_nblocks[MAIN_FORKNUM])&&
983+
blknumber<smgr->smgr_cached_nblocks[MAIN_FORKNUM])||
984+
blknumber<RelationGetNumberOfBlocks(rel));
985+
}

‎src/backend/storage/smgr/smgr.c

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -572,8 +572,9 @@ BlockNumber
572572
smgrnblocks_cached(SMgrRelationreln,ForkNumberforknum)
573573
{
574574
/*
575-
* For now, we only use cached values in recovery due to lack of a shared
576-
* invalidation mechanism for changes in file size.
575+
* For now, this function uses cached values only in recovery due to lack
576+
* of a shared invalidation mechanism for changes in file size. Code
577+
* elsewhere reads smgr_cached_nblocks and copes with stale data.
577578
*/
578579
if (InRecovery&&reln->smgr_cached_nblocks[forknum]!=InvalidBlockNumber)
579580
returnreln->smgr_cached_nblocks[forknum];

‎src/test/recovery/t/008_fsm_truncation.pl

Lines changed: 2 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,9 +1,8 @@
11

22
# Copyright (c) 2021, PostgreSQL Global Development Group
33

4-
# Test WAL replay of FSM changes.
5-
#
6-
# FSM changes don't normally need to be WAL-logged, except for truncation.
4+
# Test FSM-driven INSERT just after truncation clears FSM slots indicating
5+
# free space in removed blocks.
76
# The FSM mustn't return a page that doesn't exist (anymore).
87
use strict;
98
use warnings;

0 commit comments

Comments
 (0)

[8]ページ先頭

©2009-2025 Movatter.jp