Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Commit18b87b2

Browse files
committed
Fix possible HOT corruption when RECENTLY_DEAD changes to DEAD while pruning.
Sincedc7420c the horizon used for pruning is determined "lazily". A moreaccurate horizon is built on-demand, rather than in GetSnapshotData(). If ahorizon computation is triggered between two HeapTupleSatisfiesVacuum() callsfor the same tuple, the result can change from RECENTLY_DEAD to DEAD.heap_page_prune() can process the same tid multiple times (once following anupdate chain, once "directly"). When the result of HeapTupleSatisfiesVacuum()of a tuple changes from RECENTLY_DEAD during the first access, to DEAD in thesecond, the "tuple is DEAD and doesn't chain to anything else" path inheap_prune_chain() can end up marking the target of a LP_REDIRECT ItemIdunused.Initially not easily visible,Once the target of a LP_REDIRECT ItemId is marked unused, a new tuple versioncan reuse it. At that point the corruption may become visible, as indexentries pointing to the "original" redirect item, now point to a unrelatedtuple.To fix, compute HTSV for all tuples on a page only once. This fixes the entireclass of problems of HTSV changing inside heap_page_prune(). However,visibility changes can obviously still occur between HTSV checks insideheap_page_prune() and outside (e.g. in lazy_scan_prune()).The computation of HTSV is now done in bulk, in heap_page_prune(), rather thanon-demand in heap_prune_chain(). Besides being a bit simpler, it also isfaster: Memory accesses can happen sequentially, rather than in the order ofHOT chains.There are other causes of HeapTupleSatisfiesVacuum() results changing betweentwo visibility checks for the same tuple, even beforedc7420c. E.g.HEAPTUPLE_INSERT_IN_PROGRESS can change to HEAPTUPLE_DEAD when a transactionaborts between the two checks. None of the these other visibility statuschanges are known to cause corruption, but heap_page_prune()'s approach makesit hard to be confident.A patch implementing a more fundamental redesign of heap_page_prune(), whichfixes this bug and simplifies pruning substantially, has been proposed byPeter Geoghegan inhttps://postgr.es/m/CAH2-WzmNk6V6tqzuuabxoxM8HJRaWU6h12toaS-bqYcLiht16A@mail.gmail.comHowever, that redesign is larger change than desirable for backpatching. Asthe new design still benefits from the batched visibility determinationintroduced in this commit, it makes sense to commit this narrower fix to 14and master, and then commit Peter's improvement in master.The precise sequence required to trigger the bug is complicated and hard to doexercise in an isolation test (until we have wait points). Due to that theisolation test initially posted athttps://postgr.es/m/20211119003623.d3jusiytzjqwb62p%40alap3.anarazel.deand updated inhttps://postgr.es/m/20211122175914.ayk6gg6nvdwuhrzb%40alap3.anarazel.deisn't committable.A followup commit will introduce additional assertions, to detect problemslike this more easily.Bug: #17255Reported-By: Alexander Lakhin <exclusion@gmail.com>Debugged-By: Andres Freund <andres@anarazel.de>Debugged-By: Peter Geoghegan <pg@bowt.ie>Author: Andres Freund <andres@andres@anarazel.de>Reviewed-By: Peter Geoghegan <pg@bowt.ie>Discussion:https://postgr.es/m/20211122175914.ayk6gg6nvdwuhrzb@alap3.anarazel.deBackpatch: 14-, the oldest branch containingdc7420c
1 parent43c2175 commit18b87b2

File tree

1 file changed

+82
-26
lines changed

1 file changed

+82
-26
lines changed

‎src/backend/access/heap/pruneheap.c

Lines changed: 82 additions & 26 deletions
Original file line numberDiff line numberDiff line change
@@ -56,11 +56,30 @@ typedef struct
5656
OffsetNumberredirected[MaxHeapTuplesPerPage*2];
5757
OffsetNumbernowdead[MaxHeapTuplesPerPage];
5858
OffsetNumbernowunused[MaxHeapTuplesPerPage];
59-
/* marked[i] is true if item i is entered in one of the above arrays */
59+
60+
/*
61+
* marked[i] is true if item i is entered in one of the above arrays.
62+
*
63+
* This needs to be MaxHeapTuplesPerPage + 1 long as FirstOffsetNumber is
64+
* 1. Otherwise every access would need to subtract 1.
65+
*/
6066
boolmarked[MaxHeapTuplesPerPage+1];
67+
68+
/*
69+
* Tuple visibility is only computed once for each tuple, for correctness
70+
* and efficiency reasons; see comment in heap_page_prune() for
71+
* details. This is of type int8[,] intead of HTSV_Result[], so we can use
72+
* -1 to indicate no visibility has been computed, e.g. for LP_DEAD items.
73+
*
74+
* Same indexing as ->marked.
75+
*/
76+
int8htsv[MaxHeapTuplesPerPage+1];
6177
}PruneState;
6278

6379
/* Local functions */
80+
staticHTSV_Resultheap_prune_satisfies_vacuum(PruneState*prstate,
81+
HeapTupletup,
82+
Bufferbuffer);
6483
staticintheap_prune_chain(Bufferbuffer,
6584
OffsetNumberrootoffnum,
6685
PruneState*prstate);
@@ -252,6 +271,7 @@ heap_page_prune(Relation relation, Buffer buffer,
252271
OffsetNumberoffnum,
253272
maxoff;
254273
PruneStateprstate;
274+
HeapTupleDatatup;
255275

256276
/*
257277
* Our strategy is to scan the page and make lists of items to change,
@@ -274,8 +294,60 @@ heap_page_prune(Relation relation, Buffer buffer,
274294
prstate.nredirected=prstate.ndead=prstate.nunused=0;
275295
memset(prstate.marked,0,sizeof(prstate.marked));
276296

277-
/* Scan the page */
278297
maxoff=PageGetMaxOffsetNumber(page);
298+
tup.t_tableOid=RelationGetRelid(prstate.rel);
299+
300+
/*
301+
* Determine HTSV for all tuples.
302+
*
303+
* This is required for correctness to deal with cases where running HTSV
304+
* twice could result in different results (e.g. RECENTLY_DEAD can turn to
305+
* DEAD if another checked item causes GlobalVisTestIsRemovableFullXid()
306+
* to update the horizon, INSERT_IN_PROGRESS can change to DEAD if the
307+
* inserting transaction aborts, ...). That in turn could cause
308+
* heap_prune_chain() to behave incorrectly if a tuple is reached twice,
309+
* once directly via a heap_prune_chain() and once following a HOT chain.
310+
*
311+
* It's also good for performance. Most commonly tuples within a page are
312+
* stored at decreasing offsets (while the items are stored at increasing
313+
* offsets). When processing all tuples on a page this leads to reading
314+
* memory at decreasing offsets within a page, with a variable stride.
315+
* That's hard for CPU prefetchers to deal with. Processing the items in
316+
* reverse order (and thus the tuples in increasing order) increases
317+
* prefetching efficiency significantly / decreases the number of cache
318+
* misses.
319+
*/
320+
for (offnum=maxoff;
321+
offnum >=FirstOffsetNumber;
322+
offnum=OffsetNumberPrev(offnum))
323+
{
324+
ItemIditemid=PageGetItemId(page,offnum);
325+
HeapTupleHeaderhtup;
326+
327+
/* Nothing to do if slot doesn't contain a tuple */
328+
if (!ItemIdIsNormal(itemid))
329+
{
330+
prstate.htsv[offnum]=-1;
331+
continue;
332+
}
333+
334+
htup= (HeapTupleHeader)PageGetItem(page,itemid);
335+
tup.t_data=htup;
336+
tup.t_len=ItemIdGetLength(itemid);
337+
ItemPointerSet(&(tup.t_self),BufferGetBlockNumber(buffer),offnum);
338+
339+
/*
340+
* Set the offset number so that we can display it along with any
341+
* error that occurred while processing this tuple.
342+
*/
343+
if (off_loc)
344+
*off_loc=offnum;
345+
346+
prstate.htsv[offnum]=heap_prune_satisfies_vacuum(&prstate,&tup,
347+
buffer);
348+
}
349+
350+
/* Scan the page */
279351
for (offnum=FirstOffsetNumber;
280352
offnum <=maxoff;
281353
offnum=OffsetNumberNext(offnum))
@@ -286,10 +358,7 @@ heap_page_prune(Relation relation, Buffer buffer,
286358
if (prstate.marked[offnum])
287359
continue;
288360

289-
/*
290-
* Set the offset number so that we can display it along with any
291-
* error that occurred while processing this tuple.
292-
*/
361+
/* see preceding loop */
293362
if (off_loc)
294363
*off_loc=offnum;
295364

@@ -491,7 +560,7 @@ heap_prune_satisfies_vacuum(PruneState *prstate, HeapTuple tup, Buffer buffer)
491560
* the HOT chain is pruned by removing all DEAD tuples at the start of the HOT
492561
* chain. We also prune any RECENTLY_DEAD tuples preceding a DEAD tuple.
493562
* This is OK because a RECENTLY_DEAD tuple preceding a DEAD tuple is really
494-
* DEAD,the heap_prune_satisfies_vacuum test is just too coarse to detect it.
563+
* DEAD,our visibility test is just too coarse to detect it.
495564
*
496565
* In general, pruning must never leave behind a DEAD tuple that still has
497566
* tuple storage. VACUUM isn't prepared to deal with that case. That's why
@@ -505,11 +574,7 @@ heap_prune_satisfies_vacuum(PruneState *prstate, HeapTuple tup, Buffer buffer)
505574
* pointer is marked LP_DEAD. (This includes the case of a DEAD simple
506575
* tuple, which we treat as a chain of length 1.)
507576
*
508-
* prstate->vistest is used to distinguish whether tuples are DEAD or
509-
* RECENTLY_DEAD.
510-
*
511-
* We don't actually change the page here, except perhaps for hint-bit updates
512-
* caused by heap_prune_satisfies_vacuum. We just add entries to the arrays in
577+
* We don't actually change the page here. We just add entries to the arrays in
513578
* prstate showing the changes to be made. Items to be redirected are added
514579
* to the redirected[] array (two entries per redirection); items to be set to
515580
* LP_DEAD state are added to nowdead[]; and items to be set to LP_UNUSED
@@ -531,9 +596,6 @@ heap_prune_chain(Buffer buffer, OffsetNumber rootoffnum, PruneState *prstate)
531596
OffsetNumberchainitems[MaxHeapTuplesPerPage];
532597
intnchain=0,
533598
i;
534-
HeapTupleDatatup;
535-
536-
tup.t_tableOid=RelationGetRelid(prstate->rel);
537599

538600
rootlp=PageGetItemId(dp,rootoffnum);
539601

@@ -542,12 +604,9 @@ heap_prune_chain(Buffer buffer, OffsetNumber rootoffnum, PruneState *prstate)
542604
*/
543605
if (ItemIdIsNormal(rootlp))
544606
{
607+
Assert(prstate->htsv[rootoffnum]!=-1);
545608
htup= (HeapTupleHeader)PageGetItem(dp,rootlp);
546609

547-
tup.t_data=htup;
548-
tup.t_len=ItemIdGetLength(rootlp);
549-
ItemPointerSet(&(tup.t_self),BufferGetBlockNumber(buffer),rootoffnum);
550-
551610
if (HeapTupleHeaderIsHeapOnly(htup))
552611
{
553612
/*
@@ -568,8 +627,8 @@ heap_prune_chain(Buffer buffer, OffsetNumber rootoffnum, PruneState *prstate)
568627
* either here or while following a chain below. Whichever path
569628
* gets there first will mark the tuple unused.
570629
*/
571-
if (heap_prune_satisfies_vacuum(prstate,&tup,buffer)
572-
==HEAPTUPLE_DEAD&&!HeapTupleHeaderIsHotUpdated(htup))
630+
if (prstate->htsv[rootoffnum]==HEAPTUPLE_DEAD&&
631+
!HeapTupleHeaderIsHotUpdated(htup))
573632
{
574633
heap_prune_record_unused(prstate,rootoffnum);
575634
HeapTupleHeaderAdvanceLatestRemovedXid(htup,
@@ -636,12 +695,9 @@ heap_prune_chain(Buffer buffer, OffsetNumber rootoffnum, PruneState *prstate)
636695
break;
637696

638697
Assert(ItemIdIsNormal(lp));
698+
Assert(prstate->htsv[offnum]!=-1);
639699
htup= (HeapTupleHeader)PageGetItem(dp,lp);
640700

641-
tup.t_data=htup;
642-
tup.t_len=ItemIdGetLength(lp);
643-
ItemPointerSet(&(tup.t_self),BufferGetBlockNumber(buffer),offnum);
644-
645701
/*
646702
* Check the tuple XMIN against prior XMAX, if any
647703
*/
@@ -659,7 +715,7 @@ heap_prune_chain(Buffer buffer, OffsetNumber rootoffnum, PruneState *prstate)
659715
*/
660716
tupdead=recent_dead= false;
661717

662-
switch (heap_prune_satisfies_vacuum(prstate,&tup,buffer))
718+
switch ((HTSV_Result)prstate->htsv[offnum])
663719
{
664720
caseHEAPTUPLE_DEAD:
665721
tupdead= true;

0 commit comments

Comments
 (0)

[8]ページ先頭

©2009-2025 Movatter.jp