Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Commit3bbf668

Browse files
committed
Fix multiple problems in WAL replay.
Most of the replay functions for WAL record types that modify more thanone page failed to ensure that those pages were locked correctly to ensurethat concurrent queries could not see inconsistent page states. This isa hangover from coding decisions made long before Hot Standby was added,when it was hardly necessary to acquire buffer locks during WAL replayat all, let alone hold them for carefully-chosen periods.The key problem was that RestoreBkpBlocks was written to hold lock on eachpage restored from a full-page image for only as long as it took to updatethat page. This was guaranteed to break any WAL replay function in whichthere was any update-ordering constraint between pages, because even if thenominal order of the pages is the right one, any mixture of full-page andnon-full-page updates in the same record would result in out-of-orderupdates. Moreover, it wouldn't work for situations where there's arequirement to maintain lock on one page while updating another. Failureto honor an update ordering constraint in this way is thought to be thecause of bug #7648 from Daniel Farina: what seems to have happened thereis that a btree page being split was rewritten from a full-page imagebefore the new right sibling page was written, and because lock on theoriginal page was not maintained it was possible for hot standby queries totry to traverse the page's right-link to the not-yet-existing sibling page.To fix, get rid of RestoreBkpBlocks as such, and instead create a newfunction RestoreBackupBlock that restores just one full-page image at atime. This function can be invoked by WAL replay functions at the pointswhere they would otherwise perform non-full-page updates; in this way, thephysical order of page updates remains the same no matter which pages arereplaced by full-page images. We can then further adjust the logic inindividual replay functions if it is necessary to hold buffer locksfor overlapping periods. A side benefit is that we can simplify thehandling of concurrency conflict resolution by moving that code into therecord-type-specfic functions; there's no more need to contort the codelayout to keep conflict resolution in front of the RestoreBkpBlocks call.In connection with that, standardize on zero-based numbering rather thanone-based numbering for referencing the full-page images. In HEAD, Iremoved the macros XLR_BKP_BLOCK_1 through XLR_BKP_BLOCK_4. They arestill there in the header files in previous branches, but are no longerused by the code.In addition, fix some other bugs identified in the course of making thesechanges:spgRedoAddNode could fail to update the parent downlink at all, if theparent tuple is in the same page as either the old or new split tuple andwe're not doing a full-page image: it would get fooled by the LSN havingbeen advanced already. This would result in permanent index corruption,not just transient failure of concurrent queries.Also, ginHeapTupleFastInsert's "merge lists" case failed to mark the oldtail page as a candidate for a full-page image; in the worst case thiscould result in torn-page corruption.heap_xlog_freeze() was inconsistent about using a cleanup lock or plainexclusive lock: it did the former in the normal path but the latter for afull-page image. A plain exclusive lock seems sufficient, so change tothat.Also, remove gistRedoPageDeleteRecord(), which has been dead code sinceVACUUM FULL was rewritten.Back-patch to 9.0, where hot standby was introduced. Note however that 9.0had a significantly different WAL-logging scheme for GIST index updates,and it doesn't appear possible to make that scheme safe for concurrent hotstandby queries, because it can leave inconsistent states in the index evenbetween WAL records. Given the lack of complaints from the field, we won'twork too hard on fixing that branch.
1 parent9b3ac49 commit3bbf668

File tree

10 files changed

+753
-464
lines changed

10 files changed

+753
-464
lines changed

‎src/backend/access/gin/ginfast.c

Lines changed: 9 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -290,7 +290,7 @@ ginHeapTupleFastInsert(GinState *ginstate, GinTupleCollector *collector)
290290
if (metadata->head==InvalidBlockNumber)
291291
{
292292
/*
293-
* Main list is empty, so justcopy sublistinto main list
293+
* Main list is empty, so justinsert sublistas main list
294294
*/
295295
START_CRIT_SECTION();
296296

@@ -313,6 +313,14 @@ ginHeapTupleFastInsert(GinState *ginstate, GinTupleCollector *collector)
313313
LockBuffer(buffer,GIN_EXCLUSIVE);
314314
page=BufferGetPage(buffer);
315315

316+
rdata[0].next=rdata+1;
317+
318+
rdata[1].buffer=buffer;
319+
rdata[1].buffer_std= true;
320+
rdata[1].data=NULL;
321+
rdata[1].len=0;
322+
rdata[1].next=NULL;
323+
316324
Assert(GinPageGetOpaque(page)->rightlink==InvalidBlockNumber);
317325

318326
START_CRIT_SECTION();

‎src/backend/access/gin/ginxlog.c

Lines changed: 85 additions & 36 deletions
Original file line numberDiff line numberDiff line change
@@ -77,6 +77,9 @@ ginRedoCreateIndex(XLogRecPtr lsn, XLogRecord *record)
7777
MetaBuffer;
7878
Pagepage;
7979

80+
/* Backup blocks are not used in create_index records */
81+
Assert(!(record->xl_info&XLR_BKP_BLOCK_MASK));
82+
8083
MetaBuffer=XLogReadBuffer(*node,GIN_METAPAGE_BLKNO, true);
8184
Assert(BufferIsValid(MetaBuffer));
8285
page= (Page)BufferGetPage(MetaBuffer);
@@ -109,6 +112,9 @@ ginRedoCreatePTree(XLogRecPtr lsn, XLogRecord *record)
109112
Bufferbuffer;
110113
Pagepage;
111114

115+
/* Backup blocks are not used in create_ptree records */
116+
Assert(!(record->xl_info&XLR_BKP_BLOCK_MASK));
117+
112118
buffer=XLogReadBuffer(data->node,data->blkno, true);
113119
Assert(BufferIsValid(buffer));
114120
page= (Page)BufferGetPage(buffer);
@@ -159,9 +165,12 @@ ginRedoInsert(XLogRecPtr lsn, XLogRecord *record)
159165
}
160166
}
161167

162-
/* nothing else to do if page was backed up */
163-
if (record->xl_info&XLR_BKP_BLOCK_1)
168+
/* If we have a full-page image, restore it and we're done */
169+
if (record->xl_info&XLR_BKP_BLOCK(0))
170+
{
171+
(void)RestoreBackupBlock(lsn,record,0, false, false);
164172
return;
173+
}
165174

166175
buffer=XLogReadBuffer(data->node,data->blkno, false);
167176
if (!BufferIsValid(buffer))
@@ -256,6 +265,9 @@ ginRedoSplit(XLogRecPtr lsn, XLogRecord *record)
256265
if (data->isData)
257266
flags |=GIN_DATA;
258267

268+
/* Backup blocks are not used in split records */
269+
Assert(!(record->xl_info&XLR_BKP_BLOCK_MASK));
270+
259271
lbuffer=XLogReadBuffer(data->node,data->lblkno, true);
260272
Assert(BufferIsValid(lbuffer));
261273
lpage= (Page)BufferGetPage(lbuffer);
@@ -369,9 +381,12 @@ ginRedoVacuumPage(XLogRecPtr lsn, XLogRecord *record)
369381
Bufferbuffer;
370382
Pagepage;
371383

372-
/* nothing to do if page was backed up (and no info to do it with) */
373-
if (record->xl_info&XLR_BKP_BLOCK_1)
384+
/* If we have a full-page image, restore it and we're done */
385+
if (record->xl_info&XLR_BKP_BLOCK(0))
386+
{
387+
(void)RestoreBackupBlock(lsn,record,0, false, false);
374388
return;
389+
}
375390

376391
buffer=XLogReadBuffer(data->node,data->blkno, false);
377392
if (!BufferIsValid(buffer))
@@ -420,63 +435,74 @@ static void
420435
ginRedoDeletePage(XLogRecPtrlsn,XLogRecord*record)
421436
{
422437
ginxlogDeletePage*data= (ginxlogDeletePage*)XLogRecGetData(record);
423-
Bufferbuffer;
438+
Bufferdbuffer;
439+
Bufferpbuffer;
440+
Bufferlbuffer;
424441
Pagepage;
425442

426-
if (!(record->xl_info&XLR_BKP_BLOCK_1))
443+
if (record->xl_info&XLR_BKP_BLOCK(0))
444+
dbuffer=RestoreBackupBlock(lsn,record,0, false, true);
445+
else
427446
{
428-
buffer=XLogReadBuffer(data->node,data->blkno, false);
429-
if (BufferIsValid(buffer))
447+
dbuffer=XLogReadBuffer(data->node,data->blkno, false);
448+
if (BufferIsValid(dbuffer))
430449
{
431-
page=BufferGetPage(buffer);
450+
page=BufferGetPage(dbuffer);
432451
if (!XLByteLE(lsn,PageGetLSN(page)))
433452
{
434453
Assert(GinPageIsData(page));
435454
GinPageGetOpaque(page)->flags=GIN_DELETED;
436455
PageSetLSN(page,lsn);
437456
PageSetTLI(page,ThisTimeLineID);
438-
MarkBufferDirty(buffer);
457+
MarkBufferDirty(dbuffer);
439458
}
440-
UnlockReleaseBuffer(buffer);
441459
}
442460
}
443461

444-
if (!(record->xl_info&XLR_BKP_BLOCK_2))
462+
if (record->xl_info&XLR_BKP_BLOCK(1))
463+
pbuffer=RestoreBackupBlock(lsn,record,1, false, true);
464+
else
445465
{
446-
buffer=XLogReadBuffer(data->node,data->parentBlkno, false);
447-
if (BufferIsValid(buffer))
466+
pbuffer=XLogReadBuffer(data->node,data->parentBlkno, false);
467+
if (BufferIsValid(pbuffer))
448468
{
449-
page=BufferGetPage(buffer);
469+
page=BufferGetPage(pbuffer);
450470
if (!XLByteLE(lsn,PageGetLSN(page)))
451471
{
452472
Assert(GinPageIsData(page));
453473
Assert(!GinPageIsLeaf(page));
454474
GinPageDeletePostingItem(page,data->parentOffset);
455475
PageSetLSN(page,lsn);
456476
PageSetTLI(page,ThisTimeLineID);
457-
MarkBufferDirty(buffer);
477+
MarkBufferDirty(pbuffer);
458478
}
459-
UnlockReleaseBuffer(buffer);
460479
}
461480
}
462481

463-
if (!(record->xl_info&XLR_BKP_BLOCK_3)&&data->leftBlkno!=InvalidBlockNumber)
482+
if (record->xl_info&XLR_BKP_BLOCK(2))
483+
(void)RestoreBackupBlock(lsn,record,2, false, false);
484+
elseif (data->leftBlkno!=InvalidBlockNumber)
464485
{
465-
buffer=XLogReadBuffer(data->node,data->leftBlkno, false);
466-
if (BufferIsValid(buffer))
486+
lbuffer=XLogReadBuffer(data->node,data->leftBlkno, false);
487+
if (BufferIsValid(lbuffer))
467488
{
468-
page=BufferGetPage(buffer);
489+
page=BufferGetPage(lbuffer);
469490
if (!XLByteLE(lsn,PageGetLSN(page)))
470491
{
471492
Assert(GinPageIsData(page));
472493
GinPageGetOpaque(page)->rightlink=data->rightLink;
473494
PageSetLSN(page,lsn);
474495
PageSetTLI(page,ThisTimeLineID);
475-
MarkBufferDirty(buffer);
496+
MarkBufferDirty(lbuffer);
476497
}
477-
UnlockReleaseBuffer(buffer);
498+
UnlockReleaseBuffer(lbuffer);
478499
}
479500
}
501+
502+
if (BufferIsValid(pbuffer))
503+
UnlockReleaseBuffer(pbuffer);
504+
if (BufferIsValid(dbuffer))
505+
UnlockReleaseBuffer(dbuffer);
480506
}
481507

482508
staticvoid
@@ -505,7 +531,9 @@ ginRedoUpdateMetapage(XLogRecPtr lsn, XLogRecord *record)
505531
/*
506532
* insert into tail page
507533
*/
508-
if (!(record->xl_info&XLR_BKP_BLOCK_1))
534+
if (record->xl_info&XLR_BKP_BLOCK(0))
535+
(void)RestoreBackupBlock(lsn,record,0, false, false);
536+
else
509537
{
510538
buffer=XLogReadBuffer(data->node,data->metadata.tail, false);
511539
if (BufferIsValid(buffer))
@@ -553,20 +581,25 @@ ginRedoUpdateMetapage(XLogRecPtr lsn, XLogRecord *record)
553581
/*
554582
* New tail
555583
*/
556-
buffer=XLogReadBuffer(data->node,data->prevTail, false);
557-
if (BufferIsValid(buffer))
584+
if (record->xl_info&XLR_BKP_BLOCK(0))
585+
(void)RestoreBackupBlock(lsn,record,0, false, false);
586+
else
558587
{
559-
Pagepage=BufferGetPage(buffer);
560-
561-
if (!XLByteLE(lsn,PageGetLSN(page)))
588+
buffer=XLogReadBuffer(data->node,data->prevTail, false);
589+
if (BufferIsValid(buffer))
562590
{
563-
GinPageGetOpaque(page)->rightlink=data->newRightlink;
591+
Pagepage=BufferGetPage(buffer);
564592

565-
PageSetLSN(page,lsn);
566-
PageSetTLI(page,ThisTimeLineID);
567-
MarkBufferDirty(buffer);
593+
if (!XLByteLE(lsn,PageGetLSN(page)))
594+
{
595+
GinPageGetOpaque(page)->rightlink=data->newRightlink;
596+
597+
PageSetLSN(page,lsn);
598+
PageSetTLI(page,ThisTimeLineID);
599+
MarkBufferDirty(buffer);
600+
}
601+
UnlockReleaseBuffer(buffer);
568602
}
569-
UnlockReleaseBuffer(buffer);
570603
}
571604
}
572605

@@ -585,8 +618,12 @@ ginRedoInsertListPage(XLogRecPtr lsn, XLogRecord *record)
585618
tupsize;
586619
IndexTupletuples= (IndexTuple) (XLogRecGetData(record)+sizeof(ginxlogInsertListPage));
587620

588-
if (record->xl_info&XLR_BKP_BLOCK_1)
621+
/* If we have a full-page image, restore it and we're done */
622+
if (record->xl_info&XLR_BKP_BLOCK(0))
623+
{
624+
(void)RestoreBackupBlock(lsn,record,0, false, false);
589625
return;
626+
}
590627

591628
buffer=XLogReadBuffer(data->node,data->blkno, true);
592629
Assert(BufferIsValid(buffer));
@@ -632,6 +669,9 @@ ginRedoDeleteListPages(XLogRecPtr lsn, XLogRecord *record)
632669
Pagemetapage;
633670
inti;
634671

672+
/* Backup blocks are not used in delete_listpage records */
673+
Assert(!(record->xl_info&XLR_BKP_BLOCK_MASK));
674+
635675
metabuffer=XLogReadBuffer(data->node,GIN_METAPAGE_BLKNO, false);
636676
if (!BufferIsValid(metabuffer))
637677
return;/* assume index was deleted, nothing to do */
@@ -645,6 +685,16 @@ ginRedoDeleteListPages(XLogRecPtr lsn, XLogRecord *record)
645685
MarkBufferDirty(metabuffer);
646686
}
647687

688+
/*
689+
* In normal operation, shiftList() takes exclusive lock on all the
690+
* pages-to-be-deleted simultaneously.During replay, however, it should
691+
* be all right to lock them one at a time. This is dependent on the fact
692+
* that we are deleting pages from the head of the list, and that readers
693+
* share-lock the next page before releasing the one they are on. So we
694+
* cannot get past a reader that is on, or due to visit, any page we are
695+
* going to delete. New incoming readers will block behind our metapage
696+
* lock and then see a fully updated page list.
697+
*/
648698
for (i=0;i<data->ndeleted;i++)
649699
{
650700
Bufferbuffer=XLogReadBuffer(data->node,data->toDelete[i], false);
@@ -678,7 +728,6 @@ gin_redo(XLogRecPtr lsn, XLogRecord *record)
678728
* implement a similar optimization as we have in b-tree, and remove
679729
* killed tuples outside VACUUM, we'll need to handle that here.
680730
*/
681-
RestoreBkpBlocks(lsn,record, false);
682731

683732
topCtx=MemoryContextSwitchTo(opCtx);
684733
switch (info)

0 commit comments

Comments
 (0)

[8]ページ先頭

©2009-2025 Movatter.jp