Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Commit64a8687

Browse files
committed
Fix WAL replay in presence of an incomplete record
Physical replication always ships WAL segment files to replicas oncethey are complete. This is a problem if one WAL record is split acrossa segment boundary and the primary server crashes before writing downthe segment with the next portion of the WAL record: WAL writing aftercrash recovery would happily resume at the point where the broken recordstarted, overwriting that record ... but any standby or backup may havealready received a copy of that segment, and they are not rewinding.This causes standbys to stop following the primary after the lattercrashes: LOG: invalid contrecord length 7262 at A8/D9FFFBC8because the standby is still trying to read the continuation record(contrecord) for the original long WAL record, but it is not there andit will never be. A workaround is to stop the replica, delete the WALfile, and restart it -- at which point a fresh copy is brought over fromthe primary. But that's pretty labor intensive, and I bet many userswould just give up and re-clone the standby instead.A fix for this problem was already attempted in commit515e3d8, butit only addressed the case for the scenario of WAL archiving, sostreaming replication would still be a problem (as well as other thingssuch as taking a filesystem-level backup while the server is down afterhaving crashed), and it had performance scalability problems too; so ithad to be reverted.This commit fixes the problem using an approach suggested by AndresFreund, whereby the initial portion(s) of the split-up WAL record arekept, and a special type of WAL record is written where the contrecordwas lost, so that WAL replay in the replica knows to skip the brokenparts. With this approach, we can continue to stream/archive segmentfiles as soon as they are complete, and replay of the broken recordswill proceed across the crash point without a hitch.Because a new type of WAL record is added, users should be careful toupgrade standbys first, primaries later. Otherwise they risk the standbybeing unable to start if the primary happens to write such a record.A new TAP test that exercises this is added, but the portability of itis yet to be seen.This has been wrong since the introduction of physical replication, sobackpatch all the way back. In stable branches, keep the newXLogReaderState members at the end of the struct, to avoid an ABIbreak.Author: Álvaro Herrera <alvherre@alvh.no-ip.org>Reviewed-by: Kyotaro Horiguchi <horikyota.ntt@gmail.com>Reviewed-by: Nathan Bossart <bossartn@amazon.com>Discussion:https://postgr.es/m/202108232252.dh7uxf6oxwcy@alvherre.pgsql
1 parent4f2c753 commit64a8687

File tree

9 files changed

+450
-7
lines changed

9 files changed

+450
-7
lines changed

‎src/backend/access/rmgrdesc/xlogdesc.c

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -139,6 +139,15 @@ xlog_desc(StringInfo buf, XLogReaderState *record)
139139
xlrec.ThisTimeLineID,xlrec.PrevTimeLineID,
140140
timestamptz_to_str(xlrec.end_time));
141141
}
142+
elseif (info==XLOG_OVERWRITE_CONTRECORD)
143+
{
144+
xl_overwrite_contrecordxlrec;
145+
146+
memcpy(&xlrec,rec,sizeof(xl_overwrite_contrecord));
147+
appendStringInfo(buf,"lsn %X/%X; time %s",
148+
LSN_FORMAT_ARGS(xlrec.overwritten_lsn),
149+
timestamptz_to_str(xlrec.overwrite_time));
150+
}
142151
}
143152

144153
constchar*
@@ -178,6 +187,9 @@ xlog_identify(uint8 info)
178187
caseXLOG_END_OF_RECOVERY:
179188
id="END_OF_RECOVERY";
180189
break;
190+
caseXLOG_OVERWRITE_CONTRECORD:
191+
id="OVERWRITE_CONTRECORD";
192+
break;
181193
caseXLOG_FPI:
182194
id="FPI";
183195
break;

‎src/backend/access/transam/xlog.c

Lines changed: 149 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -215,6 +215,15 @@ static XLogRecPtr LastRec;
215215
staticXLogRecPtrflushedUpto=0;
216216
staticTimeLineIDreceiveTLI=0;
217217

218+
/*
219+
* abortedRecPtr is the start pointer of a broken record at end of WAL when
220+
* recovery completes; missingContrecPtr is the location of the first
221+
* contrecord that went missing. See CreateOverwriteContrecordRecord for
222+
* details.
223+
*/
224+
staticXLogRecPtrabortedRecPtr;
225+
staticXLogRecPtrmissingContrecPtr;
226+
218227
/*
219228
* During recovery, lastFullPageWrites keeps track of full_page_writes that
220229
* the replayed WAL records indicate. It's initialized with full_page_writes
@@ -903,8 +912,11 @@ static void CheckRequiredParameterValues(void);
903912
staticvoidXLogReportParameters(void);
904913
staticvoidcheckTimeLineSwitch(XLogRecPtrlsn,TimeLineIDnewTLI,
905914
TimeLineIDprevTLI);
915+
staticvoidVerifyOverwriteContrecord(xl_overwrite_contrecord*xlrec,
916+
XLogReaderState*state);
906917
staticvoidLocalSetXLogInsertAllowed(void);
907918
staticvoidCreateEndOfRecoveryRecord(void);
919+
staticXLogRecPtrCreateOverwriteContrecordRecord(XLogRecPtraborted_lsn);
908920
staticvoidCheckPointGuts(XLogRecPtrcheckPointRedo,intflags);
909921
staticvoidKeepLogSeg(XLogRecPtrrecptr,XLogSegNo*logSegNo);
910922
staticXLogRecPtrXLogGetReplicationSlotMinimumLSN(void);
@@ -2257,6 +2269,18 @@ AdvanceXLInsertBuffer(XLogRecPtr upto, bool opportunistic)
22572269
if (!Insert->forcePageWrites)
22582270
NewPage->xlp_info |=XLP_BKP_REMOVABLE;
22592271

2272+
/*
2273+
* If a record was found to be broken at the end of recovery, and
2274+
* we're going to write on the page where its first contrecord was
2275+
* lost, set the XLP_FIRST_IS_OVERWRITE_CONTRECORD flag on the page
2276+
* header. See CreateOverwriteContrecordRecord().
2277+
*/
2278+
if (missingContrecPtr==NewPageBeginPtr)
2279+
{
2280+
NewPage->xlp_info |=XLP_FIRST_IS_OVERWRITE_CONTRECORD;
2281+
missingContrecPtr=InvalidXLogRecPtr;
2282+
}
2283+
22602284
/*
22612285
* If first page of an XLOG segment file, make it a long header.
22622286
*/
@@ -4390,6 +4414,19 @@ ReadRecord(XLogReaderState *xlogreader, int emode,
43904414
EndRecPtr=xlogreader->EndRecPtr;
43914415
if (record==NULL)
43924416
{
4417+
/*
4418+
* When not in standby mode we find that WAL ends in an incomplete
4419+
* record, keep track of that record. After recovery is done,
4420+
* we'll write a record to indicate downstream WAL readers that
4421+
* that portion is to be ignored.
4422+
*/
4423+
if (!StandbyMode&&
4424+
!XLogRecPtrIsInvalid(xlogreader->abortedRecPtr))
4425+
{
4426+
abortedRecPtr=xlogreader->abortedRecPtr;
4427+
missingContrecPtr=xlogreader->missingContrecPtr;
4428+
}
4429+
43934430
if (readFile >=0)
43944431
{
43954432
close(readFile);
@@ -7026,6 +7063,12 @@ StartupXLOG(void)
70267063
InRecovery= true;
70277064
}
70287065

7066+
/*
7067+
* Start recovery assuming that the final record isn't lost.
7068+
*/
7069+
abortedRecPtr=InvalidXLogRecPtr;
7070+
missingContrecPtr=InvalidXLogRecPtr;
7071+
70297072
/* REDO */
70307073
if (InRecovery)
70317074
{
@@ -7622,8 +7665,9 @@ StartupXLOG(void)
76227665

76237666
/*
76247667
* Kill WAL receiver, if it's still running, before we continue to write
7625-
* the startup checkpoint record. It will trump over the checkpoint and
7626-
* subsequent records if it's still alive when we start writing WAL.
7668+
* the startup checkpoint and aborted-contrecord records. It will trump
7669+
* over these records and subsequent ones if it's still alive when we
7670+
* start writing WAL.
76277671
*/
76287672
ShutdownWalRcv();
76297673

@@ -7656,8 +7700,12 @@ StartupXLOG(void)
76567700
StandbyMode= false;
76577701

76587702
/*
7659-
* Re-fetch the last valid or last applied record, so we can identify the
7660-
* exact endpoint of what we consider the valid portion of WAL.
7703+
* Determine where to start writing WAL next.
7704+
*
7705+
* When recovery ended in an incomplete record, write a WAL record about
7706+
* that and continue after it. In all other cases, re-fetch the last
7707+
* valid or last applied record, so we can identify the exact endpoint of
7708+
* what we consider the valid portion of WAL.
76617709
*/
76627710
XLogBeginRead(xlogreader,LastRec);
76637711
record=ReadRecord(xlogreader,PANIC, false);
@@ -7806,6 +7854,18 @@ StartupXLOG(void)
78067854
XLogCtl->ThisTimeLineID=ThisTimeLineID;
78077855
XLogCtl->PrevTimeLineID=PrevTimeLineID;
78087856

7857+
/*
7858+
* Actually, if WAL ended in an incomplete record, skip the parts that
7859+
* made it through and start writing after the portion that persisted.
7860+
* (It's critical to first write an OVERWRITE_CONTRECORD message, which
7861+
* we'll do as soon as we're open for writing new WAL.)
7862+
*/
7863+
if (!XLogRecPtrIsInvalid(missingContrecPtr))
7864+
{
7865+
Assert(!XLogRecPtrIsInvalid(abortedRecPtr));
7866+
EndOfLog=missingContrecPtr;
7867+
}
7868+
78097869
/*
78107870
* Prepare to write WAL starting at EndOfLog location, and init xlog
78117871
* buffer cache using the block containing the last record from the
@@ -7858,13 +7918,23 @@ StartupXLOG(void)
78587918
XLogCtl->LogwrtRqst.Write=EndOfLog;
78597919
XLogCtl->LogwrtRqst.Flush=EndOfLog;
78607920

7921+
LocalSetXLogInsertAllowed();
7922+
7923+
/* If necessary, write overwrite-contrecord before doing anything else */
7924+
if (!XLogRecPtrIsInvalid(abortedRecPtr))
7925+
{
7926+
Assert(!XLogRecPtrIsInvalid(missingContrecPtr));
7927+
CreateOverwriteContrecordRecord(abortedRecPtr);
7928+
abortedRecPtr=InvalidXLogRecPtr;
7929+
missingContrecPtr=InvalidXLogRecPtr;
7930+
}
7931+
78617932
/*
78627933
* Update full_page_writes in shared memory and write an XLOG_FPW_CHANGE
78637934
* record before resource manager writes cleanup WAL records or checkpoint
78647935
* record is written.
78657936
*/
78667937
Insert->fullPageWrites=lastFullPageWrites;
7867-
LocalSetXLogInsertAllowed();
78687938
UpdateFullPageWrites();
78697939
LocalXLogInsertAllowed=-1;
78707940

@@ -9364,6 +9434,53 @@ CreateEndOfRecoveryRecord(void)
93649434
LocalXLogInsertAllowed=-1;/* return to "check" state */
93659435
}
93669436

9437+
/*
9438+
* Write an OVERWRITE_CONTRECORD message.
9439+
*
9440+
* When on WAL replay we expect a continuation record at the start of a page
9441+
* that is not there, recovery ends and WAL writing resumes at that point.
9442+
* But it's wrong to resume writing new WAL back at the start of the record
9443+
* that was broken, because downstream consumers of that WAL (physical
9444+
* replicas) are not prepared to "rewind". So the first action after
9445+
* finishing replay of all valid WAL must be to write a record of this type
9446+
* at the point where the contrecord was missing; to support xlogreader
9447+
* detecting the special case, XLP_FIRST_IS_OVERWRITE_CONTRECORD is also added
9448+
* to the page header where the record occurs. xlogreader has an ad-hoc
9449+
* mechanism to report metadata about the broken record, which is what we
9450+
* use here.
9451+
*
9452+
* At replay time, XLP_FIRST_IS_OVERWRITE_CONTRECORD instructs xlogreader to
9453+
* skip the record it was reading, and pass back the LSN of the skipped
9454+
* record, so that its caller can verify (on "replay" of that record) that the
9455+
* XLOG_OVERWRITE_CONTRECORD matches what was effectively overwritten.
9456+
*/
9457+
staticXLogRecPtr
9458+
CreateOverwriteContrecordRecord(XLogRecPtraborted_lsn)
9459+
{
9460+
xl_overwrite_contrecordxlrec;
9461+
XLogRecPtrrecptr;
9462+
9463+
/* sanity check */
9464+
if (!RecoveryInProgress())
9465+
elog(ERROR,"can only be used at end of recovery");
9466+
9467+
xlrec.overwritten_lsn=aborted_lsn;
9468+
xlrec.overwrite_time=GetCurrentTimestamp();
9469+
9470+
START_CRIT_SECTION();
9471+
9472+
XLogBeginInsert();
9473+
XLogRegisterData((char*)&xlrec,sizeof(xl_overwrite_contrecord));
9474+
9475+
recptr=XLogInsert(RM_XLOG_ID,XLOG_OVERWRITE_CONTRECORD);
9476+
9477+
XLogFlush(recptr);
9478+
9479+
END_CRIT_SECTION();
9480+
9481+
returnrecptr;
9482+
}
9483+
93679484
/*
93689485
* Flush all data in shared memory to disk, and fsync
93699486
*
@@ -10291,6 +10408,13 @@ xlog_redo(XLogReaderState *record)
1029110408

1029210409
RecoveryRestartPoint(&checkPoint);
1029310410
}
10411+
elseif (info==XLOG_OVERWRITE_CONTRECORD)
10412+
{
10413+
xl_overwrite_contrecordxlrec;
10414+
10415+
memcpy(&xlrec,XLogRecGetData(record),sizeof(xl_overwrite_contrecord));
10416+
VerifyOverwriteContrecord(&xlrec,record);
10417+
}
1029410418
elseif (info==XLOG_END_OF_RECOVERY)
1029510419
{
1029610420
xl_end_of_recoveryxlrec;
@@ -10451,6 +10575,26 @@ xlog_redo(XLogReaderState *record)
1045110575
}
1045210576
}
1045310577

10578+
/*
10579+
* Verify the payload of a XLOG_OVERWRITE_CONTRECORD record.
10580+
*/
10581+
staticvoid
10582+
VerifyOverwriteContrecord(xl_overwrite_contrecord*xlrec,XLogReaderState*state)
10583+
{
10584+
if (xlrec->overwritten_lsn!=state->overwrittenRecPtr)
10585+
elog(FATAL,"mismatching overwritten LSN %X/%X -> %X/%X",
10586+
LSN_FORMAT_ARGS(xlrec->overwritten_lsn),
10587+
LSN_FORMAT_ARGS(state->overwrittenRecPtr));
10588+
10589+
ereport(LOG,
10590+
(errmsg("sucessfully skipped missing contrecord at %X/%X, overwritten at %s",
10591+
LSN_FORMAT_ARGS(xlrec->overwritten_lsn),
10592+
timestamptz_to_str(xlrec->overwrite_time))));
10593+
10594+
/* Verifying the record should only happen once */
10595+
state->overwrittenRecPtr=InvalidXLogRecPtr;
10596+
}
10597+
1045410598
#ifdefWAL_DEBUG
1045510599

1045610600
staticvoid

‎src/backend/access/transam/xlogreader.c

Lines changed: 39 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -275,6 +275,7 @@ XLogReadRecord(XLogReaderState *state, char **errormsg)
275275
total_len;
276276
uint32targetRecOff;
277277
uint32pageHeaderSize;
278+
boolassembled;
278279
boolgotheader;
279280
intreadOff;
280281

@@ -290,6 +291,8 @@ XLogReadRecord(XLogReaderState *state, char **errormsg)
290291
state->errormsg_buf[0]='\0';
291292

292293
ResetDecoder(state);
294+
state->abortedRecPtr=InvalidXLogRecPtr;
295+
state->missingContrecPtr=InvalidXLogRecPtr;
293296

294297
RecPtr=state->EndRecPtr;
295298

@@ -316,7 +319,9 @@ XLogReadRecord(XLogReaderState *state, char **errormsg)
316319
randAccess= true;
317320
}
318321

322+
restart:
319323
state->currRecPtr=RecPtr;
324+
assembled= false;
320325

321326
targetPagePtr=RecPtr- (RecPtr %XLOG_BLCKSZ);
322327
targetRecOff=RecPtr %XLOG_BLCKSZ;
@@ -412,6 +417,8 @@ XLogReadRecord(XLogReaderState *state, char **errormsg)
412417
char*buffer;
413418
uint32gotlen;
414419

420+
assembled= true;
421+
415422
/*
416423
* Enlarge readRecordBuf as needed.
417424
*/
@@ -445,8 +452,25 @@ XLogReadRecord(XLogReaderState *state, char **errormsg)
445452

446453
Assert(SizeOfXLogShortPHD <=readOff);
447454

448-
/* Check that the continuation on next page looks valid */
449455
pageHeader= (XLogPageHeader)state->readBuf;
456+
457+
/*
458+
* If we were expecting a continuation record and got an
459+
* "overwrite contrecord" flag, that means the continuation record
460+
* was overwritten with a different record. Restart the read by
461+
* assuming the address to read is the location where we found
462+
* this flag; but keep track of the LSN of the record we were
463+
* reading, for later verification.
464+
*/
465+
if (pageHeader->xlp_info&XLP_FIRST_IS_OVERWRITE_CONTRECORD)
466+
{
467+
state->overwrittenRecPtr=state->currRecPtr;
468+
ResetDecoder(state);
469+
RecPtr=targetPagePtr;
470+
gotorestart;
471+
}
472+
473+
/* Check that the continuation on next page looks valid */
450474
if (!(pageHeader->xlp_info&XLP_FIRST_IS_CONTRECORD))
451475
{
452476
report_invalid_record(state,
@@ -548,6 +572,20 @@ XLogReadRecord(XLogReaderState *state, char **errormsg)
548572
returnNULL;
549573

550574
err:
575+
if (assembled)
576+
{
577+
/*
578+
* We get here when a record that spans multiple pages needs to be
579+
* assembled, but something went wrong -- perhaps a contrecord piece
580+
* was lost. If caller is WAL replay, it will know where the aborted
581+
* record was and where to direct followup WAL to be written, marking
582+
* the next piece with XLP_FIRST_IS_OVERWRITE_CONTRECORD, which will
583+
* in turn signal downstream WAL consumers that the broken WAL record
584+
* is to be ignored.
585+
*/
586+
state->abortedRecPtr=RecPtr;
587+
state->missingContrecPtr=targetPagePtr;
588+
}
551589

552590
/*
553591
* Invalidate the read state. We might read from a different source after

‎src/include/access/xlog_internal.h

Lines changed: 10 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -76,8 +76,10 @@ typedef XLogLongPageHeaderData *XLogLongPageHeader;
7676
#defineXLP_LONG_HEADER0x0002
7777
/* This flag indicates backup blocks starting in this page are optional */
7878
#defineXLP_BKP_REMOVABLE0x0004
79+
/* Replaces a missing contrecord; see CreateOverwriteContrecordRecord */
80+
#defineXLP_FIRST_IS_OVERWRITE_CONTRECORD 0x0008
7981
/* All defined flag bits in xlp_info (used for validity checking of header) */
80-
#defineXLP_ALL_FLAGS0x0007
82+
#defineXLP_ALL_FLAGS0x000F
8183

8284
#defineXLogPageHeaderSize(hdr)\
8385
(((hdr)->xlp_info & XLP_LONG_HEADER) ? SizeOfXLogLongPHD : SizeOfXLogShortPHD)
@@ -249,6 +251,13 @@ typedef struct xl_restore_point
249251
charrp_name[MAXFNAMELEN];
250252
}xl_restore_point;
251253

254+
/* Overwrite of prior contrecord */
255+
typedefstructxl_overwrite_contrecord
256+
{
257+
XLogRecPtroverwritten_lsn;
258+
TimestampTzoverwrite_time;
259+
}xl_overwrite_contrecord;
260+
252261
/* End of recovery mark, when we don't do an END_OF_RECOVERY checkpoint */
253262
typedefstructxl_end_of_recovery
254263
{

‎src/include/access/xlogreader.h

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -252,6 +252,16 @@ struct XLogReaderState
252252

253253
/* Buffer to hold error message */
254254
char*errormsg_buf;
255+
256+
/*
257+
* Set at the end of recovery: the start point of a partial record at the
258+
* end of WAL (InvalidXLogRecPtr if there wasn't one), and the start
259+
* location of its first contrecord that went missing.
260+
*/
261+
XLogRecPtrabortedRecPtr;
262+
XLogRecPtrmissingContrecPtr;
263+
/* Set when XLP_FIRST_IS_OVERWRITE_CONTRECORD is found */
264+
XLogRecPtroverwrittenRecPtr;
255265
};
256266

257267
/* Get a new XLogReader */

0 commit comments

Comments
 (0)

[8]ページ先頭

©2009-2025 Movatter.jp