NotificationsYou must be signed in to change notification settings
Fork6
Star31

Commite6767c0

committed

Fix header check for continuation records where standbys could be stuck

XLogPageRead() checks immediately for an invalid WAL record header on astandby, to be able to handle the case of continuation records that needto be read across two different sources. As written, the check was toogeneric, applying to any target LSN. Based on an analysis by KyotaroHoriguchi, what really matters is to make sure that the page header ischecked when attempting to read a LSN at the boundary of a segment, tohandle the case of a continuation record that spawns across multiplepages when dealing with multiple segments, as WAL receivers are spawnedthey request WAL from the beginning of a segment. This fix has beenproposed by Kyotaro Horiguchi.This could cause standbys to loop infinitely when dealing with acontinuation record during a timeline jump, in the case where thecontents of the record in the follow-up page are invalid.Some regression tests are added to check such scenarios, able toreproduce the original problem. In the test, the contents of acontinuation record are overwritten with junk zeros on its follow-uppage, and replayed on standbys. This is inspired by 039_end_of_wal.pl,and is enough to show how standbys should react on promotion by notbeing stuck. Without the fix, the test would fail with a timeout. Thetest to reproduce the problem has been written by Alexander Kukushkin.The original check has been introduced in0668719, for a similarproblem.Author: Kyotaro Horiguchi, Alexander KukushkinReviewed-by: Michael PaquierDiscussion:https://postgr.es/m/CAFh8B=mozC+e1wGJq0H=0O65goZju+6ab5AU7DEWCSUA2OtwDg@mail.gmail.comBackpatch-through: 13

1 parentd1bf86a commite6767c0Copy full SHA for e6767c0

File tree

3 files changed

+161

-6

lines changed

src
- backend/access/transam
  - xlogrecovery.c
- test/recovery
  - meson.build
  - t
    - 043_no_contrecord_switch.pl

3 files changed

+161

-6

lines changed

`‎src/backend/access/transam/xlogrecovery.c`

Lines changed: 7 additions & 6 deletions

Original file line number	Diff line number	Diff line change
`@@ -3436,12 +3436,12 @@ XLogPageRead(XLogReaderState *xlogreader, XLogRecPtr targetPagePtr, int reqLen,`
`3436`	`3436`	`* validates the page header anyway, and would propagate the failure up to`
`3437`	`3437`	`* ReadRecord(), which would retry. However, there's a corner case with`
`3438`	`3438`	`* continuation records, if a record is split across two pages such that`
`3439`		`- * we would need to read the two pages from different sources. For`
`3440`		`- *example, imagine a scenario where a streaming replica is started up,`
`3441`		`- * and replay reaches a record that's split across two WAL segments. The`
`3442`		`- * first page is only available locally, in pg_wal, because it's already`
`3443`		`- * been recycled on the primary. The second page, however, is not present`
`3444`		`- * in pg_wal, and we should stream it from the primary. There is a`
	`3439`	`+ * we would need to read the two pages from different sources across two`
	`3440`	`+ *WAL segments.`
	`3441`	`+ *`
	`3442`	`+ *Thefirst page is only available locally, in pg_wal, because it's`
	`3443`	`+ *alreadybeen recycled on the primary. The second page, however, is not`
	`3444`	`+ *presentin pg_wal, and we should stream it from the primary. There is a`
`3445`	`3445`	`* recycled WAL segment present in pg_wal, with garbage contents, however.`
`3446`	`3446`	`* We would read the first page from the local WAL segment, but when`
`3447`	`3447`	`* reading the second page, we would read the bogus, recycled, WAL`
`@@ -3463,6 +3463,7 @@ XLogPageRead(XLogReaderState *xlogreader, XLogRecPtr targetPagePtr, int reqLen,`
`3463`	`3463`	`* responsible for the validation.`
`3464`	`3464`	`*/`
`3465`	`3465`	`if (StandbyMode&&`
	`3466`	`+(targetPagePtr %wal_segment_size)==0&&`
`3466`	`3467`	`!XLogReaderValidatePageHeader(xlogreader,targetPagePtr,readBuf))`
`3467`	`3468`	`{`
`3468`	`3469`	`/*`

`‎src/test/recovery/meson.build`

Lines changed: 1 addition & 0 deletions

Original file line number	Diff line number	Diff line change
`@@ -51,6 +51,7 @@ tests += {`
`51`	`51`	`'t/040_standby_failover_slots_sync.pl',`
`52`	`52`	`'t/041_checkpoint_at_promote.pl',`
`53`	`53`	`'t/042_low_level_backup.pl',`
	`54`	`+'t/043_no_contrecord_switch.pl',`
`54`	`55`	`],`
`55`	`56`	`},`
`56`	`57`	`}`

`‎src/test/recovery/t/043_no_contrecord_switch.pl`

Lines changed: 153 additions & 0 deletions

Original file line number	Diff line number	Diff line change
`@@ -0,0 +1,153 @@`
	`1`	`+# Copyright (c) 2021-2025, PostgreSQL Global Development Group`
	`2`	`+`
	`3`	`+# Tests for already-propagated WAL segments ending in incomplete WAL records.`
	`4`	`+`
	`5`	`+use strict;`
	`6`	`+use warnings;`
	`7`	`+`
	`8`	`+use File::Copy;`
	`9`	`+use PostgreSQL::Test::Cluster;`
	`10`	`+use Test::More;`
	`11`	`+use Fcntlqw(SEEK_SET);`
	`12`	`+`
	`13`	`+use integer;# causes / operator to use integer math`
	`14`	`+`
	`15`	`+# Values queried from the server`
	`16`	`+my$WAL_SEGMENT_SIZE;`
	`17`	`+my$WAL_BLOCK_SIZE;`
	`18`	`+my$TLI;`
	`19`	`+`
	`20`	`+# Build name of a WAL segment, used when filtering the contents of the server`
	`21`	`+# logs.`
	`22`	`+subwal_segment_name`
	`23`	`+{`
	`24`	`+my$tli =shift;`
	`25`	`+my$segment =shift;`
	`26`	`+returnsprintf("%08X%08X%08X",$tli, 0,$segment);`
	`27`	`+}`
	`28`	`+`
	`29`	`+# Calculate from a LSN (in bytes) its segment number and its offset, used`
	`30`	`+# when filtering the contents of the server logs.`
	`31`	`+sublsn_to_segment_and_offset`
	`32`	`+{`
	`33`	`+my$lsn =shift;`
	`34`	`+return ($lsn /$WAL_SEGMENT_SIZE,$lsn %$WAL_SEGMENT_SIZE);`
	`35`	`+}`
	`36`	`+`
	`37`	`+# Get GUC value, converted to an int.`
	`38`	`+subget_int_setting`
	`39`	`+{`
	`40`	`+my$node =shift;`
	`41`	`+my$name =shift;`
	`42`	`+returnint(`
	`43`	`+$node->safe_psql(`
	`44`	`+'postgres',`
	`45`	`+"SELECT setting FROM pg_settings WHERE name = '$name'"));`
	`46`	`+}`
	`47`	`+`
	`48`	`+substart_of_page`
	`49`	`+{`
	`50`	`+my$lsn =shift;`
	`51`	`+return$lsn & ~($WAL_BLOCK_SIZE - 1);`
	`52`	`+}`
	`53`	`+`
	`54`	`+my$primary = PostgreSQL::Test::Cluster->new('primary');`
	`55`	`+$primary->init(allows_streaming=> 1,has_archiving=> 1);`
	`56`	`+`
	`57`	`+# The configuration is chosen here to minimize the friction with`
	`58`	`+# concurrent WAL activity. checkpoint_timeout avoids noise with`
	`59`	`+# checkpoint activity, and autovacuum is disabled to avoid any`
	`60`	`+# WAL activity generated by it.`
	`61`	`+$primary->append_conf(`
	`62`	`+'postgresql.conf',qq(`
	`63`	`+autovacuum = off`
	`64`	`+checkpoint_timeout = '30min'`
	`65`	`+wal_keep_size = 1GB`
	`66`	`+));`
	`67`	`+`
	`68`	`+$primary->start;`
	`69`	`+$primary->backup('backup');`
	`70`	`+`
	`71`	`+$primary->safe_psql('postgres',"CREATE TABLE t AS SELECT 0");`
	`72`	`+`
	`73`	`+$WAL_SEGMENT_SIZE = get_int_setting($primary,'wal_segment_size');`
	`74`	`+$WAL_BLOCK_SIZE = get_int_setting($primary,'wal_block_size');`
	`75`	`+$TLI =$primary->safe_psql('postgres',`
	`76`	`+"SELECT timeline_id FROM pg_control_checkpoint()");`
	`77`	`+`
	`78`	`+# Get close to the end of the current WAL page, enough to fit the`
	`79`	`+# beginning of a record that spans on two pages, generating a`
	`80`	`+# continuation record.`
	`81`	`+$primary->emit_wal(0);`
	`82`	`+my$end_lsn =`
	`83`	`+$primary->advance_wal_out_of_record_splitting_zone($WAL_BLOCK_SIZE);`
	`84`	`+`
	`85`	`+# Do some math to find the record size that will overflow the page, and`
	`86`	`+# write it.`
	`87`	`+my$overflow_size =$WAL_BLOCK_SIZE - ($end_lsn %$WAL_BLOCK_SIZE);`
	`88`	`+$end_lsn =$primary->emit_wal($overflow_size);`
	`89`	`+$primary->stop('immediate');`
	`90`	`+`
	`91`	`+# Find the beginning of the page with the continuation record and fill`
	`92`	`+# the entire page with zero bytes to simulate broken replication.`
	`93`	`+my$start_page = start_of_page($end_lsn);`
	`94`	`+my$wal_file =$primary->write_wal($TLI,$start_page,$WAL_SEGMENT_SIZE,`
	`95`	`+"\x00" x$WAL_BLOCK_SIZE);`
	`96`	`+`
	`97`	`+# Copy the file we just "hacked" to the archives.`
	`98`	`+copy($wal_file,$primary->archive_dir);`
	`99`	`+`
	`100`	`+# Start standby nodes and make sure they replay the file "hacked" from`
	`101`	`+# the archives.`
	`102`	`+my$standby1 = PostgreSQL::Test::Cluster->new('standby1');`
	`103`	`+$standby1->init_from_backup(`
	`104`	`+$primary,'backup',`
	`105`	`+standby=> 1,`
	`106`	`+has_restoring=> 1);`
	`107`	`+`
	`108`	`+my$standby2 = PostgreSQL::Test::Cluster->new('standby2');`
	`109`	`+$standby2->init_from_backup(`
	`110`	`+$primary,'backup',`
	`111`	`+standby=> 1,`
	`112`	`+has_restoring=> 1);`
	`113`	`+`
	`114`	`+my$log_size1 =-s$standby1->logfile;`
	`115`	`+my$log_size2 =-s$standby2->logfile;`
	`116`	`+`
	`117`	`+$standby1->start;`
	`118`	`+$standby2->start;`
	`119`	`+`
	`120`	`+my ($segment,$offset) = lsn_to_segment_and_offset($start_page);`
	`121`	`+my$segment_name = wal_segment_name($TLI,$segment);`
	`122`	`+my$pattern =`
	`123`	`+qq(invalid magic number 0000 .* segment$segment_name.* offset$offset);`
	`124`	`+`
	`125`	`+# We expect both standby nodes to complain about empty page when trying to`
	`126`	`+# assemble the record that spans over two pages, so wait for these in their`
	`127`	`+# logs.`
	`128`	`+$standby1->wait_for_log($pattern,$log_size1);`
	`129`	`+$standby2->wait_for_log($pattern,$log_size2);`
	`130`	`+`
	`131`	`+# Now check the case of a promotion with a timeline jump handled at`
	`132`	`+# page boundary with a continuation record.`
	`133`	`+$standby1->promote;`
	`134`	`+`
	`135`	`+# This command forces standby2 to read a continuation record from the page`
	`136`	`+# that is filled with zero bytes.`
	`137`	`+$standby1->safe_psql('postgres','SELECT pg_switch_wal()');`
	`138`	`+`
	`139`	`+# Make sure WAL moves forward.`
	`140`	`+$standby1->safe_psql('postgres',`
	`141`	`+'INSERT INTO t SELECT * FROM generate_series(1, 1000)');`
	`142`	`+`
	`143`	`+# Configure standby2 to stream from just promoted standby1 (it also pulls WAL`
	`144`	`+# files from the archive). It should be able to catch up.`
	`145`	`+$standby2->enable_streaming($standby1);`
	`146`	`+$standby2->reload;`
	`147`	`+$standby1->wait_for_replay_catchup($standby2);`
	`148`	`+`
	`149`	`+my$result =$standby2->safe_psql('postgres',"SELECT count(*) FROM t");`
	`150`	`+print"standby2:$result\n";`
	`151`	`+is($result,qq(1001),'check streamed content on standby2');`
	`152`	`+`
	`153`	`+done_testing();`

0 commit comments

Comments

(0)

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Commite6767c0

File tree

3 files changed

3 files changed

`‎src/backend/access/transam/xlogrecovery.c`

`‎src/test/recovery/meson.build`

`‎src/test/recovery/t/043_no_contrecord_switch.pl`

0 commit comments