NotificationsYou must be signed in to change notification settings
Fork28
Star151

Commit9ccdd7f

committed

PANIC on fsync() failure.

On some operating systems, it doesn't make sense to retry fsync(),because dirty data cached by the kernel may have been dropped onwrite-back failure. In that case the only remaining copy of thedata is in the WAL. A subsequent fsync() could appear to succeed,but not have flushed the data. That means that a future checkpointcould apparently complete successfully but have lost data.Therefore, violently prevent any future checkpoint attempts bypanicking on the first fsync() failure. Note that we alreadydid the same for WAL data; this change extends that behavior tonon-temporary data files.Provide a GUC data_sync_retry to control this new behavior, forusers of operating systems that don't eject dirty data, and possiblyforensic/testing uses. If it is set to on and the write-back errorwas transient, a later checkpoint might genuinely succeed (on asystem that does not throw away buffers on failure); if the error ispermanent, later checkpoints will continue to fail. The GUC defaultsto off, meaning that we panic.Back-patch to all supported releases.There is still a narrow window for error-loss on some operatingsystems: if the file is closed and later reopened and a write-backerror occurs in the intervening time, but the inode has the badluck to be evicted due to memory pressure before we reopen, we couldmiss the error. A later patch will address that with a schemefor keeping files with dirty data open at all times, but we judgethat to be too complicated to back-patch.Author: Craig Ringer, with some adjustments by Thomas MunroReported-by: Craig RingerReviewed-by: Robert Haas, Thomas Munro, Andres FreundDiscussion:https://postgr.es/m/20180427222842.in2e4mibx45zdth5%40alap3.anarazel.de

1 parent1556cb2 commit9ccdd7fCopy full SHA for 9ccdd7f

File tree

12 files changed

+99

-18

lines changed

doc/src/sgml
- config.sgml
src
- backend
  - access
    - heap
      - rewriteheap.c
    - transam
  - replication/logical
    - snapbuild.c
  - storage
    - file
      - fd.c
    - smgr
      - md.c
  - utils
    - cache
      - relmapper.c
    - misc
      - guc.c
      - postgresql.conf.sample
- include/storage
  - fd.h

12 files changed

+99

-18

lines changed

`‎doc/src/sgml/config.sgml‎`

Lines changed: 32 additions & 0 deletions

Original file line number	Diff line number	Diff line change
`@@ -8161,6 +8161,38 @@ dynamic_library_path = 'C:\tools\postgresql;H:\my_project\lib;$libdir'`
`8161`	`8161`	`</listitem>`
`8162`	`8162`	`</varlistentry>`
`8163`	`8163`
	`8164`	`+ <varlistentry id="guc-data-sync-retry" xreflabel="data_sync_retry">`
	`8165`	`+ <term><varname>data_sync_retry</varname> (<type>boolean</type>)`
	`8166`	`+ <indexterm>`
	`8167`	`+ <primary><varname>data_sync_retry</varname> configuration parameter</primary>`
	`8168`	`+ </indexterm>`
	`8169`	`+ </term>`
	`8170`	`+ <listitem>`
	`8171`	`+ <para>`
	`8172`	`+ When set to false, which is the default, <productname>PostgreSQL</productname>`
	`8173`	`+ will raise a PANIC-level error on failure to flush modified data files`
	`8174`	`+ to the filesystem. This causes the database server to crash.`
	`8175`	`+ </para>`
	`8176`	`+ <para>`
	`8177`	`+ On some operating systems, the status of data in the kernel's page`
	`8178`	`+ cache is unknown after a write-back failure. In some cases it might`
	`8179`	`+ have been entirely forgotten, making it unsafe to retry; the second`
	`8180`	`+ attempt may be reported as successful, when in fact the data has been`
	`8181`	`+ lost. In these circumstances, the only way to avoid data loss is to`
	`8182`	`+ recover from the WAL after any failure is reported, preferably`
	`8183`	`+ after investigating the root cause of the failure and replacing any`
	`8184`	`+ faulty hardware.`
	`8185`	`+ </para>`
	`8186`	`+ <para>`
	`8187`	`+ If set to true, <productname>PostgreSQL</productname> will instead`
	`8188`	`+ report an error but continue to run so that the data flushing`
	`8189`	`+ operation can be retried in a later checkpoint. Only set it to true`
	`8190`	`+ after investigating the operating system's treatment of buffered data`
	`8191`	`+ in case of write-back failure.`
	`8192`	`+ </para>`
	`8193`	`+ </listitem>`
	`8194`	`+ </varlistentry>`
	`8195`	`+`
`8164`	`8196`	`</variablelist>`
`8165`	`8197`
`8166`	`8198`	`</sect1>`

`‎src/backend/access/heap/rewriteheap.c‎`

Lines changed: 3 additions & 3 deletions

Original file line number	Diff line number	Diff line change
`@@ -978,7 +978,7 @@ logical_end_heap_rewrite(RewriteState state)`
`978`	`978`	`while ((src= (RewriteMappingFile*)hash_seq_search(&seq_status))!=NULL)`
`979`	`979`	`{`
`980`	`980`	`if (FileSync(src->vfd,WAIT_EVENT_LOGICAL_REWRITE_SYNC)!=0)`
`981`		`-ereport(ERROR,`
	`981`	`+ereport(data_sync_elevel(ERROR),`
`982`	`982`	`(errcode_for_file_access(),`
`983`	`983`	`errmsg("could not fsync file \"%s\": %m",src->path)));`
`984`	`984`	`FileClose(src->vfd);`
`@@ -1199,7 +1199,7 @@ heap_xlog_logical_rewrite(XLogReaderState *r)`
`1199`	`1199`	`*/`
`1200`	`1200`	`pgstat_report_wait_start(WAIT_EVENT_LOGICAL_REWRITE_MAPPING_SYNC);`
`1201`	`1201`	`if (pg_fsync(fd)!=0)`
`1202`		`-ereport(ERROR,`
	`1202`	`+ereport(data_sync_elevel(ERROR),`
`1203`	`1203`	`(errcode_for_file_access(),`
`1204`	`1204`	`errmsg("could not fsync file \"%s\": %m",path)));`
`1205`	`1205`	`pgstat_report_wait_end();`
`@@ -1298,7 +1298,7 @@ CheckPointLogicalRewriteHeap(void)`
`1298`	`1298`	`*/`
`1299`	`1299`	`pgstat_report_wait_start(WAIT_EVENT_LOGICAL_REWRITE_CHECKPOINT_SYNC);`
`1300`	`1300`	`if (pg_fsync(fd)!=0)`
`1301`		`-ereport(ERROR,`
	`1301`	`+ereport(data_sync_elevel(ERROR),`
`1302`	`1302`	`(errcode_for_file_access(),`
`1303`	`1303`	`errmsg("could not fsync file \"%s\": %m",path)));`
`1304`	`1304`	`pgstat_report_wait_end();`

`‎src/backend/access/transam/slru.c‎`

Lines changed: 1 addition & 1 deletion

Original file line number	Diff line number	Diff line change
`@@ -928,7 +928,7 @@ SlruReportIOError(SlruCtl ctl, int pageno, TransactionId xid)`
`928`	`928`	`path,offset)));`
`929`	`929`	`break;`
`930`	`930`	`caseSLRU_FSYNC_FAILED:`
`931`		`-ereport(ERROR,`
	`931`	`+ereport(data_sync_elevel(ERROR),`
`932`	`932`	`(errcode_for_file_access(),`
`933`	`933`	`errmsg("could not access status of transaction %u",xid),`
`934`	`934`	`errdetail("Could not fsync file \"%s\": %m.",`

`‎src/backend/access/transam/timeline.c‎`

Lines changed: 2 additions & 2 deletions

Original file line number	Diff line number	Diff line change
`@@ -406,7 +406,7 @@ writeTimeLineHistory(TimeLineID newTLI, TimeLineID parentTLI,`
`406`	`406`
`407`	`407`	`pgstat_report_wait_start(WAIT_EVENT_TIMELINE_HISTORY_SYNC);`
`408`	`408`	`if (pg_fsync(fd)!=0)`
`409`		`-ereport(ERROR,`
	`409`	`+ereport(data_sync_elevel(ERROR),`
`410`	`410`	`(errcode_for_file_access(),`
`411`	`411`	`errmsg("could not fsync file \"%s\": %m",tmppath)));`
`412`	`412`	`pgstat_report_wait_end();`
`@@ -485,7 +485,7 @@ writeTimeLineHistoryFile(TimeLineID tli, char *content, int size)`
`485`	`485`
`486`	`486`	`pgstat_report_wait_start(WAIT_EVENT_TIMELINE_HISTORY_FILE_SYNC);`
`487`	`487`	`if (pg_fsync(fd)!=0)`
`488`		`-ereport(ERROR,`
	`488`	`+ereport(data_sync_elevel(ERROR),`
`489`	`489`	`(errcode_for_file_access(),`
`490`	`490`	`errmsg("could not fsync file \"%s\": %m",tmppath)));`
`491`	`491`	`pgstat_report_wait_end();`

`‎src/backend/access/transam/xlog.c‎`

Lines changed: 1 addition & 1 deletion

Original file line number	Diff line number	Diff line change
`@@ -3455,7 +3455,7 @@ XLogFileCopy(XLogSegNo destsegno, TimeLineID srcTLI, XLogSegNo srcsegno,`
`3455`	`3455`
`3456`	`3456`	`pgstat_report_wait_start(WAIT_EVENT_WAL_COPY_SYNC);`
`3457`	`3457`	`if (pg_fsync(fd)!=0)`
`3458`		`-ereport(ERROR,`
	`3458`	`+ereport(data_sync_elevel(ERROR),`
`3459`	`3459`	`(errcode_for_file_access(),`
`3460`	`3460`	`errmsg("could not fsync file \"%s\": %m",tmppath)));`
`3461`	`3461`	`pgstat_report_wait_end();`

`‎src/backend/replication/logical/snapbuild.c‎`

Lines changed: 3 additions & 0 deletions

Original file line number	Diff line number	Diff line change
`@@ -1629,6 +1629,9 @@ SnapBuildSerialize(SnapBuild *builder, XLogRecPtr lsn)`
`1629`	`1629`	`* fsync the file before renaming so that even if we crash after this we`
`1630`	`1630`	`* have either a fully valid file or nothing.`
`1631`	`1631`	`*`
	`1632`	`+ * It's safe to just ERROR on fsync() here because we'll retry the whole`
	`1633`	`+ * operation including the writes.`
	`1634`	`+ *`
`1632`	`1635`	`* TODO: Do the fsync() via checkpoints/restartpoints, doing it here has`
`1633`	`1636`	`* some noticeable overhead since it's performed synchronously during`
`1634`	`1637`	`* decoding?`

`‎src/backend/storage/file/fd.c‎`

Lines changed: 41 additions & 7 deletions

Original file line number	Diff line number	Diff line change
`@@ -145,6 +145,8 @@ intmax_files_per_process = 1000;`
`145`	`145`	`*/`
`146`	`146`	`intmax_safe_fds=32;/* default if not changed */`
`147`	`147`
	`148`	`+/* Whether it is safe to continue running after fsync() fails. */`
	`149`	`+booldata_sync_retry= false;`
`148`	`150`
`149`	`151`	`/* Debugging.... */`
`150`	`152`
`@@ -430,11 +432,9 @@ pg_flush_data(int fd, off_t offset, off_t nbytes)`
`430`	`432`	`*/`
`431`	`433`	`rc=sync_file_range(fd,offset,nbytes,`
`432`	`434`	`SYNC_FILE_RANGE_WRITE);`
`433`		`-`
`434`		`-/* don't error out, this is just a performance optimization */`
`435`	`435`	`if (rc!=0)`
`436`	`436`	`{`
`437`		`-ereport(WARNING,`
	`437`	`+ereport(data_sync_elevel(WARNING),`
`438`	`438`	`(errcode_for_file_access(),`
`439`	`439`	`errmsg("could not flush dirty data: %m")));`
`440`	`440`	`}`
`@@ -506,7 +506,7 @@ pg_flush_data(int fd, off_t offset, off_t nbytes)`
`506`	`506`	`rc=msync(p, (size_t)nbytes,MS_ASYNC);`
`507`	`507`	`if (rc!=0)`
`508`	`508`	`{`
`509`		`-ereport(WARNING,`
	`509`	`+ereport(data_sync_elevel(WARNING),`
`510`	`510`	`(errcode_for_file_access(),`
`511`	`511`	`errmsg("could not flush dirty data: %m")));`
`512`	`512`	`/* NB: need to fall through to munmap()! */`
`@@ -562,7 +562,7 @@ pg_flush_data(int fd, off_t offset, off_t nbytes)`
`562`	`562`	`void`
`563`	`563`	`fsync_fname(constchar*fname,boolisdir)`
`564`	`564`	`{`
`565`		`-fsync_fname_ext(fname,isdir, false,ERROR);`
	`565`	`+fsync_fname_ext(fname,isdir, false,data_sync_elevel(ERROR));`
`566`	`566`	`}`
`567`	`567`
`568`	`568`	`/*`
`@@ -1022,7 +1022,8 @@ LruDelete(File file)`
`1022`	`1022`	`* to leak the FD than to mess up our internal state.`
`1023`	`1023`	`*/`
`1024`	`1024`	`if (close(vfdP->fd))`
`1025`		`-elog(LOG,"could not close file \"%s\": %m",vfdP->fileName);`
	`1025`	`+elog(vfdP->fdstate&FD_TEMP_FILE_LIMIT ?LOG :data_sync_elevel(LOG),`
	`1026`	`+"could not close file \"%s\": %m",vfdP->fileName);`
`1026`	`1027`	`vfdP->fd=VFD_CLOSED;`
`1027`	`1028`	`--nfile;`
`1028`	`1029`
`@@ -1698,7 +1699,14 @@ FileClose(File file)`
`1698`	`1699`	`{`
`1699`	`1700`	`/* close the file */`
`1700`	`1701`	`if (close(vfdP->fd))`
`1701`		`-elog(LOG,"could not close file \"%s\": %m",vfdP->fileName);`
	`1702`	`+{`
	`1703`	`+/*`
	`1704`	`+ * We may need to panic on failure to close non-temporary files;`
	`1705`	`+ * see LruDelete.`
	`1706`	`+ */`
	`1707`	`+elog(vfdP->fdstate&FD_TEMP_FILE_LIMIT ?LOG :data_sync_elevel(LOG),`
	`1708`	`+"could not close file \"%s\": %m",vfdP->fileName);`
	`1709`	`+}`
`1702`	`1710`
`1703`	`1711`	`--nfile;`
`1704`	`1712`	`vfdP->fd=VFD_CLOSED;`
`@@ -3091,6 +3099,9 @@ looks_like_temp_rel_name(const char *name)`
`3091`	`3099`	`* harmless cases such as read-only files in the data directory, and that's`
`3092`	`3100`	`* not good either.`
`3093`	`3101`	`*`
	`3102`	`+ * Note that if we previously crashed due to a PANIC on fsync(), we'll be`
	`3103`	`+ * rewriting all changes again during recovery.`
	`3104`	`+ *`
`3094`	`3105`	`* Note we assume we're chdir'd into PGDATA to begin with.`
`3095`	`3106`	`*/`
`3096`	`3107`	`void`
`@@ -3413,3 +3424,26 @@ MakePGDirectory(const char *directoryName)`
`3413`	`3424`	`{`
`3414`	`3425`	`returnmkdir(directoryName,pg_dir_create_mode);`
`3415`	`3426`	`}`
	`3427`	`+`
	`3428`	`+/*`
	`3429`	`+ * Return the passed-in error level, or PANIC if data_sync_retry is off.`
	`3430`	`+ *`
	`3431`	`+ * Failure to fsync any data file is cause for immediate panic, unless`
	`3432`	`+ * data_sync_retry is enabled. Data may have been written to the operating`
	`3433`	`+ * system and removed from our buffer pool already, and if we are running on`
	`3434`	`+ * an operating system that forgets dirty data on write-back failure, there`
	`3435`	`+ * may be only one copy of the data remaining: in the WAL. A later attempt to`
	`3436`	`+ * fsync again might falsely report success. Therefore we must not allow any`
	`3437`	`+ * further checkpoints to be attempted. data_sync_retry can in theory be`
	`3438`	`+ * enabled on systems known not to drop dirty buffered data on write-back`
	`3439`	`+ * failure (with the likely outcome that checkpoints will continue to fail`
	`3440`	`+ * until the underlying problem is fixed).`
	`3441`	`+ *`
	`3442`	`+ * Any code that reports a failure from fsync() or related functions should`
	`3443`	`+ * filter the error level with this function.`
	`3444`	`+ */`
	`3445`	`+int`
	`3446`	`+data_sync_elevel(intelevel)`
	`3447`	`+{`
	`3448`	`+returndata_sync_retry ?elevel :PANIC;`
	`3449`	`+}`

`‎src/backend/storage/smgr/md.c‎`

Lines changed: 3 additions & 3 deletions

Original file line number	Diff line number	Diff line change
`@@ -1012,7 +1012,7 @@ mdimmedsync(SMgrRelation reln, ForkNumber forknum)`
`1012`	`1012`	`MdfdVec*v=&reln->md_seg_fds[forknum][segno-1];`
`1013`	`1013`
`1014`	`1014`	`if (FileSync(v->mdfd_vfd,WAIT_EVENT_DATA_FILE_IMMEDIATE_SYNC)<0)`
`1015`		`-ereport(ERROR,`
	`1015`	`+ereport(data_sync_elevel(ERROR),`
`1016`	`1016`	`(errcode_for_file_access(),`
`1017`	`1017`	`errmsg("could not fsync file \"%s\": %m",`
`1018`	`1018`	`FilePathName(v->mdfd_vfd))));`
`@@ -1257,7 +1257,7 @@ mdsync(void)`
`1257`	`1257`	`bms_join(new_requests,requests);`
`1258`	`1258`
`1259`	`1259`	`errno=save_errno;`
`1260`		`-ereport(ERROR,`
	`1260`	`+ereport(data_sync_elevel(ERROR),`
`1261`	`1261`	`(errcode_for_file_access(),`
`1262`	`1262`	`errmsg("could not fsync file \"%s\": %m",`
`1263`	`1263`	`path)));`
`@@ -1431,7 +1431,7 @@ register_dirty_segment(SMgrRelation reln, ForkNumber forknum, MdfdVec *seg)`
`1431`	`1431`	`(errmsg("could not forward fsync request because request queue is full")));`
`1432`	`1432`
`1433`	`1433`	`if (FileSync(seg->mdfd_vfd,WAIT_EVENT_DATA_FILE_SYNC)<0)`
`1434`		`-ereport(ERROR,`
	`1434`	`+ereport(data_sync_elevel(ERROR),`
`1435`	`1435`	`(errcode_for_file_access(),`
`1436`	`1436`	`errmsg("could not fsync file \"%s\": %m",`
`1437`	`1437`	`FilePathName(seg->mdfd_vfd))));`

`‎src/backend/utils/cache/relmapper.c‎`

Lines changed: 1 addition & 1 deletion

Original file line number	Diff line number	Diff line change
`@@ -876,7 +876,7 @@ write_relmap_file(bool shared, RelMapFile *newmap,`
`876`	`876`	`*/`
`877`	`877`	`pgstat_report_wait_start(WAIT_EVENT_RELATION_MAP_SYNC);`
`878`	`878`	`if (pg_fsync(fd)!=0)`
`879`		`-ereport(ERROR,`
	`879`	`+ereport(data_sync_elevel(ERROR),`
`880`	`880`	`(errcode_for_file_access(),`
`881`	`881`	`errmsg("could not fsync file \"%s\": %m",`
`882`	`882`	`mapfilename)));`

`‎src/backend/utils/misc/guc.c‎`

Lines changed: 9 additions & 0 deletions

Original file line number	Diff line number	Diff line change
`@@ -1830,6 +1830,15 @@ static struct config_bool ConfigureNamesBool[] =`
`1830`	`1830`	`NULL,NULL,NULL`
`1831`	`1831`	`},`
`1832`	`1832`
	`1833`	`+{`
	`1834`	`+{"data_sync_retry",PGC_POSTMASTER,ERROR_HANDLING_OPTIONS,`
	`1835`	`+gettext_noop("Whether to continue running after a failure to sync data files."),`
	`1836`	`+},`
	`1837`	`+&data_sync_retry,`
	`1838`	`+false,`
	`1839`	`+NULL,NULL,NULL`
	`1840`	`+},`
	`1841`	`+`
`1833`	`1842`	`/* End-of-list marker */`
`1834`	`1843`	`{`
`1835`	`1844`	`{NULL,0,0,NULL,NULL},NULL, false,NULL,NULL,NULL`

0 commit comments

Comments

(0)

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Commit9ccdd7f

File tree

12 files changed

12 files changed

`‎doc/src/sgml/config.sgml‎`

`‎src/backend/access/heap/rewriteheap.c‎`

`‎src/backend/access/transam/slru.c‎`

`‎src/backend/access/transam/timeline.c‎`

`‎src/backend/access/transam/xlog.c‎`

`‎src/backend/replication/logical/snapbuild.c‎`

`‎src/backend/storage/file/fd.c‎`

`‎src/backend/storage/smgr/md.c‎`

`‎src/backend/utils/cache/relmapper.c‎`

`‎src/backend/utils/misc/guc.c‎`

0 commit comments