NotificationsYou must be signed in to change notification settings
Fork4.9k
Star17.7k

Commit428b1d6

committed

Allow to trigger kernel writeback after a configurable number of writes.

Currently writes to the main data files of postgres all go through theOS page cache. This means that some operating systems can end upcollecting a large number of dirty buffers in their respective pagecaches. When these dirty buffers are flushed to storage rapidly, be itbecause of fsync(), timeouts, or dirty ratios, latency for other readsand writes can increase massively. This is the primary reason forregular massive stalls observed in real world scenarios and artificialbenchmarks; on rotating disks stalls on the order of hundreds of secondshave been observed.On linux it is possible to control this by reducing the global dirtylimits significantly, reducing the above problem. But globalconfiguration is rather problematic because it'll affect otherapplications; also PostgreSQL itself doesn't always generally want thisbehavior, e.g. for temporary files it's undesirable.Several operating systems allow some control over the kernel pagecache. Linux has sync_file_range(2), several posix systems have msync(2)and posix_fadvise(2). sync_file_range(2) is preferable because itrequires no special setup, whereas msync() requires the to-be-flushedrange to be mmap'ed. For the purpose of flushing dirty dataposix_fadvise(2) is the worst alternative, as flushing dirty data isjust a side-effect of POSIX_FADV_DONTNEED, which also removes the pagesfrom the page cache. Thus the feature is enabled by default only onlinux, but can be enabled on all systems that have any of the aboveAPIs.While desirable and likely possible this patch does not contain animplementation for windows.With the infrastructure added, writes made via checkpointer, bgwriterand normal user backends can be flushed after a configurable number ofwrites. Each of these sources of writes controlled by a separate GUC,checkpointer_flush_after, bgwriter_flush_after and backend_flush_afterrespectively; they're separate because the number of flushes that aregood are separate, and because the performance considerations ofcontrolled flushing for each of these are different.A later patch will add checkpoint sorting - after that flushes from theckeckpoint will almost always be desirable. Bgwriter flushes are most ofthe time going to be random, which are slow on lots of storage hardware.Flushing in backends works well if the storage and bgwriter can keep up,but if not it can have negative consequences. This patch is likely tohave negative performance consequences without checkpoint sorting, butunfortunately so has sorting without flush control.Discussion: alpine.DEB.2.10.1506011320000.28433@stoAuthor: Fabien Coelho and Andres Freund

1 parentc82c92b commit428b1d6Copy full SHA for 428b1d6

File tree

15 files changed

+601

-31

lines changed

doc/src/sgml
- config.sgml
- wal.sgml
src
- backend
  - postmaster
    - bgwriter.c
  - storage
    - buffer
      - buf_init.c
      - bufmgr.c
    - file
      - copydir.c
      - fd.c
    - smgr
      - md.c
      - smgr.c
  - utils/misc
    - guc.c
- include/storage
- tools/pgindent
  - typedefs.list

15 files changed

+601

-31

lines changed

`‎doc/src/sgml/config.sgml`

Lines changed: 87 additions & 0 deletions

Original file line number	Diff line number	Diff line change
`@@ -1843,6 +1843,35 @@ include_dir 'conf.d'`
`1843`	`1843`	`</para>`
`1844`	`1844`	`</listitem>`
`1845`	`1845`	`</varlistentry>`
	`1846`	`+`
	`1847`	`+ <varlistentry id="guc-bgwriter-flush-after" xreflabel="bgwriter_flush_after">`
	`1848`	`+ <term><varname>bgwriter_flush_after</varname> (<type>int</type>)`
	`1849`	`+ <indexterm>`
	`1850`	`+ <primary><varname>bgwriter_flush_after</> configuration parameter</primary>`
	`1851`	`+ </indexterm>`
	`1852`	`+ </term>`
	`1853`	`+ <listitem>`
	`1854`	`+ <para>`
	`1855`	`+ Whenever more than <varname>bgwriter_flush_after</varname> bytes have`
	`1856`	`+ been written by the bgwriter, attempt to force the OS to issue these`
	`1857`	`+ writes to the underlying storage. Doing so will limit the amount of`
	`1858`	`+ dirty data in the kernel's page cache, reducing the likelihood of`
	`1859`	`+ stalls when an fsync is issued at the end of a checkpoint, or when`
	`1860`	`+ the OS writes data back in larger batches in the background. Often`
	`1861`	`+ that will result in greatly reduced transaction latency, but there`
	`1862`	`+ also are some cases, especially with workloads that are bigger than`
	`1863`	`+ <xref linkend="guc-shared-buffers">, but smaller than the OS's page`
	`1864`	`+ cache, where performance might degrade. This setting may have no`
	`1865`	`+ effect on some platforms. The valid range is between`
	`1866`	`+ <literal>0</literal>, which disables controlled writeback, and`
	`1867`	`+ <literal>2MB</literal>. The default is <literal>512Kb</> on Linux,`
	`1868`	`+ <literal>0</> elsewhere. (Non-default values of`
	`1869`	`+ <symbol>BLCKSZ</symbol> change the default and maximum.)`
	`1870`	`+ This parameter can only be set in the <filename>postgresql.conf</>`
	`1871`	`+ file or on the server command line.`
	`1872`	`+ </para>`
	`1873`	`+ </listitem>`
	`1874`	`+ </varlistentry>`
`1846`	`1875`	`</variablelist>`
`1847`	`1876`
`1848`	`1877`	`<para>`
`@@ -1944,6 +1973,35 @@ include_dir 'conf.d'`
`1944`	`1973`	`</para>`
`1945`	`1974`	`</listitem>`
`1946`	`1975`	`</varlistentry>`
	`1976`	`+`
	`1977`	`+ <varlistentry id="guc-backend-flush-after" xreflabel="backend_flush_after">`
	`1978`	`+ <term><varname>backend_flush_after</varname> (<type>int</type>)`
	`1979`	`+ <indexterm>`
	`1980`	`+ <primary><varname>backend_flush_after</> configuration parameter</primary>`
	`1981`	`+ </indexterm>`
	`1982`	`+ </term>`
	`1983`	`+ <listitem>`
	`1984`	`+ <para>`
	`1985`	`+ Whenever more than <varname>backend_flush_after</varname> bytes have`
	`1986`	`+ been written by a single backend, attempt to force the OS to issue`
	`1987`	`+ these writes to the underlying storage. Doing so will limit the`
	`1988`	`+ amount of dirty data in the kernel's page cache, reducing the`
	`1989`	`+ likelihood of stalls when an fsync is issued at the end of a`
	`1990`	`+ checkpoint, or when the OS writes data back in larger batches in the`
	`1991`	`+ background. Often that will result in greatly reduced transaction`
	`1992`	`+ latency, but there also are some cases, especially with workloads`
	`1993`	`+ that are bigger than <xref linkend="guc-shared-buffers">, but smaller`
	`1994`	`+ than the OS's page cache, where performance might degrade. This`
	`1995`	`+ setting may have no effect on some platforms. The valid range is`
	`1996`	`+ between <literal>0</literal>, which disables controlled writeback,`
	`1997`	`+ and <literal>2MB</literal>. The default is <literal>128Kb</> on`
	`1998`	`+ Linux, <literal>0</> elsewhere. (Non-default values of`
	`1999`	`+ <symbol>BLCKSZ</symbol> change the default and maximum.)`
	`2000`	`+ This parameter can only be set in the <filename>postgresql.conf</>`
	`2001`	`+ file or on the server command line.`
	`2002`	`+ </para>`
	`2003`	`+ </listitem>`
	`2004`	`+ </varlistentry>`
`1947`	`2005`	`</variablelist>`
`1948`	`2006`	`</sect2>`
`1949`	`2007`	`</sect1>`
`@@ -2475,6 +2533,35 @@ include_dir 'conf.d'`
`2475`	`2533`	`</listitem>`
`2476`	`2534`	`</varlistentry>`
`2477`	`2535`
	`2536`	`+ <varlistentry id="guc-checkpoint-flush-after" xreflabel="checkpoint_flush_after">`
	`2537`	`+ <term><varname>checkpoint_flush_after</varname> (<type>int</type>)`
	`2538`	`+ <indexterm>`
	`2539`	`+ <primary><varname>checkpoint_flush_after</> configuration parameter</primary>`
	`2540`	`+ </indexterm>`
	`2541`	`+ </term>`
	`2542`	`+ <listitem>`
	`2543`	`+ <para>`
	`2544`	`+ Whenever more than <varname>checkpoint_flush_after</varname> bytes`
	`2545`	`+ have been written while performing a checkpoint, attempt to force the`
	`2546`	`+ OS to issue these writes to the underlying storage. Doing so will`
	`2547`	`+ limit the amount of dirty data in the kernel's page cache, reducing`
	`2548`	`+ the likelihood of stalls when an fsync is issued at the end of the`
	`2549`	`+ checkpoint, or when the OS writes data back in larger batches in the`
	`2550`	`+ background. Often that will result in greatly reduced transaction`
	`2551`	`+ latency, but there also are some cases, especially with workloads`
	`2552`	`+ that are bigger than <xref linkend="guc-shared-buffers">, but smaller`
	`2553`	`+ than the OS's page cache, where performance might degrade. This`
	`2554`	`+ setting may have no effect on some platforms. The valid range is`
	`2555`	`+ between <literal>0</literal>, which disables controlled writeback,`
	`2556`	`+ and <literal>2MB</literal>. The default is <literal>128Kb</> on`
	`2557`	`+ Linux, <literal>0</> elsewhere. (Non-default values of`
	`2558`	`+ <symbol>BLCKSZ</symbol> change the default and maximum.)`
	`2559`	`+ This parameter can only be set in the <filename>postgresql.conf</>`
	`2560`	`+ file or on the server command line.`
	`2561`	`+ </para>`
	`2562`	`+ </listitem>`
	`2563`	`+ </varlistentry>`
	`2564`	`+`
`2478`	`2565`	`<varlistentry id="guc-checkpoint-warning" xreflabel="checkpoint_warning">`
`2479`	`2566`	`<term><varname>checkpoint_warning</varname> (<type>integer</type>)`
`2480`	`2567`	`<indexterm>`

`‎doc/src/sgml/wal.sgml`

Lines changed: 11 additions & 0 deletions

Original file line number	Diff line number	Diff line change
`@@ -545,6 +545,17 @@`
`545`	`545`	`unexpected variation in the number of WAL segments needed.`
`546`	`546`	`</para>`
`547`	`547`
	`548`	`+ <para>`
	`549`	`+ On Linux and POSIX platforms <xref linkend="guc-checkpoint-flush-after">`
	`550`	`+ allows to force the OS that pages written by the checkpoint should be`
	`551`	`+ flushed to disk after a configurable number of bytes. Otherwise, these`
	`552`	`+ pages may be kept in the OS's page cache, inducing a stall when`
	`553`	`+ <literal>fsync</> is issued at the end of a checkpoint. This setting will`
	`554`	`+ often help to reduce transaction latency, but it also can an adverse effect`
	`555`	`+ on performance; particularly for workloads that are bigger than`
	`556`	`+ <xref linkend="guc-shared-buffers">, but smaller than the OS's page cache.`
	`557`	`+ </para>`
	`558`	`+`
`548`	`559`	`<para>`
`549`	`560`	`The number of WAL segment files in <filename>pg_xlog</> directory depends on`
`550`	`561`	`<varname>min_wal_size</>, <varname>max_wal_size</> and`

`‎src/backend/postmaster/bgwriter.c`

Lines changed: 7 additions & 1 deletion

Original file line number	Diff line number	Diff line change
`@@ -111,6 +111,7 @@ BackgroundWriterMain(void)`
`111`	`111`	`sigjmp_buflocal_sigjmp_buf;`
`112`	`112`	`MemoryContextbgwriter_context;`
`113`	`113`	`boolprev_hibernate;`
	`114`	`+WritebackContextwb_context;`
`114`	`115`
`115`	`116`	`/*`
`116`	`117`	`* Properly accept or ignore signals the postmaster might send us.`
`@@ -164,6 +165,8 @@ BackgroundWriterMain(void)`
`164`	`165`	`ALLOCSET_DEFAULT_MAXSIZE);`
`165`	`166`	`MemoryContextSwitchTo(bgwriter_context);`
`166`	`167`
	`168`	`+WritebackContextInit(&wb_context,&bgwriter_flush_after);`
	`169`	`+`
`167`	`170`	`/*`
`168`	`171`	`* If an exception is encountered, processing resumes here.`
`169`	`172`	`*`
`@@ -208,6 +211,9 @@ BackgroundWriterMain(void)`
`208`	`211`	`/* Flush any leaked data in the top-level context */`
`209`	`212`	`MemoryContextResetAndDeleteChildren(bgwriter_context);`
`210`	`213`
	`214`	`+/* re-initilialize to avoid repeated errors causing problems */`
	`215`	`+WritebackContextInit(&wb_context,&bgwriter_flush_after);`
	`216`	`+`
`211`	`217`	`/* Now we can allow interrupts again */`
`212`	`218`	`RESUME_INTERRUPTS();`
`213`	`219`
`@@ -272,7 +278,7 @@ BackgroundWriterMain(void)`
`272`	`278`	`/*`
`273`	`279`	`* Do one cycle of dirty-buffer writing.`
`274`	`280`	`*/`
`275`		`-can_hibernate=BgBufferSync();`
	`281`	`+can_hibernate=BgBufferSync(&wb_context);`
`276`	`282`
`277`	`283`	`/*`
`278`	`284`	`* Send off activity statistics to the stats collector`

`‎src/backend/storage/buffer/buf_init.c`

Lines changed: 5 additions & 0 deletions

Original file line number	Diff line number	Diff line change
`@@ -23,6 +23,7 @@ char *BufferBlocks;`
`23`	`23`	`LWLockMinimallyPadded*BufferIOLWLockArray=NULL;`
`24`	`24`	`LWLockTrancheBufferIOLWLockTranche;`
`25`	`25`	`LWLockTrancheBufferContentLWLockTranche;`
	`26`	`+WritebackContextBackendWritebackContext;`
`26`	`27`
`27`	`28`
`28`	`29`	`/*`
`@@ -149,6 +150,10 @@ InitBufferPool(void)`
`149`	`150`
`150`	`151`	`/* Init other shared buffer-management stuff */`
`151`	`152`	`StrategyInitialize(!foundDescs);`
	`153`	`+`
	`154`	`+/* Initialize per-backend file flush context */`
	`155`	`+WritebackContextInit(&BackendWritebackContext,`
	`156`	`+&backend_flush_after);`
`152`	`157`	`}`
`153`	`158`
`154`	`159`	`/*`

0 commit comments

Comments

(0)

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Commit428b1d6

File tree

15 files changed

15 files changed

`‎doc/src/sgml/config.sgml`

`‎doc/src/sgml/wal.sgml`

`‎src/backend/postmaster/bgwriter.c`

`‎src/backend/storage/buffer/buf_init.c`

0 commit comments