NotificationsYou must be signed in to change notification settings
Fork13
Star145

Commitf796961

committed

Merge branch 'nik-streaming-replication-lag-troubleshooting' into 'main'

Troubleshooting of streaming replication lagsSee merge request postgres-ai/postgresql-consulting/postgres-howtos!33

2 parents4e7e31a +fe3b81e commitf796961Copy full SHA for f796961

File tree

1 file changed

+79

-0

lines changed

0093_how_to_troubleshoot_streaming_replication_lag.md

1 file changed

+79

-0

lines changed

`‎0093_how_to_troubleshoot_streaming_replication_lag.md`

Lines changed: 79 additions & 0 deletions

Original file line number	Diff line number	Diff line change
`@@ -0,0 +1,79 @@`
	`1`	`+#How to troubleshoot streaming replication lag`
	`2`	`+Streaming replication in Postgres allows for continuous data replication from a primary server to standby servers, to ensure high availability and balance read-only workloads. However, replication lag can occur, leading to delays in data synchronization. This guide provides steps to troubleshoot and mitigate replication lag.`
	`3`	`+`
	`4`	`+##Identifying the lag`
	`5`	`+To start investigation we need to understand where we actually have lag, on which stage of replication:`
	`6`	+- sending WAL stream to replica via network by`walsender`
	`7`	+- receiving WAL stream on replica from network by`walreciever`
	`8`	+- writing WAL on disk on replica by`walreciever`
	`9`	`+- applying (replaying) WAL as a recovery process`
	`10`	`+`
	`11`	`+ Thus, streaming replication lag can be categorized into three types:`
	`12`	`+-Write Lag: The delay between when a transaction is committed on the primary and when it is written to the WAL on the standby.`
	`13`	`+-Flush Lag: The delay between when a transaction is written to the WAL on the standby and when it is flushed to the disk.`
	`14`	`+-Apply (Replay) Lag: The delay between when a transaction is flushed to the disk and when it is applied to the database on the standby.`
	`15`	`+`
	`16`	`+###Analysis query`
	`17`	`+To identify the lag, use the following SQL query:`
	`18`	+```sql
	`19`	`+select`
	`20`	`+ pid,`
	`21`	`+ client_addr,`
	`22`	`+ application_name,`
	`23`	`+ state,`
	`24`	`+ coalesce(pg_current_wal_lsn()- sent_lsn,0)AS sent_lag_bytes,`
	`25`	`+ coalesce(sent_lsn- write_lsn,0)AS write_lag_bytes,`
	`26`	`+ coalesce(write_lsn- flush_lsn,0)AS flush_lag_bytes,`
	`27`	`+ coalesce(flush_lsn- replay_lsn,0)AS replay_lag_bytes,`
	`28`	`+ coalesce(pg_current_wal_lsn()- replay_lsn,0)AS total_lag_bytes`
	`29`	`+from pg_stat_replication;`
	`30`	+```
	`31`	`+`
	`32`	`+###Example`
	`33`	`+We will get something like this:`
	`34`	+```
	`35`	`+postgres=# select`
	`36`	`+ pid,`
	`37`	`+ client_addr,`
	`38`	`+ application_name,`
	`39`	`+ state,`
	`40`	`+ coalesce(pg_current_wal_lsn() - sent_lsn, 0) AS sent_lag_bytes,`
	`41`	`+ coalesce(sent_lsn - write_lsn, 0) AS write_lag_bytes,`
	`42`	`+ coalesce(write_lsn - flush_lsn, 0) AS flush_lag_bytes,`
	`43`	`+ coalesce(flush_lsn - replay_lsn, 0) AS replay_lag_bytes,`
	`44`	`+ coalesce(pg_current_wal_lsn() - replay_lsn, 0) AS total_lag_bytes`
	`45`	`+from pg_stat_replication;`
	`46`	`+`
	`47`	`+ pid \| client_addr \| application_name \| state \| sent_lag_bytes \| write_lag_bytes \| flush_lag_bytes \| replay_lag_bytes \| total_lag_bytes`
	`48`	`+---------+----------------+------------------+-----------+----------------+-----------------+-----------------+------------------+-----------------`
	`49`	`+ 3602908 \| 10.122.224.101 \| backupmachine1 \| streaming \| 0 \| 728949184 \| 0 \| 0 \| 0`
	`50`	`+ 2490863 \| 10.122.224.102 \| backupmachine1 \| streaming \| 0 \| 519580176 \| 0 \| 0 \| 0`
	`51`	`+ 2814582 \| 10.122.224.103 \| replica1 \| streaming \| 0 \| 743384 \| 0 \| 1087208 \| 1830592`
	`52`	`+ 3596177 \| 10.122.224.104 \| replica2 \| streaming \| 0 \| 2426856 \| 0 \| 4271952 \| 6698808`
	`53`	`+ 319473 \| 10.122.224.105 \| replica3 \| streaming \| 0 \| 125080 \| 162040 \| 4186920 \| 4474040`
	`54`	+```
	`55`	`+`
	`56`	`+###How to read results`
	`57`	+Meaning of those`_lsn`
	`58`	+-`sent_lsn`: How much WAL (lsn position) has already been sent over the network
	`59`	+-`write_lsn`: How much WAL (lsn position) has been sent to the operating system (before flushing)
	`60`	+-`flush_lsn`: How much WAL (lsn position) has been flushed to the disk (written on the disk)
	`61`	+-`replay_lsn`: How much WAL (lsn position) has been applied (visible for queries)
	`62`	`+`
	`63`	+So lag is a gap between`pg_current_wal_lsn` and`replay_lsn` (`total_lag_bytes`, and it's a good idea to add it to monitoring, but for troubleshooting purposes we will need all 4
	`64`	`+`
	`65`	+- Lag on`sent_lag_bytes` means we have issues with sending the data, i.e. CPU saturated`WALsender` or overloaded network socket on the primary side
	`66`	+- Lag on`write_lag_bytes` means we have issues with receiving the data, i.e. CPU saturated`WALreceiver` or overloaded network socket on the replica side
	`67`	+- Lag on`flush_lag_bytes` means we have issues with writing the data on the disk on replica side, i.e. CPU saturated or IO contention of`WALreceiver`
	`68`	+- Lag`replay_lag_bytes` means we have issues with applying WAL on replica, usually CPU saturated or IO contention of postgres process
	`69`	`+`
	`70`	`+Once we pinpointed the problem, we need to troubleshoot the process(es) on the OS level to find the bottleneck.`
	`71`	`+`
	`72`	`+`
	`73`	`+##Possible bottlenecks`
	`74`	`+TBD`
	`75`	`+`
	`76`	`+##Additional resources`
	`77`	`+-[Streaming replication](https://www.postgresql.org/docs/current/warm-standby.html#STREAMING-REPLICATION) (official Postgres docs)`
	`78`	`+-[pg_stat_replication view](https://www.postgresql.org/docs/current/monitoring-stats.html#PG-STAT-REPLICATION-VIEW) (official Postgres docs)`
	`79`	`+-[Replication configuration parameters](https://www.postgresql.org/docs/current/runtime-config-replication.html) (official Postgres docs)`

0 commit comments

Comments

(0)

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Commitf796961

File tree

1 file changed

1 file changed

`‎0093_how_to_troubleshoot_streaming_replication_lag.md`

0 commit comments