Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Commitf796961

Browse files
committed
Merge branch 'nik-streaming-replication-lag-troubleshooting' into 'main'
Troubleshooting of streaming replication lagsSee merge request postgres-ai/postgresql-consulting/postgres-howtos!33
2 parents4e7e31a +fe3b81e commitf796961

File tree

1 file changed

+79
-0
lines changed

1 file changed

+79
-0
lines changed
Lines changed: 79 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,79 @@
1+
#How to troubleshoot streaming replication lag
2+
Streaming replication in Postgres allows for continuous data replication from a primary server to standby servers, to ensure high availability and balance read-only workloads. However, replication lag can occur, leading to delays in data synchronization. This guide provides steps to troubleshoot and mitigate replication lag.
3+
4+
##Identifying the lag
5+
To start investigation we need to understand where we actually have lag, on which stage of replication:
6+
- sending WAL stream to replica via network by`walsender`
7+
- receiving WAL stream on replica from network by`walreciever`
8+
- writing WAL on disk on replica by`walreciever`
9+
- applying (replaying) WAL as a recovery process
10+
11+
Thus, streaming replication lag can be categorized into three types:
12+
-**Write Lag**: The delay between when a transaction is committed on the primary and when it is written to the WAL on the standby.
13+
-**Flush Lag**: The delay between when a transaction is written to the WAL on the standby and when it is flushed to the disk.
14+
-**Apply (Replay) Lag**: The delay between when a transaction is flushed to the disk and when it is applied to the database on the standby.
15+
16+
###Analysis query
17+
To identify the lag, use the following SQL query:
18+
```sql
19+
select
20+
pid,
21+
client_addr,
22+
application_name,
23+
state,
24+
coalesce(pg_current_wal_lsn()- sent_lsn,0)AS sent_lag_bytes,
25+
coalesce(sent_lsn- write_lsn,0)AS write_lag_bytes,
26+
coalesce(write_lsn- flush_lsn,0)AS flush_lag_bytes,
27+
coalesce(flush_lsn- replay_lsn,0)AS replay_lag_bytes,
28+
coalesce(pg_current_wal_lsn()- replay_lsn,0)AS total_lag_bytes
29+
from pg_stat_replication;
30+
```
31+
32+
###Example
33+
We will get something like this:
34+
```
35+
postgres=# select
36+
pid,
37+
client_addr,
38+
application_name,
39+
state,
40+
coalesce(pg_current_wal_lsn() - sent_lsn, 0) AS sent_lag_bytes,
41+
coalesce(sent_lsn - write_lsn, 0) AS write_lag_bytes,
42+
coalesce(write_lsn - flush_lsn, 0) AS flush_lag_bytes,
43+
coalesce(flush_lsn - replay_lsn, 0) AS replay_lag_bytes,
44+
coalesce(pg_current_wal_lsn() - replay_lsn, 0) AS total_lag_bytes
45+
from pg_stat_replication;
46+
47+
pid | client_addr | application_name | state | sent_lag_bytes | write_lag_bytes | flush_lag_bytes | replay_lag_bytes | total_lag_bytes
48+
---------+----------------+------------------+-----------+----------------+-----------------+-----------------+------------------+-----------------
49+
3602908 | 10.122.224.101 | backupmachine1 | streaming | 0 | 728949184 | 0 | 0 | 0
50+
2490863 | 10.122.224.102 | backupmachine1 | streaming | 0 | 519580176 | 0 | 0 | 0
51+
2814582 | 10.122.224.103 | replica1 | streaming | 0 | 743384 | 0 | 1087208 | 1830592
52+
3596177 | 10.122.224.104 | replica2 | streaming | 0 | 2426856 | 0 | 4271952 | 6698808
53+
319473 | 10.122.224.105 | replica3 | streaming | 0 | 125080 | 162040 | 4186920 | 4474040
54+
```
55+
56+
###How to read results
57+
Meaning of those`_lsn`
58+
-`sent_lsn`: How much WAL (lsn position) has already been sent over the network
59+
-`write_lsn`: How much WAL (lsn position) has been sent to the operating system (before flushing)
60+
-`flush_lsn`: How much WAL (lsn position) has been flushed to the disk (written on the disk)
61+
-`replay_lsn`: How much WAL (lsn position) has been applied (visible for queries)
62+
63+
So lag is a gap between`pg_current_wal_lsn` and`replay_lsn` (`total_lag_bytes`, and it's a good idea to add it to monitoring, but for troubleshooting purposes we will need all 4
64+
65+
- Lag on`sent_lag_bytes` means we have issues with sending the data, i.e. CPU saturated`WALsender` or overloaded network socket on the primary side
66+
- Lag on`write_lag_bytes` means we have issues with receiving the data, i.e. CPU saturated`WALreceiver` or overloaded network socket on the replica side
67+
- Lag on`flush_lag_bytes` means we have issues with writing the data on the disk on replica side, i.e. CPU saturated or IO contention of`WALreceiver`
68+
- Lag`replay_lag_bytes` means we have issues with applying WAL on replica, usually CPU saturated or IO contention of postgres process
69+
70+
Once we pinpointed the problem, we need to troubleshoot the process(es) on the OS level to find the bottleneck.
71+
72+
73+
##Possible bottlenecks
74+
TBD
75+
76+
##Additional resources
77+
-[Streaming replication](https://www.postgresql.org/docs/current/warm-standby.html#STREAMING-REPLICATION) (official Postgres docs)
78+
-[pg_stat_replication view](https://www.postgresql.org/docs/current/monitoring-stats.html#PG-STAT-REPLICATION-VIEW) (official Postgres docs)
79+
-[Replication configuration parameters](https://www.postgresql.org/docs/current/runtime-config-replication.html) (official Postgres docs)

0 commit comments

Comments
 (0)

[8]ページ先頭

©2009-2025 Movatter.jp