Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Commite732277

Browse files
committed
Merge branch 'main' into 'main'
Days 71-73See merge request postgres-ai/postgresql-consulting/postgres-howtos!27
2 parentseddd71a +c9b74ba commite732277

7 files changed

+250
-0
lines changed
Lines changed: 131 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,131 @@
1+
Originally from:[tweet](https://twitter.com/samokhvalov/status/1732486201830240283),[LinkedIn post]().
2+
3+
---
4+
5+
#How to understand what's blocking DDL
6+
7+
>I post a new PostgreSQL "howto" article every day. Join me in this
8+
>journey –[subscribe](https://twitter.com/samokhvalov/), provide feedback, share!
9+
10+
In[day 60](0060_how_to_add_a_column.md), we discussed how to add a column and that under load you need
11+
low`lock_timeout` and retries (also
12+
see:[Zero-downtime Postgres schema migrations need this: lock_timeout and retries](https://postgres.ai/blog/20210923-zero-downtime-postgres-schema-migrations-lock-timeout-and-retries)).
13+
14+
There we used`max_attempts` set to`1000` and this is probably too many, but interesting question is: how to understand
15+
what's blocking our DDL, if it doesn't succeed after many retries?
16+
17+
The first thing to do is to enable`log_lock_waits`. In this case, after`deadlock_timeout` (`1s` by default; and this
18+
setting defines when deadlock detection happens), you'll see some information about our blocked session.
19+
20+
But we don't want to start with`lock_timeout` more than`1s` – it would be too invasive. Solution is to set a lower
21+
`deadlock_timeout` right in session. Example (assuming that there is another session just read from table`t` and keeps
22+
transaction open):
23+
24+
```sql
25+
postgres=# set lock_timeout to '100ms';
26+
SET
27+
28+
postgres=# set deadlock_timeout to '50ms';
29+
SET
30+
31+
postgres=# set log_lock_waits to 'on';
32+
SET
33+
34+
postgres=# alter table t1 add column new_c int8;
35+
ERROR: canceling statement due to lock timeout
36+
```
37+
38+
In the logs we have two entries – one after~`50ms` (`deadlock_timeout`) with some useful details about the waiting on
39+
lock acquisition, and another when our statement fails after`100ms`:
40+
41+
```
42+
2023-12-06 19:12:21.823 UTC [197] LOG: process 197 still waiting for AccessExclusiveLock on relation 16384 of database 13481 after 54.022 ms
43+
2023-12-06 19:12:21.823 UTC [197] DETAIL: Process holding the lock: 211. Wait queue: 197.
44+
2023-12-06 19:12:21.823 UTC [197] STATEMENT: alter table t1 add column new_c int8;
45+
2023-12-06 19:12:21.874 UTC [197] ERROR: canceling statement due to lock timeout
46+
2023-12-06 19:12:21.874 UTC [197] STATEMENT: alter table t1 add column new_c int8;
47+
```
48+
49+
Unfortunately, the log message that was written when we achieved`deadlock_timeout` (`50ms` in this case) doesn't report
50+
details about the blocker – just`PID` (`Process holding the lock: 211`). If we're lucky, we might have details about
51+
this session around the timestamp we look at in the Postgres logs – we should just look around the timestamp when
52+
failure occurred. We might find that it's`autovacuum` running in the transaction ID wraparound prevention mode, some
53+
DDL (if we log DDL with`log_statement='ddl'`), or some long-running query that has reached`log_min_duration_statement`
54+
(worth setting it to`1s` or a lower value controlling how much we actually log to avoid the "observer effect"), and got
55+
logged. But in many cases, there is nothing else in the logs, and we need another solution.
56+
57+
In this case, we can do this:
58+
59+
1) Use a custom`application_name` value for sessions that run DDL, to more easily distinguish them
60+
in`pg_stat_activity` – simply setting it in`PGAPPNAME` or just via`SET`:
61+
62+
```sql
63+
set application_name='ddl_runner';
64+
```
65+
66+
2)In a separate session, run an observer query. For example, this anonymous`DO` block with PL/pgSQL code:
67+
68+
```sql
69+
do $$
70+
declare
71+
i int;
72+
rec record;
73+
wait_more boolean := false;
74+
begin
75+
for i in select generate_series(1, 2000) loop
76+
for rec in
77+
select
78+
pid,
79+
left(query, 50) as query_left50,
80+
*
81+
from pg_stat_activity
82+
where
83+
application_name = 'ddl_runner'
84+
and wait_event_type = 'Lock'
85+
loop
86+
raise log 'DDL session blocked. Session info: pid=%, query_left50=%, wait=%:%. Blockers: %',
87+
rec.pid,
88+
rec.query_left50,
89+
rec.wait_event_type,
90+
rec.wait_event,
91+
(
92+
select
93+
array_agg(json_build_object(
94+
'pid', pid,
95+
'state', state,
96+
'wait', (wait_event_type || ':' || wait_event),
97+
'query_l50', left(query, 50)
98+
))
99+
from pg_stat_activity
100+
where array[pid] <@ pg_blocking_pids(rec.pid)
101+
);
102+
103+
wait_more := true;
104+
end loop;
105+
106+
if wait_more then
107+
perform pg_sleep(1);
108+
109+
wait_more := false;
110+
else
111+
perform pg_sleep(0.05);
112+
end if;
113+
end loop;
114+
end $$;
115+
```
116+
117+
This observer code will report somethinglike this, if DDL session waits more than 50ms:
118+
119+
>2023-12-06T19:37:35.746363053Z2023-12-0619:37:35.746 UTC [237] LOG: DDL session blocked. Session info: pid=197, query_left50=alter table t1 add column new_c int8;, wait=Lock
120+
>
121+
> :relation. Blockers: {"{\"pid\" : 211, \"state\" : \"idlein transaction\", \"wait\" : \"Client:ClientRead\", \"query_l50\" : \"selectfrom t1limit0;\"}"}
122+
123+
This should be enough for troubleshooting of failing DDL attempts.
124+
125+
A couple of more notes:
126+
127+
- Instead of anonymous DO block, it's probably better to wrap PL/pgSQL code into a function, so the `CONTEXT` and
128+
`STATEMENT` parts of the log messages don't consume too much space.
129+
- Such a function then can be invoked by`pg_cron`, so you have a permanent observability tool. But it should be taken
130+
into account that it's a whole backend that runs this, so it might be not wise to have it constantly running –
131+
instead, it could wake up from time to time only during periods when we know some DDL can be executed.

‎0072_how_to_remove_a_foreign_key.md

Lines changed: 34 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,34 @@
1+
Originally from:[tweet](https://twitter.com/samokhvalov/status/1732855546141933694),[LinkedIn post]().
2+
3+
---
4+
5+
#How to remove a foreign key
6+
7+
>I post a new PostgreSQL "howto" article every day. Join me in this
8+
>journey –[subscribe](https://twitter.com/samokhvalov/), provide feedback, share!
9+
10+
Removing a foreign key (FK) is straightforward:
11+
12+
```
13+
alter table messages
14+
drop constraint fk_messages_users;
15+
```
16+
17+
However, under heavy load, this is not a safe operation, due to the reasons discussed in
18+
[zero-downtime Postgres schema migrations need this: lock_timeout and retries](https://postgres.ai/blog/20210923-zero-downtime-postgres-schema-migrations-lock-timeout-and-retries):
19+
20+
- this operation needs a brief`AccessExclusiveLock` on*both* tables
21+
- if at least one of the locks cannot be quickly acquired, it waits, potentially blocking other sessions (including
22+
`SELECT`s), to both tables.
23+
24+
To solve this problem, we need to use a low lock_timeout and retries:
25+
26+
```sql
27+
set lock_timeout to'100ms';
28+
alter ...-- be ready to fail and try again.
29+
```
30+
31+
The same technique needs to be used in the first step of the operation of new FK creation, when we define a new FK with
32+
the`NOT VALID` option, as was discussed in[day 70, How to add a foreign key](0070_how_to_add_a_foreign_key.md).
33+
34+
See also:[day 71, How to understand what's blocking DDL](0071_how_to_understand_what_is_blocking_ddl.md).
Lines changed: 81 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,81 @@
1+
Originally from:[tweet](https://twitter.com/samokhvalov/status/1733252574479040950),[LinkedIn post]().
2+
3+
---
4+
5+
#How to analyze heavyweight locks, part 3. Persistent monitoring
6+
7+
>I post a new PostgreSQL "howto" article every day. Join me in this
8+
>journey –[subscribe](https://twitter.com/samokhvalov/), provide feedback, share!
9+
10+
If a spike of active sessions (`pg_stat_activity`,`state='active'`) occurs along with a spike of exclusive locks
11+
(`pg_locks`,`mode ~* 'exclusive'`), we need to analyze the blocked and blocking sessions, to understand the root cause.
12+
13+
Ad-hoc approaches were previously discussed:
14+
15+
-[Day 22, How to analyze heavyweight locks, part 1](0022_how_to_analyze_heavyweight_locks_part_1.md)
16+
-[Day 42, How to analyze heavyweight locks, part 2: Lock trees (a.k.a. "lock queues", "wait queues", "blocking chains")](0042_how_to_analyze_heavyweight_locks_part_2.md)
17+
- Specifically for the cases, when DDL is
18+
blocked:[Day 71, How to understand what's blocking DDL](0071_how_to_understand_what_is_blocking_ddl.md)
19+
20+
However, an ad-hoc approach is not helpful if the incident is already over. A big part of this is because even with
21+
`log_lock_waits = 'on'`, Postgres logs report only the "victim" (blocked session) query text (in the`STATEMENT` part of
22+
log entries), with only`PID` information available about the blocker session.
23+
24+
To troubleshoot past incidents and understand trends, we need lock analysis to be implemented in monitoring, providing
25+
details about the blocked and blocking sessions.
26+
27+
Here is a query that[@VKukharik](https://twitter.com/VKukharik) developed for
28+
the[pgwatch2 - Postgres.ai Edition](https://hub.docker.com/r/postgresai/pgwatch2). It doesn't involve
29+
the function`pg_blocking_pids()` because it would be unsafe due to possible observer effect.
30+
31+
```sql
32+
with sa_snapshotas (
33+
select*
34+
from pg_stat_activity
35+
where
36+
datname= current_database()
37+
and pid<> pg_backend_pid()
38+
and state<>'idle'
39+
)
40+
select
41+
(extract(epochfrom now())* 1e9)::bigintas epoch_ns,
42+
waiting.pidas waiting_pid,
43+
waiting_stm.usename::textas tag_waiting_user,
44+
waiting_stm.application_name::textas tag_waiting_appname,
45+
waiting.modeas waiting_mode,
46+
waiting.locktypeas waiting_locktype,
47+
waiting.relation::regclass::textas tag_waiting_table,
48+
waiting_stm.queryas waiting_query,
49+
(extract(epochfrom (now()-waiting_stm.state_change))*1000)::bigintas waiting_ms,
50+
blocker.pidas blocker_pid,
51+
blocker_stm.usename::textas tag_blocker_user,
52+
blocker_stm.application_name::textas tag_blocker_appname,
53+
blocker.modeas blocker_mode,
54+
blocker.locktypeas blocker_locktype,
55+
blocker.relation::regclass::textas tag_blocker_table,
56+
blocker_stm.queryas blocker_query,
57+
(extract(epochfrom (now()-blocker_stm.xact_start))*1000)::bigintas blocker_tx_ms
58+
frompg_catalog.pg_locksas waiting
59+
join sa_snapshotas waiting_stmonwaiting_stm.pid=waiting.pid
60+
joinpg_catalog.pg_locksas blockeron
61+
waiting.pid<>blocker.pid
62+
andblocker.granted
63+
andwaiting.database=blocker.database
64+
and (
65+
waiting.relation=blocker.relation
66+
orwaiting.transactionid=blocker.transactionid
67+
)
68+
join sa_snapshotas blocker_stmonblocker_stm.pid=blocker.pid
69+
where notwaiting.granted;
70+
```
71+
72+
With this added to monitoring, here is what can be observed there during incidents involving lock contention spikes:
73+
74+
![](./files/0073_01.jpg)
75+
76+
And it has all the details available for both blocked and blocking sessions, supporting the troubleshooting and
77+
root-cause analysis activities:
78+
79+
![](./files/0073_02.jpg)
80+
81+
![](./files/0073_03.jpg)

‎README.md

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -96,6 +96,10 @@ As an example, first 2 rows:
9696
- 0067[Autovacuum "queue" and progress](./0067_autovacuum_queue_and_progress.md)
9797
- 0068[psql shortcuts](./0068_psql_shortcuts.md)
9898
- 0069[How to add a CHECK constraint without downtime](./0069_howd_tod_addd_ad_checkd_constraintd_withoutd_downtime.md)
99+
- 0070[How to add a foreign key](./0070_how_to_add_a_foreign_key.md)
100+
- 0071[How to understand what's blocking DDL](./0071_how_to_understand_what_is_blocking_ddl.md)
101+
- 0072[How to remove a foreign key](./0072_how_to_remove_a_foreign_key.md)
102+
- 0073[How to analyze heavyweight locks, part 3. Persistent monitoring](./0073_how_to_analyze_heavyweight_locks_part_3_persistent_monitoring.md)
99103
- ...
100104

101105
##Contributors

‎files/0073_01.jpg

531 KB
Loading

‎files/0073_02.jpg

291 KB
Loading

‎files/0073_03.jpg

283 KB
Loading

0 commit comments

Comments
 (0)

[8]ページ先頭

©2009-2025 Movatter.jp