NotificationsYou must be signed in to change notification settings
Fork13
Star145

Commite732277

committed

Merge branch 'main' into 'main'

Days 71-73See merge request postgres-ai/postgresql-consulting/postgres-howtos!27

2 parentseddd71a +c9b74ba commite732277Copy full SHA for e732277

File tree

7 files changed

+250

-0

lines changed

7 files changed

+250

-0

lines changed

`‎0071_how_to_understand_what_is_blocking_ddl.md`

Lines changed: 131 additions & 0 deletions

Original file line number	Diff line number	Diff line change
`@@ -0,0 +1,131 @@`
	`1`	`+Originally from:[tweet](https://twitter.com/samokhvalov/status/1732486201830240283),[LinkedIn post]().`
	`2`	`+`
	`3`	`+---`
	`4`	`+`
	`5`	`+#How to understand what's blocking DDL`
	`6`	`+`
	`7`	`+>I post a new PostgreSQL "howto" article every day. Join me in this`
	`8`	`+>journey –[subscribe](https://twitter.com/samokhvalov/), provide feedback, share!`
	`9`	`+`
	`10`	`+In[day 60](0060_how_to_add_a_column.md), we discussed how to add a column and that under load you need`
	`11`	+low`lock_timeout` and retries (also
	`12`	`+see:[Zero-downtime Postgres schema migrations need this: lock_timeout and retries](https://postgres.ai/blog/20210923-zero-downtime-postgres-schema-migrations-lock-timeout-and-retries)).`
	`13`	`+`
	`14`	+There we used`max_attempts` set to`1000` and this is probably too many, but interesting question is: how to understand
	`15`	`+what's blocking our DDL, if it doesn't succeed after many retries?`
	`16`	`+`
	`17`	+The first thing to do is to enable`log_lock_waits`. In this case, after`deadlock_timeout` (`1s` by default; and this
	`18`	`+setting defines when deadlock detection happens), you'll see some information about our blocked session.`
	`19`	`+`
	`20`	+But we don't want to start with`lock_timeout` more than`1s` – it would be too invasive. Solution is to set a lower
	`21`	+`deadlock_timeout` right in session. Example (assuming that there is another session just read from table`t` and keeps
	`22`	`+transaction open):`
	`23`	`+`
	`24`	+```sql
	`25`	`+postgres=# set lock_timeout to '100ms';`
	`26`	`+SET`
	`27`	`+`
	`28`	`+postgres=# set deadlock_timeout to '50ms';`
	`29`	`+SET`
	`30`	`+`
	`31`	`+postgres=# set log_lock_waits to 'on';`
	`32`	`+SET`
	`33`	`+`
	`34`	`+postgres=# alter table t1 add column new_c int8;`
	`35`	`+ERROR: canceling statement due to lock timeout`
	`36`	+```
	`37`	`+`
	`38`	+In the logs we have two entries – one after~`50ms` (`deadlock_timeout`) with some useful details about the waiting on
	`39`	+lock acquisition, and another when our statement fails after`100ms`:
	`40`	`+`
	`41`	+```
	`42`	`+2023-12-06 19:12:21.823 UTC [197] LOG: process 197 still waiting for AccessExclusiveLock on relation 16384 of database 13481 after 54.022 ms`
	`43`	`+2023-12-06 19:12:21.823 UTC [197] DETAIL: Process holding the lock: 211. Wait queue: 197.`
	`44`	`+2023-12-06 19:12:21.823 UTC [197] STATEMENT: alter table t1 add column new_c int8;`
	`45`	`+2023-12-06 19:12:21.874 UTC [197] ERROR: canceling statement due to lock timeout`
	`46`	`+2023-12-06 19:12:21.874 UTC [197] STATEMENT: alter table t1 add column new_c int8;`
	`47`	+```
	`48`	`+`
	`49`	+Unfortunately, the log message that was written when we achieved`deadlock_timeout` (`50ms` in this case) doesn't report
	`50`	+details about the blocker – just`PID` (`Process holding the lock: 211`). If we're lucky, we might have details about
	`51`	`+this session around the timestamp we look at in the Postgres logs – we should just look around the timestamp when`
	`52`	+failure occurred. We might find that it's`autovacuum` running in the transaction ID wraparound prevention mode, some
	`53`	+DDL (if we log DDL with`log_statement='ddl'`), or some long-running query that has reached`log_min_duration_statement`
	`54`	+(worth setting it to`1s` or a lower value controlling how much we actually log to avoid the "observer effect"), and got
	`55`	`+logged. But in many cases, there is nothing else in the logs, and we need another solution.`
	`56`	`+`
	`57`	`+In this case, we can do this:`
	`58`	`+`
	`59`	+1) Use a custom`application_name` value for sessions that run DDL, to more easily distinguish them
	`60`	+ in`pg_stat_activity` – simply setting it in`PGAPPNAME` or just via`SET`:
	`61`	`+`
	`62`	+```sql
	`63`	`+set application_name='ddl_runner';`
	`64`	+ ```
	`65`	`+`
	`66`	+2)In a separate session, run an observer query. For example, this anonymous`DO` block with PL/pgSQL code:
	`67`	`+`
	`68`	+```sql
	`69`	`+ do $$`
	`70`	`+ declare`
	`71`	`+ i int;`
	`72`	`+ rec record;`
	`73`	`+ wait_more boolean := false;`
	`74`	`+ begin`
	`75`	`+ for i in select generate_series(1, 2000) loop`
	`76`	`+ for rec in`
	`77`	`+ select`
	`78`	`+ pid,`
	`79`	`+ left(query, 50) as query_left50,`
	`80`	`+ *`
	`81`	`+ from pg_stat_activity`
	`82`	`+ where`
	`83`	`+ application_name = 'ddl_runner'`
	`84`	`+ and wait_event_type = 'Lock'`
	`85`	`+ loop`
	`86`	`+ raise log 'DDL session blocked. Session info: pid=%, query_left50=%, wait=%:%. Blockers: %',`
	`87`	`+ rec.pid,`
	`88`	`+ rec.query_left50,`
	`89`	`+ rec.wait_event_type,`
	`90`	`+ rec.wait_event,`
	`91`	`+ (`
	`92`	`+ select`
	`93`	`+ array_agg(json_build_object(`
	`94`	`+ 'pid', pid,`
	`95`	`+ 'state', state,`
	`96`	`+ 'wait', (wait_event_type \|\| ':' \|\| wait_event),`
	`97`	`+ 'query_l50', left(query, 50)`
	`98`	`+ ))`
	`99`	`+ from pg_stat_activity`
	`100`	`+ where array[pid] <@ pg_blocking_pids(rec.pid)`
	`101`	`+ );`
	`102`	`+`
	`103`	`+ wait_more := true;`
	`104`	`+ end loop;`
	`105`	`+`
	`106`	`+ if wait_more then`
	`107`	`+ perform pg_sleep(1);`
	`108`	`+`
	`109`	`+ wait_more := false;`
	`110`	`+ else`
	`111`	`+ perform pg_sleep(0.05);`
	`112`	`+ end if;`
	`113`	`+ end loop;`
	`114`	`+ end $$;`
	`115`	+```
	`116`	`+`
	`117`	`+ This observer code will report somethinglike this, if DDL session waits more than 50ms:`
	`118`	`+`
	`119`	`+>2023-12-06T19:37:35.746363053Z2023-12-0619:37:35.746 UTC [237] LOG: DDL session blocked. Session info: pid=197, query_left50=alter table t1 add column new_c int8;, wait=Lock`
	`120`	`+>`
	`121`	`+> :relation. Blockers: {"{\"pid\" : 211, \"state\" : \"idlein transaction\", \"wait\" : \"Client:ClientRead\", \"query_l50\" : \"selectfrom t1limit0;\"}"}`
	`122`	`+`
	`123`	`+ This should be enough for troubleshooting of failing DDL attempts.`
	`124`	`+`
	`125`	`+A couple of more notes:`
	`126`	`+`
	`127`	+- Instead of anonymous DO block, it's probably better to wrap PL/pgSQL code into a function, so the `CONTEXT` and
	`128`	+ `STATEMENT` parts of the log messages don't consume too much space.
	`129`	+- Such a function then can be invoked by`pg_cron`, so you have a permanent observability tool. But it should be taken
	`130`	`+ into account that it's a whole backend that runs this, so it might be not wise to have it constantly running –`
	`131`	`+ instead, it could wake up from time to time only during periods when we know some DDL can be executed.`

`‎0072_how_to_remove_a_foreign_key.md`

Lines changed: 34 additions & 0 deletions

Original file line number	Diff line number	Diff line change
`@@ -0,0 +1,34 @@`
	`1`	`+Originally from:[tweet](https://twitter.com/samokhvalov/status/1732855546141933694),[LinkedIn post]().`
	`2`	`+`
	`3`	`+---`
	`4`	`+`
	`5`	`+#How to remove a foreign key`
	`6`	`+`
	`7`	`+>I post a new PostgreSQL "howto" article every day. Join me in this`
	`8`	`+>journey –[subscribe](https://twitter.com/samokhvalov/), provide feedback, share!`
	`9`	`+`
	`10`	`+Removing a foreign key (FK) is straightforward:`
	`11`	`+`
	`12`	+```
	`13`	`+alter table messages`
	`14`	`+drop constraint fk_messages_users;`
	`15`	+```
	`16`	`+`
	`17`	`+However, under heavy load, this is not a safe operation, due to the reasons discussed in`
	`18`	`+[zero-downtime Postgres schema migrations need this: lock_timeout and retries](https://postgres.ai/blog/20210923-zero-downtime-postgres-schema-migrations-lock-timeout-and-retries):`
	`19`	`+`
	`20`	+- this operation needs a brief`AccessExclusiveLock` onboth tables
	`21`	`+- if at least one of the locks cannot be quickly acquired, it waits, potentially blocking other sessions (including`
	`22`	+`SELECT`s), to both tables.
	`23`	`+`
	`24`	`+To solve this problem, we need to use a low lock_timeout and retries:`
	`25`	`+`
	`26`	+```sql
	`27`	`+set lock_timeout to'100ms';`
	`28`	`+alter ...-- be ready to fail and try again.`
	`29`	+```
	`30`	`+`
	`31`	`+The same technique needs to be used in the first step of the operation of new FK creation, when we define a new FK with`
	`32`	+the`NOT VALID` option, as was discussed in[day 70, How to add a foreign key](0070_how_to_add_a_foreign_key.md).
	`33`	`+`
	`34`	`+See also:[day 71, How to understand what's blocking DDL](0071_how_to_understand_what_is_blocking_ddl.md).`

`‎0073_how_to_analyze_heavyweight_locks_part_3_persistent_monitoring.md`

Lines changed: 81 additions & 0 deletions

Original file line number	Diff line number	Diff line change
`@@ -0,0 +1,81 @@`
	`1`	`+Originally from:[tweet](https://twitter.com/samokhvalov/status/1733252574479040950),[LinkedIn post]().`
	`2`	`+`
	`3`	`+---`
	`4`	`+`
	`5`	`+#How to analyze heavyweight locks, part 3. Persistent monitoring`
	`6`	`+`
	`7`	`+>I post a new PostgreSQL "howto" article every day. Join me in this`
	`8`	`+>journey –[subscribe](https://twitter.com/samokhvalov/), provide feedback, share!`
	`9`	`+`
	`10`	+If a spike of active sessions (`pg_stat_activity`,`state='active'`) occurs along with a spike of exclusive locks
	`11`	+(`pg_locks`,`mode ~* 'exclusive'`), we need to analyze the blocked and blocking sessions, to understand the root cause.
	`12`	`+`
	`13`	`+Ad-hoc approaches were previously discussed:`
	`14`	`+`
	`15`	`+-[Day 22, How to analyze heavyweight locks, part 1](0022_how_to_analyze_heavyweight_locks_part_1.md)`
	`16`	`+-[Day 42, How to analyze heavyweight locks, part 2: Lock trees (a.k.a. "lock queues", "wait queues", "blocking chains")](0042_how_to_analyze_heavyweight_locks_part_2.md)`
	`17`	`+- Specifically for the cases, when DDL is`
	`18`	`+ blocked:[Day 71, How to understand what's blocking DDL](0071_how_to_understand_what_is_blocking_ddl.md)`
	`19`	`+`
	`20`	`+However, an ad-hoc approach is not helpful if the incident is already over. A big part of this is because even with`
	`21`	+`log_lock_waits = 'on'`, Postgres logs report only the "victim" (blocked session) query text (in the`STATEMENT` part of
	`22`	+log entries), with only`PID` information available about the blocker session.
	`23`	`+`
	`24`	`+To troubleshoot past incidents and understand trends, we need lock analysis to be implemented in monitoring, providing`
	`25`	`+details about the blocked and blocking sessions.`
	`26`	`+`
	`27`	`+Here is a query that[@VKukharik](https://twitter.com/VKukharik) developed for`
	`28`	`+the[pgwatch2 - Postgres.ai Edition](https://hub.docker.com/r/postgresai/pgwatch2). It doesn't involve`
	`29`	+the function`pg_blocking_pids()` because it would be unsafe due to possible observer effect.
	`30`	`+`
	`31`	+```sql
	`32`	`+with sa_snapshotas (`
	`33`	`+select*`
	`34`	`+from pg_stat_activity`
	`35`	`+where`
	`36`	`+ datname= current_database()`
	`37`	`+and pid<> pg_backend_pid()`
	`38`	`+and state<>'idle'`
	`39`	`+)`
	`40`	`+select`
	`41`	`+ (extract(epochfrom now())* 1e9)::bigintas epoch_ns,`
	`42`	`+waiting.pidas waiting_pid,`
	`43`	`+waiting_stm.usename::textas tag_waiting_user,`
	`44`	`+waiting_stm.application_name::textas tag_waiting_appname,`
	`45`	`+waiting.modeas waiting_mode,`
	`46`	`+waiting.locktypeas waiting_locktype,`
	`47`	`+waiting.relation::regclass::textas tag_waiting_table,`
	`48`	`+waiting_stm.queryas waiting_query,`
	`49`	`+ (extract(epochfrom (now()-waiting_stm.state_change))*1000)::bigintas waiting_ms,`
	`50`	`+blocker.pidas blocker_pid,`
	`51`	`+blocker_stm.usename::textas tag_blocker_user,`
	`52`	`+blocker_stm.application_name::textas tag_blocker_appname,`
	`53`	`+blocker.modeas blocker_mode,`
	`54`	`+blocker.locktypeas blocker_locktype,`
	`55`	`+blocker.relation::regclass::textas tag_blocker_table,`
	`56`	`+blocker_stm.queryas blocker_query,`
	`57`	`+ (extract(epochfrom (now()-blocker_stm.xact_start))*1000)::bigintas blocker_tx_ms`
	`58`	`+frompg_catalog.pg_locksas waiting`
	`59`	`+join sa_snapshotas waiting_stmonwaiting_stm.pid=waiting.pid`
	`60`	`+joinpg_catalog.pg_locksas blockeron`
	`61`	`+waiting.pid<>blocker.pid`
	`62`	`+andblocker.granted`
	`63`	`+andwaiting.database=blocker.database`
	`64`	`+and (`
	`65`	`+waiting.relation=blocker.relation`
	`66`	`+orwaiting.transactionid=blocker.transactionid`
	`67`	`+ )`
	`68`	`+join sa_snapshotas blocker_stmonblocker_stm.pid=blocker.pid`
	`69`	`+where notwaiting.granted;`
	`70`	+```
	`71`	`+`
	`72`	`+With this added to monitoring, here is what can be observed there during incidents involving lock contention spikes:`
	`73`	`+`
	`74`	`+![](./files/0073_01.jpg)`
	`75`	`+`
	`76`	`+And it has all the details available for both blocked and blocking sessions, supporting the troubleshooting and`
	`77`	`+root-cause analysis activities:`
	`78`	`+`
	`79`	`+![](./files/0073_02.jpg)`
	`80`	`+`
	`81`	`+![](./files/0073_03.jpg)`

`‎README.md`

Lines changed: 4 additions & 0 deletions

Original file line number	Diff line number	Diff line change
`@@ -96,6 +96,10 @@ As an example, first 2 rows:`
`96`	`96`	`- 0067[Autovacuum "queue" and progress](./0067_autovacuum_queue_and_progress.md)`
`97`	`97`	`- 0068[psql shortcuts](./0068_psql_shortcuts.md)`
`98`	`98`	`- 0069[How to add a CHECK constraint without downtime](./0069_howd_tod_addd_ad_checkd_constraintd_withoutd_downtime.md)`
	`99`	`+- 0070[How to add a foreign key](./0070_how_to_add_a_foreign_key.md)`
	`100`	`+- 0071[How to understand what's blocking DDL](./0071_how_to_understand_what_is_blocking_ddl.md)`
	`101`	`+- 0072[How to remove a foreign key](./0072_how_to_remove_a_foreign_key.md)`
	`102`	`+- 0073[How to analyze heavyweight locks, part 3. Persistent monitoring](./0073_how_to_analyze_heavyweight_locks_part_3_persistent_monitoring.md)`
`99`	`103`	`- ...`
`100`	`104`
`101`	`105`	`##Contributors`

`‎files/0073_01.jpg`

531 KB

`‎files/0073_02.jpg`

291 KB

`‎files/0073_03.jpg`

283 KB

0 commit comments

Comments

(0)

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Commite732277

File tree

7 files changed

7 files changed

`‎0071_how_to_understand_what_is_blocking_ddl.md`

`‎0072_how_to_remove_a_foreign_key.md`

`‎0073_how_to_analyze_heavyweight_locks_part_3_persistent_monitoring.md`

`‎README.md`

`‎files/0073_01.jpg`

`‎files/0073_02.jpg`

`‎files/0073_03.jpg`

0 commit comments