NotificationsYou must be signed in to change notification settings
Fork13
Star145

Commit347d412

committed

Days 77-79

1 parentee2863e commit347d412Copy full SHA for 347d412

File tree

4 files changed

+234

-0

lines changed

4 files changed

+234

-0

lines changed

`‎0077_zero_downtime_major_upgrade.md`

Lines changed: 98 additions & 0 deletions

Original file line number	Diff line number	Diff line change
`@@ -0,0 +1,98 @@`
	`1`	`+Originally from:[tweet](https://twitter.com/samokhvalov/status/),[LinkedIn post]().`
	`2`	`+`
	`3`	`+---`
	`4`	`+`
	`5`	`+#Postgres major upgrade without any downtime for a very large cluster running under heavy load`
	`6`	`+`
	`7`	`+>I post a new PostgreSQL "howto" article every day. Join me in this`
	`8`	`+>journey –[subscribe](https://twitter.com/samokhvalov/), provide feedback, share!`
	`9`	`+`
	`10`	`+`
	`11`	`+Right now on HN frontpage there is an article by knock.app`
	`12`	`+on[zero downtime Postgres upgrades](https://news.ycombinator.com/item?id=38616181).`
	`13`	`+`
	`14`	`+Later this week,[Alexander Sosna](https://twitter.com/xxorde) (GitLab) is`
	`15`	`+presenting[a talk](https://postgresql.eu/events/pgconfeu2023/schedule/session/4791-how-we-execute-postgresql-major-upgrades-at-gitlab-with-zero-downtime/)`
	`16`	`+explaining how GitLab's large clusters were upgraded under heavy load without any downtime – I highly recommend looking`
	`17`	`+that that work`
	`18`	`+`
	`19`	`+There are many details behind the proper zero downtime upgrade, many challenges to solve, and here I present only a`
	`20`	`+high-level plan. It works well for very large (dozens of TiB) clusters, with many replicas, and working under high TPS`
	`21`	`+(dozens of 10k). The process described here involves logical replication (with the "physical-to-logical" trick) and`
	`22`	`+PgBouncer's PAUSE/RESUME (assuming that PgBouncer is used).`
	`23`	`+`
	`24`	`+The detailed material will require several separate howtos to be written.`
	`25`	`+`
	`26`	`+The process consists of 2 steps:`
	`27`	`+`
	`28`	`+1)UPGRADE: New cluster is created, running on new Postgres major version.`
	`29`	`+`
	`30`	`+2)SWITCHOVER: step by step switchover of the traffic.`
	`31`	`+`
	`32`	`+##Step 1: UPGRADE`
	`33`	`+`
	`34`	+There is a "standard" approach when a new cluster is first created from scratch (`initdb`) and then a brand new logical
	`35`	+replica is created based on it (`copy_data = false` when creating logical subscription). However, this way of logical
	`36`	`+replica provisioning takes a lot of time can be very problematic to execute under heavy load.`
	`37`	`+`
	`38`	`+An alternative is to take a physical replica and convert it to logical. This is relatively easy to do knowing the`
	`39`	+logical slot's LSN position and using`recovery_target_lsn`. This recipe is called "physical2logical" conversion, and it
	`40`	`+allows for the creation of a logical replica based on physical replica (e.g. created based on a cloud snapshot in`
	`41`	`+minutes), reliably and quickly.`
	`42`	`+`
	`43`	+However, combining the physical2logical conversion with`pg_upgrade` should be done with care.
	`44`	`+Details can be found[here](https://postgresql.org/message-id/flat/20230217075433.u5mjly4d5cr4hcfe%40jrouhaud).`
	`45`	`+`
	`46`	`+Steps:`
	`47`	`+`
	`48`	`+1. Create a new cluster, which will be a secondary cluster using cascaded physical replication (Patroni supports it).`
	`49`	`+`
	`50`	`+2. Ensure that the new cluster's leader is not significantly lagging behind the old cluster's primary.`
	`51`	`+`
	`52`	`+3. Stop new cluster's node in proper order (making sure stopped replicas are fully in sync with their leader).`
	`53`	`+`
	`54`	+4. On the old cluster's primary, create a logical slot and publication`FOR ALL TABLES` (this doesn't require
	`55`	+ table-level locks, thus we don't need low`lock_timeout` and retries) and remember the slot's position.
	`56`	`+`
	`57`	+5. Reconfigure new cluster's leader, setting`recovery_target_lsn` to the remembered slot's LSN, and disabling
	`58`	+`restore_command`.
	`59`	`+`
	`60`	`+6. Start new cluster's leader and replicas, let them reach the desired LSN. Then the leader can be promoted.`
	`61`	`+`
	`62`	`+7. Stop new cluster's nodes again, in proper order.`
	`63`	`+`
	`64`	+8. Run`pg_upgrade --link` on the new cluster's leader.
	`65`	`+`
	`66`	+9. Use`rsync --hard-links --size-only` on the new cluster's replicas – this is disputable step (see
	`67`	`+ details[here](https://postgresql.org/message-id/flat/CAM527d8heqkjG5VrvjU3Xjsqxg41ufUyabD9QZccdAxnpbRH-Q%40mail.gmail.com)),`
	`68`	+ but this is what most people use for in-place (w/o logical replication) upgrades with`pg_upgrade --link`, and there
	`69`	`+ is no another fast alternative invented yet.`
	`70`	`+`
	`71`	`+10. Configure new cluster's leader (now primary already) to use logical replication – create subscription,`
	`72`	+ with`copy_data = false` and let it catch up with the working old cluster.
	`73`	`+`
	`74`	`+During all these steps the old cluster is up and running, and new cluster is invisible to users. This gives you a huge`
	`75`	`+benefit of testing the whole process right in production (after proper testing in lower environments).`
	`76`	`+`
	`77`	`+##Step 2: SWITCHOVER`
	`78`	`+`
	`79`	`+First, it makes sense to switch over the read-only (RO) traffic. If the application allows, it makes sense to first`
	`80`	`+redirect only part of the RO traffic to new replicas. This would require an advanced replication lag detection in the`
	`81`	`+load balancing code`
	`82`	`+(see:`
	`83`	`+[Day 17: How to determine the replication lag - Hybrid case: logical & physical](./0017_how_to_determine_the_replication_lag.md#hybrid-case-logical-physical)).`
	`84`	`+`
	`85`	`+When it is time to redirect the RW traffic, to achieve zero downtime, one can use PgBouncer's PAUSE/RESUME. If there are`
	`86`	+multiple PgBouncer nodes (running on separate hosts/ports, or involving`SO_REUSEPORT`), it is important to implement a
	`87`	`+graceful approach for PAUSE acquisition. Once all PAUSEs are acquired, the traffic can be redirected, and it is time to`
	`88`	`+issue RESUME. Before that it is important to make sure that all writes are fully propagated to the new primary.`
	`89`	`+`
	`90`	`+It is important to have measures developed to protect the old cluster from writes coming from application or users –`
	`91`	+e.g. adjusting`pg_hba.conf` or shutting the cluster down. However, for advanced rollback capabilities, it makes sense to
	`92`	`+implement "reverse" logical replication, in the same manner as "forward" one, setting it up during switchover. In this`
	`93`	`+case, the writes to the old cluster are allowed – but only from logical replication. The reverse replication allows for`
	`94`	`+rollback even some time after the whole process is finished, this makes the whole process fully reversible from any`
	`95`	`+point.`
	`96`	`+`
	`97`	`+As already mentioned, there are many aspects to address, this is only a high-level plan. If you can attend this talk, do`
	`98`	`+it[here](https://postgresql.eu/events/pgconfeu2023/schedule/session/4791-how-we-execute-postgresql-major-upgrades-at-gitlab-with-zero-downtime/).`

`‎0078_estimate_yoy_table_growth.md`

Lines changed: 66 additions & 0 deletions

Original file line number	Diff line number	Diff line change
`@@ -0,0 +1,66 @@`
	`1`	`+Originally from:[tweet](https://twitter.com/samokhvalov/status/1735568654996283501),[LinkedIn post]().`
	`2`	`+`
	`3`	`+---`
	`4`	`+`
	`5`	`+#How to estimate the YoY growth of a very large table using row creation timestamps and the planner statistics`
	`6`	`+`
	`7`	`+>I post a new PostgreSQL "howto" article every day. Join me in this`
	`8`	`+>journey –[subscribe](https://twitter.com/samokhvalov/), provide feedback, share!`
	`9`	`+`
	`10`	`+Assume we have a 10 TiB table created many years ago, partitioned or not, like this one:`
	`11`	`+`
	`12`	+```sql
	`13`	`+createtablet (`
	`14`	`+ id int8primary key,-- of course, not int4`
	`15`	`+ created_attimestamptz default now(),`
	`16`	`+ ...`
	`17`	`+);`
	`18`	+```
	`19`	`+`
	`20`	`+and we need to quickly understand how the table grew year over year, assuming that no rows were deleted (or only a`
	`21`	`+negligible amount). So, we just need to count rows for every year.`
	`22`	`+`
	`23`	`+A straightforward approach would be:`
	`24`	`+`
	`25`	+```sql
	`26`	`+select`
	`27`	`+ date_trunc('year', created_at)as year,`
	`28`	`+count(*)`
	`29`	`+from t`
	`30`	`+group by year`
	`31`	`+order by year;`
	`32`	+```
	`33`	`+`
	`34`	`+However, for a 10 TiB table, we'd need to wait many hours, if not days, for this analysis to complete.`
	`35`	`+`
	`36`	`+Here is fast, but not a precise way to get the row counts for each year (assuming the table has up-to-date stats; if in`
	`37`	+doubt, run`ANALYZE` on it first):
	`38`	`+`
	`39`	+```sql
	`40`	`+do $$`
	`41`	`+declare`
	`42`	`+ table_fqntext :='public.t';`
	`43`	`+ year_startint :=2000;`
	`44`	`+ year_endint := extract(yearfrom now())::int;`
	`45`	`+ yearint;`
	`46`	`+ explain_json json;`
	`47`	`+begin`
	`48`	`+ for yearin year_start..year_end loop`
	`49`	`+ execute format(`
	`50`	`+ $e$`
	`51`	`+ explain (format json)select*`
	`52`	`+from %s`
	`53`	`+where created_at`
	`54`	`+ between'%s-01-01'and'%s-12-31'`
	`55`	`+ $e$,`
	`56`	`+ table_fqn,`
	`57`	`+ year,`
	`58`	`+ year`
	`59`	`+ ) into explain_json;`
	`60`	`+`
	`61`	`+ raise info'Year: %, Estimated rows: %',`
	`62`	`+ year,`
	`63`	`+ explain_json->0->'Plan'->>'Plan Rows';`
	`64`	`+ end loop;`
	`65`	`+end $$;`
	`66`	+```

`‎0079_rebuild_indexes_without_deadlocks.md`

Lines changed: 67 additions & 0 deletions

Original file line number	Diff line number	Diff line change
`@@ -0,0 +1,67 @@`
	`1`	`+Originally from:[tweet](https://twitter.com/samokhvalov/status/1735204785232715997),[LinkedIn post]().`
	`2`	`+`
	`3`	`+---`
	`4`	`+`
	`5`	`+#How to rebuild many indexes using many backends avoiding deadlocks`
	`6`	`+`
	`7`	`+>I post a new PostgreSQL "howto" article every day. Join me in this`
	`8`	`+>journey –[subscribe](https://twitter.com/samokhvalov/), provide feedback, share!`
	`9`	`+`
	`10`	`+Sometimes we need to reindex many indexes – or all of them – and want to do it faster.`
	`11`	`+`
	`12`	`+For example, this makes sense after upgrading from pre-14 Postgres to version 14+, when we want to rebuild all B-tree`
	`13`	`+indexes to benefit from optimizations that reduce bloat growth rates.`
	`14`	`+`
	`15`	`+We might decide to use a single session and higher value of`
	`16`	`+[max_parallel_maintenance_workers](https://postgresqlco.nf/doc/en/param/max_parallel_maintenance_workers/), processing`
	`17`	`+one index at a time. But if we have powerful resources (a lot of vCPUs and fast disks), then the maximal value of`
	`18`	+`max_parallel_maintenance_workers` may not be enough to move as fast as we can (changing
	`19`	+`max_parallel_maintenance_workers` doesn't require a restart, but we cannot use more than`max_worker_processes`
	`20`	`+workers, and changing that requires a restart). In this case, it may make sense to process multiple indexes in parallel,`
	`21`	+using`REINDEX INDEX CONCURRENTLY`.
	`22`	`+`
	`23`	`+But in this case, indexes need to be processed in proper order. The problem is that if you attempt to rebuild two`
	`24`	`+indexes belonging to the same table in parallel, a deadlock will be detected, and one of the sessions will fail:`
	`25`	`+`
	`26`	+```sql
	`27`	`+nik=# reindex index concurrently t1_hash_record_idx3;`
	`28`	`+ERROR: deadlock detected`
	`29`	`+DETAIL: Process40 waits for ShareLockon virtual transaction4/2506; blocked by process1313.`
	`30`	`+Process1313 waits for ShareUpdateExclusiveLockon relation16634 of database16401; blocked by process40.`
	`31`	`+HINT: See server log for query details.`
	`32`	+```
	`33`	`+`
	`34`	`+To address this, we can use this approach:`
	`35`	`+`
	`36`	+1. Decide how many reindexing sessions you want to use, taking into account`max_parallel_maintenance_workers` and the
	`37`	`+ planned resource utilization / saturation risks (CPU and disk IO).`
	`38`	`+`
	`39`	`+2. Assuming we want to use N reindexing sessions, build the full list of indexes, with the table names they belong to,`
	`40`	`+ and "assign" each table to a particular reindexing session. See the query below that does it.`
	`41`	`+`
	`42`	`+3. Using this "assignment", divide the whole list of indexes to N separate lists, so all the indexes for a particular`
	`43`	`+ table are present only in a single list – and now we can just run N sessions using these N lists.`
	`44`	`+`
	`45`	`+For the step 2, here is a query that can help:`
	`46`	`+`
	`47`	+```sql
	`48`	`+\set NUMBER_OF_SESSIONS10`
	`49`	`+`
	`50`	`+select`
	`51`	`+ format('%I.%I',n.nspname,c.relname)as table_fqn,`
	`52`	`+ format('%I.%I',n.nspname,i.relname)as index_fqn,`
	`53`	`+ mod(`
	`54`	`+ hashtext(format('%I.%I',n.nspname,c.relname)) &2147483647,`
	`55`	`+ :NUMBER_OF_SESSIONS`
	`56`	`+ )as session_id`
	`57`	`+from`
	`58`	`+ pg_index idx`
	`59`	`+join pg_class conidx.indrelid=c.oid`
	`60`	`+join pg_class ionidx.indexrelid=i.oid`
	`61`	`+join pg_namespace nonc.relnamespace=n.oid`
	`62`	`+where`
	`63`	`+n.nspname notin ('pg_catalog','pg_toast','information_schema')`
	`64`	`+-- and ... additional filters if needed`
	`65`	`+order by`
	`66`	`+ table_fqn, index_fqn;`
	`67`	+```

`‎README.md`

Lines changed: 3 additions & 0 deletions

Original file line number	Diff line number	Diff line change
`@@ -103,6 +103,9 @@ As an example, first 2 rows:`
`103`	`103`	`- 0074[How to flush caches (OS page cache and Postgres buffer pool)](./0074_how_to_flush_caches.md)`
`104`	`104`	`- 0075[How to find redundant indexes](./0076_how_to_find_redundent_indexes)`
`105`	`105`	`- 0076[How to find unused indexes](./0076_how_to_find_redundent_indexes.md)`
	`106`	`+- 0077[Postgres major upgrade without any downtime for a very large cluster running under heavy load](./0077_zero_downtime_major_upgrade.md)`
	`107`	`+- 0078[How to estimate the YoY growth of a very large table using row creation timestamps and the planner statistics](./0078_estimate_yoy_table_growth.md)`
	`108`	`+- 0079[How to rebuild many indexes using many backends avoiding deadlocks](./0079_rebuild_indexes_without_deadlocks.md)`
`106`	`109`	`- ...`
`107`	`110`
`108`	`111`	`##Contributors`

0 commit comments

Comments

(0)

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Commit347d412

File tree

4 files changed

4 files changed

`‎0077_zero_downtime_major_upgrade.md`

`‎0078_estimate_yoy_table_growth.md`

`‎0079_rebuild_indexes_without_deadlocks.md`

`‎README.md`

0 commit comments