Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Commit347d412

Browse files
committed
Days 77-79
1 parentee2863e commit347d412

File tree

4 files changed

+234
-0
lines changed

4 files changed

+234
-0
lines changed

‎0077_zero_downtime_major_upgrade.md

Lines changed: 98 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,98 @@
1+
Originally from:[tweet](https://twitter.com/samokhvalov/status/),[LinkedIn post]().
2+
3+
---
4+
5+
#Postgres major upgrade without any downtime for a very large cluster running under heavy load
6+
7+
>I post a new PostgreSQL "howto" article every day. Join me in this
8+
>journey –[subscribe](https://twitter.com/samokhvalov/), provide feedback, share!
9+
10+
11+
Right now on HN frontpage there is an article by knock.app
12+
on[zero downtime Postgres upgrades](https://news.ycombinator.com/item?id=38616181).
13+
14+
Later this week,[Alexander Sosna](https://twitter.com/xxorde) (GitLab) is
15+
presenting[a talk](https://postgresql.eu/events/pgconfeu2023/schedule/session/4791-how-we-execute-postgresql-major-upgrades-at-gitlab-with-zero-downtime/)
16+
explaining how GitLab's large clusters were upgraded under heavy load without any downtime – I highly recommend looking
17+
that that work
18+
19+
There are many details behind the proper zero downtime upgrade, many challenges to solve, and here I present only a
20+
high-level plan. It works well for very large (dozens of TiB) clusters, with many replicas, and working under high TPS
21+
(dozens of 10k). The process described here involves logical replication (with the "physical-to-logical" trick) and
22+
PgBouncer's PAUSE/RESUME (assuming that PgBouncer is used).
23+
24+
The detailed material will require several separate howtos to be written.
25+
26+
The process consists of 2 steps:
27+
28+
1)**UPGRADE:** New cluster is created, running on new Postgres major version.
29+
30+
2)**SWITCHOVER:** step by step switchover of the traffic.
31+
32+
##Step 1: UPGRADE
33+
34+
There is a "standard" approach when a new cluster is first created from scratch (`initdb`) and then a brand new logical
35+
replica is created based on it (`copy_data = false` when creating logical subscription). However, this way of logical
36+
replica provisioning takes a lot of time can be very problematic to execute under heavy load.
37+
38+
An alternative is to take a physical replica and convert it to logical. This is relatively easy to do knowing the
39+
logical slot's LSN position and using`recovery_target_lsn`. This recipe is called "physical2logical" conversion, and it
40+
allows for the creation of a logical replica based on physical replica (e.g. created based on a cloud snapshot in
41+
minutes), reliably and quickly.
42+
43+
However, combining the physical2logical conversion with`pg_upgrade` should be done with care.
44+
Details can be found[here](https://postgresql.org/message-id/flat/20230217075433.u5mjly4d5cr4hcfe%40jrouhaud).
45+
46+
Steps:
47+
48+
1. Create a new cluster, which will be a secondary cluster using cascaded physical replication (Patroni supports it).
49+
50+
2. Ensure that the new cluster's leader is not significantly lagging behind the old cluster's primary.
51+
52+
3. Stop new cluster's node in proper order (making sure stopped replicas are fully in sync with their leader).
53+
54+
4. On the old cluster's primary, create a logical slot and publication`FOR ALL TABLES` (this doesn't require
55+
table-level locks, thus we don't need low`lock_timeout` and retries) and remember the slot's position.
56+
57+
5. Reconfigure new cluster's leader, setting`recovery_target_lsn` to the remembered slot's LSN, and disabling
58+
`restore_command`.
59+
60+
6. Start new cluster's leader and replicas, let them reach the desired LSN. Then the leader can be promoted.
61+
62+
7. Stop new cluster's nodes again, in proper order.
63+
64+
8. Run`pg_upgrade --link` on the new cluster's leader.
65+
66+
9. Use`rsync --hard-links --size-only` on the new cluster's replicas – this is disputable step (see
67+
details[here](https://postgresql.org/message-id/flat/CAM527d8heqkjG5VrvjU3Xjsqxg41ufUyabD9QZccdAxnpbRH-Q%40mail.gmail.com)),
68+
but this is what most people use for in-place (w/o logical replication) upgrades with`pg_upgrade --link`, and there
69+
is no another fast alternative invented yet.
70+
71+
10. Configure new cluster's leader (now primary already) to use logical replication – create subscription,
72+
with`copy_data = false` and let it catch up with the working old cluster.
73+
74+
During all these steps the old cluster is up and running, and new cluster is invisible to users. This gives you a huge
75+
benefit of testing the whole process right in production (after proper testing in lower environments).
76+
77+
##Step 2: SWITCHOVER
78+
79+
First, it makes sense to switch over the read-only (RO) traffic. If the application allows, it makes sense to first
80+
redirect only part of the RO traffic to new replicas. This would require an advanced replication lag detection in the
81+
load balancing code
82+
(see:
83+
[Day 17: How to determine the replication lag - Hybrid case: logical & physical](./0017_how_to_determine_the_replication_lag.md#hybrid-case-logical-physical)).
84+
85+
When it is time to redirect the RW traffic, to achieve zero downtime, one can use PgBouncer's PAUSE/RESUME. If there are
86+
multiple PgBouncer nodes (running on separate hosts/ports, or involving`SO_REUSEPORT`), it is important to implement a
87+
graceful approach for PAUSE acquisition. Once all PAUSEs are acquired, the traffic can be redirected, and it is time to
88+
issue RESUME. Before that it is important to make sure that all writes are fully propagated to the new primary.
89+
90+
It is important to have measures developed to protect the old cluster from writes coming from application or users –
91+
e.g. adjusting`pg_hba.conf` or shutting the cluster down. However, for advanced rollback capabilities, it makes sense to
92+
implement "reverse" logical replication, in the same manner as "forward" one, setting it up during switchover. In this
93+
case, the writes to the old cluster are allowed – but only from logical replication. The reverse replication allows for
94+
rollback even some time after the whole process is finished, this makes the whole process fully reversible from any
95+
point.
96+
97+
As already mentioned, there are many aspects to address, this is only a high-level plan. If you can attend this talk, do
98+
it[here](https://postgresql.eu/events/pgconfeu2023/schedule/session/4791-how-we-execute-postgresql-major-upgrades-at-gitlab-with-zero-downtime/).

‎0078_estimate_yoy_table_growth.md

Lines changed: 66 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,66 @@
1+
Originally from:[tweet](https://twitter.com/samokhvalov/status/1735568654996283501),[LinkedIn post]().
2+
3+
---
4+
5+
#How to estimate the YoY growth of a very large table using row creation timestamps and the planner statistics
6+
7+
>I post a new PostgreSQL "howto" article every day. Join me in this
8+
>journey –[subscribe](https://twitter.com/samokhvalov/), provide feedback, share!
9+
10+
Assume we have a 10 TiB table created many years ago, partitioned or not, like this one:
11+
12+
```sql
13+
createtablet (
14+
id int8primary key,-- of course, not int4
15+
created_attimestamptz default now(),
16+
...
17+
);
18+
```
19+
20+
and we need to quickly understand how the table grew year over year, assuming that no rows were deleted (or only a
21+
negligible amount). So, we just need to count rows for every year.
22+
23+
A straightforward approach would be:
24+
25+
```sql
26+
select
27+
date_trunc('year', created_at)as year,
28+
count(*)
29+
from t
30+
group by year
31+
order by year;
32+
```
33+
34+
However, for a 10 TiB table, we'd need to wait many hours, if not days, for this analysis to complete.
35+
36+
Here is fast, but not a precise way to get the row counts for each year (assuming the table has up-to-date stats; if in
37+
doubt, run`ANALYZE` on it first):
38+
39+
```sql
40+
do $$
41+
declare
42+
table_fqntext :='public.t';
43+
year_startint :=2000;
44+
year_endint := extract(yearfrom now())::int;
45+
yearint;
46+
explain_json json;
47+
begin
48+
for yearin year_start..year_end loop
49+
execute format(
50+
$e$
51+
explain (format json)select*
52+
from %s
53+
where created_at
54+
between'%s-01-01'and'%s-12-31'
55+
$e$,
56+
table_fqn,
57+
year,
58+
year
59+
) into explain_json;
60+
61+
raise info'Year: %, Estimated rows: %',
62+
year,
63+
explain_json->0->'Plan'->>'Plan Rows';
64+
end loop;
65+
end $$;
66+
```
Lines changed: 67 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,67 @@
1+
Originally from:[tweet](https://twitter.com/samokhvalov/status/1735204785232715997),[LinkedIn post]().
2+
3+
---
4+
5+
#How to rebuild many indexes using many backends avoiding deadlocks
6+
7+
>I post a new PostgreSQL "howto" article every day. Join me in this
8+
>journey –[subscribe](https://twitter.com/samokhvalov/), provide feedback, share!
9+
10+
Sometimes we need to reindex many indexes – or all of them – and want to do it faster.
11+
12+
For example, this makes sense after upgrading from pre-14 Postgres to version 14+, when we want to rebuild all B-tree
13+
indexes to benefit from optimizations that reduce bloat growth rates.
14+
15+
We might decide to use a single session and higher value of
16+
[max_parallel_maintenance_workers](https://postgresqlco.nf/doc/en/param/max_parallel_maintenance_workers/), processing
17+
one index at a time. But if we have powerful resources (a lot of vCPUs and fast disks), then the maximal value of
18+
`max_parallel_maintenance_workers` may not be enough to move as fast as we can (changing
19+
`max_parallel_maintenance_workers` doesn't require a restart, but we cannot use more than`max_worker_processes`
20+
workers, and changing that requires a restart). In this case, it may make sense to process multiple indexes in parallel,
21+
using`REINDEX INDEX CONCURRENTLY`.
22+
23+
But in this case, indexes need to be processed in proper order. The problem is that if you attempt to rebuild two
24+
indexes belonging to the same table in parallel, a deadlock will be detected, and one of the sessions will fail:
25+
26+
```sql
27+
nik=# reindex index concurrently t1_hash_record_idx3;
28+
ERROR: deadlock detected
29+
DETAIL: Process40 waits for ShareLockon virtual transaction4/2506; blocked by process1313.
30+
Process1313 waits for ShareUpdateExclusiveLockon relation16634 of database16401; blocked by process40.
31+
HINT: See server log for query details.
32+
```
33+
34+
To address this, we can use this approach:
35+
36+
1. Decide how many reindexing sessions you want to use, taking into account`max_parallel_maintenance_workers` and the
37+
planned resource utilization / saturation risks (CPU and disk IO).
38+
39+
2. Assuming we want to use N reindexing sessions, build the full list of indexes, with the table names they belong to,
40+
and "assign" each table to a particular reindexing session. See the query below that does it.
41+
42+
3. Using this "assignment", divide the whole list of indexes to N separate lists, so all the indexes for a particular
43+
table are present only in a single list – and now we can just run N sessions using these N lists.
44+
45+
For the step 2, here is a query that can help:
46+
47+
```sql
48+
\set NUMBER_OF_SESSIONS10
49+
50+
select
51+
format('%I.%I',n.nspname,c.relname)as table_fqn,
52+
format('%I.%I',n.nspname,i.relname)as index_fqn,
53+
mod(
54+
hashtext(format('%I.%I',n.nspname,c.relname)) &2147483647,
55+
:NUMBER_OF_SESSIONS
56+
)as session_id
57+
from
58+
pg_index idx
59+
join pg_class conidx.indrelid=c.oid
60+
join pg_class ionidx.indexrelid=i.oid
61+
join pg_namespace nonc.relnamespace=n.oid
62+
where
63+
n.nspname notin ('pg_catalog','pg_toast','information_schema')
64+
-- and ... additional filters if needed
65+
order by
66+
table_fqn, index_fqn;
67+
```

‎README.md

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -103,6 +103,9 @@ As an example, first 2 rows:
103103
- 0074[How to flush caches (OS page cache and Postgres buffer pool)](./0074_how_to_flush_caches.md)
104104
- 0075[How to find redundant indexes](./0076_how_to_find_redundent_indexes)
105105
- 0076[How to find unused indexes](./0076_how_to_find_redundent_indexes.md)
106+
- 0077[Postgres major upgrade without any downtime for a very large cluster running under heavy load](./0077_zero_downtime_major_upgrade.md)
107+
- 0078[How to estimate the YoY growth of a very large table using row creation timestamps and the planner statistics](./0078_estimate_yoy_table_growth.md)
108+
- 0079[How to rebuild many indexes using many backends avoiding deadlocks](./0079_rebuild_indexes_without_deadlocks.md)
106109
- ...
107110

108111
##Contributors

0 commit comments

Comments
 (0)

[8]ページ先頭

©2009-2025 Movatter.jp