I recently had an issue with a test Postgres server that didn't shut down cleanly and could not restart because it was stuck in recovery. I tried all the quick tricks I could get from browsing the internet but all of them failed. I ended up usingpg_resetwal to get to a point where the database could be able to start up again. This meant losing some data that had not been written to disk before the crash. I decided to look into how recovery really works for Postgres, which ended up in yet another extensionpg_wal_recovery. In this post, I'll demonstrate a simple table recovery using Postgres. I'll be using/usr/local/pgsql/data
as my Postgres data location. If this is different for you, then you can change it.
First the definition of Recovery[I googled :) ]: Data recovery is the process of restoring a database to a consistent state after a crash or power loss. Changes to data are usually stored separately as a sequential log. The actual data pages(table data) can be written to permanent storage later. When a crash occurs but before data is flushed to disk, the database can be in an inconsistent state whereby there are changes in the log that are not reflected in the table file. During the next restart, the database needs to "fix" this inconsistency when it reads the log and replays every record from the last REDO location in order to restore the database to a consistent state.
Postgres uses theWAL to log changes made to tables before the data is flushed to permanent storage during acheckpoint or by the background writer. The WAL plays a very important role in the operation of database activities such as replication, point-in-time-recovery(PITR), recovery, backups and others.
Postgres uses a control filePGDATA/global/pg_control
to store state information required by various parts of the database. This is where details like checkpoint location, redo location, the database state value(enum), the next transaction ID to be assigned, next OID and many more are stored. A copy of the file data is stored in shared memory for easy access and is routinely synced with the data on disk. When the database starts up it reads the control file and checks values of REDO and Checkpoint locations. If these two values are the same and the database was shut down cleanly, then normal startup process continues. If not, the recovery process follows. SeeStartupProcessMain
insrc/backend/postmaster/startup.c
andStartupXLOG
insrc/backend/access/transam/xlog.c
. Also seeInitWalRecovery
andPerformWalRecovery
intransam/xlogrecovery.c
for the low-level implementations.
The control file data can be read using thepg_controldata tool. A sample output is as follows.
postgres@f63dffa121c7:~$ pg_controldata -D /usr/local/pgsql/data/pg_control version number: 1800Catalog version number: 202505071Database system identifier: 7504776984629272596Database cluster state: in productionpg_control last modified: Tue Jul 8 21:38:34 2025Latest checkpoint location: 16/18031820Latest checkpoint's REDO location: 16/18031820Latest checkpoint's REDO WAL file: 000000010000001600000018Latest checkpoint's TimeLineID: 1Latest checkpoint's PrevTimeLineID: 1Latest checkpoint's full_page_writes: onLatest checkpoint's NextXID: 1:1847483885Latest checkpoint's NextOID: 443131.....
In this case, both REDO and Checkpoint point to the same location in the WAL file and the database is up and running as shown in the value ofDatabase cluster state: in production
.
Now if we run theCHECKPOINT
command, the checkpoint and redo positions in the control file will change.
postgres=# CHECKPOINT;CHECKPOINT
The control file data has changed
pg_control version number: 1800Catalog version number: 202505071Database system identifier: 7504776984629272596Database cluster state: in productionpg_control last modified: Tue Jul 8 22:06:08 2025Latest checkpoint location: 16/18031928Latest checkpoint's REDO location: 16/180318D0Latest checkpoint's REDO WAL file: 000000010000001600000018Latest checkpoint's TimeLineID: 1Latest checkpoint's PrevTimeLineID: 1Latest checkpoint's full_page_writes: onLatest checkpoint's NextXID: 1:1847483885.....
Looking at how much data has changed before and after the checkpoint command we get
postgres=# select '16/18031928'::pg_lsn - '16/18031820'::pg_lsn as checkpoint_diff_bytes; checkpoint_diff_bytes----------------------- 264(1 row)postgres=# select '16/180318D0'::pg_lsn - '16/18031820'::pg_lsn as redo_diff_bytes; redo_diff_bytes----------------- 176(1 row)
We can see which records have added to the WAL when the checkpoint command was called using thepg_waldump
tool
postgres@f63dffa121c7:~$ pg_waldump --start=16/18031820 /usr/local/pgsql/data/pg_wal/000000010000001600000018rmgr: XLOG len (rec/tot): 114/ 114, tx: 0, lsn: 16/18031820, prev 16/180317E8, desc: CHECKPOINT_SHUTDOWN redo 16/18031820; tli 1; prev tli 1; fpw true; wal_level replica; xid 1:1847483885; oid 443131; multi 1; offset 0; oldest xid 2147483649 in DB 5; oldest multi 1 in DB 5; oldest/newest commit timestamp xid: 0/0; oldest running xid 0; shutdownrmgr: Standby len (rec/tot): 50/ 50, tx: 0, lsn: 16/18031898, prev 16/18031820, desc: RUNNING_XACTS nextXid 1847483885 latestCompletedXid 1847483884 oldestRunningXid 1847483885rmgr: XLOG len (rec/tot): 30/ 30, tx: 0, lsn: 16/180318D0, prev 16/18031898, desc: CHECKPOINT_REDO wal_level replicarmgr: Standby len (rec/tot): 50/ 50, tx: 0, lsn: 16/180318F0, prev 16/180318D0, desc: RUNNING_XACTS nextXid 1847483885 latestCompletedXid 1847483884 oldestRunningXid 1847483885rmgr: XLOG len (rec/tot): 114/ 114, tx: 0, lsn: 16/18031928, prev 16/180318F0, desc: CHECKPOINT_ONLINE redo 16/180318D0; tli 1; prev tli 1; fpw true; wal_level replica; xid 1:1847483885; oid 443131; multi 1; offset 0; oldest xid 2147483649 in DB 5; oldest multi 1 in DB 5; oldest/newest commit timestamp xid: 0/0; oldest running xid 1847483885; onlinermgr: Standby len (rec/tot): 50/ 50, tx: 0, lsn: 16/180319A0, prev 16/18031928, desc: RUNNING_XACTS nextXid 1847483885 latestCompletedXid 1847483884 oldestRunningXid 1847483885pg_waldump: error: error in WAL record at 16/180319A0: invalid record length at 16/180319D8: expected at least 24, got 0
No actual data records have been added. The redo record is now at 16/180318D0 from 16/18031820 position while the checkpoint record is now at 16/18031928 from 16/18031820. Now we proceed to the demo.
Open a client sessions and create a tablefoo
and insert some data to it then, do a "dirty" shutdown of the server before any checkpoint is ran
postgres=# DROP TABLE IF EXISTS foo;NOTICE: table "foo" does not exist, skippingDROP TABLEpostgres=# CREATE TABLE foo(id SERIAL, k INT NOT NULL);CREATE TABLEpostgres=# INSERT INTO foo SELECT i, i*100 FROM generate_series(1, 100) i;INSERT 0 100
In another terminal session, kill the server using the--mode=immediate
option
postgres@f63dffa121c7:~$ pg_ctl -D /usr/local/pgsql/data -l ~/logfile stop --mode=immediatewaiting for server to shut down.... doneserver stopped
Tablefoo
should now have 100 records since the transaction committed, right? Not really.
On commit, a user can only be sure that data has been stored in the WAL but not in the table files. Before the database can be restarted, we need to do two things: copy the control file and copy the WAL segment file to another location. Create a directory namedglobal
inside the new location and copy the control file into theglobal
directory. Then copy the last active WAL segment file(as shown inLatest checkpoint's REDO WAL file
row when you run the commandpg_controldata -D /usr/local/pgsql/data/
) to the same location. Ensure the postgres user has read/write access the the location. I'll use the/tmp
directory for this case. The final structure should look like this
postgres@f63dffa121c7:~$ cp /usr/local/pgsql/data/global/pg_control /tmp/global/postgres@f63dffa121c7:~$ cp /usr/local/pgsql/data/pg_wal/000000010000001600000018 /tmp/postgres@f63dffa121c7:~$ ls /tmp/000000010000001600000018 globalpostgres@f63dffa121c7:~$ ls /tmp/global/pg_control
Now clear the WAL data using thepg_resetwal tool. This is because, by default, when Postgres starts up it will first check if it needs to go into recovery mode and it will use the WAL to recover any un-checkpointed data. Manual recovery will be done later using thepg_wal_recovery
extension.
postgres@f63dffa121c7:~$ pg_resetwal -D /usr/local/pgsql/data/ -fWrite-ahead log reset
pg_resetwal should never be used in a production database. It has destructive side effects that can lead to data loss. Its usage should be limited to situations where the user knows what they are doing and they are aware of the risks involved.
Looking at the control data now, the checkpoint and redo locations match. If there were any WAL segment files in thepg_wal
directory prior to running the command, they have been removed. Also, a new WAL segment file has been created000000010000001600000019
.
postgres@f63dffa121c7:~$ pg_controldata -D /usr/local/pgsql/data/pg_control version number: 1800Catalog version number: 202505071Database system identifier: 7504776984629272596Database cluster state: shut downpg_control last modified: Wed Jul 9 10:46:20 2025Latest checkpoint location: 16/19000028Latest checkpoint's REDO location: 16/19000028Latest checkpoint's REDO WAL file: 000000010000001600000019Latest checkpoint's TimeLineID: 1Latest checkpoint's PrevTimeLineID: 1Latest checkpoint's full_page_writes: onLatest checkpoint's NextXID: 1:1847483887Latest checkpoint's NextOID: 451323.....
Now the database can be restarted since we know it won't go into automatic recovery mode.
postgres@f63dffa121c7:~$ pg_ctl -D /usr/local/pgsql/data -l ~/logfile startwaiting for server to start.... doneserver started
Once it is up, we can check thefoo
table to find out if there is any data in it.
postgres=# SELECT * FROM foo; id | k----+---(0 rows)
It returned no rows. Does it mean the data inserted before has been lost? Well....Not really.
The data was committed so we can be sure that, at the very least, the data has been logged in the WAL. We can check this using thepg_wal_recovery
extension.
postgres=# SELECT * FROM wal_list_records('/tmp');WARNING: invalid record length at 16/18065410: expected at least 24, got 0 wal_file_name | wal_type | wal_record--------------------------+-------------------+------------- 000000010000001600000018 | CHECKPOINT_REDO | 16/1805D4E0 000000010000001600000018 | RUNNING_XACTS | 16/1805D500 000000010000001600000018 | CHECKPOINT_ONLINE | 16/1805D538 000000010000001600000018 | RUNNING_XACTS | 16/1805D5B0 000000010000001600000018 | FPI_FOR_HINT | 16/1805D5E8 000000010000001600000018 | FPI_FOR_HINT | 16/1805E0C0 000000010000001600000018 | FPI_FOR_HINT | 16/1805F880 000000010000001600000018 | RUNNING_XACTS | 16/18060D00 000000010000001600000018 | INSERT+INIT | 16/18060D38 000000010000001600000018 | INSERT | 16/18060D78 000000010000001600000018 | INSERT | 16/18060DB8 000000010000001600000018 | INSERT | 16/18060DF8 000000010000001600000018 | INSERT | 16/18060E38 ..... 000000010000001600000018 | INSERT | 16/18062550 000000010000001600000018 | INSERT | 16/18062590 000000010000001600000018 | INSERT | 16/180625D0 000000010000001600000018 | INSERT | 16/18062610 000000010000001600000018 | COMMIT | 16/18062650 000000010000001600000018 | INSERT | 16/18062678 000000010000001600000018 | INSERT_LEAF | 16/18063CD8 000000010000001600000018 | INSERT | 16/18064268 000000010000001600000018 | INSERT_LEAF | 16/180644B8 000000010000001600000018 | INPLACE | 16/180644F8 000000010000001600000018 | COMMIT | 16/18065388(115 rows)
The extension read WAL records starting from16/1805D4E0
, the last REDO location in the control file that was copied earlier. There are 100 WAL records from offset16/18060D38
to16/18062610
in the WAL segment file. This matches the number of inserts that was done just before the database was shutdown. We can check this usingpg_waldump tool which provides way more information about the WAL records
postgres@f63dffa121c7:~$ pg_waldump --start=16/18060D38 --end=16/18062610 /tmp/000000010000001600000018rmgr: Heap len (rec/tot): 63/ 63, tx: 1847483887, lsn: 16/18060D38, prev 16/18060D00, desc: INSERT+INIT off: 1, flags: 0x00, blkref #0: rel 1663/5/443132 blk 0rmgr: Heap len (rec/tot): 63/ 63, tx: 1847483887, lsn: 16/18060D78, prev 16/18060D38, desc: INSERT off: 2, flags: 0x00, blkref #0: rel 1663/5/443132 blk 0rmgr: Heap len (rec/tot): 63/ 63, tx: 1847483887, lsn: 16/18060DB8, prev 16/18060D78, desc: INSERT off: 3, flags: 0x00, blkref #0: rel 1663/5/443132 blk 0.....rmgr: Heap len (rec/tot): 63/ 63, tx: 1847483887, lsn: 16/18062590, prev 16/18062550, desc: INSERT off: 98, flags: 0x00, blkref #0: rel 1663/5/443132 blk 0rmgr: Heap len (rec/tot): 63/ 63, tx: 1847483887, lsn: 16/180625D0, prev 16/18062590, desc: INSERT off: 99, flags: 0x00, blkref #0: rel 1663/5/443132 blk 0
The last thing to check before replaying the WAL records is the data file(tablefoo
) to confirm if there is any data present. For this, thepageinspect extension comes in handy.
postgres=# SELECT lp, t_ctid, t_xmin, t_xmax FROM heap_page_items(get_raw_page('foo', 0)) WHERE lp_len > 0; lp | t_ctid | t_xmin | t_xmax----+--------+--------+--------(0 rows)
Here we query the content of the first page(zero-based) in the data file. Thelp
(line pointer) is the item number of the record within the page.t_ctid
is the location of the record within the data file e.g (0,3) can be read as third item on the first page.t_xmin
andt_xmax
are the transaction IDs of the inserting and deleting transactions respectively.
The table is empty. We have to replay the WAL in order to get the data into the table. Thepg_wal_recovery
extension will be used for this purpose. It will output the last record replayed in the WAL when done.
postgres=# select * FROM wal_recover('/tmp');WARNING: invalid record length at 16/18065410: expected at least 24, got 0 wal_type | wal_record----------+------------- COMMIT | 16/18065388(1 row)
The "lost" data should now appear in the data files. Again, we can run the query using the helper functions from thepageinspect
extension
postgres=# SELECT lp, t_ctid, t_xmin, t_xmax FROM heap_page_items(get_raw_page('foo', 0)) WHERE lp_len > 0; lp | t_ctid | t_xmin | t_xmax-----+---------+------------+-------- 1 | (0,1) | 1847483887 | 0 2 | (0,2) | 1847483887 | 0 3 | (0,3) | 1847483887 | 0..... 97 | (0,97) | 1847483887 | 0 98 | (0,98) | 1847483887 | 0 99 | (0,99) | 1847483887 | 0 100 | (0,100) | 1847483887 | 0(100 rows)
The data is now present in the table file. In this case, all 100 records fit in a single page(each page is 8kb in size). After a restart, querying thefoo
table now returns all 100 records. The data has been restored.
postgres=# SELECT * FROM foo; id | k-----+------- 1 | 100 2 | 200..... 99 | 9900 100 | 10000(100 rows)
In this post I demonstrated how to perform a simple data recovery using the WAL. In real life systems, the chances of having to do recovery like this, are very low. Postgres already ships with a robust recovery infrastructure that has been battle tested for many years, so automatic recovery would handle such for you anyway. Building the extension made me go deeper into the lower-level parts of recovery in Postgres. The extension is for educational purposes only and would not be suitable for a production database or any database that has data you care about.
Top comments(0)
For further actions, you may consider blocking this person and/orreporting abuse