Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Commit54d0e28

Browse files
committed
Add some documentation about how we WAL-log filesystem actions.
Per a question from Robert Haas.
1 parent594419e commit54d0e28

File tree

1 file changed

+80
-1
lines changed
  • src/backend/access/transam

1 file changed

+80
-1
lines changed

‎src/backend/access/transam/README

Lines changed: 80 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
$PostgreSQL: pgsql/src/backend/access/transam/README,v 1.13 2009/12/19 01:32:33 sriggs Exp $
1+
$PostgreSQL: pgsql/src/backend/access/transam/README,v 1.14 2010/09/17 00:42:39 tgl Exp $
22

33
The Transaction System
44
======================
@@ -543,6 +543,85 @@ consistency. Such insertions occur after WAL is operational, so they can
543543
and should write WAL records for the additional generated actions.
544544

545545

546+
Write-Ahead Logging for Filesystem Actions
547+
------------------------------------------
548+
549+
The previous section described how to WAL-log actions that only change page
550+
contents within shared buffers. For that type of action it is generally
551+
possible to check all likely error cases (such as insufficient space on the
552+
page) before beginning to make the actual change. Therefore we can make
553+
the change and the creation of the associated WAL log record "atomic" by
554+
wrapping them into a critical section --- the odds of failure partway
555+
through are low enough that PANIC is acceptable if it does happen.
556+
557+
Clearly, that approach doesn't work for cases where there's a significant
558+
probability of failure within the action to be logged, such as creation
559+
of a new file or database. We don't want to PANIC, and we especially don't
560+
want to PANIC after having already written a WAL record that says we did
561+
the action --- if we did, replay of the record would probably fail again
562+
and PANIC again, making the failure unrecoverable. This means that the
563+
ordinary WAL rule of "write WAL before the changes it describes" doesn't
564+
work, and we need a different design for such cases.
565+
566+
There are several basic types of filesystem actions that have this
567+
issue. Here is how we deal with each:
568+
569+
1. Adding a disk page to an existing table.
570+
571+
This action isn't WAL-logged at all. We extend a table by writing a page
572+
of zeroes at its end. We must actually do this write so that we are sure
573+
the filesystem has allocated the space. If the write fails we can just
574+
error out normally. Once the space is known allocated, we can initialize
575+
and fill the page via one or more normal WAL-logged actions. Because it's
576+
possible that we crash between extending the file and writing out the WAL
577+
entries, we have to treat discovery of an all-zeroes page in a table or
578+
index as being a non-error condition. In such cases we can just reclaim
579+
the space for re-use.
580+
581+
2. Creating a new table, which requires a new file in the filesystem.
582+
583+
We try to create the file, and if successful we make a WAL record saying
584+
we did it. If not successful, we can just throw an error. Notice that
585+
there is a window where we have created the file but not yet written any
586+
WAL about it to disk. If we crash during this window, the file remains
587+
on disk as an "orphan". It would be possible to clean up such orphans
588+
by having database restart search for files that don't have any committed
589+
entry in pg_class, but that currently isn't done because of the possibility
590+
of deleting data that is useful for forensic analysis of the crash.
591+
Orphan files are harmless --- at worst they waste a bit of disk space ---
592+
because we check for on-disk collisions when allocating new relfilenode
593+
OIDs. So cleaning up isn't really necessary.
594+
595+
3. Deleting a table, which requires an unlink() that could fail.
596+
597+
Our approach here is to WAL-log the operation first, but to treat failure
598+
of the actual unlink() call as a warning rather than error condition.
599+
Again, this can leave an orphan file behind, but that's cheap compared to
600+
the alternatives. Since we can't actually do the unlink() until after
601+
we've committed the DROP TABLE transaction, throwing an error would be out
602+
of the question anyway. (It may be worth noting that the WAL entry about
603+
the file deletion is actually part of the commit record for the dropping
604+
transaction.)
605+
606+
4. Creating and deleting databases and tablespaces, which requires creating
607+
and deleting directories and entire directory trees.
608+
609+
These cases are handled similarly to creating individual files, ie, we
610+
try to do the action first and then write a WAL entry if it succeeded.
611+
The potential amount of wasted disk space is rather larger, of course.
612+
In the creation case we try to delete the directory tree again if creation
613+
fails, so as to reduce the risk of wasted space. Failure partway through
614+
a deletion operation results in a corrupt database: the DROP failed, but
615+
some of the data is gone anyway. There is little we can do about that,
616+
though, and in any case it was presumably data the user no longer wants.
617+
618+
In all of these cases, if WAL replay fails to redo the original action
619+
we must panic and abort recovery. The DBA will have to manually clean up
620+
(for instance, free up some disk space or fix directory permissions) and
621+
then restart recovery. This is part of the reason for not writing a WAL
622+
entry until we've successfully done the original action.
623+
624+
546625
Asynchronous Commit
547626
-------------------
548627

0 commit comments

Comments
 (0)

[8]ページ先頭

©2009-2025 Movatter.jp