NotificationsYou must be signed in to change notification settings
Fork5
Star27

Commit54d0e28

committed

Add some documentation about how we WAL-log filesystem actions.

Per a question from Robert Haas.

1 parent594419e commit54d0e28Copy full SHA for 54d0e28

File tree

1 file changed

+80

-1

lines changed

src/backend/access/transam
- README

1 file changed

+80

-1

lines changed

`‎src/backend/access/transam/README‎`

Lines changed: 80 additions & 1 deletion

Original file line number	Diff line number	Diff line change
`@@ -1,4 +1,4 @@`
`1`		`-$PostgreSQL: pgsql/src/backend/access/transam/README,v 1.13 2009/12/19 01:32:33 sriggs Exp $`
	`1`	`+$PostgreSQL: pgsql/src/backend/access/transam/README,v 1.14 2010/09/17 00:42:39 tgl Exp $`
`2`	`2`
`3`	`3`	`The Transaction System`
`4`	`4`	`======================`
`@@ -543,6 +543,85 @@ consistency. Such insertions occur after WAL is operational, so they can`
`543`	`543`	`and should write WAL records for the additional generated actions.`
`544`	`544`
`545`	`545`
	`546`	`+Write-Ahead Logging for Filesystem Actions`
	`547`	`+------------------------------------------`
	`548`	`+`
	`549`	`+The previous section described how to WAL-log actions that only change page`
	`550`	`+contents within shared buffers. For that type of action it is generally`
	`551`	`+possible to check all likely error cases (such as insufficient space on the`
	`552`	`+page) before beginning to make the actual change. Therefore we can make`
	`553`	`+the change and the creation of the associated WAL log record "atomic" by`
	`554`	`+wrapping them into a critical section --- the odds of failure partway`
	`555`	`+through are low enough that PANIC is acceptable if it does happen.`
	`556`	`+`
	`557`	`+Clearly, that approach doesn't work for cases where there's a significant`
	`558`	`+probability of failure within the action to be logged, such as creation`
	`559`	`+of a new file or database. We don't want to PANIC, and we especially don't`
	`560`	`+want to PANIC after having already written a WAL record that says we did`
	`561`	`+the action --- if we did, replay of the record would probably fail again`
	`562`	`+and PANIC again, making the failure unrecoverable. This means that the`
	`563`	`+ordinary WAL rule of "write WAL before the changes it describes" doesn't`
	`564`	`+work, and we need a different design for such cases.`
	`565`	`+`
	`566`	`+There are several basic types of filesystem actions that have this`
	`567`	`+issue. Here is how we deal with each:`
	`568`	`+`
	`569`	`+1. Adding a disk page to an existing table.`
	`570`	`+`
	`571`	`+This action isn't WAL-logged at all. We extend a table by writing a page`
	`572`	`+of zeroes at its end. We must actually do this write so that we are sure`
	`573`	`+the filesystem has allocated the space. If the write fails we can just`
	`574`	`+error out normally. Once the space is known allocated, we can initialize`
	`575`	`+and fill the page via one or more normal WAL-logged actions. Because it's`
	`576`	`+possible that we crash between extending the file and writing out the WAL`
	`577`	`+entries, we have to treat discovery of an all-zeroes page in a table or`
	`578`	`+index as being a non-error condition. In such cases we can just reclaim`
	`579`	`+the space for re-use.`
	`580`	`+`
	`581`	`+2. Creating a new table, which requires a new file in the filesystem.`
	`582`	`+`
	`583`	`+We try to create the file, and if successful we make a WAL record saying`
	`584`	`+we did it. If not successful, we can just throw an error. Notice that`
	`585`	`+there is a window where we have created the file but not yet written any`
	`586`	`+WAL about it to disk. If we crash during this window, the file remains`
	`587`	`+on disk as an "orphan". It would be possible to clean up such orphans`
	`588`	`+by having database restart search for files that don't have any committed`
	`589`	`+entry in pg_class, but that currently isn't done because of the possibility`
	`590`	`+of deleting data that is useful for forensic analysis of the crash.`
	`591`	`+Orphan files are harmless --- at worst they waste a bit of disk space ---`
	`592`	`+because we check for on-disk collisions when allocating new relfilenode`
	`593`	`+OIDs. So cleaning up isn't really necessary.`
	`594`	`+`
	`595`	`+3. Deleting a table, which requires an unlink() that could fail.`
	`596`	`+`
	`597`	`+Our approach here is to WAL-log the operation first, but to treat failure`
	`598`	`+of the actual unlink() call as a warning rather than error condition.`
	`599`	`+Again, this can leave an orphan file behind, but that's cheap compared to`
	`600`	`+the alternatives. Since we can't actually do the unlink() until after`
	`601`	`+we've committed the DROP TABLE transaction, throwing an error would be out`
	`602`	`+of the question anyway. (It may be worth noting that the WAL entry about`
	`603`	`+the file deletion is actually part of the commit record for the dropping`
	`604`	`+transaction.)`
	`605`	`+`
	`606`	`+4. Creating and deleting databases and tablespaces, which requires creating`
	`607`	`+and deleting directories and entire directory trees.`
	`608`	`+`
	`609`	`+These cases are handled similarly to creating individual files, ie, we`
	`610`	`+try to do the action first and then write a WAL entry if it succeeded.`
	`611`	`+The potential amount of wasted disk space is rather larger, of course.`
	`612`	`+In the creation case we try to delete the directory tree again if creation`
	`613`	`+fails, so as to reduce the risk of wasted space. Failure partway through`
	`614`	`+a deletion operation results in a corrupt database: the DROP failed, but`
	`615`	`+some of the data is gone anyway. There is little we can do about that,`
	`616`	`+though, and in any case it was presumably data the user no longer wants.`
	`617`	`+`
	`618`	`+In all of these cases, if WAL replay fails to redo the original action`
	`619`	`+we must panic and abort recovery. The DBA will have to manually clean up`
	`620`	`+(for instance, free up some disk space or fix directory permissions) and`
	`621`	`+then restart recovery. This is part of the reason for not writing a WAL`
	`622`	`+entry until we've successfully done the original action.`
	`623`	`+`
	`624`	`+`
`546`	`625`	`Asynchronous Commit`
`547`	`626`	`-------------------`
`548`	`627`

0 commit comments

Comments

(0)

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Commit54d0e28

File tree

1 file changed

1 file changed

`‎src/backend/access/transam/README‎`

0 commit comments