NotificationsYou must be signed in to change notification settings
Fork5
Star27

Commitc11453c

committed

hash: Add write-ahead logging support.

The warning about hash indexes not being write-ahead logged and theiruse being discouraged has been removed. "snapshot too old" is nowsupported for tables with hash indexes. Most importantly, barringbugs, hash indexes will now be crash-safe and usable on standbys.This commit doesn't yet add WAL consistency checking for hashindexes, as we now have for other index types; a separate patch hasbeen submitted to cure that lack.Amit Kapila, reviewed and slightly modified by me. The larger patchseries of which this is a part has been reviewed and tested by ÁlvaroHerrera, Ashutosh Sharma, Mark Kirkwood, Jeff Janes, and JesperPedersen.Discussion:http://postgr.es/m/CAA4eK1JOBX=YU33631Qh-XivYXtPSALh514+jR8XeD7v+K3r_Q@mail.gmail.com

1 parent2b32ac2 commitc11453cCopy full SHA for c11453c

File tree

25 files changed

+2001

-140

lines changed

contrib
- pageinspect/expected
  - hash.out
- pgstattuple/expected
  - pgstattuple.out
doc/src/sgml
src
- backend
  - access
    - hash
    - rmgrdesc
      - hashdesc.c
  - commands
    - indexcmds.c
  - utils/cache
    - relcache.c
- include/access
  - hash_xlog.h
- test/regress/expected

25 files changed

+2001

-140

lines changed

`‎contrib/pageinspect/expected/hash.out‎`

Lines changed: 0 additions & 1 deletion

Original file line number	Diff line number	Diff line change
`@@ -1,7 +1,6 @@`
`1`	`1`	`CREATE TABLE test_hash (a int, b text);`
`2`	`2`	`INSERT INTO test_hash VALUES (1, 'one');`
`3`	`3`	`CREATE INDEX test_hash_a_idx ON test_hash USING hash (a);`
`4`		`-WARNING: hash indexes are not WAL-logged and their use is discouraged`
`5`	`4`	`\x`
`6`	`5`	`SELECT hash_page_type(get_raw_page('test_hash_a_idx', 0));`
`7`	`6`	`-[ RECORD 1 ]--+---------`

`‎contrib/pgstattuple/expected/pgstattuple.out‎`

Lines changed: 0 additions & 2 deletions

Original file line number	Diff line number	Diff line change
`@@ -131,7 +131,6 @@ select * from pgstatginindex('test_ginidx');`
`131`	`131`	`(1 row)`
`132`	`132`
`133`	`133`	`create index test_hashidx on test using hash (b);`
`134`		`-WARNING: hash indexes are not WAL-logged and their use is discouraged`
`135`	`134`	`select * from pgstathashindex('test_hashidx');`
`136`	`135`	`version \| bucket_pages \| overflow_pages \| bitmap_pages \| zero_pages \| live_items \| dead_items \| free_percent`
`137`	`136`	`---------+--------------+----------------+--------------+------------+------------+------------+--------------`
`@@ -226,7 +225,6 @@ ERROR: "test_partition" is not an index`
`226`	`225`	`-- an actual index of a partitioned table should work though`
`227`	`226`	`create index test_partition_idx on test_partition(a);`
`228`	`227`	`create index test_partition_hash_idx on test_partition using hash (a);`
`229`		`-WARNING: hash indexes are not WAL-logged and their use is discouraged`
`230`	`228`	`-- these should work`
`231`	`229`	`select pgstatindex('test_partition_idx');`
`232`	`230`	`pgstatindex`

`‎doc/src/sgml/backup.sgml‎`

Lines changed: 0 additions & 13 deletions

Original file line number	Diff line number	Diff line change
`@@ -1536,19 +1536,6 @@ archive_command = 'local_backup_script.sh "%p" "%f"'`
`1536`	`1536`	`technique. These will probably be fixed in future releases:`
`1537`	`1537`
`1538`	`1538`	`<itemizedlist>`
`1539`		`- <listitem>`
`1540`		`- <para>`
`1541`		`- Operations on hash indexes are not presently WAL-logged, so`
`1542`		`- replay will not update these indexes. This will mean that any new inserts`
`1543`		`- will be ignored by the index, updated rows will apparently disappear and`
`1544`		`- deleted rows will still retain pointers. In other words, if you modify a`
`1545`		`- table with a hash index on it then you will get incorrect query results`
`1546`		`- on a standby server. When recovery completes it is recommended that you`
`1547`		`- manually <xref linkend="sql-reindex">`
`1548`		`- each such index after completing a recovery operation.`
`1549`		`- </para>`
`1550`		`- </listitem>`
`1551`		`-`
`1552`	`1539`	`<listitem>`
`1553`	`1540`	`<para>`
`1554`	`1541`	`If a <xref linkend="sql-createdatabase">`

`‎doc/src/sgml/config.sgml‎`

Lines changed: 3 additions & 4 deletions

Original file line number	Diff line number	Diff line change
`@@ -2153,10 +2153,9 @@ include_dir 'conf.d'`
`2153`	`2153`	`has materialized a result set, no error will be generated even if the`
`2154`	`2154`	`underlying rows in the referenced table have been vacuumed away.`
`2155`	`2155`	`Some tables cannot safely be vacuumed early, and so will not be`
`2156`		`- affected by this setting. Examples include system catalogs and any`
`2157`		`- table which has a hash index. For such tables this setting will`
`2158`		`- neither reduce bloat nor create a possibility of a <literal>snapshot`
`2159`		`- too old</> error on scanning.`
	`2156`	`+ affected by this setting, such as system catalogs. For such tables`
	`2157`	`+ this setting will neither reduce bloat nor create a possibility`
	`2158`	`+ of a <literal>snapshot too old</> error on scanning.`
`2160`	`2159`	`</para>`
`2161`	`2160`	`</listitem>`
`2162`	`2161`	`</varlistentry>`

`‎doc/src/sgml/high-availability.sgml‎`

Lines changed: 0 additions & 6 deletions

Original file line number	Diff line number	Diff line change
`@@ -2351,12 +2351,6 @@ LOG: database system is ready to accept read only connections`
`2351`	`2351`	`These can and probably will be fixed in future releases:`
`2352`	`2352`
`2353`	`2353`	`<itemizedlist>`
`2354`		`- <listitem>`
`2355`		`- <para>`
`2356`		`- Operations on hash indexes are not presently WAL-logged, so`
`2357`		`- replay will not update these indexes.`
`2358`		`- </para>`
`2359`		`- </listitem>`
`2360`	`2354`	`<listitem>`
`2361`	`2355`	`<para>`
`2362`	`2356`	`Full knowledge of running transactions is required before snapshots`

`‎doc/src/sgml/indices.sgml‎`

Lines changed: 0 additions & 12 deletions

Original file line number	Diff line number	Diff line change
`@@ -193,18 +193,6 @@ CREATE INDEX <replaceable>name</replaceable> ON <replaceable>table</replaceable>`
`193`	`193`	`</synopsis>`
`194`	`194`	`</para>`
`195`	`195`
`196`		`- <caution>`
`197`		`- <para>`
`198`		`- Hash index operations are not presently WAL-logged,`
`199`		`- so hash indexes might need to be rebuilt with <command>REINDEX</>`
`200`		`- after a database crash if there were unwritten changes.`
`201`		`- Also, changes to hash indexes are not replicated over streaming or`
`202`		`- file-based replication after the initial base backup, so they`
`203`		`- give wrong answers to queries that subsequently use them.`
`204`		`- For these reasons, hash index use is presently discouraged.`
`205`		`- </para>`
`206`		`- </caution>`
`207`		`-`
`208`	`196`	`<para>`
`209`	`197`	`<indexterm>`
`210`	`198`	`<primary>index</primary>`

`‎doc/src/sgml/ref/create_index.sgml‎`

Lines changed: 0 additions & 13 deletions

Original file line number	Diff line number	Diff line change
`@@ -510,19 +510,6 @@ Indexes:`
`510`	`510`	`they can be useful.`
`511`	`511`	`</para>`
`512`	`512`
`513`		`- <caution>`
`514`		`- <para>`
`515`		`- Hash index operations are not presently WAL-logged,`
`516`		`- so hash indexes might need to be rebuilt with <command>REINDEX</>`
`517`		`- after a database crash if there were unwritten changes.`
`518`		`- Also, changes to hash indexes are not replicated over streaming or`
`519`		`- file-based replication after the initial base backup, so they`
`520`		`- give wrong answers to queries that subsequently use them.`
`521`		`- Hash indexes are also not properly restored during point-in-time`
`522`		`- recovery. For these reasons, hash index use is presently discouraged.`
`523`		`- </para>`
`524`		`- </caution>`
`525`		`-`
`526`	`513`	`<para>`
`527`	`514`	`Currently, only the B-tree, GiST, GIN, and BRIN index methods support`
`528`	`515`	`multicolumn indexes. Up to 32 fields can be specified by default.`

`‎src/backend/access/hash/Makefile‎`

Lines changed: 1 addition & 1 deletion

Original file line number	Diff line number	Diff line change
`@@ -13,6 +13,6 @@ top_builddir = ../../../..`
`13`	`13`	`include$(top_builddir)/src/Makefile.global`
`14`	`14`
`15`	`15`	`OBJS = hash.o hashfunc.o hashinsert.o hashovfl.o hashpage.o hashsearch.o\`
`16`		`- hashsort.o hashutil.o hashvalidate.o`
	`16`	`+ hashsort.o hashutil.o hashvalidate.o hash_xlog.o`
`17`	`17`
`18`	`18`	`include$(top_srcdir)/src/backend/common.mk`

`‎src/backend/access/hash/README‎`

Lines changed: 119 additions & 19 deletions

Original file line number	Diff line number	Diff line change
`@@ -213,7 +213,7 @@ this flag must be clear before splitting a bucket; thus, a bucket can't be`
`213`	`213`	`split again until the previous split is totally complete.`
`214`	`214`
`215`	`215`	`The moved-by-split flag on a tuple indicates that tuple is moved from old to`
`216`		`-new bucket. Concurrent scanscan skip such tuplestill the split operation`
	`216`	`+new bucket. Concurrent scanswill skip such tuplesuntil the split operation`
`217`	`217`	`is finished. Once the tuple is marked as moved-by-split, it will remain so`
`218`	`218`	`forever but that does no harm. We have intentionally not cleared it as that`
`219`	`219`	`can generate an additional I/O which is not necessary.`
`@@ -287,13 +287,17 @@ The insertion algorithm is rather similar:`
`287`	`287`	`if current page is full, release lock but not pin, read/exclusive-lock`
`288`	`288`	`next page; repeat as needed`
`289`	`289`	`>> see below if no space in any page of bucket`
	`290`	`+take buffer content lock in exclusive mode on metapage`
`290`	`291`	`insert tuple at appropriate place in page`
`291`		`-mark current page dirty and release buffer content lock and pin`
`292`		`-if the current page is not a bucket page, release the pin on bucket page`
`293`		`-pin meta page and take buffer content lock in exclusive mode`
	`292`	`+mark current page dirty`
`294`	`293`	`increment tuple count, decide if split needed`
`295`		`-mark meta page dirty and release buffer content lock and pin`
`296`		`-done if no split needed, else enter Split algorithm below`
	`294`	`+mark meta page dirty`
	`295`	`+write WAL for insertion of tuple`
	`296`	`+release the buffer content lock on metapage`
	`297`	`+release buffer content lock on current page`
	`298`	`+if current page is not a bucket page, release the pin on bucket page`
	`299`	`+if split is needed, enter Split algorithm below`
	`300`	`+release the pin on metapage`
`297`	`301`
`298`	`302`	`To speed searches, the index entries within any individual index page are`
`299`	`303`	`kept sorted by hash code; the insertion code must take care to insert new`
`@@ -328,12 +332,17 @@ existing bucket in two, thereby lowering the fill ratio:`
`328`	`332`	`try to finish the split and the cleanup work`
`329`	`333`	`if that succeeds, start over; if it fails, give up`
`330`	`334`	`mark the old and new buckets indicating split is in progress`
	`335`	`+mark both old and new buckets as dirty`
	`336`	`+write WAL for allocation of new page for split`
`331`	`337`	`copy the tuples that belongs to new bucket from old bucket, marking`
`332`	`338`	`them as moved-by-split`
	`339`	`+write WAL record for moving tuples to new page once the new page is full`
	`340`	`+or all the pages of old bucket are finished`
`333`	`341`	`release lock but not pin for primary bucket page of old bucket,`
`334`	`342`	`read/shared-lock next page; repeat as needed`
`335`	`343`	`clear the bucket-being-split and bucket-being-populated flags`
`336`	`344`	`mark the old bucket indicating split-cleanup`
	`345`	`+write WAL for changing the flags on both old and new buckets`
`337`	`346`
`338`	`347`	`The split operation's attempt to acquire cleanup-lock on the old bucket number`
`339`	`348`	`could fail if another process holds any lock or pin on it. We do not want to`
`@@ -369,6 +378,8 @@ The fourth operation is garbage collection (bulk deletion):`
`369`	`378`	`acquire cleanup lock on primary bucket page`
`370`	`379`	`loop:`
`371`	`380`	`scan and remove tuples`
	`381`	`+mark the target page dirty`
	`382`	`+write WAL for deleting tuples from target page`
`372`	`383`	`if this is the last bucket page, break out of loop`
`373`	`384`	`pin and x-lock next page`
`374`	`385`	`release prior lock and pin (except keep pin on primary bucket page)`
`@@ -383,7 +394,8 @@ The fourth operation is garbage collection (bulk deletion):`
`383`	`394`	`check if number of buckets changed`
`384`	`395`	`if so, release content lock and pin and return to for-each-bucket loop`
`385`	`396`	`else update metapage tuple count`
`386`		`-mark meta page dirty and release buffer content lock and pin`
	`397`	`+mark meta page dirty and write WAL for update of metapage`
	`398`	`+release buffer content lock and pin`
`387`	`399`
`388`	`400`	`Note that this is designed to allow concurrent splits and scans. If a split`
`389`	`401`	`occurs, tuples relocated into the new bucket will be visited twice by the`
`@@ -425,18 +437,16 @@ Obtaining an overflow page:`
`425`	`437`	`search for a free page (zero bit in bitmap)`
`426`	`438`	`if found:`
`427`	`439`	`set bit in bitmap`
`428`		`-mark bitmap page dirty and release content lock`
	`440`	`+mark bitmap page dirty`
`429`	`441`	`take metapage buffer content lock in exclusive mode`
`430`	`442`	`if first-free-bit value did not change,`
`431`	`443`	`update it and mark meta page dirty`
`432`		`-release meta page buffer content lock`
`433`		`-return page number`
`434`	`444`	`else (not found):`
`435`	`445`	`release bitmap page buffer content lock`
`436`	`446`	`loop back to try next bitmap page, if any`
`437`	`447`	`-- here when we have checked all bitmap pages; we hold meta excl. lock`
`438`	`448`	`extend index to add another overflow page; update meta information`
`439`		`-mark meta page dirty and release buffer content lock`
	`449`	`+mark meta page dirty`
`440`	`450`	`return page number`
`441`	`451`
`442`	`452`	`It is slightly annoying to release and reacquire the metapage lock`
`@@ -456,12 +466,17 @@ like this:`
`456`	`466`
`457`	`467`	`-- having determined that no space is free in the target bucket:`
`458`	`468`	`remember last page of bucket, drop write lock on it`
`459`		`-call free-page-acquire routine`
`460`	`469`	`re-write-lock last page of bucket`
`461`	`470`	`if it is not last anymore, step to the last page`
`462`		`-update (former) last page to point to new page`
	`471`	`+execute free-page-acquire (obtaining an overflow page) mechanism`
	`472`	`+ described above`
	`473`	`+update (former) last page to point to the new page and mark buffer dirty`
`463`	`474`	`write-lock and initialize new page, with back link to former last page`
`464`		`-write and release former last page`
	`475`	`+write WAL for addition of overflow page`
	`476`	`+release the locks on meta page and bitmap page acquired in`
	`477`	`+ free-page-acquire algorithm`
	`478`	`+release the lock on former last page`
	`479`	`+release the lock on new overflow page`
`465`	`480`	`insert tuple into new page`
`466`	`481`	`-- etc.`
`467`	`482`
`@@ -488,12 +503,14 @@ accessors of pages in the bucket. The algorithm is:`
`488`	`503`	`determine which bitmap page contains the free space bit for page`
`489`	`504`	`release meta page buffer content lock`
`490`	`505`	`pin bitmap page and take buffer content lock in exclusive mode`
`491`		`-update bitmap bit`
`492`		`-mark bitmap page dirty and release buffer content lock and pin`
`493`		`-if page number is less than what we saw as first-free-bit in meta:`
`494`	`506`	`retake meta page buffer content lock in exclusive mode`
	`507`	`+move (insert) tuples that belong to the overflow page being freed`
	`508`	`+update bitmap bit`
	`509`	`+mark bitmap page dirty`
`495`	`510`	`if page number is still less than first-free-bit,`
`496`	`511`	`update first-free-bit field and mark meta page dirty`
	`512`	`+write WAL for delinking overflow page operation`
	`513`	`+release buffer content lock and pin`
`497`	`514`	`release meta page buffer content lock and pin`
`498`	`515`
`499`	`516`	`We have to do it this way because we must clear the bitmap bit before`
`@@ -504,8 +521,91 @@ page acquirer will scan more bitmap bits than he needs to. What must be`
`504`	`521`	`avoided is having first-free-bit greater than the actual first free bit,`
`505`	`522`	`because then that free page would never be found by searchers.`
`506`	`523`
`507`		`-All the freespace operations should be called while holding no buffer`
`508`		`-locks. Since they need no lmgr locks, deadlock is not possible.`
	`524`	`+The reason of moving tuples from overflow page while delinking the later is`
	`525`	`+to make that as an atomic operation. Not doing so could lead to spurious reads`
	`526`	`+on standby. Basically, the user might see the same tuple twice.`
	`527`	`+`
	`528`	`+`
	`529`	`+WAL Considerations`
	`530`	`+------------------`
	`531`	`+`
	`532`	`+The hash index operations like create index, insert, delete, bucket split,`
	`533`	`+allocate overflow page, and squeeze in themselves don't guarantee hash index`
	`534`	`+consistency after a crash. To provide robustness, we write WAL for each of`
	`535`	`+these operations.`
	`536`	`+`
	`537`	`+CREATE INDEX writes multiple WAL records. First, we write a record to cover`
	`538`	`+the initializatoin of the metapage, followed by one for each new bucket`
	`539`	`+created, followed by one for the initial bitmap page. It's not important for`
	`540`	`+index creation to appear atomic, because the index isn't yet visible to any`
	`541`	`+other transaction, and the creating transaction will roll back in the event of`
	`542`	`+a crash. It would be difficult to cover the whole operation with a single`
	`543`	`+write-ahead log record anyway, because we can log only a fixed number of`
	`544`	`+pages, as given by XLR_MAX_BLOCK_ID (32), with current XLog machinery.`
	`545`	`+`
	`546`	`+Ordinary item insertions (that don't force a page split or need a new overflow`
	`547`	`+page) are single WAL entries. They touch a single bucket page and the`
	`548`	`+metapage. The metapage is updated during replay as it is updated during`
	`549`	`+original operation.`
	`550`	`+`
	`551`	`+If an insertion causes the addition of an overflow page, there will be one`
	`552`	`+WAL entry for the new overflow page and second entry for insert itself.`
	`553`	`+`
	`554`	`+If an insertion causes a bucket split, there will be one WAL entry for insert`
	`555`	`+itself, followed by a WAL entry for allocating a new bucket, followed by a WAL`
	`556`	`+entry for each overflow bucket page in the new bucket to which the tuples are`
	`557`	`+moved from old bucket, followed by a WAL entry to indicate that split is`
	`558`	`+complete for both old and new buckets. A split operation which requires`
	`559`	`+overflow pages to complete the operation will need to write a WAL record for`
	`560`	`+each new allocation of an overflow page.`
	`561`	`+`
	`562`	`+As splitting involves multiple atomic actions, it's possible that the system`
	`563`	`+crashes between moving tuples from bucket pages of the old bucket to new`
	`564`	`+bucket. In such a case, after recovery, the old and new buckets will be`
	`565`	`+marked with bucket-being-split and bucket-being-populated flags respectively`
	`566`	`+which indicates that split is in progress for those buckets. The reader`
	`567`	`+algorithm works correctly, as it will scan both the old and new buckets when`
	`568`	`+the split is in progress as explained in the reader algorithm section above.`
	`569`	`+`
	`570`	`+We finish the split at next insert or split operation on the old bucket as`
	`571`	`+explained in insert and split algorithm above. It could be done during`
	`572`	`+searches, too, but it seems best not to put any extra updates in what would`
	`573`	`+otherwise be a read-only operation (updating is not possible in hot standby`
	`574`	`+mode anyway). It would seem natural to complete the split in VACUUM, but since`
	`575`	`+splitting a bucket might require allocating a new page, it might fail if you`
	`576`	`+run out of disk space. That would be bad during VACUUM - the reason for`
	`577`	`+running VACUUM in the first place might be that you run out of disk space,`
	`578`	`+and now VACUUM won't finish because you're out of disk space. In contrast,`
	`579`	`+an insertion can require enlarging the physical file anyway.`
	`580`	`+`
	`581`	`+Deletion of tuples from a bucket is performed for two reasons: to remove dead`
	`582`	`+tuples, and to remove tuples that were moved by a bucket split. A WAL entry`
	`583`	`+is made for each bucket page from which tuples are removed, and then another`
	`584`	`+WAL entry is made when we clear the needs-split-cleanup flag. If dead tuples`
	`585`	`+are removed, a separate WAL entry is made to update the metapage.`
	`586`	`+`
	`587`	`+As deletion involves multiple atomic operations, it is quite possible that`
	`588`	`+system crashes after (a) removing tuples from some of the bucket pages, (b)`
	`589`	`+before clearing the garbage flag, or (c) before updating the metapage. If the`
	`590`	`+system crashes before completing (b), it will again try to clean the bucket`
	`591`	`+during next vacuum or insert after recovery which can have some performance`
	`592`	`+impact, but it will work fine. If the system crashes before completing (c),`
	`593`	`+after recovery there could be some additional splits until the next vacuum`
	`594`	`+updates the metapage, but the other operations like insert, delete and scan`
	`595`	`+will work correctly. We can fix this problem by actually updating the`
	`596`	`+metapage based on delete operation during replay, but it's not clear whether`
	`597`	`+it's worth the complication.`
	`598`	`+`
	`599`	`+A squeeze operation moves tuples from one of the buckets later in the chain to`
	`600`	`+one of the bucket earlier in chain and writes WAL record when either the`
	`601`	`+bucket to which it is writing tuples is filled or bucket from which it`
	`602`	`+is removing the tuples becomes empty.`
	`603`	`+`
	`604`	`+As a squeeze operation involves writing multiple atomic operations, it is`
	`605`	`+quite possible that the system crashes before completing the operation on`
	`606`	`+entire bucket. After recovery, the operations will work correctly, but`
	`607`	`+the index will remain bloated and this can impact performance of read and`
	`608`	`+insert operations until the next vacuum squeeze the bucket completely.`
`509`	`609`
`510`	`610`
`511`	`611`	`Other Notes`

0 commit comments

Comments

(0)

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Commitc11453c

File tree

25 files changed

25 files changed

`‎contrib/pageinspect/expected/hash.out‎`

`‎contrib/pgstattuple/expected/pgstattuple.out‎`

`‎doc/src/sgml/backup.sgml‎`

`‎doc/src/sgml/config.sgml‎`

`‎doc/src/sgml/high-availability.sgml‎`

`‎doc/src/sgml/indices.sgml‎`

`‎doc/src/sgml/ref/create_index.sgml‎`

`‎src/backend/access/hash/Makefile‎`

`‎src/backend/access/hash/README‎`

0 commit comments