NotificationsYou must be signed in to change notification settings
Fork6
Star31

Commitc4afdca

committed

Fix two serious bugs introduced into hash indexes by the 8.4 patch that made

hash indexes keep entries sorted by hash value. First, the original plans forconcurrency assumed that insertions would happen only at the end of a page,which is no longer true; this could cause scans to transiently fail to findindex entries in the presence of concurrent insertions. We can compensateby teaching scans to re-find their position after re-acquiring read locks.Second, neither the bucket split nor the bucket compaction logic had beenfixed to preserve hashvalue ordering, so application of either of thoseprocesses could lead to permanent corruption of an index, in the sensethat searches might fail to find entries that are present.This patch fixes the split and compaction logic to preserve hashvalueordering, but it cannot do anything about pre-existing corruption. We willneed to recommend reindexing all hash indexes in the 8.4.2 release notes.To buy back the performance loss hereby induced in split and compaction,fix them to use PageIndexMultiDelete instead of retail PageIndexDeleteoperations. We might later want to do something with qsort'ing thepage contents rather than doing a binary search for each insertion,but that seemed more invasive than I cared to risk in a back-patch.Per bug #5157 from Jeff Janes and subsequent investigation.

1 parentef59fa0 commitc4afdcaCopy full SHA for c4afdca

File tree

6 files changed

+250

-190

lines changed

src
- backend/access/hash
- include/access
  - hash.h

6 files changed

+250

-190

lines changed

`‎src/backend/access/hash/README`

Lines changed: 35 additions & 15 deletions

Original file line number	Diff line number	Diff line change
`@@ -1,14 +1,15 @@`
`1`		`-$PostgreSQL: pgsql/src/backend/access/hash/README,v 1.8 2008/03/20 17:55:14 momjian Exp $`
	`1`	`+$PostgreSQL: pgsql/src/backend/access/hash/README,v 1.9 2009/11/01 21:25:25 tgl Exp $`
`2`	`2`
`3`	`3`	`Hash Indexing`
`4`	`4`	`=============`
`5`	`5`
`6`		`-This directory contains an implementation of hash indexing for Postgres. Most`
`7`		`-of the core ideas are taken from Margo Seltzer and Ozan Yigit, A New Hashing`
`8`		`-Package for UNIX, Proceedings of the Winter USENIX Conference, January 1991.`
`9`		`-(Our in-memory hashtable implementation, src/backend/utils/hash/dynahash.c,`
`10`		`-also relies on some of the same concepts; it is derived from code written by`
`11`		`-Esmond Pitt and later improved by Margo among others.)`
	`6`	`+This directory contains an implementation of hash indexing for Postgres.`
	`7`	`+Most of the core ideas are taken from Margo Seltzer and Ozan Yigit,`
	`8`	`+A New Hashing Package for UNIX, Proceedings of the Winter USENIX Conference,`
	`9`	`+January 1991. (Our in-memory hashtable implementation,`
	`10`	`+src/backend/utils/hash/dynahash.c, also relies on some of the same concepts;`
	`11`	`+it is derived from code written by Esmond Pitt and later improved by Margo`
	`12`	`+among others.)`
`12`	`13`
`13`	`14`	`A hash index consists of two or more "buckets", into which tuples are`
`14`	`15`	`placed whenever their hash key maps to the bucket number. The`
`@@ -32,6 +33,14 @@ rebuilding it with REINDEX. Overflow pages can be recycled for reuse`
`32`	`33`	`in other buckets, but we never give them back to the operating system.`
`33`	`34`	`There is no provision for reducing the number of buckets, either.`
`34`	`35`
	`36`	`+As of PostgreSQL 8.4, hash index entries store only the hash code, not the`
	`37`	`+actual data value, for each indexed item. This makes the index entries`
	`38`	`+smaller (perhaps very substantially so) and speeds up various operations.`
	`39`	`+In particular, we can speed searches by keeping the index entries in any`
	`40`	`+one index page sorted by hash code, thus allowing binary search to be used`
	`41`	`+within an index page. Note however that there is no assumption about the`
	`42`	`+relative ordering of hash codes across different index pages of a bucket.`
	`43`	`+`
`35`	`44`
`36`	`45`	`Page Addressing`
`37`	`46`	`---------------`
`@@ -205,8 +214,18 @@ the reader ensures that the target bucket calculation is valid (otherwise`
`205`	`214`	`the bucket might be split before the reader arrives at it, and the target`
`206`	`215`	`entries might go into the new bucket). Holding the bucket sharelock for`
`207`	`216`	`the remainder of the scan prevents the reader's current-tuple pointer from`
`208`		`-being invalidated by other processes. Notice though that the reader need`
`209`		`-not prevent other buckets from being split or compacted.`
	`217`	`+being invalidated by splits or compactions. Notice that the reader's lock`
	`218`	`+does not prevent other buckets from being split or compacted.`
	`219`	`+`
	`220`	`+To keep concurrency reasonably good, we require readers to cope with`
	`221`	`+concurrent insertions, which means that they have to be able to re-find`
	`222`	`+their current scan position after re-acquiring the page sharelock. Since`
	`223`	`+deletion is not possible while a reader holds the bucket sharelock, and`
	`224`	`+we assume that heap tuple TIDs are unique, this can be implemented by`
	`225`	`+searching for the same heap tuple TID previously returned. Insertion does`
	`226`	`+not move index entries across pages, so the previously-returned index entry`
	`227`	`+should always be on the same page, at the same or higher offset number,`
	`228`	`+as it was before.`
`210`	`229`
`211`	`230`	`The insertion algorithm is rather similar:`
`212`	`231`
`@@ -220,19 +239,20 @@ The insertion algorithm is rather similar:`
`220`	`239`	`read/exclusive-lock current page of bucket`
`221`	`240`	`if full, release, read/exclusive-lock next page; repeat as needed`
`222`	`241`	`>> see below if no space in any page of bucket`
`223`		`-insert tuple`
	`242`	`+insert tuple at appropriate place in page`
`224`	`243`	`write/release current page`
`225`	`244`	`release bucket share-lock`
`226`	`245`	`read/exclusive-lock meta page`
`227`	`246`	`increment tuple count, decide if split needed`
`228`	`247`	`write/release meta page`
`229`	`248`	`done if no split needed, else enter Split algorithm below`
`230`	`249`
`231`		`-It is okay for an insertion to take place in a bucket that is being`
`232`		`-actively scanned, because it does not change the position of any existing`
`233`		`-item in the bucket, so scan states are not invalidated. We only need the`
`234`		`-short-term buffer locks to ensure that readers do not see a`
`235`		`-partially-updated page.`
	`250`	`+To speed searches, the index entries within any individual index page are`
	`251`	`+kept sorted by hash code; the insertion code must take care to insert new`
	`252`	`+entries in the right place. It is okay for an insertion to take place in a`
	`253`	`+bucket that is being actively scanned, because readers can cope with this`
	`254`	`+as explained above. We only need the short-term buffer locks to ensure`
	`255`	`+that readers do not see a partially-updated page.`
`236`	`256`
`237`	`257`	`It is clearly impossible for readers and inserters to deadlock, and in`
`238`	`258`	`fact this algorithm allows them a very high degree of concurrency.`

`‎src/backend/access/hash/hash.c`

Lines changed: 49 additions & 22 deletions

Original file line number	Diff line number	Diff line change
`@@ -8,7 +8,7 @@`
`8`	`8`	`*`
`9`	`9`	`*`
`10`	`10`	`* IDENTIFICATION`
`11`		`- * $PostgreSQL: pgsql/src/backend/access/hash/hash.c,v 1.113 2009/07/29 20:56:18 tgl Exp $`
	`11`	`+ * $PostgreSQL: pgsql/src/backend/access/hash/hash.c,v 1.114 2009/11/01 21:25:25 tgl Exp $`
`12`	`12`	`*`
`13`	`13`	`* NOTES`
`14`	`14`	`* This file contains only the public interface routines.`
`@@ -206,8 +206,10 @@ hashgettuple(PG_FUNCTION_ARGS)`
`206`	`206`	`ScanDirectiondir= (ScanDirection)PG_GETARG_INT32(1);`
`207`	`207`	`HashScanOpaqueso= (HashScanOpaque)scan->opaque;`
`208`	`208`	`Relationrel=scan->indexRelation;`
	`209`	`+Bufferbuf;`
`209`	`210`	`Pagepage;`
`210`	`211`	`OffsetNumberoffnum;`
	`212`	`+ItemPointercurrent;`
`211`	`213`	`boolres;`
`212`	`214`
`213`	`215`	`/* Hash indexes are always lossy since we store only the hash code */`
`@@ -225,8 +227,38 @@ hashgettuple(PG_FUNCTION_ARGS)`
`225`	`227`	`* appropriate direction. If we haven't done so yet, we call a routine to`
`226`	`228`	`* get the first item in the scan.`
`227`	`229`	`*/`
`228`		`-if (ItemPointerIsValid(&(so->hashso_curpos)))`
	`230`	`+current=&(so->hashso_curpos);`
	`231`	`+if (ItemPointerIsValid(current))`
`229`	`232`	`{`
	`233`	`+/*`
	`234`	`+ * An insertion into the current index page could have happened while`
	`235`	`+ * we didn't have read lock on it. Re-find our position by looking`
	`236`	`+ * for the TID we previously returned. (Because we hold share lock on`
	`237`	`+ * the bucket, no deletions or splits could have occurred; therefore`
	`238`	`+ * we can expect that the TID still exists in the current index page,`
	`239`	`+ * at an offset >= where we were.)`
	`240`	`+ */`
	`241`	`+OffsetNumbermaxoffnum;`
	`242`	`+`
	`243`	`+buf=so->hashso_curbuf;`
	`244`	`+Assert(BufferIsValid(buf));`
	`245`	`+page=BufferGetPage(buf);`
	`246`	`+maxoffnum=PageGetMaxOffsetNumber(page);`
	`247`	`+for (offnum=ItemPointerGetOffsetNumber(current);`
	`248`	`+offnum <=maxoffnum;`
	`249`	`+offnum=OffsetNumberNext(offnum))`
	`250`	`+{`
	`251`	`+IndexTupleitup;`
	`252`	`+`
	`253`	`+itup= (IndexTuple)PageGetItem(page,PageGetItemId(page,offnum));`
	`254`	`+if (ItemPointerEquals(&scan->xs_ctup.t_self,&itup->t_tid))`
	`255`	`+break;`
	`256`	`+}`
	`257`	`+if (offnum>maxoffnum)`
	`258`	`+elog(ERROR,"failed to re-find scan position within index \"%s\"",`
	`259`	`+RelationGetRelationName(rel));`
	`260`	`+ItemPointerSetOffsetNumber(current,offnum);`
	`261`	`+`
`230`	`262`	`/*`
`231`	`263`	`* Check to see if we should kill the previously-fetched tuple.`
`232`	`264`	`*/`
`@@ -235,16 +267,14 @@ hashgettuple(PG_FUNCTION_ARGS)`
`235`	`267`	`/*`
`236`	`268`	`* Yes, so mark it by setting the LP_DEAD state in the item flags.`
`237`	`269`	`*/`
`238`		`-offnum=ItemPointerGetOffsetNumber(&(so->hashso_curpos));`
`239`		`-page=BufferGetPage(so->hashso_curbuf);`
`240`	`270`	`ItemIdMarkDead(PageGetItemId(page,offnum));`
`241`	`271`
`242`	`272`	`/*`
`243`	`273`	`* Since this can be redone later if needed, it's treated the same`
`244`	`274`	`* as a commit-hint-bit status update for heap tuples: we mark the`
`245`	`275`	`* buffer dirty but don't make a WAL log entry.`
`246`	`276`	`*/`
`247`		`-SetBufferCommitInfoNeedsSave(so->hashso_curbuf);`
	`277`	`+SetBufferCommitInfoNeedsSave(buf);`
`248`	`278`	`}`
`249`	`279`
`250`	`280`	`/*`
`@@ -262,7 +292,7 @@ hashgettuple(PG_FUNCTION_ARGS)`
`262`	`292`	`{`
`263`	`293`	`while (res)`
`264`	`294`	`{`
`265`		`-offnum=ItemPointerGetOffsetNumber(&(so->hashso_curpos));`
	`295`	`+offnum=ItemPointerGetOffsetNumber(current);`
`266`	`296`	`page=BufferGetPage(so->hashso_curbuf);`
`267`	`297`	`if (!ItemIdIsDead(PageGetItemId(page,offnum)))`
`268`	`298`	`break;`
`@@ -517,7 +547,8 @@ hashbulkdelete(PG_FUNCTION_ARGS)`
`517`	`547`	`HashPageOpaqueopaque;`
`518`	`548`	`OffsetNumberoffno;`
`519`	`549`	`OffsetNumbermaxoffno;`
`520`		`-boolpage_dirty= false;`
	`550`	`+OffsetNumberdeletable[MaxOffsetNumber];`
	`551`	`+intndeletable=0;`
`521`	`552`
`522`	`553`	`vacuum_delay_point();`
`523`	`554`
`@@ -529,9 +560,10 @@ hashbulkdelete(PG_FUNCTION_ARGS)`
`529`	`560`	`Assert(opaque->hasho_bucket==cur_bucket);`
`530`	`561`
`531`	`562`	`/* Scan each tuple in page */`
`532`		`-offno=FirstOffsetNumber;`
`533`	`563`	`maxoffno=PageGetMaxOffsetNumber(page);`
`534`		`-while (offno <=maxoffno)`
	`564`	`+for (offno=FirstOffsetNumber;`
	`565`	`+offno <=maxoffno;`
	`566`	`+offno=OffsetNumberNext(offno))`
`535`	`567`	`{`
`536`	`568`	`IndexTupleitup;`
`537`	`569`	`ItemPointerhtup;`
`@@ -541,30 +573,25 @@ hashbulkdelete(PG_FUNCTION_ARGS)`
`541`	`573`	`htup=&(itup->t_tid);`
`542`	`574`	`if (callback(htup,callback_state))`
`543`	`575`	`{`
`544`		`-/* delete the item from the page */`
`545`		`-PageIndexTupleDelete(page,offno);`
`546`		`-bucket_dirty=page_dirty= true;`
`547`		`-`
`548`		`-/* don't increment offno, instead decrement maxoffno */`
`549`		`-maxoffno=OffsetNumberPrev(maxoffno);`
`550`		`-`
	`576`	`+/* mark the item for deletion */`
	`577`	`+deletable[ndeletable++]=offno;`
`551`	`578`	`tuples_removed+=1;`
`552`	`579`	`}`
`553`	`580`	`else`
`554`		`-{`
`555`		`-offno=OffsetNumberNext(offno);`
`556`		`-`
`557`	`581`	`num_index_tuples+=1;`
`558`		`-}`
`559`	`582`	`}`
`560`	`583`
`561`	`584`	`/*`
`562`		`- *Write page if needed, advance to next page.`
	`585`	`+ *Apply deletions and write page if needed, advance to next page.`
`563`	`586`	`*/`
`564`	`587`	`blkno=opaque->hasho_nextblkno;`
`565`	`588`
`566`		`-if (page_dirty)`
	`589`	`+if (ndeletable>0)`
	`590`	`+{`
	`591`	`+PageIndexMultiDelete(page,deletable,ndeletable);`
`567`	`592`	`_hash_wrtbuf(rel,buf);`
	`593`	`+bucket_dirty= true;`
	`594`	`+}`
`568`	`595`	`else`
`569`	`596`	`_hash_relbuf(rel,buf);`
`570`	`597`	`}`

`‎src/backend/access/hash/hashinsert.c`

Lines changed: 10 additions & 13 deletions

Original file line number	Diff line number	Diff line change
`@@ -8,7 +8,7 @@`
`8`	`8`	`*`
`9`	`9`	`*`
`10`	`10`	`* IDENTIFICATION`
`11`		`- * $PostgreSQL: pgsql/src/backend/access/hash/hashinsert.c,v 1.52 2009/01/0117:23:35 momjian Exp $`
	`11`	`+ * $PostgreSQL: pgsql/src/backend/access/hash/hashinsert.c,v 1.53 2009/11/0121:25:25 tgl Exp $`
`12`	`12`	`*`
`13`	`13`	`*-------------------------------------------------------------------------`
`14`	`14`	`*/`
`@@ -20,10 +20,6 @@`
`20`	`20`	`#include"utils/rel.h"`
`21`	`21`
`22`	`22`
`23`		`-staticOffsetNumber_hash_pgaddtup(Relationrel,Bufferbuf,`
`24`		`-Sizeitemsize,IndexTupleitup);`
`25`		`-`
`26`		`-`
`27`	`23`	`/*`
`28`	`24`	`*_hash_doinsert() -- Handle insertion of a single index tuple.`
`29`	`25`	`*`
`@@ -180,15 +176,16 @@ _hash_doinsert(Relation rel, IndexTuple itup)`
`180`	`176`	`/*`
`181`	`177`	`*_hash_pgaddtup() -- add a tuple to a particular page in the index.`
`182`	`178`	`*`
`183`		`- *This routine adds the tuple to the page as requested; it does`
`184`		`- *not write out the page. It is an error to call pgaddtup() without`
`185`		`- *a write lock and pin.`
	`179`	`+ * This routine adds the tuple to the page as requested; it does not write out`
	`180`	`+ * the page. It is an error to call pgaddtup() without pin and write lock on`
	`181`	`+ * the target buffer.`
	`182`	`+ *`
	`183`	`+ * Returns the offset number at which the tuple was inserted. This function`
	`184`	`+ * is responsible for preserving the condition that tuples in a hash index`
	`185`	`+ * page are sorted by hashkey value.`
`186`	`186`	`*/`
`187`		`-staticOffsetNumber`
`188`		`-_hash_pgaddtup(Relationrel,`
`189`		`-Bufferbuf,`
`190`		`-Sizeitemsize,`
`191`		`-IndexTupleitup)`
	`187`	`+OffsetNumber`
	`188`	`+_hash_pgaddtup(Relationrel,Bufferbuf,Sizeitemsize,IndexTupleitup)`
`192`	`189`	`{`
`193`	`190`	`OffsetNumberitup_off;`
`194`	`191`	`Pagepage;`

0 commit comments

Comments

(0)

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Commitc4afdca

File tree

6 files changed

6 files changed

`‎src/backend/access/hash/README`

`‎src/backend/access/hash/hash.c`

`‎src/backend/access/hash/hashinsert.c`

0 commit comments