NotificationsYou must be signed in to change notification settings
Fork6
Star31

Commit7c75ef5

committed

hash: Implement page-at-a-time scan.

Commit09cb5c0 added a similaroptimization to btree back in 2006, but nobody bothered to implementthe same thing for hash indexes, probably because they weren'tWAL-logged and had lots of other performance problems as well. Aswith the corresponding btree case, this eliminates the problem ofpotentially needing to refind our position within the page, and cutsdown on pin/unpin traffic as well.Ashutosh Sharma, reviewed by Alexander Korotkov, Jesper Pedersen,Amit Kapila, and me. Some final edits to comments and README byme.Discussion:http://postgr.es/m/CAE9k0Pm3KTx93K8_5j6VMzG4h5F+SyknxUwXrN-zqSZ9X8ZS3w@mail.gmail.com

1 parent0f574a7 commit7c75ef5Copy full SHA for 7c75ef5

File tree

6 files changed

+562

-398

lines changed

src
- backend/access/hash
- include/access
  - hash.h

6 files changed

+562

-398

lines changed

`‎src/backend/access/hash/README‎`

Lines changed: 49 additions & 30 deletions

Original file line number	Diff line number	Diff line change
`@@ -259,26 +259,25 @@ The reader algorithm is:`
`259`	`259`	`-- then, per read request:`
`260`	`260`	`reacquire content lock on current page`
`261`	`261`	`step to next page if necessary (no chaining of content locks, but keep`
`262`		`- the pin on the primary bucket throughout the scan; we also maintain`
`263`		`- a pin on the page currently being scanned)`
`264`		`-get tuple`
`265`		`-release content lock`
	`262`	`+the pin on the primary bucket throughout the scan)`
	`263`	`+save all the matching tuples from current index page into an items array`
	`264`	`+release pin and content lock (but if it is primary bucket page retain`
	`265`	`+its pin till the end of the scan)`
	`266`	`+get tuple from an item array`
`266`	`267`	`-- at scan shutdown:`
`267`	`268`	`release all pins still held`
`268`	`269`
`269`	`270`	`Holding the buffer pin on the primary bucket page for the whole scan prevents`
`270`	`271`	`the reader's current-tuple pointer from being invalidated by splits or`
`271`	`272`	`compactions. (Of course, other buckets can still be split or compacted.)`
`272`	`273`
`273`		`-To keep concurrency reasonably good, we require readers to cope with`
`274`		`-concurrent insertions, which means that they have to be able to re-find`
`275`		`-their current scan position after re-acquiring the buffer content lock on`
`276`		`-page. Since deletion is not possible while a reader holds the pin on bucket,`
`277`		`-and we assume that heap tuple TIDs are unique, this can be implemented by`
`278`		`-searching for the same heap tuple TID previously returned. Insertion does`
`279`		`-not move index entries across pages, so the previously-returned index entry`
`280`		`-should always be on the same page, at the same or higher offset number,`
`281`		`-as it was before.`
	`274`	`+To minimize lock/unlock traffic, hash index scan always searches the entire`
	`275`	`+hash page to identify all the matching items at once, copying their heap tuple`
	`276`	`+IDs into backend-local storage. The heap tuple IDs are then processed while not`
	`277`	`+holding any page lock within the index thereby, allowing concurrent insertion`
	`278`	`+to happen on the same index page without any requirement of re-finding the`
	`279`	`+current scan position for the reader. We do continue to hold a pin on the`
	`280`	`+bucket page, to protect against concurrent deletions and bucket split.`
`282`	`281`
`283`	`282`	`To allow for scans during a bucket split, if at the start of the scan, the`
`284`	`283`	`bucket is marked as bucket-being-populated, it scan all the tuples in that`
`@@ -415,23 +414,43 @@ The fourth operation is garbage collection (bulk deletion):`
`415`	`414`
`416`	`415`	`Note that this is designed to allow concurrent splits and scans. If a split`
`417`	`416`	`occurs, tuples relocated into the new bucket will be visited twice by the`
`418`		`-scan, but that does no harm. As we release the lock on bucket page during`
`419`		`-cleanup scan of a bucket, it will allow concurrent scan to start on a bucket`
`420`		`-and ensures that scan will always be behind cleanup. It is must to keep scans`
`421`		`-behind cleanup, else vacuum could decrease the TIDs that are required to`
`422`		`-complete the scan. Now, as the scan that returns multiple tuples from the`
`423`		`-same bucket page always expect next valid TID to be greater than or equal to`
`424`		`-the current TID, it might miss the tuples. This holds true for backward scans`
`425`		`-as well (backward scans first traverse each bucket starting from first bucket`
`426`		`-to last overflow page in the chain). We must be careful about the statistics`
`427`		`-reported by the VACUUM operation. What we can do is count the number of`
`428`		`-tuples scanned, and believe this in preference to the stored tuple count if`
`429`		`-the stored tuple count and number of buckets did not change at any time`
`430`		`-during the scan. This provides a way of correcting the stored tuple count if`
`431`		`-it gets out of sync for some reason. But if a split or insertion does occur`
`432`		`-concurrently, the scan count is untrustworthy; instead, subtract the number of`
`433`		`-tuples deleted from the stored tuple count and use that.`
`434`		`-`
	`417`	`+scan, but that does no harm. See also "Interlocking Between Scans and`
	`418`	`+VACUUM", below.`
	`419`	`+`
	`420`	`+We must be careful about the statistics reported by the VACUUM operation.`
	`421`	`+What we can do is count the number of tuples scanned, and believe this in`
	`422`	`+preference to the stored tuple count if the stored tuple count and number of`
	`423`	`+buckets did not change at any time during the scan. This provides a way of`
	`424`	`+correcting the stored tuple count if it gets out of sync for some reason. But`
	`425`	`+if a split or insertion does occur concurrently, the scan count is`
	`426`	`+untrustworthy; instead, subtract the number of tuples deleted from the stored`
	`427`	`+tuple count and use that.`
	`428`	`+`
	`429`	`+Interlocking Between Scans and VACUUM`
	`430`	`+-------------------------------------`
	`431`	`+`
	`432`	`+Since we release the lock on bucket page during a cleanup scan of a bucket, a`
	`433`	`+concurrent scan could start in that bucket before we've finished vacuuming it.`
	`434`	`+If a scan gets ahead of cleanup, we could have the following problem: (1) the`
	`435`	`+scan sees heap TIDs that are about to be removed before they are processed by`
	`436`	`+VACUUM, (2) the scan decides that one or more of those TIDs are dead, (3)`
	`437`	`+VACUUM completes, (3) one or more of the TIDs the scan decided were dead are`
	`438`	`+reused for an unrelated tuple, and finally (5) the scan wakes up and`
	`439`	`+erroneously kills the new tuple.`
	`440`	`+`
	`441`	`+Note that this requires VACUUM and a scan to be active in the same bucket at`
	`442`	`+the same time. If VACUUM completes before the scan starts, the scan never has`
	`443`	`+a chance to see the dead tuples; if the scan completes before the VACUUM`
	`444`	`+starts, the heap TIDs can't have been reused meanwhile. Furthermore, VACUUM`
	`445`	`+can't start on a bucket that has an active scan, because the scan holds a pin`
	`446`	`+on the primary bucket page, and VACUUM must take a cleanup lock on that page`
	`447`	`+in order to begin cleanup. Therefore, the only way this problem can occur is`
	`448`	`+for a scan to start after VACUUM has released the cleanup lock on the bucket`
	`449`	`+but before it has processed the entire bucket and then overtake the cleanup`
	`450`	`+operation.`
	`451`	`+`
	`452`	`+Currently, we prevent this using lock chaining: cleanup locks the next page`
	`453`	`+in the chain before releasing the lock and pin on the page just processed.`
`435`	`454`
`436`	`455`	`Free Space Management`
`437`	`456`	`---------------------`

`‎src/backend/access/hash/hash.c‎`

Lines changed: 32 additions & 133 deletions

Original file line number	Diff line number	Diff line change
`@@ -268,65 +268,20 @@ bool`
`268`	`268`	`hashgettuple(IndexScanDescscan,ScanDirectiondir)`
`269`	`269`	`{`
`270`	`270`	`HashScanOpaqueso= (HashScanOpaque)scan->opaque;`
`271`		`-Relationrel=scan->indexRelation;`
`272`		`-Bufferbuf;`
`273`		`-Pagepage;`
`274`		`-OffsetNumberoffnum;`
`275`		`-ItemPointercurrent;`
`276`	`271`	`boolres;`
`277`	`272`
`278`	`273`	`/* Hash indexes are always lossy since we store only the hash code */`
`279`	`274`	`scan->xs_recheck= true;`
`280`	`275`
`281`		`-/*`
`282`		`- * We hold pin but not lock on current buffer while outside the hash AM.`
`283`		`- * Reacquire the read lock here.`
`284`		`- */`
`285`		`-if (BufferIsValid(so->hashso_curbuf))`
`286`		`-LockBuffer(so->hashso_curbuf,BUFFER_LOCK_SHARE);`
`287`		`-`
`288`	`276`	`/*`
`289`	`277`	`* If we've already initialized this scan, we can just advance it in the`
`290`	`278`	`* appropriate direction. If we haven't done so yet, we call a routine to`
`291`	`279`	`* get the first item in the scan.`
`292`	`280`	`*/`
`293`		`-current=&(so->hashso_curpos);`
`294`		`-if (ItemPointerIsValid(current))`
	`281`	`+if (!HashScanPosIsValid(so->currPos))`
	`282`	`+res=_hash_first(scan,dir);`
	`283`	`+else`
`295`	`284`	`{`
`296`		`-/*`
`297`		`- * An insertion into the current index page could have happened while`
`298`		`- * we didn't have read lock on it. Re-find our position by looking`
`299`		`- * for the TID we previously returned. (Because we hold a pin on the`
`300`		`- * primary bucket page, no deletions or splits could have occurred;`
`301`		`- * therefore we can expect that the TID still exists in the current`
`302`		`- * index page, at an offset >= where we were.)`
`303`		`- */`
`304`		`-OffsetNumbermaxoffnum;`
`305`		`-`
`306`		`-buf=so->hashso_curbuf;`
`307`		`-Assert(BufferIsValid(buf));`
`308`		`-page=BufferGetPage(buf);`
`309`		`-`
`310`		`-/*`
`311`		`- * We don't need test for old snapshot here as the current buffer is`
`312`		`- * pinned, so vacuum can't clean the page.`
`313`		`- */`
`314`		`-maxoffnum=PageGetMaxOffsetNumber(page);`
`315`		`-for (offnum=ItemPointerGetOffsetNumber(current);`
`316`		`-offnum <=maxoffnum;`
`317`		`-offnum=OffsetNumberNext(offnum))`
`318`		`-{`
`319`		`-IndexTupleitup;`
`320`		`-`
`321`		`-itup= (IndexTuple)PageGetItem(page,PageGetItemId(page,offnum));`
`322`		`-if (ItemPointerEquals(&(so->hashso_heappos),&(itup->t_tid)))`
`323`		`-break;`
`324`		`-}`
`325`		`-if (offnum>maxoffnum)`
`326`		`-elog(ERROR,"failed to re-find scan position within index \"%s\"",`
`327`		`-RelationGetRelationName(rel));`
`328`		`-ItemPointerSetOffsetNumber(current,offnum);`
`329`		`-`
`330`	`285`	`/*`
`331`	`286`	`* Check to see if we should kill the previously-fetched tuple.`
`332`	`287`	`*/`
`@@ -341,47 +296,18 @@ hashgettuple(IndexScanDesc scan, ScanDirection dir)`
`341`	`296`	`* entries.`
`342`	`297`	`*/`
`343`	`298`	`if (so->killedItems==NULL)`
`344`		`-so->killedItems=palloc(MaxIndexTuplesPerPage*`
`345`		`-sizeof(HashScanPosItem));`
	`299`	`+so->killedItems=(int*)`
	`300`	`+palloc(MaxIndexTuplesPerPage*sizeof(int));`
`346`	`301`
`347`	`302`	`if (so->numKilled<MaxIndexTuplesPerPage)`
`348`		`-{`
`349`		`-so->killedItems[so->numKilled].heapTid=so->hashso_heappos;`
`350`		`-so->killedItems[so->numKilled].indexOffset=`
`351`		`-ItemPointerGetOffsetNumber(&(so->hashso_curpos));`
`352`		`-so->numKilled++;`
`353`		`-}`
	`303`	`+so->killedItems[so->numKilled++]=so->currPos.itemIndex;`
`354`	`304`	`}`
`355`	`305`
`356`	`306`	`/*`
`357`	`307`	`* Now continue the scan.`
`358`	`308`	`*/`
`359`	`309`	`res=_hash_next(scan,dir);`
`360`	`310`	`}`
`361`		`-else`
`362`		`-res=_hash_first(scan,dir);`
`363`		`-`
`364`		`-/*`
`365`		`- * Skip killed tuples if asked to.`
`366`		`- */`
`367`		`-if (scan->ignore_killed_tuples)`
`368`		`-{`
`369`		`-while (res)`
`370`		`-{`
`371`		`-offnum=ItemPointerGetOffsetNumber(current);`
`372`		`-page=BufferGetPage(so->hashso_curbuf);`
`373`		`-if (!ItemIdIsDead(PageGetItemId(page,offnum)))`
`374`		`-break;`
`375`		`-res=_hash_next(scan,dir);`
`376`		`-}`
`377`		`-}`
`378`		`-`
`379`		`-/* Release read lock on current buffer, but keep it pinned */`
`380`		`-if (BufferIsValid(so->hashso_curbuf))`
`381`		`-LockBuffer(so->hashso_curbuf,BUFFER_LOCK_UNLOCK);`
`382`		`-`
`383`		`-/* Return current heap TID on success */`
`384`		`-scan->xs_ctup.t_self=so->hashso_heappos;`
`385`	`311`
`386`	`312`	`returnres;`
`387`	`313`	`}`
`@@ -396,35 +322,21 @@ hashgetbitmap(IndexScanDesc scan, TIDBitmap *tbm)`
`396`	`322`	`HashScanOpaqueso= (HashScanOpaque)scan->opaque;`
`397`	`323`	`boolres;`
`398`	`324`	`int64ntids=0;`
	`325`	`+HashScanPosItem*currItem;`
`399`	`326`
`400`	`327`	`res=_hash_first(scan,ForwardScanDirection);`
`401`	`328`
`402`	`329`	`while (res)`
`403`	`330`	`{`
`404`		`-booladd_tuple;`
	`331`	`+currItem=&so->currPos.items[so->currPos.itemIndex];`
`405`	`332`
`406`	`333`	`/*`
`407`		`- * Skip killed tuples if asked to.`
	`334`	`+ * _hash_first and _hash_next handle eliminate dead index entries`
	`335`	`+ * whenever scan->ignored_killed_tuples is true. Therefore, there's`
	`336`	`+ * nothing to do here except add the results to the TIDBitmap.`
`408`	`337`	`*/`
`409`		`-if (scan->ignore_killed_tuples)`
`410`		`-{`
`411`		`-Pagepage;`
`412`		`-OffsetNumberoffnum;`
`413`		`-`
`414`		`-offnum=ItemPointerGetOffsetNumber(&(so->hashso_curpos));`
`415`		`-page=BufferGetPage(so->hashso_curbuf);`
`416`		`-add_tuple= !ItemIdIsDead(PageGetItemId(page,offnum));`
`417`		`-}`
`418`		`-else`
`419`		`-add_tuple= true;`
`420`		`-`
`421`		`-/* Save tuple ID, and continue scanning */`
`422`		`-if (add_tuple)`
`423`		`-{`
`424`		`-/* Note we mark the tuple ID as requiring recheck */`
`425`		`-tbm_add_tuples(tbm,&(so->hashso_heappos),1, true);`
`426`		`-ntids++;`
`427`		`-}`
	`338`	`+tbm_add_tuples(tbm,&(currItem->heapTid),1, true);`
	`339`	`+ntids++;`
`428`	`340`
`429`	`341`	`res=_hash_next(scan,ForwardScanDirection);`
`430`	`342`	`}`
`@@ -448,12 +360,9 @@ hashbeginscan(Relation rel, int nkeys, int norderbys)`
`448`	`360`	`scan=RelationGetIndexScan(rel,nkeys,norderbys);`
`449`	`361`
`450`	`362`	`so= (HashScanOpaque)palloc(sizeof(HashScanOpaqueData));`
`451`		`-so->hashso_curbuf=InvalidBuffer;`
	`363`	`+HashScanPosInvalidate(so->currPos);`
`452`	`364`	`so->hashso_bucket_buf=InvalidBuffer;`
`453`	`365`	`so->hashso_split_bucket_buf=InvalidBuffer;`
`454`		`-/* set position invalid (this will cause _hash_first call) */`
`455`		`-ItemPointerSetInvalid(&(so->hashso_curpos));`
`456`		`-ItemPointerSetInvalid(&(so->hashso_heappos));`
`457`	`366`
`458`	`367`	`so->hashso_buc_populated= false;`
`459`	`368`	`so->hashso_buc_split= false;`
`@@ -476,22 +385,17 @@ hashrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,`
`476`	`385`	`HashScanOpaqueso= (HashScanOpaque)scan->opaque;`
`477`	`386`	`Relationrel=scan->indexRelation;`
`478`	`387`
`479`		`-/*`
`480`		`- * Before leaving current page, deal with any killed items. Also, ensure`
`481`		`- * that we acquire lock on current page before calling _hash_kill_items.`
`482`		`- */`
`483`		`-if (so->numKilled>0)`
	`388`	`+if (HashScanPosIsValid(so->currPos))`
`484`	`389`	`{`
`485`		`-LockBuffer(so->hashso_curbuf,BUFFER_LOCK_SHARE);`
`486`		`-_hash_kill_items(scan);`
`487`		`-LockBuffer(so->hashso_curbuf,BUFFER_LOCK_UNLOCK);`
	`390`	`+/* Before leaving current page, deal with any killed items */`
	`391`	`+if (so->numKilled>0)`
	`392`	`+_hash_kill_items(scan);`
`488`	`393`	`}`
`489`	`394`
`490`	`395`	`_hash_dropscanbuf(rel,so);`
`491`	`396`
`492`	`397`	`/* set position invalid (this will cause _hash_first call) */`
`493`		`-ItemPointerSetInvalid(&(so->hashso_curpos));`
`494`		`-ItemPointerSetInvalid(&(so->hashso_heappos));`
	`398`	`+HashScanPosInvalidate(so->currPos);`
`495`	`399`
`496`	`400`	`/* Update scan key, if a new one is given */`
`497`	`401`	`if (scankey&&scan->numberOfKeys>0)`
`@@ -514,15 +418,11 @@ hashendscan(IndexScanDesc scan)`
`514`	`418`	`HashScanOpaqueso= (HashScanOpaque)scan->opaque;`
`515`	`419`	`Relationrel=scan->indexRelation;`
`516`	`420`
`517`		`-/*`
`518`		`- * Before leaving current page, deal with any killed items. Also, ensure`
`519`		`- * that we acquire lock on current page before calling _hash_kill_items.`
`520`		`- */`
`521`		`-if (so->numKilled>0)`
	`421`	`+if (HashScanPosIsValid(so->currPos))`
`522`	`422`	`{`
`523`		`-LockBuffer(so->hashso_curbuf,BUFFER_LOCK_SHARE);`
`524`		`-_hash_kill_items(scan);`
`525`		`-LockBuffer(so->hashso_curbuf,BUFFER_LOCK_UNLOCK);`
	`423`	`+/* Before leaving current page, deal with any killed items */`
	`424`	`+if (so->numKilled>0)`
	`425`	`+_hash_kill_items(scan);`
`526`	`426`	`}`
`527`	`427`
`528`	`428`	`_hash_dropscanbuf(rel,so);`
`@@ -755,16 +655,15 @@ hashvacuumcleanup(IndexVacuumInfo info, IndexBulkDeleteResult stats)`
`755`	`655`	`* primary bucket page. The lock won't necessarily be held continuously,`
`756`	`656`	`* though, because we'll release it when visiting overflow pages.`
`757`	`657`	`*`
`758`		`- * It would be very bad if this function cleaned a page while some other`
`759`		`- * backend was in the midst of scanning it, because hashgettuple assumes`
`760`		`- * that the next valid TID will be greater than or equal to the current`
`761`		`- * valid TID. There can't be any concurrent scans in progress when we first`
`762`		`- * enter this function because of the cleanup lock we hold on the primary`
`763`		`- * bucket page, but as soon as we release that lock, there might be. We`
`764`		`- * handle that by conspiring to prevent those scans from passing our cleanup`
`765`		`- * scan. To do that, we lock the next page in the bucket chain before`
`766`		`- * releasing the lock on the previous page. (This type of lock chaining is`
`767`		`- * not ideal, so we might want to look for a better solution at some point.)`
	`658`	`+ * There can't be any concurrent scans in progress when we first enter this`
	`659`	`+ * function because of the cleanup lock we hold on the primary bucket page,`
	`660`	`+ * but as soon as we release that lock, there might be. If those scans got`
	`661`	`+ * ahead of our cleanup scan, they might see a tuple before we kill it and`
	`662`	`+ * wake up only after VACUUM has completed and the TID has been recycled for`
	`663`	`+ * an unrelated tuple. To avoid that calamity, we prevent scans from passing`
	`664`	`+ * our cleanup scan by locking the next page in the bucket chain before`
	`665`	`+ * releasing the lock on the previous page. (This type of lock chaining is not`
	`666`	`+ * ideal, so we might want to look for a better solution at some point.)`
`768`	`667`	`*`
`769`	`668`	`* We need to retain a pin on the primary bucket to ensure that no concurrent`
`770`	`669`	`* split can start.`

`‎src/backend/access/hash/hashpage.c‎`

Lines changed: 5 additions & 5 deletions

Original file line number	Diff line number	Diff line change
`@@ -298,20 +298,20 @@ _hash_dropscanbuf(Relation rel, HashScanOpaque so)`
`298`	`298`	`{`
`299`	`299`	`/* release pin we hold on primary bucket page */`
`300`	`300`	`if (BufferIsValid(so->hashso_bucket_buf)&&`
`301`		`-so->hashso_bucket_buf!=so->hashso_curbuf)`
	`301`	`+so->hashso_bucket_buf!=so->currPos.buf)`
`302`	`302`	`_hash_dropbuf(rel,so->hashso_bucket_buf);`
`303`	`303`	`so->hashso_bucket_buf=InvalidBuffer;`
`304`	`304`
`305`	`305`	`/* release pin we hold on primary bucket page of bucket being split */`
`306`	`306`	`if (BufferIsValid(so->hashso_split_bucket_buf)&&`
`307`		`-so->hashso_split_bucket_buf!=so->hashso_curbuf)`
	`307`	`+so->hashso_split_bucket_buf!=so->currPos.buf)`
`308`	`308`	`_hash_dropbuf(rel,so->hashso_split_bucket_buf);`
`309`	`309`	`so->hashso_split_bucket_buf=InvalidBuffer;`
`310`	`310`
`311`	`311`	`/* release any pin we still hold */`
`312`		`-if (BufferIsValid(so->hashso_curbuf))`
`313`		`-_hash_dropbuf(rel,so->hashso_curbuf);`
`314`		`-so->hashso_curbuf=InvalidBuffer;`
	`312`	`+if (BufferIsValid(so->currPos.buf))`
	`313`	`+_hash_dropbuf(rel,so->currPos.buf);`
	`314`	`+so->currPos.buf=InvalidBuffer;`
`315`	`315`
`316`	`316`	`/* reset split scan */`
`317`	`317`	`so->hashso_buc_populated= false;`

0 commit comments

Comments

(0)

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Commit7c75ef5

File tree

6 files changed

6 files changed

`‎src/backend/access/hash/README‎`

`‎src/backend/access/hash/hash.c‎`

`‎src/backend/access/hash/hashpage.c‎`

0 commit comments