NotificationsYou must be signed in to change notification settings
Fork6
Star31

Commit9de3aa6

committed

Rewrite the GiST insertion logic so that we don't need the post-recovery

cleanup stage to finish incomplete inserts or splits anymore. There was tworeasons for the cleanup step:1. When a new tuple was inserted to a leaf page, the downlink in the parentneeded to be updated to contain (ie. to be consistent with) the new key.Updating the parent in turn might require recursively updating the parent ofthe parent. We now handle that by updating the parent while traversing downthe tree, so that when we insert the leaf tuple, all the parents are alreadyconsistent with the new key, and the tree is consistent at every step.2. When a page is split, we need to insert the downlink for the new rightpage(s), and update the downlink for the original page to not include keysthat moved to the right page(s). We now handle that by setting a new flag,F_FOLLOW_RIGHT, on the non-rightmost pages in the split. When that flag isset, scans always follow the rightlink, regardless of the NSN mechanism usedto detect concurrent page splits. That way the tree is consistent right aftersplit, even though the downlink is still missing. This is very similar to theway B-tree splits are handled. When the downlink is inserted in the parent,the flag is cleared. To keep the insertion algorithm simple, when aninsertion sees an incomplete split, indicated by the F_FOLLOW_RIGHT flag, itfinishes the split before doing anything else.These changes allow removing the whole "invalid tuple" mechanism, but Iretained the scan code to still follow invalid tuples correctly. While wedon't create any such tuples anymore, we want to handle them gracefully incase you pg_upgrade a GiST index that has them. If we encounter any on aninsert, though, we just throw an error saying that you need to REINDEX.The issue that got me into doing this is that if you did a checkpoint whilean insert or split was in progress, and the checkpoint finishes quickly sothat there is no WAL record related to the insert between RedoRecPtr and thecheckpoint record, recovery from that checkpoint would not know to finishthe incomplete insert. IOW, we have the same issue we solved with therm_safe_restartpoint mechanism during normal operation too. It's highlyunlikely to happen in practice, and this fix is far too large to backpatch,so we're just going to live with in previous versions, but this refactoringfixes it going forward.With this patch, you don't get the annoying'index "FOO" needs VACUUM or REINDEX to finish crash recovery' noticesanymore if you crash at an unfortunate moment.

1 parent7a1ca89 commit9de3aa6Copy full SHA for 9de3aa6

File tree

11 files changed

+1030

-1235

lines changed

doc/src/sgml
- gist.sgml
src
- backend/access
  - gist
  - transam
    - rmgr.c
- include/access
  - gist.h
  - gist_private.h

11 files changed

+1030

-1235

lines changed

`‎doc/src/sgml/gist.sgml‎`

Lines changed: 0 additions & 29 deletions

Original file line number	Diff line number	Diff line change
`@@ -709,33 +709,4 @@ my_distance(PG_FUNCTION_ARGS)`
`709`	`709`
`710`	`710`	`</sect1>`
`711`	`711`
`712`		`-<sect1 id="gist-recovery">`
`713`		`- <title>Crash Recovery</title>`
`714`		`-`
`715`		`- <para>`
`716`		`- Usually, replay of the WAL log is sufficient to restore the integrity`
`717`		`- of a GiST index following a database crash. However, there are some`
`718`		`- corner cases in which the index state is not fully rebuilt. The index`
`719`		`- will still be functionally correct, but there might be some performance`
`720`		`- degradation. When this occurs, the index can be repaired by`
`721`		`- <command>VACUUM</>ing its table, or by rebuilding the index using`
`722`		`- <command>REINDEX</>. In some cases a plain <command>VACUUM</> is`
`723`		`- not sufficient, and either <command>VACUUM FULL</> or <command>REINDEX</>`
`724`		`- is needed. The need for one of these procedures is indicated by occurrence`
`725`		`- of this log message during crash recovery:`
`726`		`-<programlisting>`
`727`		`-LOG: index NNN/NNN/NNN needs VACUUM or REINDEX to finish crash recovery`
`728`		`-</programlisting>`
`729`		`- or this log message during routine index insertions:`
`730`		`-<programlisting>`
`731`		`-LOG: index "FOO" needs VACUUM or REINDEX to finish crash recovery`
`732`		`-</programlisting>`
`733`		`- If a plain <command>VACUUM</> finds itself unable to complete recovery`
`734`		`- fully, it will return a notice:`
`735`		`-<programlisting>`
`736`		`-NOTICE: index "FOO" needs VACUUM FULL or REINDEX to finish crash recovery`
`737`		`-</programlisting>`
`738`		`- </para>`
`739`		`-</sect1>`
`740`		`-`
`741`	`712`	`</chapter>`

`‎src/backend/access/gist/README‎`

Lines changed: 111 additions & 70 deletions

Original file line number	Diff line number	Diff line change
`@@ -108,43 +108,71 @@ Penalty is used for choosing a subtree to insert; method PickSplit is used for`
`108`	`108`	`the node splitting algorithm; method Union is used for propagating changes`
`109`	`109`	`upward to maintain the tree properties.`
`110`	`110`
`111`		`-NOTICE: We modified original INSERT algorithm for performance reason. In`
`112`		`-particularly, it is now a single-pass algorithm.`
`113`		`-`
`114`		`-Function findLeaf is used to identify subtree for insertion. Page, in which`
`115`		`-insertion is proceeded, is locked as well as its parent page. Functions`
`116`		`-findParent and findPath are used to find parent pages, which could be changed`
`117`		`-because of concurrent access. Function pageSplit is recurrent and could split`
`118`		`-page by more than 2 pages, which could be necessary if keys have different`
`119`		`-lengths or more than one key are inserted (in such situation, user defined`
`120`		`-function pickSplit cannot guarantee free space on page).`
`121`		`-`
`122`		`-findLeaf(new-key)`
`123`		`-push(stack, [root, 0]) //page, LSN`
`124`		`-while(true)`
`125`		`-ptr = top of stack`
`126`		`-latch( ptr->page, S-mode )`
`127`		`-ptr->lsn = ptr->page->lsn`
`128`		`-if ( exists ptr->parent AND ptr->parent->lsn < ptr->page->nsn )`
`129`		`-unlatch( ptr->page )`
`130`		`-pop stack`
`131`		`-else if ( ptr->page is not leaf )`
`132`		`-push( stack, [get_best_child(ptr->page, new-key), 0] )`
`133`		`-unlatch( ptr->page )`
`134`		`-else`
`135`		`-unlatch( ptr->page )`
`136`		`-latch( ptr->page, X-mode )`
`137`		`-if ( ptr->page is not leaf )`
`138`		`-//the only root page can become a non-leaf`
`139`		`-unlatch( ptr->page )`
`140`		`-else if ( ptr->parent->lsn < ptr->page->nsn )`
`141`		`-unlatch( ptr->page )`
`142`		`-pop stack`
`143`		`-else`
`144`		`-return stack`
`145`		`-end`
`146`		`-end`
`147`		`-end`
	`111`	`+To insert a tuple, we first have to find a suitable leaf page to insert to.`
	`112`	`+The algorithm walks down the tree, starting from the root, along the path`
	`113`	`+of smallest Penalty. At each step:`
	`114`	`+`
	`115`	`+1. Has this page been split since we looked at the parent? If so, it's`
	`116`	`+possible that we should be inserting to the other half instead, so retreat`
	`117`	`+back to the parent.`
	`118`	`+2. If this is a leaf node, we've found our target node.`
	`119`	`+3. Otherwise use Penalty to pick a new target subtree.`
	`120`	`+4. Check the key representing the target subtree. If it doesn't already cover`
	`121`	`+the key we're inserting, replace it with the Union of the old downlink key`
	`122`	`+and the key being inserted. (Actually, we always call Union, and just skip`
	`123`	`+the replacement if the Unioned key is the same as the existing key)`
	`124`	`+5. Replacing the key in step 4 might cause the page to be split. In that case,`
	`125`	`+propagate the change upwards and restart the algorithm from the first parent`
	`126`	`+that didn't need to be split.`
	`127`	`+6. Walk down to the target subtree, and goto 1.`
	`128`	`+`
	`129`	`+This differs from the insertion algorithm in the original paper. In the`
	`130`	`+original paper, you first walk down the tree until you reach a leaf page, and`
	`131`	`+then you adjust the downlink in the parent, and propagating the adjustment up,`
	`132`	`+all the way up to the root in the worst case. But we adjust the downlinks to`
	`133`	`+cover the new key already when we walk down, so that when we reach the leaf`
	`134`	`+page, we don't need to update the parents anymore, except to insert the`
	`135`	`+downlinks if we have to split the page. This makes crash recovery simpler:`
	`136`	`+after inserting a key to the page, the tree is immediately self-consistent`
	`137`	`+without having to update the parents. Even if we split a page and crash before`
	`138`	`+inserting the downlink to the parent, the tree is self-consistent because the`
	`139`	`+right half of the split is accessible via the rightlink of the left page`
	`140`	`+(which replaced the original page).`
	`141`	`+`
	`142`	`+Note that the algorithm can walk up and down the tree before reaching a leaf`
	`143`	`+page, if internal pages need to split while adjusting the downlinks for the`
	`144`	`+new key. Eventually, you should reach the bottom, and proceed with the`
	`145`	`+insertion of the new tuple.`
	`146`	`+`
	`147`	`+Once we've found the target page to insert to, we check if there's room`
	`148`	`+for the new tuple. If there is, the tuple is inserted, and we're done.`
	`149`	`+If it doesn't fit, however, the page needs to be split. Note that it is`
	`150`	`+possible that a page needs to be split into more than two pages, if keys have`
	`151`	`+different lengths or more than one key is being inserted at a time (which can`
	`152`	`+happen when inserting downlinks for a page split that resulted in more than`
	`153`	`+two pages at the lower level). After splitting a page, the parent page needs`
	`154`	`+to be updated. The downlink for the new page needs to be inserted, and the`
	`155`	`+downlink for the old page, which became the left half of the split, needs to`
	`156`	`+be updated to only cover those tuples that stayed on the left page. Inserting`
	`157`	`+the downlink in the parent can again lead to a page split, recursing up to the`
	`158`	`+root page in the worst case.`
	`159`	`+`
	`160`	`+gistplacetopage is the workhorse function that performs one step of the`
	`161`	`+insertion. If the tuple fits, it inserts it to the given page, otherwise`
	`162`	`+it splits the page, and constructs the new downlink tuples for the split`
	`163`	`+pages. The caller must then call gistplacetopage() on the parent page to`
	`164`	`+insert the downlink tuples. The parent page that holds the downlink to`
	`165`	`+the child might have migrated as a result of concurrent splits of the`
	`166`	`+parent, gistfindCorrectParent() is used to find the parent page.`
	`167`	`+`
	`168`	`+Splitting the root page works slightly differently. At root split,`
	`169`	`+gistplacetopage() allocates the new child pages and replaces the old root`
	`170`	`+page with the new root containing downlinks to the new children, all in one`
	`171`	`+operation.`
	`172`	`+`
	`173`	`+`
	`174`	`+findPath is a subroutine of findParent, used when the correct parent page`
	`175`	`+can't be found by following the rightlinks at the parent level:`
`148`	`176`
`149`	`177`	`findPath( stack item )`
`150`	`178`	`push stack, [root, 0, 0] // page, LSN, parent`
`@@ -165,9 +193,13 @@ findPath( stack item )`
`165`	`193`	`pop stack`
`166`	`194`	`end`
`167`	`195`
	`196`	`+`
	`197`	`+gistFindCorrectParent is used to re-find the parent of a page during`
	`198`	`+insertion. It might have migrated to the right since we traversed down the`
	`199`	`+tree because of page splits.`
	`200`	`+`
`168`	`201`	`findParent( stack item )`
`169`	`202`	`parent = item->parent`
`170`		`-latch( parent->page, X-mode )`
`171`	`203`	`if ( parent->page->lsn != parent->lsn )`
`172`	`204`	`while(true)`
`173`	`205`	`search parent tuple on parent->page, if found the return`
`@@ -181,9 +213,13 @@ findParent( stack item )`
`181`	`213`	`end`
`182`	`214`	`newstack = findPath( item->parent )`
`183`	`215`	`replace part of stack to new one`
	`216`	`+latch( parent->page, X-mode )`
`184`	`217`	`return findParent( item )`
`185`	`218`	`end`
`186`	`219`
	`220`	`+pageSplit function decides how to distribute keys to the new pages after`
	`221`	`+page split:`
	`222`	`+`
`187`	`223`	`pageSplit(page, allkeys)`
`188`	`224`	`(lkeys, rkeys) = pickSplit( allkeys )`
`189`	`225`	`if ( page is root )`
`@@ -204,39 +240,44 @@ pageSplit(page, allkeys)`
`204`	`240`	`return newkeys`
`205`	`241`
`206`	`242`
`207`		`-placetopage(page, keysarray)`
`208`		`-if ( no space left on page )`
`209`		`-keysarray = pageSplit(page, [ extract_keys(page), keysarray])`
`210`		`-last page in chain gets old NSN,`
`211`		`-original and others - new NSN equals to LSN`
`212`		`-if ( page is root )`
`213`		`-make new root with keysarray`
`214`		`-end`
`215`		`-else`
`216`		`-put keysarray on page`
`217`		`-if ( length of keysarray > 1 )`
`218`		`-keysarray = [ union(keysarray) ]`
`219`		`-end`
`220`		`-end`
`221`	`243`
`222`		`-insert(new-key)`
`223`		`-stack = findLeaf(new-key)`
`224`		`-keysarray = [new-key]`
`225`		`-ptr = top of stack`
`226`		`-while(true)`
`227`		`-findParent( ptr ) //findParent latches parent page`
`228`		`-keysarray = placetopage(ptr->page, keysarray)`
`229`		`-unlatch( ptr->page )`
`230`		`-pop stack;`
`231`		`-ptr = top of stack`
`232`		`-if (length of keysarray == 1)`
`233`		`-newboundingkey = union(oldboundingkey, keysarray)`
`234`		`-if (newboundingkey == oldboundingkey)`
`235`		`-unlatch ptr->page`
`236`		`-break loop`
`237`		`-end`
`238`		`-end`
`239`		`-end`
	`244`	`+Concurrency control`
	`245`	`+-------------------`
	`246`	`+As a rule of thumb, if you need to hold a lock on multiple pages at the`
	`247`	`+same time, the locks should be acquired in the following order: child page`
	`248`	`+before parent, and left-to-right at the same level. Always acquiring the`
	`249`	`+locks in the same order avoids deadlocks.`
	`250`	`+`
	`251`	`+The search algorithm only looks at and locks one page at a time. Consequently`
	`252`	`+there's a race condition between a search and a page split. A page split`
	`253`	`+happens in two phases: 1. The page is split 2. The downlink is inserted to the`
	`254`	`+parent. If a search looks at the parent page between those steps, before the`
	`255`	`+downlink is inserted, it will still find the new right half by following the`
	`256`	`+rightlink on the left half. But it must not follow the rightlink if it saw the`
	`257`	`+downlink in the parent, or the page will be visited twice!`
	`258`	`+`
	`259`	`+A split initially marks the left page with the F_FOLLOW_RIGHT flag. If a scan`
	`260`	`+sees that flag set, it knows that the right page is missing the downlink, and`
	`261`	`+should be visited too. When split inserts the downlink to the parent, it`
	`262`	`+clears the F_FOLLOW_RIGHT flag in the child, and sets the NSN field in the`
	`263`	`+child page header to match the LSN of the insertion on the parent. If the`
	`264`	`+F_FOLLOW_RIGHT flag is not set, a scan compares the NSN on the child and the`
	`265`	`+LSN it saw in the parent. If NSN < LSN, the scan looked at the parent page`
	`266`	`+before the downlink was inserted, so it should follow the rightlink. Otherwise`
	`267`	`+the scan saw the downlink in the parent page, and will/did follow that as`
	`268`	`+usual.`
	`269`	`+`
	`270`	`+A scan can't normally see a page with the F_FOLLOW_RIGHT flag set, because`
	`271`	`+a page split keeps the child pages locked until the downlink has been inserted`
	`272`	`+to the parent and the flag cleared again. But if a crash happens in the middle`
	`273`	`+of a page split, before the downlinks are inserted into the parent, that will`
	`274`	`+leave a page with F_FOLLOW_RIGHT in the tree. Scans handle that just fine,`
	`275`	`+but we'll eventually want to fix that for performance reasons. And more`
	`276`	`+importantly, dealing with pages with missing downlink pointers in the parent`
	`277`	`+would complicate the insertion algorithm. So when an insertion sees a page`
	`278`	`+with F_FOLLOW_RIGHT set, it immediately tries to bring the split that`
	`279`	`+crashed in the middle to completion by adding the downlink in the parent.`
	`280`	`+`
`240`	`281`
`241`	`282`	`Authors:`
`242`	`283`	`Teodor Sigaev<teodor@sigaev.ru>`

0 commit comments

Comments

(0)

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Commit9de3aa6

File tree

11 files changed

11 files changed

`‎doc/src/sgml/gist.sgml‎`

`‎src/backend/access/gist/README‎`

0 commit comments