Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Commit4b099e1

Browse files
committed
Fix race condition in GIN posting tree page deletion.
If a page is deleted, and reused for something else, just as a search isfollowing a rightlink to it from its left sibling, the search would continuescanning whatever the new contents of the page are. That could lead toincorrect query results, or even something more curious if the page isreused for a different kind of a page.To fix, modify the search algorithm to lock the next page before releasingthe previous one, and refrain from deleting pages from the leftmost branchof the tree.Add a new Concurrency section to the README, explaining why this works.There is a lot more one could say about concurrency in GIN, but that's foranother patch.Backpatch to all supported versions.
1 parent987f05e commit4b099e1

File tree

5 files changed

+121
-59
lines changed

5 files changed

+121
-59
lines changed

‎src/backend/access/gin/README

Lines changed: 50 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -80,6 +80,56 @@ From experience, a value of 'gin_fuzzy_search_limit' in the thousands
8080
have no effect for queries returning a result set with less tuples than this
8181
number.
8282

83+
Concurrency
84+
-----------
85+
86+
The entry tree and each posting tree is a B-tree, with right-links connecting
87+
sibling pages at the same level. This is the same structure that is used in
88+
the regular B-tree indexam (invented by Lehman & Yao), but we don't support
89+
scanning a GIN trees backwards, so we don't need left-links.
90+
91+
To avoid deadlocks, B-tree pages must always be locked in the same order:
92+
left to right, and bottom to top. When searching, the tree is traversed from
93+
top to bottom, so the lock on the parent page must be released before
94+
descending to the next level. Concurrent page splits move the keyspace to
95+
right, so after following a downlink, the page actually containing the key
96+
we're looking for might be somewhere to the right of the page we landed on.
97+
In that case, we follow the right-links until we find the page we're looking
98+
for.
99+
100+
To delete a page, the page's left sibling, the target page, and its parent,
101+
are locked in that order, and the page is marked as deleted. However, a
102+
concurrent search might already have read a pointer to the page, and might be
103+
just about to follow it. A page can be reached via the right-link of its left
104+
sibling, or via its downlink in the parent.
105+
106+
To prevent a backend from reaching a deleted page via a right-link, when
107+
following a right-link the lock on the previous page is not released until
108+
the lock on next page has been acquired.
109+
110+
The downlink is more tricky. A search descending the tree must release the
111+
lock on the parent page before locking the child, or it could deadlock with
112+
a concurrent split of the child page; a page split locks the parent, while
113+
already holding a lock on the child page. However, posting trees are only
114+
fully searched from left to right, starting from the leftmost leaf. (The
115+
tree-structure is only needed by insertions, to quickly find the correct
116+
insert location). So as long as we don't delete the leftmost page on each
117+
level, a search can never follow a downlink to page that's about to be
118+
deleted.
119+
120+
The previous paragraph's reasoning only applies to searches, and only to
121+
posting trees. To protect from inserters following a downlink to a deleted
122+
page, vacuum simply locks out all concurrent insertions to the posting tree,
123+
by holding a super-exclusive lock on the posting tree root. Inserters hold a
124+
pin on the root page, but searches do not, so while new searches cannot begin
125+
while root page is locked, any already-in-progress scans can continue
126+
concurrently with vacuum. In the entry tree, we never delete pages.
127+
128+
(This is quite different from the mechanism the btree indexam uses to make
129+
page-deletions safe; it stamps the deleted pages with an XID and keeps the
130+
deleted pages around with the right-link intact until all concurrent scans
131+
have finished.)
132+
83133
Limitations
84134
-----------
85135

‎src/backend/access/gin/ginbtree.c

Lines changed: 42 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -112,10 +112,8 @@ ginFindLeafPage(GinBtree btree, GinBtreeStack *stack)
112112
/* rightmost page */
113113
break;
114114

115+
stack->buffer=ginStepRight(stack->buffer,btree->index,access);
115116
stack->blkno=rightlink;
116-
LockBuffer(stack->buffer,GIN_UNLOCK);
117-
stack->buffer=ReleaseAndReadBuffer(stack->buffer,btree->index,stack->blkno);
118-
LockBuffer(stack->buffer,access);
119117
page=BufferGetPage(stack->buffer);
120118
}
121119

@@ -151,6 +149,41 @@ ginFindLeafPage(GinBtree btree, GinBtreeStack *stack)
151149
returnNULL;
152150
}
153151

152+
/*
153+
* Step right from current page.
154+
*
155+
* The next page is locked first, before releasing the current page. This is
156+
* crucial to protect from concurrent page deletion (see comment in
157+
* ginDeletePage).
158+
*/
159+
Buffer
160+
ginStepRight(Bufferbuffer,Relationindex,intlockmode)
161+
{
162+
Buffernextbuffer;
163+
Pagepage=BufferGetPage(buffer);
164+
boolisLeaf=GinPageIsLeaf(page);
165+
boolisData=GinPageIsData(page);
166+
BlockNumberblkno=GinPageGetOpaque(page)->rightlink;
167+
168+
nextbuffer=ReadBuffer(index,blkno);
169+
LockBuffer(nextbuffer,lockmode);
170+
UnlockReleaseBuffer(buffer);
171+
172+
/* Sanity check that the page we stepped to is of similar kind. */
173+
page=BufferGetPage(nextbuffer);
174+
if (isLeaf!=GinPageIsLeaf(page)||isData!=GinPageIsData(page))
175+
elog(ERROR,"right sibling of GIN page is of different type");
176+
177+
/*
178+
* Given the proper lock sequence above, we should never land on a
179+
* deleted page.
180+
*/
181+
if (GinPageIsDeleted(page))
182+
elog(ERROR,"right sibling of GIN page was deleted");
183+
184+
returnnextbuffer;
185+
}
186+
154187
void
155188
freeGinBtreeStack(GinBtreeStack*stack)
156189
{
@@ -240,12 +273,12 @@ findParents(GinBtree btree, GinBtreeStack *stack,
240273
while ((offset=btree->findChildPtr(btree,page,stack->blkno,InvalidOffsetNumber))==InvalidOffsetNumber)
241274
{
242275
blkno=GinPageGetOpaque(page)->rightlink;
243-
LockBuffer(buffer,GIN_UNLOCK);
244-
ReleaseBuffer(buffer);
245276
if (blkno==InvalidBlockNumber)
277+
{
278+
UnlockReleaseBuffer(buffer);
246279
break;
247-
buffer=ReadBuffer(btree->index,blkno);
248-
LockBuffer(buffer,GIN_EXCLUSIVE);
280+
}
281+
buffer=ginStepRight(buffer,btree->index,GIN_EXCLUSIVE);
249282
page=BufferGetPage(buffer);
250283
}
251284

@@ -427,23 +460,21 @@ ginInsertValue(GinBtree btree, GinBtreeStack *stack)
427460
{
428461
BlockNumberrightlink=GinPageGetOpaque(page)->rightlink;
429462

430-
LockBuffer(parent->buffer,GIN_UNLOCK);
431-
432463
if (rightlink==InvalidBlockNumber)
433464
{
434465
/*
435466
* rightmost page, but we don't find parent, we should use
436467
* plain search...
437468
*/
469+
LockBuffer(parent->buffer,GIN_UNLOCK);
438470
findParents(btree,stack,rootBlkno);
439471
parent=stack->parent;
440472
page=BufferGetPage(parent->buffer);
441473
break;
442474
}
443475

476+
parent->buffer=ginStepRight(parent->buffer,btree->index,GIN_EXCLUSIVE);
444477
parent->blkno=rightlink;
445-
parent->buffer=ReleaseAndReadBuffer(parent->buffer,btree->index,parent->blkno);
446-
LockBuffer(parent->buffer,GIN_EXCLUSIVE);
447478
page=BufferGetPage(parent->buffer);
448479
}
449480

‎src/backend/access/gin/ginget.c

Lines changed: 6 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -73,14 +73,11 @@ moveRightIfItNeeded(GinBtreeData *btree, GinBtreeStack *stack)
7373
/*
7474
* We scanned the whole page, so we should take right page
7575
*/
76-
stack->blkno=GinPageGetOpaque(page)->rightlink;
77-
7876
if (GinPageRightMost(page))
7977
return false;/* no more pages */
8078

81-
LockBuffer(stack->buffer,GIN_UNLOCK);
82-
stack->buffer=ReleaseAndReadBuffer(stack->buffer,btree->index,stack->blkno);
83-
LockBuffer(stack->buffer,GIN_SHARE);
79+
stack->buffer=ginStepRight(stack->buffer,btree->index,GIN_SHARE);
80+
stack->blkno=BufferGetBlockNumber(stack->buffer);
8481
stack->off=FirstOffsetNumber;
8582
}
8683

@@ -97,7 +94,6 @@ scanForItems(Relation index, GinScanEntry scanEntry, BlockNumber rootPostingTree
9794
GinPostingTreeScan*gdi;
9895
Bufferbuffer;
9996
Pagepage;
100-
BlockNumberblkno;
10197

10298
gdi=prepareScanPostingTree(index,rootPostingTree, TRUE);
10399

@@ -122,16 +118,13 @@ scanForItems(Relation index, GinScanEntry scanEntry, BlockNumber rootPostingTree
122118
scanEntry->predictNumberResult+=GinPageGetOpaque(page)->maxoff;
123119
}
124120

125-
blkno=GinPageGetOpaque(page)->rightlink;
126121
if (GinPageRightMost(page))
127122
{
128123
UnlockReleaseBuffer(buffer);
129124
return;/* no more pages */
130125
}
131126

132-
LockBuffer(buffer,GIN_UNLOCK);
133-
buffer=ReleaseAndReadBuffer(buffer,index,blkno);
134-
LockBuffer(buffer,GIN_SHARE);
127+
buffer=ginStepRight(buffer,index,GIN_SHARE);
135128
}
136129
}
137130

@@ -455,7 +448,6 @@ static void
455448
entryGetNextItem(Relationindex,GinScanEntryentry)
456449
{
457450
Pagepage;
458-
BlockNumberblkno;
459451

460452
for (;;)
461453
{
@@ -473,21 +465,16 @@ entryGetNextItem(Relation index, GinScanEntry entry)
473465
* It's needed to go by right link. During that we should refind
474466
* first ItemPointer greater that stored
475467
*/
476-
477-
blkno=GinPageGetOpaque(page)->rightlink;
478-
479-
LockBuffer(entry->buffer,GIN_UNLOCK);
480-
if (blkno==InvalidBlockNumber)
468+
if (GinPageRightMost(page))
481469
{
482-
ReleaseBuffer(entry->buffer);
470+
UnlockReleaseBuffer(entry->buffer);
483471
ItemPointerSetInvalid(&entry->curItem);
484472
entry->buffer=InvalidBuffer;
485473
entry->isFinished= TRUE;
486474
return;
487475
}
488476

489-
entry->buffer=ReleaseAndReadBuffer(entry->buffer,index,blkno);
490-
LockBuffer(entry->buffer,GIN_SHARE);
477+
entry->buffer=ginStepRight(entry->buffer,index,GIN_SHARE);
491478
page=BufferGetPage(entry->buffer);
492479

493480
entry->offset=InvalidOffsetNumber;

‎src/backend/access/gin/ginvacuum.c

Lines changed: 22 additions & 29 deletions
Original file line numberDiff line numberDiff line change
@@ -242,6 +242,9 @@ ginVacuumPostingTreeLeaves(GinVacuumState *gvs, BlockNumber blkno, bool isRoot,
242242
returnhasVoidPage;
243243
}
244244

245+
/*
246+
* Delete a posting tree page.
247+
*/
245248
staticvoid
246249
ginDeletePage(GinVacuumState*gvs,BlockNumberdeleteBlkno,BlockNumberleftBlkno,
247250
BlockNumberparentBlkno,OffsetNumbermyoff,boolisParentRoot)
@@ -251,39 +254,35 @@ ginDeletePage(GinVacuumState *gvs, BlockNumber deleteBlkno, BlockNumber leftBlkn
251254
BufferpBuffer;
252255
Pagepage,
253256
parentPage;
257+
BlockNumberrightlink;
254258

259+
/*
260+
* Lock the pages in the same order as an insertion would, to avoid
261+
* deadlocks: left, then right, then parent.
262+
*/
263+
lBuffer=ReadBufferExtended(gvs->index,MAIN_FORKNUM,leftBlkno,
264+
RBM_NORMAL,gvs->strategy);
255265
dBuffer=ReadBufferExtended(gvs->index,MAIN_FORKNUM,deleteBlkno,
256266
RBM_NORMAL,gvs->strategy);
257-
258-
if (leftBlkno!=InvalidBlockNumber)
259-
lBuffer=ReadBufferExtended(gvs->index,MAIN_FORKNUM,leftBlkno,
260-
RBM_NORMAL,gvs->strategy);
261-
else
262-
lBuffer=InvalidBuffer;
263-
264267
pBuffer=ReadBufferExtended(gvs->index,MAIN_FORKNUM,parentBlkno,
265268
RBM_NORMAL,gvs->strategy);
266269

270+
LockBuffer(lBuffer,GIN_EXCLUSIVE);
267271
LockBuffer(dBuffer,GIN_EXCLUSIVE);
268272
if (!isParentRoot)/* parent is already locked by
269273
* LockBufferForCleanup() */
270274
LockBuffer(pBuffer,GIN_EXCLUSIVE);
271-
if (leftBlkno!=InvalidBlockNumber)
272-
LockBuffer(lBuffer,GIN_EXCLUSIVE);
273275

274276
START_CRIT_SECTION();
275277

276-
if (leftBlkno!=InvalidBlockNumber)
277-
{
278-
BlockNumberrightlink;
279-
280-
page=BufferGetPage(dBuffer);
281-
rightlink=GinPageGetOpaque(page)->rightlink;
278+
/* Unlink the page by changing left sibling's rightlink */
279+
page=BufferGetPage(dBuffer);
280+
rightlink=GinPageGetOpaque(page)->rightlink;
282281

283-
page=BufferGetPage(lBuffer);
284-
GinPageGetOpaque(page)->rightlink=rightlink;
285-
}
282+
page=BufferGetPage(lBuffer);
283+
GinPageGetOpaque(page)->rightlink=rightlink;
286284

285+
/* Delete downlink from parent */
287286
parentPage=BufferGetPage(pBuffer);
288287
#ifdefUSE_ASSERT_CHECKING
289288
do
@@ -368,10 +367,7 @@ ginDeletePage(GinVacuumState *gvs, BlockNumber deleteBlkno, BlockNumber leftBlkn
368367
if (!isParentRoot)
369368
LockBuffer(pBuffer,GIN_UNLOCK);
370369
ReleaseBuffer(pBuffer);
371-
372-
if (leftBlkno!=InvalidBlockNumber)
373-
UnlockReleaseBuffer(lBuffer);
374-
370+
UnlockReleaseBuffer(lBuffer);
375371
UnlockReleaseBuffer(dBuffer);
376372

377373
END_CRIT_SECTION();
@@ -439,15 +435,12 @@ ginScanToDelete(GinVacuumState *gvs, BlockNumber blkno, bool isRoot, DataPageDel
439435

440436
if (GinPageGetOpaque(page)->maxoff<FirstOffsetNumber)
441437
{
442-
if (!(me->leftBlkno==InvalidBlockNumber&&GinPageRightMost(page)))
438+
/* we never delete the left- or rightmost branch */
439+
if (me->leftBlkno!=InvalidBlockNumber&& !GinPageRightMost(page))
443440
{
444-
/* we never delete right most branch */
445441
Assert(!isRoot);
446-
if (GinPageGetOpaque(page)->maxoff<FirstOffsetNumber)
447-
{
448-
ginDeletePage(gvs,blkno,me->leftBlkno,me->parent->blkno,myoff,me->parent->isRoot);
449-
meDelete= TRUE;
450-
}
442+
ginDeletePage(gvs,blkno,me->leftBlkno,me->parent->blkno,myoff,me->parent->isRoot);
443+
meDelete= TRUE;
451444
}
452445
}
453446

‎src/include/access/gin.h

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -431,6 +431,7 @@ typedef struct GinBtreeData
431431

432432
externGinBtreeStack*ginPrepareFindLeafPage(GinBtreebtree,BlockNumberblkno);
433433
externGinBtreeStack*ginFindLeafPage(GinBtreebtree,GinBtreeStack*stack);
434+
externBufferginStepRight(Bufferbuffer,Relationindex,intlockmode);
434435
externvoidfreeGinBtreeStack(GinBtreeStack*stack);
435436
externvoidginInsertValue(GinBtreebtree,GinBtreeStack*stack);
436437
externvoidfindParents(GinBtreebtree,GinBtreeStack*stack,BlockNumberrootBlkno);

0 commit comments

Comments
 (0)

[8]ページ先頭

©2009-2025 Movatter.jp