Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Commit3527a5f

Browse files
committed
Fix race condition in GIN posting tree page deletion.
If a page is deleted, and reused for something else, just as a search isfollowing a rightlink to it from its left sibling, the search would continuescanning whatever the new contents of the page are. That could lead toincorrect query results, or even something more curious if the page isreused for a different kind of a page.To fix, modify the search algorithm to lock the next page before releasingthe previous one, and refrain from deleting pages from the leftmost branchof the tree.Add a new Concurrency section to the README, explaining why this works.There is a lot more one could say about concurrency in GIN, but that's foranother patch.Backpatch to all supported versions.
1 parentaf38d14 commit3527a5f

File tree

5 files changed

+123
-63
lines changed

5 files changed

+123
-63
lines changed

‎src/backend/access/gin/README

Lines changed: 50 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -210,6 +210,56 @@ fit on one pending-list page must have those pages to itself, even if this
210210
results in wasting much of the space on the preceding page and the last
211211
page for the tuple.)
212212

213+
Concurrency
214+
-----------
215+
216+
The entry tree and each posting tree is a B-tree, with right-links connecting
217+
sibling pages at the same level. This is the same structure that is used in
218+
the regular B-tree indexam (invented by Lehman & Yao), but we don't support
219+
scanning a GIN trees backwards, so we don't need left-links.
220+
221+
To avoid deadlocks, B-tree pages must always be locked in the same order:
222+
left to right, and bottom to top. When searching, the tree is traversed from
223+
top to bottom, so the lock on the parent page must be released before
224+
descending to the next level. Concurrent page splits move the keyspace to
225+
right, so after following a downlink, the page actually containing the key
226+
we're looking for might be somewhere to the right of the page we landed on.
227+
In that case, we follow the right-links until we find the page we're looking
228+
for.
229+
230+
To delete a page, the page's left sibling, the target page, and its parent,
231+
are locked in that order, and the page is marked as deleted. However, a
232+
concurrent search might already have read a pointer to the page, and might be
233+
just about to follow it. A page can be reached via the right-link of its left
234+
sibling, or via its downlink in the parent.
235+
236+
To prevent a backend from reaching a deleted page via a right-link, when
237+
following a right-link the lock on the previous page is not released until
238+
the lock on next page has been acquired.
239+
240+
The downlink is more tricky. A search descending the tree must release the
241+
lock on the parent page before locking the child, or it could deadlock with
242+
a concurrent split of the child page; a page split locks the parent, while
243+
already holding a lock on the child page. However, posting trees are only
244+
fully searched from left to right, starting from the leftmost leaf. (The
245+
tree-structure is only needed by insertions, to quickly find the correct
246+
insert location). So as long as we don't delete the leftmost page on each
247+
level, a search can never follow a downlink to page that's about to be
248+
deleted.
249+
250+
The previous paragraph's reasoning only applies to searches, and only to
251+
posting trees. To protect from inserters following a downlink to a deleted
252+
page, vacuum simply locks out all concurrent insertions to the posting tree,
253+
by holding a super-exclusive lock on the posting tree root. Inserters hold a
254+
pin on the root page, but searches do not, so while new searches cannot begin
255+
while root page is locked, any already-in-progress scans can continue
256+
concurrently with vacuum. In the entry tree, we never delete pages.
257+
258+
(This is quite different from the mechanism the btree indexam uses to make
259+
page-deletions safe; it stamps the deleted pages with an XID and keeps the
260+
deleted pages around with the right-link intact until all concurrent scans
261+
have finished.)
262+
213263
Limitations
214264
-----------
215265

‎src/backend/access/gin/ginbtree.c

Lines changed: 42 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -113,10 +113,8 @@ ginFindLeafPage(GinBtree btree, GinBtreeStack *stack)
113113
/* rightmost page */
114114
break;
115115

116+
stack->buffer=ginStepRight(stack->buffer,btree->index,access);
116117
stack->blkno=rightlink;
117-
LockBuffer(stack->buffer,GIN_UNLOCK);
118-
stack->buffer=ReleaseAndReadBuffer(stack->buffer,btree->index,stack->blkno);
119-
LockBuffer(stack->buffer,access);
120118
page=BufferGetPage(stack->buffer);
121119
}
122120

@@ -152,6 +150,41 @@ ginFindLeafPage(GinBtree btree, GinBtreeStack *stack)
152150
returnNULL;
153151
}
154152

153+
/*
154+
* Step right from current page.
155+
*
156+
* The next page is locked first, before releasing the current page. This is
157+
* crucial to protect from concurrent page deletion (see comment in
158+
* ginDeletePage).
159+
*/
160+
Buffer
161+
ginStepRight(Bufferbuffer,Relationindex,intlockmode)
162+
{
163+
Buffernextbuffer;
164+
Pagepage=BufferGetPage(buffer);
165+
boolisLeaf=GinPageIsLeaf(page);
166+
boolisData=GinPageIsData(page);
167+
BlockNumberblkno=GinPageGetOpaque(page)->rightlink;
168+
169+
nextbuffer=ReadBuffer(index,blkno);
170+
LockBuffer(nextbuffer,lockmode);
171+
UnlockReleaseBuffer(buffer);
172+
173+
/* Sanity check that the page we stepped to is of similar kind. */
174+
page=BufferGetPage(nextbuffer);
175+
if (isLeaf!=GinPageIsLeaf(page)||isData!=GinPageIsData(page))
176+
elog(ERROR,"right sibling of GIN page is of different type");
177+
178+
/*
179+
* Given the proper lock sequence above, we should never land on a
180+
* deleted page.
181+
*/
182+
if (GinPageIsDeleted(page))
183+
elog(ERROR,"right sibling of GIN page was deleted");
184+
185+
returnnextbuffer;
186+
}
187+
155188
void
156189
freeGinBtreeStack(GinBtreeStack*stack)
157190
{
@@ -240,12 +273,12 @@ ginFindParents(GinBtree btree, GinBtreeStack *stack,
240273
while ((offset=btree->findChildPtr(btree,page,stack->blkno,InvalidOffsetNumber))==InvalidOffsetNumber)
241274
{
242275
blkno=GinPageGetOpaque(page)->rightlink;
243-
LockBuffer(buffer,GIN_UNLOCK);
244-
ReleaseBuffer(buffer);
245276
if (blkno==InvalidBlockNumber)
277+
{
278+
UnlockReleaseBuffer(buffer);
246279
break;
247-
buffer=ReadBuffer(btree->index,blkno);
248-
LockBuffer(buffer,GIN_EXCLUSIVE);
280+
}
281+
buffer=ginStepRight(buffer,btree->index,GIN_EXCLUSIVE);
249282
page=BufferGetPage(buffer);
250283
}
251284

@@ -448,23 +481,21 @@ ginInsertValue(GinBtree btree, GinBtreeStack *stack, GinStatsData *buildStats)
448481
{
449482
BlockNumberrightlink=GinPageGetOpaque(page)->rightlink;
450483

451-
LockBuffer(parent->buffer,GIN_UNLOCK);
452-
453484
if (rightlink==InvalidBlockNumber)
454485
{
455486
/*
456487
* rightmost page, but we don't find parent, we should use
457488
* plain search...
458489
*/
490+
LockBuffer(parent->buffer,GIN_UNLOCK);
459491
ginFindParents(btree,stack,rootBlkno);
460492
parent=stack->parent;
461493
page=BufferGetPage(parent->buffer);
462494
break;
463495
}
464496

497+
parent->buffer=ginStepRight(parent->buffer,btree->index,GIN_EXCLUSIVE);
465498
parent->blkno=rightlink;
466-
parent->buffer=ReleaseAndReadBuffer(parent->buffer,btree->index,parent->blkno);
467-
LockBuffer(parent->buffer,GIN_EXCLUSIVE);
468499
page=BufferGetPage(parent->buffer);
469500
}
470501

‎src/backend/access/gin/ginget.c

Lines changed: 8 additions & 23 deletions
Original file line numberDiff line numberDiff line change
@@ -107,16 +107,11 @@ moveRightIfItNeeded(GinBtreeData *btree, GinBtreeStack *stack)
107107
/*
108108
* We scanned the whole page, so we should take right page
109109
*/
110-
stack->blkno=GinPageGetOpaque(page)->rightlink;
111-
112110
if (GinPageRightMost(page))
113111
return false;/* no more pages */
114112

115-
LockBuffer(stack->buffer,GIN_UNLOCK);
116-
stack->buffer=ReleaseAndReadBuffer(stack->buffer,
117-
btree->index,
118-
stack->blkno);
119-
LockBuffer(stack->buffer,GIN_SHARE);
113+
stack->buffer=ginStepRight(stack->buffer,btree->index,GIN_SHARE);
114+
stack->blkno=BufferGetBlockNumber(stack->buffer);
120115
stack->off=FirstOffsetNumber;
121116
}
122117

@@ -134,7 +129,6 @@ scanPostingTree(Relation index, GinScanEntry scanEntry,
134129
GinPostingTreeScan*gdi;
135130
Bufferbuffer;
136131
Pagepage;
137-
BlockNumberblkno;
138132

139133
/* Descend to the leftmost leaf page */
140134
gdi=ginPrepareScanPostingTree(index,rootPostingTree, TRUE);
@@ -164,10 +158,7 @@ scanPostingTree(Relation index, GinScanEntry scanEntry,
164158
if (GinPageRightMost(page))
165159
break;/* no more pages */
166160

167-
blkno=GinPageGetOpaque(page)->rightlink;
168-
LockBuffer(buffer,GIN_UNLOCK);
169-
buffer=ReleaseAndReadBuffer(buffer,index,blkno);
170-
LockBuffer(buffer,GIN_SHARE);
161+
buffer=ginStepRight(buffer,index,GIN_SHARE);
171162
}
172163

173164
UnlockReleaseBuffer(buffer);
@@ -546,7 +537,6 @@ static void
546537
entryGetNextItem(GinState*ginstate,GinScanEntryentry)
547538
{
548539
Pagepage;
549-
BlockNumberblkno;
550540

551541
for (;;)
552542
{
@@ -564,23 +554,18 @@ entryGetNextItem(GinState *ginstate, GinScanEntry entry)
564554
* It's needed to go by right link. During that we should refind
565555
* first ItemPointer greater that stored
566556
*/
567-
568-
blkno=GinPageGetOpaque(page)->rightlink;
569-
570-
LockBuffer(entry->buffer,GIN_UNLOCK);
571-
if (blkno==InvalidBlockNumber)
557+
if (GinPageRightMost(page))
572558
{
573-
ReleaseBuffer(entry->buffer);
559+
UnlockReleaseBuffer(entry->buffer);
574560
ItemPointerSetInvalid(&entry->curItem);
575561
entry->buffer=InvalidBuffer;
576562
entry->isFinished= TRUE;
577563
return;
578564
}
579565

580-
entry->buffer=ReleaseAndReadBuffer(entry->buffer,
581-
ginstate->index,
582-
blkno);
583-
LockBuffer(entry->buffer,GIN_SHARE);
566+
entry->buffer=ginStepRight(entry->buffer,
567+
ginstate->index,
568+
GIN_SHARE);
584569
page=BufferGetPage(entry->buffer);
585570

586571
entry->offset=InvalidOffsetNumber;

‎src/backend/access/gin/ginvacuum.c

Lines changed: 22 additions & 29 deletions
Original file line numberDiff line numberDiff line change
@@ -240,6 +240,9 @@ ginVacuumPostingTreeLeaves(GinVacuumState *gvs, BlockNumber blkno, bool isRoot,
240240
returnhasVoidPage;
241241
}
242242

243+
/*
244+
* Delete a posting tree page.
245+
*/
243246
staticvoid
244247
ginDeletePage(GinVacuumState*gvs,BlockNumberdeleteBlkno,BlockNumberleftBlkno,
245248
BlockNumberparentBlkno,OffsetNumbermyoff,boolisParentRoot)
@@ -249,39 +252,35 @@ ginDeletePage(GinVacuumState *gvs, BlockNumber deleteBlkno, BlockNumber leftBlkn
249252
BufferpBuffer;
250253
Pagepage,
251254
parentPage;
255+
BlockNumberrightlink;
252256

257+
/*
258+
* Lock the pages in the same order as an insertion would, to avoid
259+
* deadlocks: left, then right, then parent.
260+
*/
261+
lBuffer=ReadBufferExtended(gvs->index,MAIN_FORKNUM,leftBlkno,
262+
RBM_NORMAL,gvs->strategy);
253263
dBuffer=ReadBufferExtended(gvs->index,MAIN_FORKNUM,deleteBlkno,
254264
RBM_NORMAL,gvs->strategy);
255-
256-
if (leftBlkno!=InvalidBlockNumber)
257-
lBuffer=ReadBufferExtended(gvs->index,MAIN_FORKNUM,leftBlkno,
258-
RBM_NORMAL,gvs->strategy);
259-
else
260-
lBuffer=InvalidBuffer;
261-
262265
pBuffer=ReadBufferExtended(gvs->index,MAIN_FORKNUM,parentBlkno,
263266
RBM_NORMAL,gvs->strategy);
264267

268+
LockBuffer(lBuffer,GIN_EXCLUSIVE);
265269
LockBuffer(dBuffer,GIN_EXCLUSIVE);
266270
if (!isParentRoot)/* parent is already locked by
267271
* LockBufferForCleanup() */
268272
LockBuffer(pBuffer,GIN_EXCLUSIVE);
269-
if (leftBlkno!=InvalidBlockNumber)
270-
LockBuffer(lBuffer,GIN_EXCLUSIVE);
271273

272274
START_CRIT_SECTION();
273275

274-
if (leftBlkno!=InvalidBlockNumber)
275-
{
276-
BlockNumberrightlink;
277-
278-
page=BufferGetPage(dBuffer);
279-
rightlink=GinPageGetOpaque(page)->rightlink;
276+
/* Unlink the page by changing left sibling's rightlink */
277+
page=BufferGetPage(dBuffer);
278+
rightlink=GinPageGetOpaque(page)->rightlink;
280279

281-
page=BufferGetPage(lBuffer);
282-
GinPageGetOpaque(page)->rightlink=rightlink;
283-
}
280+
page=BufferGetPage(lBuffer);
281+
GinPageGetOpaque(page)->rightlink=rightlink;
284282

283+
/* Delete downlink from parent */
285284
parentPage=BufferGetPage(pBuffer);
286285
#ifdefUSE_ASSERT_CHECKING
287286
do
@@ -366,10 +365,7 @@ ginDeletePage(GinVacuumState *gvs, BlockNumber deleteBlkno, BlockNumber leftBlkn
366365
if (!isParentRoot)
367366
LockBuffer(pBuffer,GIN_UNLOCK);
368367
ReleaseBuffer(pBuffer);
369-
370-
if (leftBlkno!=InvalidBlockNumber)
371-
UnlockReleaseBuffer(lBuffer);
372-
368+
UnlockReleaseBuffer(lBuffer);
373369
UnlockReleaseBuffer(dBuffer);
374370

375371
END_CRIT_SECTION();
@@ -437,15 +433,12 @@ ginScanToDelete(GinVacuumState *gvs, BlockNumber blkno, bool isRoot, DataPageDel
437433

438434
if (GinPageGetOpaque(page)->maxoff<FirstOffsetNumber)
439435
{
440-
if (!(me->leftBlkno==InvalidBlockNumber&&GinPageRightMost(page)))
436+
/* we never delete the left- or rightmost branch */
437+
if (me->leftBlkno!=InvalidBlockNumber&& !GinPageRightMost(page))
441438
{
442-
/* we never delete right most branch */
443439
Assert(!isRoot);
444-
if (GinPageGetOpaque(page)->maxoff<FirstOffsetNumber)
445-
{
446-
ginDeletePage(gvs,blkno,me->leftBlkno,me->parent->blkno,myoff,me->parent->isRoot);
447-
meDelete= TRUE;
448-
}
440+
ginDeletePage(gvs,blkno,me->leftBlkno,me->parent->blkno,myoff,me->parent->isRoot);
441+
meDelete= TRUE;
449442
}
450443
}
451444

‎src/include/access/gin_private.h

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -508,6 +508,7 @@ typedef struct GinBtreeData
508508

509509
externGinBtreeStack*ginPrepareFindLeafPage(GinBtreebtree,BlockNumberblkno);
510510
externGinBtreeStack*ginFindLeafPage(GinBtreebtree,GinBtreeStack*stack);
511+
externBufferginStepRight(Bufferbuffer,Relationindex,intlockmode);
511512
externvoidfreeGinBtreeStack(GinBtreeStack*stack);
512513
externvoidginInsertValue(GinBtreebtree,GinBtreeStack*stack,
513514
GinStatsData*buildStats);

0 commit comments

Comments
 (0)

[8]ページ先頭

©2009-2025 Movatter.jp