Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Commit9d37c03

Browse files
committed
Repair PANIC condition in hash indexes when a previous index extension attempt
failed (due to lock conflicts or out-of-space). We might have alreadyextended the index's filesystem EOF before failing, causing the EOF to bebeyond what the metapage says is the last used page. Hence the invariantmaintained by the code needs to be "EOF is at or beyond last used page",not "EOF is exactly the last used page". Problem was created by my patchof 2006-11-19 that attempted to repair bug #2737. Since that wasback-patched to 7.4, this needs to be as well. Per report and test casefrom Vlastimil Krejcir.
1 parent77a41e7 commit9d37c03

File tree

4 files changed

+134
-88
lines changed

4 files changed

+134
-88
lines changed

‎src/backend/access/hash/README

Lines changed: 17 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
$PostgreSQL: pgsql/src/backend/access/hash/README,v 1.5 2007/01/09 07:30:49 tgl Exp $
1+
$PostgreSQL: pgsql/src/backend/access/hash/README,v 1.6 2007/04/19 20:24:04 tgl Exp $
22

33
This directory contains an implementation of hash indexing for Postgres. Most
44
of the core ideas are taken from Margo Seltzer and Ozan Yigit, A New Hashing
@@ -77,6 +77,18 @@ index, and preparing to allocate additional overflow pages after those
7777
bucket pages. hashm_spares[] entries before S cannot change anymore,
7878
since that would require moving already-created bucket pages.
7979

80+
The last page nominally used by the index is always determinable from
81+
hashm_spares[S]. To avoid complaints from smgr, the logical EOF as seen by
82+
the filesystem and smgr must always be greater than or equal to this page.
83+
We have to allow the case "greater than" because it's possible that during
84+
an index extension we crash after allocating filesystem space and before
85+
updating the metapage. Note that on filesystems that allow "holes" in
86+
files, it's entirely likely that pages before the logical EOF are not yet
87+
allocated: when we allocate a new splitpoint's worth of bucket pages, we
88+
physically zero the last such page to force the EOF up, and the first such
89+
page will be used immediately, but the intervening pages are not written
90+
until needed.
91+
8092
Since overflow pages may be recycled if enough tuples are deleted from
8193
their bucket, we need a way to keep track of currently-free overflow
8294
pages. The state of each overflow page (0 = available, 1 = not available)
@@ -310,6 +322,10 @@ we can just error out without any great harm being done.
310322
Free space management
311323
---------------------
312324

325+
(Question: why is this so complicated? Why not just have a linked list
326+
of free pages with the list head in the metapage? It's not like we
327+
avoid needing to modify the metapage with all this.)
328+
313329
Free space management consists of two sub-algorithms, one for reserving
314330
an overflow page to add to a bucket chain, and one for returning an empty
315331
overflow page to the free pool.

‎src/backend/access/hash/hashovfl.c

Lines changed: 7 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,7 @@
88
*
99
*
1010
* IDENTIFICATION
11-
* $PostgreSQL: pgsql/src/backend/access/hash/hashovfl.c,v 1.55 2007/04/09 22:03:57 tgl Exp $
11+
* $PostgreSQL: pgsql/src/backend/access/hash/hashovfl.c,v 1.56 2007/04/19 20:24:04 tgl Exp $
1212
*
1313
* NOTES
1414
* Overflow pages look like ordinary relation pages.
@@ -272,19 +272,12 @@ _hash_getovflpage(Relation rel, Buffer metabuf)
272272
blkno=bitno_to_blkno(metap,bit);
273273

274274
/*
275-
*We have to fetchthe page withP_NEW to ensure smgr's idea of the
275+
*Fetchthe page with_hash_getnewbuf to ensure smgr's idea of the
276276
* relation length stays in sync with ours. XXX It's annoying to do this
277277
* with metapage write lock held; would be better to use a lock that
278-
* doesn't block incoming searches. Best way to fix it would be to stop
279-
* maintaining hashm_spares[hashm_ovflpoint] and rely entirely on the
280-
* smgr relation length to track where new overflow pages come from;
281-
* then we could release the metapage before we do the smgrextend.
282-
* FIXME later (not in beta...)
278+
* doesn't block incoming searches.
283279
*/
284-
newbuf=_hash_getbuf(rel,P_NEW,HASH_WRITE);
285-
if (BufferGetBlockNumber(newbuf)!=blkno)
286-
elog(ERROR,"unexpected hash relation size: %u, should be %u",
287-
BufferGetBlockNumber(newbuf),blkno);
280+
newbuf=_hash_getnewbuf(rel,blkno,HASH_WRITE);
288281

289282
metap->hashm_spares[splitnum]++;
290283

@@ -507,19 +500,14 @@ _hash_initbitmap(Relation rel, HashMetaPage metap, BlockNumber blkno)
507500
/*
508501
* It is okay to write-lock the new bitmap page while holding metapage
509502
* write lock, because no one else could be contending for the new page.
510-
* Also, the metapage lock makes it safe to extend the index using P_NEW,
511-
* which we want to do to ensure the smgr's idea of the relation size
512-
* stays in step with ours.
503+
* Also, the metapage lock makes it safe to extend the index using
504+
* _hash_getnewbuf.
513505
*
514506
* There is some loss of concurrency in possibly doing I/O for the new
515507
* page while holding the metapage lock, but this path is taken so seldom
516508
* that it's not worth worrying about.
517509
*/
518-
buf=_hash_getbuf(rel,P_NEW,HASH_WRITE);
519-
if (BufferGetBlockNumber(buf)!=blkno)
520-
elog(ERROR,"unexpected hash relation size: %u, should be %u",
521-
BufferGetBlockNumber(buf),blkno);
522-
510+
buf=_hash_getnewbuf(rel,blkno,HASH_WRITE);
523511
pg=BufferGetPage(buf);
524512

525513
/* initialize the page */

‎src/backend/access/hash/hashpage.c

Lines changed: 108 additions & 67 deletions
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,7 @@
88
*
99
*
1010
* IDENTIFICATION
11-
* $PostgreSQL: pgsql/src/backend/access/hash/hashpage.c,v 1.65 2007/04/09 22:03:57 tgl Exp $
11+
* $PostgreSQL: pgsql/src/backend/access/hash/hashpage.c,v 1.66 2007/04/19 20:24:04 tgl Exp $
1212
*
1313
* NOTES
1414
* Postgres hash pages look like ordinary relation pages. The opaque
@@ -36,7 +36,8 @@
3636
#include"utils/lsyscache.h"
3737

3838

39-
staticBlockNumber_hash_alloc_buckets(Relationrel,uint32nblocks);
39+
staticbool_hash_alloc_buckets(Relationrel,BlockNumberfirstblock,
40+
uint32nblocks);
4041
staticvoid_hash_splitbucket(Relationrel,Buffermetabuf,
4142
Bucketobucket,Bucketnbucket,
4243
BlockNumberstart_oblkno,
@@ -104,8 +105,9 @@ _hash_droplock(Relation rel, BlockNumber whichlock, int access)
104105
*requested buffer and its reference count has been incremented
105106
*(ie, the buffer is "locked and pinned").
106107
*
107-
*blkno == P_NEW is allowed, but it is caller's responsibility to
108-
*ensure that only one process can extend the index at a time.
108+
*P_NEW is disallowed because this routine should only be used
109+
*to access pages that are known to be before the filesystem EOF.
110+
*Extending the index should be done with _hash_getnewbuf.
109111
*
110112
*All call sites should call either _hash_checkpage or _hash_pageinit
111113
*on the returned page, depending on whether the block is expected
@@ -116,6 +118,9 @@ _hash_getbuf(Relation rel, BlockNumber blkno, int access)
116118
{
117119
Bufferbuf;
118120

121+
if (blkno==P_NEW)
122+
elog(ERROR,"hash AM does not use P_NEW");
123+
119124
buf=ReadBuffer(rel,blkno);
120125

121126
if (access!=HASH_NOLOCK)
@@ -125,6 +130,51 @@ _hash_getbuf(Relation rel, BlockNumber blkno, int access)
125130
returnbuf;
126131
}
127132

133+
/*
134+
*_hash_getnewbuf() -- Get a new page at the end of the index.
135+
*
136+
*This has the same API as _hash_getbuf, except that we are adding
137+
*a page to the index, and hence expect the page to be past the
138+
*logical EOF. (However, we have to support the case where it isn't,
139+
*since a prior try might have crashed after extending the filesystem
140+
*EOF but before updating the metapage to reflect the added page.)
141+
*
142+
*It is caller's responsibility to ensure that only one process can
143+
*extend the index at a time.
144+
*
145+
*All call sites should call _hash_pageinit on the returned page.
146+
*Also, it's difficult to imagine why access would not be HASH_WRITE.
147+
*/
148+
Buffer
149+
_hash_getnewbuf(Relationrel,BlockNumberblkno,intaccess)
150+
{
151+
BlockNumbernblocks=RelationGetNumberOfBlocks(rel);
152+
Bufferbuf;
153+
154+
if (blkno==P_NEW)
155+
elog(ERROR,"hash AM does not use P_NEW");
156+
if (blkno>nblocks)
157+
elog(ERROR,"access to noncontiguous page in hash index \"%s\"",
158+
RelationGetRelationName(rel));
159+
160+
/* smgr insists we use P_NEW to extend the relation */
161+
if (blkno==nblocks)
162+
{
163+
buf=ReadBuffer(rel,P_NEW);
164+
if (BufferGetBlockNumber(buf)!=blkno)
165+
elog(ERROR,"unexpected hash relation size: %u, should be %u",
166+
BufferGetBlockNumber(buf),blkno);
167+
}
168+
else
169+
buf=ReadBuffer(rel,blkno);
170+
171+
if (access!=HASH_NOLOCK)
172+
LockBuffer(buf,access);
173+
174+
/* ref count and lock type are correct */
175+
returnbuf;
176+
}
177+
128178
/*
129179
*_hash_relbuf() -- release a locked buffer.
130180
*
@@ -238,12 +288,11 @@ _hash_metapinit(Relation rel)
238288

239289
/*
240290
* We initialize the metapage, the first two bucket pages, and the
241-
* first bitmap page in sequence, usingP_NEW to cause smgrextend()
242-
* calls to occur. This ensures that the smgr level has the right
243-
* idea of the physical index length.
291+
* first bitmap page in sequence, using_hash_getnewbuf to cause
292+
*smgrextend()calls to occur. This ensures that the smgr level
293+
*has the rightidea of the physical index length.
244294
*/
245-
metabuf=_hash_getbuf(rel,P_NEW,HASH_WRITE);
246-
Assert(BufferGetBlockNumber(metabuf)==HASH_METAPAGE);
295+
metabuf=_hash_getnewbuf(rel,HASH_METAPAGE,HASH_WRITE);
247296
pg=BufferGetPage(metabuf);
248297
_hash_pageinit(pg,BufferGetPageSize(metabuf));
249298

@@ -301,8 +350,7 @@ _hash_metapinit(Relation rel)
301350
*/
302351
for (i=0;i <=1;i++)
303352
{
304-
buf=_hash_getbuf(rel,P_NEW,HASH_WRITE);
305-
Assert(BufferGetBlockNumber(buf)==BUCKET_TO_BLKNO(metap,i));
353+
buf=_hash_getnewbuf(rel,BUCKET_TO_BLKNO(metap,i),HASH_WRITE);
306354
pg=BufferGetPage(buf);
307355
_hash_pageinit(pg,BufferGetPageSize(buf));
308356
pageopaque= (HashPageOpaque)PageGetSpecialPointer(pg);
@@ -350,7 +398,6 @@ _hash_expandtable(Relation rel, Buffer metabuf)
350398
Bucketold_bucket;
351399
Bucketnew_bucket;
352400
uint32spare_ndx;
353-
BlockNumberfirstblock=InvalidBlockNumber;
354401
BlockNumberstart_oblkno;
355402
BlockNumberstart_nblkno;
356403
uint32maxbucket;
@@ -402,39 +449,15 @@ _hash_expandtable(Relation rel, Buffer metabuf)
402449
if (metap->hashm_maxbucket >= (uint32)0x7FFFFFFE)
403450
gotofail;
404451

405-
/*
406-
* If the split point is increasing (hashm_maxbucket's log base 2
407-
* increases), we need to allocate a new batch of bucket pages.
408-
*/
409-
new_bucket=metap->hashm_maxbucket+1;
410-
spare_ndx=_hash_log2(new_bucket+1);
411-
if (spare_ndx>metap->hashm_ovflpoint)
412-
{
413-
Assert(spare_ndx==metap->hashm_ovflpoint+1);
414-
/*
415-
* The number of buckets in the new splitpoint is equal to the
416-
* total number already in existence, i.e. new_bucket. Currently
417-
* this maps one-to-one to blocks required, but someday we may need
418-
* a more complicated calculation here.
419-
*/
420-
firstblock=_hash_alloc_buckets(rel,new_bucket);
421-
if (firstblock==InvalidBlockNumber)
422-
gotofail;/* can't split due to BlockNumber overflow */
423-
}
424-
425452
/*
426453
* Determine which bucket is to be split, and attempt to lock the old
427454
* bucket.If we can't get the lock, give up.
428455
*
429456
* The lock protects us against other backends, but not against our own
430457
* backend. Must check for active scans separately.
431-
*
432-
* Ideally we would lock the new bucket too before proceeding, but if we
433-
* are about to cross a splitpoint then the BUCKET_TO_BLKNO mapping isn't
434-
* correct yet. For simplicity we update the metapage first and then
435-
* lock. This should be okay because no one else should be trying to lock
436-
* the new bucket yet...
437458
*/
459+
new_bucket=metap->hashm_maxbucket+1;
460+
438461
old_bucket= (new_bucket&metap->hashm_lowmask);
439462

440463
start_oblkno=BUCKET_TO_BLKNO(metap,old_bucket);
@@ -445,6 +468,45 @@ _hash_expandtable(Relation rel, Buffer metabuf)
445468
if (!_hash_try_getlock(rel,start_oblkno,HASH_EXCLUSIVE))
446469
gotofail;
447470

471+
/*
472+
* Likewise lock the new bucket (should never fail).
473+
*
474+
* Note: it is safe to compute the new bucket's blkno here, even though
475+
* we may still need to update the BUCKET_TO_BLKNO mapping. This is
476+
* because the current value of hashm_spares[hashm_ovflpoint] correctly
477+
* shows where we are going to put a new splitpoint's worth of buckets.
478+
*/
479+
start_nblkno=BUCKET_TO_BLKNO(metap,new_bucket);
480+
481+
if (_hash_has_active_scan(rel,new_bucket))
482+
elog(ERROR,"scan in progress on supposedly new bucket");
483+
484+
if (!_hash_try_getlock(rel,start_nblkno,HASH_EXCLUSIVE))
485+
elog(ERROR,"could not get lock on supposedly new bucket");
486+
487+
/*
488+
* If the split point is increasing (hashm_maxbucket's log base 2
489+
* increases), we need to allocate a new batch of bucket pages.
490+
*/
491+
spare_ndx=_hash_log2(new_bucket+1);
492+
if (spare_ndx>metap->hashm_ovflpoint)
493+
{
494+
Assert(spare_ndx==metap->hashm_ovflpoint+1);
495+
/*
496+
* The number of buckets in the new splitpoint is equal to the
497+
* total number already in existence, i.e. new_bucket. Currently
498+
* this maps one-to-one to blocks required, but someday we may need
499+
* a more complicated calculation here.
500+
*/
501+
if (!_hash_alloc_buckets(rel,start_nblkno,new_bucket))
502+
{
503+
/* can't split due to BlockNumber overflow */
504+
_hash_droplock(rel,start_oblkno,HASH_EXCLUSIVE);
505+
_hash_droplock(rel,start_nblkno,HASH_EXCLUSIVE);
506+
gotofail;
507+
}
508+
}
509+
448510
/*
449511
* Okay to proceed with split.Update the metapage bucket mapping info.
450512
*
@@ -477,20 +539,6 @@ _hash_expandtable(Relation rel, Buffer metabuf)
477539
metap->hashm_ovflpoint=spare_ndx;
478540
}
479541

480-
/* now we can compute the new bucket's primary block number */
481-
start_nblkno=BUCKET_TO_BLKNO(metap,new_bucket);
482-
483-
/* if we added a splitpoint, should match result of _hash_alloc_buckets */
484-
if (firstblock!=InvalidBlockNumber&&
485-
firstblock!=start_nblkno)
486-
elog(PANIC,"unexpected hash relation size: %u, should be %u",
487-
firstblock,start_nblkno);
488-
489-
Assert(!_hash_has_active_scan(rel,new_bucket));
490-
491-
if (!_hash_try_getlock(rel,start_nblkno,HASH_EXCLUSIVE))
492-
elog(PANIC,"could not get lock on supposedly new bucket");
493-
494542
/* Done mucking with metapage */
495543
END_CRIT_SECTION();
496544

@@ -539,7 +587,7 @@ _hash_expandtable(Relation rel, Buffer metabuf)
539587
* This does not need to initialize the new bucket pages; we'll do that as
540588
* each one is used by _hash_expandtable(). But we have to extend the logical
541589
* EOF to the end of the splitpoint; this keeps smgr's idea of the EOF in
542-
* sync with ours, so thatoverflow-page allocation works correctly.
590+
* sync with ours, so thatwe don't get complaints from smgr.
543591
*
544592
* We do this by writing a page of zeroes at the end of the splitpoint range.
545593
* We expect that the filesystem will ensure that the intervening pages read
@@ -554,37 +602,30 @@ _hash_expandtable(Relation rel, Buffer metabuf)
554602
* for the purpose. OTOH, adding a splitpoint is a very infrequent operation,
555603
* so it may not be worth worrying about.
556604
*
557-
* Returnsthe first block number in the new splitpoint's range, or
558-
*InvalidBlockNumber if allocation failed due toBlockNumber overflow.
605+
* ReturnsTRUE if successful, or FALSE if allocation failed due to
606+
* BlockNumber overflow.
559607
*/
560-
staticBlockNumber
561-
_hash_alloc_buckets(Relationrel,uint32nblocks)
608+
staticbool
609+
_hash_alloc_buckets(Relationrel,BlockNumberfirstblock,uint32nblocks)
562610
{
563-
BlockNumberfirstblock;
564611
BlockNumberlastblock;
565612
charzerobuf[BLCKSZ];
566613

567-
/*
568-
* Since we hold metapage lock, no one else is either splitting or
569-
* allocating a new page in _hash_getovflpage(); hence it's safe to
570-
* assume that the relation length isn't changing under us.
571-
*/
572-
firstblock=RelationGetNumberOfBlocks(rel);
573614
lastblock=firstblock+nblocks-1;
574615

575616
/*
576617
* Check for overflow in block number calculation; if so, we cannot
577618
* extend the index anymore.
578619
*/
579620
if (lastblock<firstblock||lastblock==InvalidBlockNumber)
580-
returnInvalidBlockNumber;
621+
returnfalse;
581622

582623
MemSet(zerobuf,0,sizeof(zerobuf));
583624

584-
/* Note: we assume RelationGetNumberOfBlocks didRelationOpenSmgr for us */
625+
RelationOpenSmgr(rel);
585626
smgrextend(rel->rd_smgr,lastblock,zerobuf,rel->rd_istemp);
586627

587-
returnfirstblock;
628+
returntrue;
588629
}
589630

590631

‎src/include/access/hash.h

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -7,7 +7,7 @@
77
* Portions Copyright (c) 1996-2007, PostgreSQL Global Development Group
88
* Portions Copyright (c) 1994, Regents of the University of California
99
*
10-
* $PostgreSQL: pgsql/src/include/access/hash.h,v 1.78 2007/04/09 22:04:05 tgl Exp $
10+
* $PostgreSQL: pgsql/src/include/access/hash.h,v 1.79 2007/04/19 20:24:04 tgl Exp $
1111
*
1212
* NOTES
1313
*modeled after Margo Seltzer's hash implementation for unix.
@@ -284,6 +284,7 @@ extern void _hash_getlock(Relation rel, BlockNumber whichlock, int access);
284284
externbool_hash_try_getlock(Relationrel,BlockNumberwhichlock,intaccess);
285285
externvoid_hash_droplock(Relationrel,BlockNumberwhichlock,intaccess);
286286
externBuffer_hash_getbuf(Relationrel,BlockNumberblkno,intaccess);
287+
externBuffer_hash_getnewbuf(Relationrel,BlockNumberblkno,intaccess);
287288
externvoid_hash_relbuf(Relationrel,Bufferbuf);
288289
externvoid_hash_dropbuf(Relationrel,Bufferbuf);
289290
externvoid_hash_wrtbuf(Relationrel,Bufferbuf);

0 commit comments

Comments
 (0)

[8]ページ先頭

©2009-2025 Movatter.jp