@@ -132,15 +132,6 @@ long-term locking since there is a (small) risk of deadlock, which we must
132132be able to detect. Buffer context locks are used for short-term access
133133control to individual pages of the index.
134134
135- We define the following lmgr locks for a hash index:
136-
137- LockPage(rel, 0) represents the right to modify the hash-code-to-bucket
138- mapping. A process attempting to enlarge the hash table by splitting a
139- bucket must exclusive-lock this lock before modifying the metapage data
140- representing the mapping. Processes intending to access a particular
141- bucket must share-lock this lock until they have acquired lock on the
142- correct target bucket.
143-
144135LockPage(rel, page), where page is the page number of a hash bucket page,
145136represents the right to split or compact an individual bucket. A process
146137splitting a bucket must exclusive-lock both old and new halves of the
@@ -150,7 +141,10 @@ insertions must share-lock the bucket they are scanning or inserting into.
150141(It is okay to allow concurrent scans and insertions.)
151142
152143The lmgr lock IDs corresponding to overflow pages are currently unused.
153- These are available for possible future refinements.
144+ These are available for possible future refinements. LockPage(rel, 0)
145+ is also currently undefined (it was previously used to represent the right
146+ to modify the hash-code-to-bucket mapping, but it is no longer needed for
147+ that purpose).
154148
155149Note that these lock definitions are conceptually distinct from any sort
156150of lock on the pages whose numbers they share. A process must also obtain
@@ -165,9 +159,7 @@ hash index code, since a process holding one of these locks could block
165159waiting for an unrelated lock held by another process. If that process
166160then does something that requires exclusive lock on the bucket, we have
167161deadlock. Therefore the bucket locks must be lmgr locks so that deadlock
168- can be detected and recovered from. This also forces the page-zero lock
169- to be an lmgr lock, because as we'll see below it is held while attempting
170- to acquire a bucket lock, and so it could also participate in a deadlock.
162+ can be detected and recovered from.
171163
172164Processes must obtain read (share) buffer context lock on any hash index
173165page while reading it, and write (exclusive) lock while modifying it.
@@ -195,24 +187,30 @@ track of available overflow pages.
195187
196188The reader algorithm is:
197189
198- share-lock page 0 (to prevent active split)
199- read/sharelock meta page
200- compute bucket number for target hash key
201- release meta page
202- share-lock bucket page (to prevent split/compact of this bucket)
203- release page 0 share-lock
190+ pin meta page and take buffer content lock in shared mode
191+ loop:
192+ compute bucket number for target hash key
193+ release meta page buffer content lock
194+ if (correct bucket page is already locked)
195+ break
196+ release any existing bucket page lock (if a concurrent split happened)
197+ take heavyweight bucket lock
198+ retake meta page buffer content lock in shared mode
204199-- then, per read request:
205- read/sharelock current page of bucket
200+ release pin on metapage
201+ read current page of bucket and take shared buffer content lock
206202step to next page if necessary (no chaining of locks)
207203get tuple
208- release current page
204+ releasebuffer content lock and pin on current page
209205-- at scan shutdown:
210206release bucket share-lock
211207
212- By holding the page-zero lock until lock on the target bucket is obtained,
213- the reader ensures that the target bucket calculation is valid (otherwise
214- the bucket might be split before the reader arrives at it, and the target
215- entries might go into the new bucket). Holding the bucket sharelock for
208+ We can't hold the metapage lock while acquiring a lock on the target bucket,
209+ because that might result in an undetected deadlock (lwlocks do not participate
210+ in deadlock detection). Instead, we relock the metapage after acquiring the
211+ bucket page lock and check whether the bucket has been split. If not, we're
212+ done. If so, we release our previously-acquired lock and repeat the process
213+ using the new bucket number. Holding the bucket sharelock for
216214the remainder of the scan prevents the reader's current-tuple pointer from
217215being invalidated by splits or compactions. Notice that the reader's lock
218216does not prevent other buckets from being split or compacted.
@@ -229,22 +227,26 @@ as it was before.
229227
230228The insertion algorithm is rather similar:
231229
232- share-lock page 0 (to prevent active split)
233- read/sharelock meta page
234- compute bucket number for target hash key
235- release meta page
236- share-lock bucket page (to prevent split/compact of this bucket)
237- release page 0 share-lock
230+ pin meta page and take buffer content lock in shared mode
231+ loop:
232+ compute bucket number for target hash key
233+ release meta page buffer content lock
234+ if (correct bucket page is already locked)
235+ break
236+ release any existing bucket page lock (if a concurrent split happened)
237+ take heavyweight bucket lock in shared mode
238+ retake meta page buffer content lock in shared mode
238239-- (so far same as reader)
239- read/exclusive-lock current page of bucket
240+ release pin on metapage
241+ pin current page of bucket and take exclusive buffer content lock
240242if full, release, read/exclusive-lock next page; repeat as needed
241243>> see below if no space in any page of bucket
242244insert tuple at appropriate place in page
243- write/release current page
244- releasebucket share-lock
245- read/exclusive-lock meta page
245+ mark current page dirty and release buffer content lock and pin
246+ releaseheavyweight share-lock
247+ pin meta page and take buffer content lock in shared mode
246248increment tuple count, decide if split needed
247- write/release meta page
249+ mark meta page dirty and release buffer content lock and pin
248250done if no split needed, else enter Split algorithm below
249251
250252To speed searches, the index entries within any individual index page are
@@ -269,26 +271,23 @@ index is overfull (has a higher-than-wanted ratio of tuples to buckets).
269271The algorithm attempts, but does not necessarily succeed, to split one
270272existing bucket in two, thereby lowering the fill ratio:
271273
272- exclusive-lock page 0 (assert the right to begin a split)
273- read/exclusive-lock meta page
274+ pin meta page and take buffer content lock in exclusive mode
274275check split still needed
275- if split not needed anymore, droplocks and exit
276+ if split not needed anymore, dropbuffer content lock and pin and exit
276277decide which bucket to split
277278Attempt to X-lock old bucket number (definitely could fail)
278279Attempt to X-lock new bucket number (shouldn't fail, but...)
279- if above fail, drop locks and exit
280+ if above fail, drop locks andpin and exit
280281update meta page to reflect new number of buckets
281- write/release meta page
282- release X-lock on page 0
282+ mark meta page dirty and release buffer content lock and pin
283283-- now, accesses to all other buckets can proceed.
284284Perform actual split of bucket, moving tuples as needed
285285>> see below about acquiring needed extra space
286286Release X-locks of old and new buckets
287287
288- Note the page zero and metapage locks are not held while the actual tuple
289- rearrangement is performed, so accesses to other buckets can proceed in
290- parallel; in fact, it's possible for multiple bucket splits to proceed
291- in parallel.
288+ Note the metapage lock is not held while the actual tuple rearrangement is
289+ performed, so accesses to other buckets can proceed in parallel; in fact,
290+ it's possible for multiple bucket splits to proceed in parallel.
292291
293292Split's attempt to X-lock the old bucket number could fail if another
294293process holds S-lock on it. We do not want to wait if that happens, first
@@ -316,20 +315,20 @@ go-round.
316315The fourth operation is garbage collection (bulk deletion):
317316
318317next bucket := 0
319- read/sharelock meta page
318+ pin metapage and take buffer content lock in exclusive mode
320319fetch current max bucket number
321- release meta page
320+ release meta page buffer content lock and pin
322321while next bucket <= max bucket do
323322Acquire X lock on target bucket
324323Scan and remove tuples, compact free space as needed
325324Release X lock
326325next bucket ++
327326end loop
328- exclusive- lockmeta page
327+ pin metapage and take buffer content lockin exclusive mode
329328check if number of buckets changed
330- if so, release lock and return to for-each-bucket loop
329+ if so, releasecontent lock and pin and return to for-each-bucket loop
331330else update metapage tuple count
332- write/release meta page
331+ mark meta page dirty and release buffer content lock and pin
333332
334333Note that this is designed to allow concurrent splits. If a split occurs,
335334tuples relocated into the new bucket will be visited twice by the scan,
@@ -360,25 +359,25 @@ overflow page to the free pool.
360359
361360Obtaining an overflow page:
362361
363- read/exclusive- lockmeta page
362+ take metapage content lockin exclusive mode
364363determine next bitmap page number; if none, exit loop
365- release meta page lock
366- read/exclusive-lock bitmap page
364+ release meta pagecontent lock
365+ pin bitmap page and take content lock in exclusive mode
367366search for a free page (zero bit in bitmap)
368367if found:
369368set bit in bitmap
370- write/release bitmap page
371- read/exclusive- lockmeta page
369+ mark bitmap page dirty and release content lock
370+ take metapage buffer content lockin exclusive mode
372371if first-free-bit value did not change,
373- update it andwrite meta page
374- release meta page
372+ update it andmark meta page dirty
373+ release meta page buffer content lock
375374return page number
376375else (not found):
377- release bitmap page
376+ release bitmap page buffer content lock
378377loop back to try next bitmap page, if any
379378-- here when we have checked all bitmap pages; we hold meta excl. lock
380379extend index to add another overflow page; update meta information
381- write/release meta page
380+ mark meta page dirty and release buffer content lock
382381return page number
383382
384383It is slightly annoying to release and reacquire the metapage lock
@@ -428,17 +427,17 @@ algorithm is:
428427
429428delink overflow page from bucket chain
430429(this requires read/update/write/release of fore and aft siblings)
431- read/share-lock meta page
430+ pin meta page and take buffer content lock in shared mode
432431determine which bitmap page contains the free space bit for page
433- release meta page
434- read/exclusive-lock bitmap page
432+ relase meta page buffer content lock
433+ pin bitmap page and take buffer content lock in exclusie mode
435434update bitmap bit
436- write/release bitmap page
435+ mark bitmap page dirty and release buffer content lock and pin
437436if page number is less than what we saw as first-free-bit in meta:
438- read/exclusive-lock meta page
437+ retake meta page buffer content lock in exclusive mode
439438if page number is still less than first-free-bit,
440- update first-free-bit field andwrite meta page
441- release meta page
439+ update first-free-bit field andmark meta page dirty
440+ release meta page buffer content lock and pin
442441
443442We have to do it this way because we must clear the bitmap bit before
444443changing the first-free-bit field (hashm_firstfree). It is possible that