@@ -213,7 +213,7 @@ this flag must be clear before splitting a bucket; thus, a bucket can't be
213213split again until the previous split is totally complete.
214214
215215The moved-by-split flag on a tuple indicates that tuple is moved from old to
216- new bucket. Concurrent scanscan skip such tuplestill the split operation
216+ new bucket. Concurrent scanswill skip such tuplesuntil the split operation
217217is finished. Once the tuple is marked as moved-by-split, it will remain so
218218forever but that does no harm. We have intentionally not cleared it as that
219219can generate an additional I/O which is not necessary.
@@ -287,13 +287,17 @@ The insertion algorithm is rather similar:
287287if current page is full, release lock but not pin, read/exclusive-lock
288288 next page; repeat as needed
289289>> see below if no space in any page of bucket
290+ take buffer content lock in exclusive mode on metapage
290291insert tuple at appropriate place in page
291- mark current page dirty and release buffer content lock and pin
292- if the current page is not a bucket page, release the pin on bucket page
293- pin meta page and take buffer content lock in exclusive mode
292+ mark current page dirty
294293increment tuple count, decide if split needed
295- mark meta page dirty and release buffer content lock and pin
296- done if no split needed, else enter Split algorithm below
294+ mark meta page dirty
295+ write WAL for insertion of tuple
296+ release the buffer content lock on metapage
297+ release buffer content lock on current page
298+ if current page is not a bucket page, release the pin on bucket page
299+ if split is needed, enter Split algorithm below
300+ release the pin on metapage
297301
298302To speed searches, the index entries within any individual index page are
299303kept sorted by hash code; the insertion code must take care to insert new
@@ -328,12 +332,17 @@ existing bucket in two, thereby lowering the fill ratio:
328332 try to finish the split and the cleanup work
329333 if that succeeds, start over; if it fails, give up
330334mark the old and new buckets indicating split is in progress
335+ mark both old and new buckets as dirty
336+ write WAL for allocation of new page for split
331337copy the tuples that belongs to new bucket from old bucket, marking
332338 them as moved-by-split
339+ write WAL record for moving tuples to new page once the new page is full
340+ or all the pages of old bucket are finished
333341release lock but not pin for primary bucket page of old bucket,
334342 read/shared-lock next page; repeat as needed
335343clear the bucket-being-split and bucket-being-populated flags
336344mark the old bucket indicating split-cleanup
345+ write WAL for changing the flags on both old and new buckets
337346
338347The split operation's attempt to acquire cleanup-lock on the old bucket number
339348could fail if another process holds any lock or pin on it. We do not want to
@@ -369,6 +378,8 @@ The fourth operation is garbage collection (bulk deletion):
369378acquire cleanup lock on primary bucket page
370379loop:
371380scan and remove tuples
381+ mark the target page dirty
382+ write WAL for deleting tuples from target page
372383if this is the last bucket page, break out of loop
373384pin and x-lock next page
374385release prior lock and pin (except keep pin on primary bucket page)
@@ -383,7 +394,8 @@ The fourth operation is garbage collection (bulk deletion):
383394check if number of buckets changed
384395if so, release content lock and pin and return to for-each-bucket loop
385396else update metapage tuple count
386- mark meta page dirty and release buffer content lock and pin
397+ mark meta page dirty and write WAL for update of metapage
398+ release buffer content lock and pin
387399
388400Note that this is designed to allow concurrent splits and scans. If a split
389401occurs, tuples relocated into the new bucket will be visited twice by the
@@ -425,18 +437,16 @@ Obtaining an overflow page:
425437search for a free page (zero bit in bitmap)
426438if found:
427439set bit in bitmap
428- mark bitmap page dirty and release content lock
440+ mark bitmap page dirty
429441take metapage buffer content lock in exclusive mode
430442if first-free-bit value did not change,
431443update it and mark meta page dirty
432- release meta page buffer content lock
433- return page number
434444else (not found):
435445release bitmap page buffer content lock
436446loop back to try next bitmap page, if any
437447-- here when we have checked all bitmap pages; we hold meta excl. lock
438448extend index to add another overflow page; update meta information
439- mark meta page dirty and release buffer content lock
449+ mark meta page dirty
440450return page number
441451
442452It is slightly annoying to release and reacquire the metapage lock
@@ -456,12 +466,17 @@ like this:
456466
457467-- having determined that no space is free in the target bucket:
458468remember last page of bucket, drop write lock on it
459- call free-page-acquire routine
460469re-write-lock last page of bucket
461470if it is not last anymore, step to the last page
462- update (former) last page to point to new page
471+ execute free-page-acquire (obtaining an overflow page) mechanism
472+ described above
473+ update (former) last page to point to the new page and mark buffer dirty
463474write-lock and initialize new page, with back link to former last page
464- write and release former last page
475+ write WAL for addition of overflow page
476+ release the locks on meta page and bitmap page acquired in
477+ free-page-acquire algorithm
478+ release the lock on former last page
479+ release the lock on new overflow page
465480insert tuple into new page
466481-- etc.
467482
@@ -488,12 +503,14 @@ accessors of pages in the bucket. The algorithm is:
488503determine which bitmap page contains the free space bit for page
489504release meta page buffer content lock
490505pin bitmap page and take buffer content lock in exclusive mode
491- update bitmap bit
492- mark bitmap page dirty and release buffer content lock and pin
493- if page number is less than what we saw as first-free-bit in meta:
494506retake meta page buffer content lock in exclusive mode
507+ move (insert) tuples that belong to the overflow page being freed
508+ update bitmap bit
509+ mark bitmap page dirty
495510if page number is still less than first-free-bit,
496511update first-free-bit field and mark meta page dirty
512+ write WAL for delinking overflow page operation
513+ release buffer content lock and pin
497514release meta page buffer content lock and pin
498515
499516We have to do it this way because we must clear the bitmap bit before
@@ -504,8 +521,91 @@ page acquirer will scan more bitmap bits than he needs to. What must be
504521avoided is having first-free-bit greater than the actual first free bit,
505522because then that free page would never be found by searchers.
506523
507- All the freespace operations should be called while holding no buffer
508- locks. Since they need no lmgr locks, deadlock is not possible.
524+ The reason of moving tuples from overflow page while delinking the later is
525+ to make that as an atomic operation. Not doing so could lead to spurious reads
526+ on standby. Basically, the user might see the same tuple twice.
527+
528+
529+ WAL Considerations
530+ ------------------
531+
532+ The hash index operations like create index, insert, delete, bucket split,
533+ allocate overflow page, and squeeze in themselves don't guarantee hash index
534+ consistency after a crash. To provide robustness, we write WAL for each of
535+ these operations.
536+
537+ CREATE INDEX writes multiple WAL records. First, we write a record to cover
538+ the initializatoin of the metapage, followed by one for each new bucket
539+ created, followed by one for the initial bitmap page. It's not important for
540+ index creation to appear atomic, because the index isn't yet visible to any
541+ other transaction, and the creating transaction will roll back in the event of
542+ a crash. It would be difficult to cover the whole operation with a single
543+ write-ahead log record anyway, because we can log only a fixed number of
544+ pages, as given by XLR_MAX_BLOCK_ID (32), with current XLog machinery.
545+
546+ Ordinary item insertions (that don't force a page split or need a new overflow
547+ page) are single WAL entries. They touch a single bucket page and the
548+ metapage. The metapage is updated during replay as it is updated during
549+ original operation.
550+
551+ If an insertion causes the addition of an overflow page, there will be one
552+ WAL entry for the new overflow page and second entry for insert itself.
553+
554+ If an insertion causes a bucket split, there will be one WAL entry for insert
555+ itself, followed by a WAL entry for allocating a new bucket, followed by a WAL
556+ entry for each overflow bucket page in the new bucket to which the tuples are
557+ moved from old bucket, followed by a WAL entry to indicate that split is
558+ complete for both old and new buckets. A split operation which requires
559+ overflow pages to complete the operation will need to write a WAL record for
560+ each new allocation of an overflow page.
561+
562+ As splitting involves multiple atomic actions, it's possible that the system
563+ crashes between moving tuples from bucket pages of the old bucket to new
564+ bucket. In such a case, after recovery, the old and new buckets will be
565+ marked with bucket-being-split and bucket-being-populated flags respectively
566+ which indicates that split is in progress for those buckets. The reader
567+ algorithm works correctly, as it will scan both the old and new buckets when
568+ the split is in progress as explained in the reader algorithm section above.
569+
570+ We finish the split at next insert or split operation on the old bucket as
571+ explained in insert and split algorithm above. It could be done during
572+ searches, too, but it seems best not to put any extra updates in what would
573+ otherwise be a read-only operation (updating is not possible in hot standby
574+ mode anyway). It would seem natural to complete the split in VACUUM, but since
575+ splitting a bucket might require allocating a new page, it might fail if you
576+ run out of disk space. That would be bad during VACUUM - the reason for
577+ running VACUUM in the first place might be that you run out of disk space,
578+ and now VACUUM won't finish because you're out of disk space. In contrast,
579+ an insertion can require enlarging the physical file anyway.
580+
581+ Deletion of tuples from a bucket is performed for two reasons: to remove dead
582+ tuples, and to remove tuples that were moved by a bucket split. A WAL entry
583+ is made for each bucket page from which tuples are removed, and then another
584+ WAL entry is made when we clear the needs-split-cleanup flag. If dead tuples
585+ are removed, a separate WAL entry is made to update the metapage.
586+
587+ As deletion involves multiple atomic operations, it is quite possible that
588+ system crashes after (a) removing tuples from some of the bucket pages, (b)
589+ before clearing the garbage flag, or (c) before updating the metapage. If the
590+ system crashes before completing (b), it will again try to clean the bucket
591+ during next vacuum or insert after recovery which can have some performance
592+ impact, but it will work fine. If the system crashes before completing (c),
593+ after recovery there could be some additional splits until the next vacuum
594+ updates the metapage, but the other operations like insert, delete and scan
595+ will work correctly. We can fix this problem by actually updating the
596+ metapage based on delete operation during replay, but it's not clear whether
597+ it's worth the complication.
598+
599+ A squeeze operation moves tuples from one of the buckets later in the chain to
600+ one of the bucket earlier in chain and writes WAL record when either the
601+ bucket to which it is writing tuples is filled or bucket from which it
602+ is removing the tuples becomes empty.
603+
604+ As a squeeze operation involves writing multiple atomic operations, it is
605+ quite possible that the system crashes before completing the operation on
606+ entire bucket. After recovery, the operations will work correctly, but
607+ the index will remain bloated and this can impact performance of read and
608+ insert operations until the next vacuum squeeze the bucket completely.
509609
510610
511611Other Notes