@@ -329,19 +329,26 @@ down in the chain. This is repeated until there are no internal pages left
329
329
in the chain. Finally, the half-dead leaf page itself is unlinked from its
330
330
siblings.
331
331
332
- A deleted page cannot bereclaimed immediately, since there may be other
332
+ A deleted page cannot berecycled immediately, since there may be other
333
333
processes waiting to reference it (ie, search processes that just left the
334
334
parent, or scans moving right or left from one of the siblings). These
335
- processes must observe that the page is marked dead and recover
336
- accordingly. Searches and forward scans simply follow the right-link
337
- until they find a non-dead page --- this will be where the deleted page's
338
- key-space moved to.
335
+ processes must be able to observe a deleted page for some time after the
336
+ deletion operation, in order to be able to at least recover from it (they
337
+ recover by moving right, as with concurrent page splits). Searchers never
338
+ have to worry about concurrent page recycling.
339
+
340
+ See "Placing deleted pages in the FSM" section below for a description of
341
+ when and how deleted pages become safe for VACUUM to make recyclable.
342
+
343
+ Page deletion and backwards scans
344
+ ---------------------------------
339
345
340
346
Moving left in a backward scan is complicated because we must consider
341
347
the possibility that the left sibling was just split (meaning we must find
342
348
the rightmost page derived from the left sibling), plus the possibility
343
349
that the page we were just on has now been deleted and hence isn't in the
344
350
sibling chain at all anymore. So the move-left algorithm becomes:
351
+
345
352
0. Remember the page we are on as the "original page".
346
353
1. Follow the original page's left-link (we're done if this is zero).
347
354
2. If the current page is live and its right-link matches the "original
@@ -358,28 +365,15 @@ sibling chain at all anymore. So the move-left algorithm becomes:
358
365
current left-link). If it is dead, move right until a non-dead page
359
366
is found (there must be one, since rightmost pages are never deleted),
360
367
mark that as the new "original page", and return to step 1.
368
+
361
369
This algorithm is correct because the live page found by step 4 will have
362
370
the same left keyspace boundary as the page we started from. Therefore,
363
371
when we ultimately exit, it must be on a page whose right keyspace
364
372
boundary matches the left boundary of where we started --- which is what
365
373
we need to be sure we don't miss or re-scan any items.
366
374
367
- A deleted page can only be reclaimed once there is no scan or search that
368
- has a reference to it; until then, it must stay in place with its
369
- right-link undisturbed. We implement this by waiting until all active
370
- snapshots and registered snapshots as of the deletion are gone; which is
371
- overly strong, but is simple to implement within Postgres. When marked
372
- dead, a deleted page is labeled with the next-transaction counter value.
373
- VACUUM can reclaim the page for re-use when this transaction number is
374
- guaranteed to be "visible to everyone". As collateral damage, this
375
- implementation also waits for running XIDs with no snapshots and for
376
- snapshots taken until the next transaction to allocate an XID commits.
377
-
378
- Reclaiming a page doesn't actually change the state of the page --- we
379
- simply record it in the free space map, from which it will be handed out
380
- the next time a new page is needed for a page split. The deleted page's
381
- contents will be overwritten by the split operation (it will become the
382
- new right page).
375
+ Page deletion and tree height
376
+ -----------------------------
383
377
384
378
Because we never delete the rightmost page of any level (and in particular
385
379
never delete the root), it's impossible for the height of the tree to
@@ -399,6 +393,43 @@ as part of the atomic update for the delete (either way, the metapage has
399
393
to be the last page locked in the update to avoid deadlock risks). This
400
394
avoids race conditions if two such operations are executing concurrently.
401
395
396
+ Placing deleted pages in the FSM
397
+ --------------------------------
398
+
399
+ Recycling a page is decoupled from page deletion. A deleted page can only
400
+ be put in the FSM to be recycled once there is no possible scan or search
401
+ that has a reference to it; until then, it must stay in place with its
402
+ sibling links undisturbed, as a tombstone that allows concurrent searches
403
+ to detect and then recover from concurrent deletions (which are rather
404
+ like concurrent page splits to searchers). This design is an
405
+ implementation of what Lanin and Shasha call "the drain technique".
406
+
407
+ We implement the technique by waiting until all active snapshots and
408
+ registered snapshots as of the page deletion are gone; which is overly
409
+ strong, but is simple to implement within Postgres. When marked fully
410
+ dead, a deleted page is labeled with the next-transaction counter value.
411
+ VACUUM can reclaim the page for re-use when the stored XID is guaranteed
412
+ to be "visible to everyone". As collateral damage, we wait for snapshots
413
+ taken until the next transaction to allocate an XID commits. We also wait
414
+ for running XIDs with no snapshots.
415
+
416
+ The need for this additional indirection after a page deletion operation
417
+ takes place is a natural consequence of the highly permissive rules for
418
+ index scans with Lehman and Yao's design. In general an index scan
419
+ doesn't have to hold a lock or even a pin on any page when it descends the
420
+ tree (nothing that you'd usually think of as an interlock is held "between
421
+ levels"). At the same time, index scans cannot be allowed to land on a
422
+ truly unrelated page due to concurrent recycling (not to be confused with
423
+ concurrent deletion), because that results in wrong answers to queries.
424
+ Simpler approaches to page deletion that don't need to defer recycling are
425
+ possible, but none seem compatible with Lehman and Yao's design.
426
+
427
+ Placing an already-deleted page in the FSM to be recycled when needed
428
+ doesn't actually change the state of the page. The page will be changed
429
+ whenever it is subsequently taken from the FSM for reuse. The deleted
430
+ page's contents will be overwritten by the split operation (it will become
431
+ the new right sibling page).
432
+
402
433
Fastpath For Index Insertion
403
434
----------------------------
404
435