@@ -3,11 +3,11 @@ src/backend/storage/lmgr/README-SSI
33Serializable Snapshot Isolation (SSI) and Predicate Locking
44===========================================================
55
6- This iscurrently sitting in the lmgr directory because about 90% of
7- the code is an implementation of predicate locking, which is required
8- for SSI, rather than being directly related to SSI itself. When
9- another use for predicate locking justifies the effort to tease these
10- two things apart, this README file should probably be split.
6+ Thiscode is in the lmgr directory because about 90% of it is an
7+ implementation of predicate locking, which is required for SSI,
8+ rather than being directly related to SSI itself. When another use
9+ for predicate locking justifies the effort to tease these two things
10+ apart, this README file should probably be split.
1111
1212
1313Credits
@@ -151,11 +151,11 @@ transactions.
151151SSI Algorithm
152152-------------
153153
154- Serializable transaction in PostgreSQL are implemented using
154+ As of 9.1, serializable transactions in PostgreSQL are implemented using
155155Serializable Snapshot Isolation (SSI), based on the work of Cahill
156156et al. Fundamentally, this allows snapshot isolation to run as it
157- has , while monitoring for conditions which could create a serialization
158- anomaly.
157+ previously did , while monitoring for conditions which could create a
158+ serialization anomaly.
159159
160160SSI is based on the observation [2] that each snapshot isolation
161161anomaly corresponds to a cycle that contains a "dangerous structure"
@@ -168,8 +168,10 @@ SSI works by watching for this dangerous structure, and rolling
168168back a transaction when needed to prevent any anomaly. This means it
169169only needs to track rw-conflicts between concurrent transactions, not
170170wr- and ww-dependencies. It also means there is a risk of false
171- positives, because not every dangerous structure corresponds to an
172- actual serialization failure.
171+ positives, because not every dangerous structure is embedded in an
172+ actual cycle. The number of false positives is low in practice, so
173+ this represents an acceptable tradeoff for keeping the detection
174+ overhead low.
173175
174176The PostgreSQL implementation uses two additional optimizations:
175177
@@ -182,11 +184,12 @@ The PostgreSQL implementation uses two additional optimizations:
182184 one. Proof:
183185
184186 - Because there is a cycle, there must be some transaction T0 that
185- precedes Tin in theserial order . (T0 might be the same as Tout).
187+ precedes Tin in thecycle . (T0 might be the same as Tout.)
186188
187- - Thedependency between T0 and Tin can't be a rw-conflict,
189+ - Theedge between T0 and Tin can't be a rw-conflict or ww-dependency ,
188190 because Tin was read-only, so it must be a wr-dependency.
189- Those can only occur if T0 committed before Tin started.
191+ Those can only occur if T0 committed before Tin took its snapshot,
192+ else Tin would have ignored T0's output.
190193
191194 - Because Tout must commit before any other transaction in the
192195 cycle, it must commit before T0 commits -- and thus before Tin
@@ -258,8 +261,8 @@ full serializable transactions under either strategy. Practical
258261implementations of predicate locking generally involve acquiring
259262locks against data as it is accessed, using multiple granularities
260263(tuple, page, table, etc.) with escalation as needed to keep the lock
261- count to a number which can be tracked within RAM structures, and
262- this was used in PostgreSQL. Coarse granularities can cause some
264+ count to a number which can be tracked within RAM structures. This
265+ approach was used in PostgreSQL. Coarse granularities can cause some
263266false positive indications of conflict. The number of false positives
264267can be influenced by plan choice.
265268
@@ -276,18 +279,18 @@ Hellerstein, Stonebraker and Hamilton paper [3], along with the
276279locking papers referenced from that and the Cahill papers.
277280
278281Because the SIREAD locks don't block, traditional locking techniques
279- were be modified. Intent locking (locking higher level objects
282+ have to be modified. Intent locking (locking higher level objects
280283before locking lower level objects) doesn't work with non-blocking
281284"locks" (which are, in some respects, more like flags than locks).
282285
283286A configurable amount of shared memory is reserved at postmaster
284287start-up to track predicate locks. This size cannot be changed
285288without a restart.
286289
287- * To prevent resource exhaustion, multiple fine-grained locks may
290+ To prevent resource exhaustion, multiple fine-grained locks may
288291be promoted to a single coarser-grained lock as needed.
289292
290- * An attempt to acquire an SIREAD lock on a tuple when the same
293+ An attempt to acquire an SIREAD lock on a tuple when the same
291294transaction already holds an SIREAD lock on the page or the relation
292295will be ignored. Likewise, an attempt to lock a page when the
293296relation is locked will be ignored, and the acquisition of a coarser
@@ -306,8 +309,8 @@ Predicate locks will be acquired for the heap based on the following:
306309will be locked, whether or not it meets selection criteria; except
307310that there is no need to acquire an SIREAD lock on a tuple when the
308311transaction already holds a write lock on any tuple representing the
309- row, since a rw-dependency would also create a ww-dependency which
310- has more aggressive enforcement andwill thus prevent any anomaly.
312+ row, since a rw-conflict would also create a ww-dependency which
313+ has more aggressive enforcement and thus will prevent any anomaly.
311314
312315 * Modifying a heap tuple creates a rw-conflict with any transaction
313316that holds a SIREAD lock on that tuple, or on the page or relation
@@ -341,13 +344,13 @@ need not generate a conflict, although an update which "moves" a row
341344into the scan must generate a conflict. While correctness allows
342345false positives, they should be minimized for performance reasons.
343346
344- Several optimizations are possible, though not all implemented yet:
347+ Several optimizations are possible, though not allare implemented yet:
345348
346349 * An index scan which is just finding the right position for an
347- index insertion or deletionneeds not acquire a predicate lock.
350+ index insertion or deletionneed not acquire a predicate lock.
348351
349352 * An index scan which is comparing for equality on the entire key
350- for a unique indexneeds not acquire a predicate lock as long as a key
353+ for a unique indexneed not acquire a predicate lock as long as a key
351354is found corresponding to a visible tuple which has not been modified
352355by another transaction -- there are no "between or around" gaps to
353356cover.
@@ -362,6 +365,9 @@ x = 1 AND x = 2), then no predicate lock is needed.
362365
363366Other index AM implementation considerations:
364367
368+ * For an index AM that doesn't have support for predicate locking,
369+ we just acquire a predicate lock on the whole index for any search.
370+
365371 * B-tree index searches acquire predicate locks only on the
366372index *leaf* pages needed to lock the appropriate index range. If,
367373however, a search discovers that no root page has yet been created, a
@@ -395,8 +401,8 @@ tracking SIREAD locks.
395401any length of time; lock information is written to the tuples
396402involved in the transactions.
397403 * In PostgreSQL, existing lock structures have pointers to
398- memory which is related to aconnection . SIREAD locks need to persist
399- past the end of the originating transaction and even theconnection
404+ memory which is related to asession . SIREAD locks need to persist
405+ past the end of the originating transaction and even thesession
400406which ran it.
401407 * PostgreSQL needs to be able to tolerate a large number of
402408transactions executing while one long-running transaction stays open
@@ -411,7 +417,8 @@ isolation level distinct from snapshot isolation.
411417in the papers.
412418
413419 5. PostgreSQL doesn't assign a transaction number to a database
414- transaction until and unless necessary.
420+ transaction until and unless necessary (normally, when the transaction
421+ attempts to modify data).
415422
416423 6. PostgreSQL has pluggable data types with user-definable
417424operators, as well as pluggable index types, not all of which are
@@ -453,42 +460,46 @@ versions of the row, based on the following proof that any additional
453460serialization failures we would get from that would be false
454461positives:
455462
456- o If transaction T1 reads a row (thus acquiring a predicate
457- lock on it) and a second transaction T2 updates that row, must a
458- third transaction T3 which updates the new version of the row have a
459- rw-conflict in from T1 to prevent anomalies? In other words, does it
460- matter whether this edge T1 -> T3 is there?
463+ o If transaction T1 reads a row version (thus acquiring a
464+ predicate lock on it) and a second transaction T2 updates that row
465+ version (thus creating a rw-conflict graph edge from T1 to T2), must a
466+ third transaction T3 which re-updates the new version of the row also
467+ have a rw-conflict in from T1 to prevent anomalies? In other words,
468+ does it matter whether we recognize the edge T1 -> T3?
461469
462470 o If T1 has a conflict in, it certainly doesn't. Adding the
463471edge T1 -> T3 would create a dangerous structure, but we already had
464- one from the edge T1 -> T2, so we would have aborted something
465- anyway.
472+ one from the edge T1 -> T2, so we would have aborted something anyway.
473+ (T2 has already committed, else T3 could not have updated its output;
474+ but we would have aborted either T1 or T1's predecessor(s). Hence
475+ no cycle involving T1 and T3 can survive.)
466476
467477 o Now let's consider the case where T1 doesn't have a
468- conflict in. If that's the case, for this edge T1 -> T3 to make a
469- difference, T3 must have a rw-conflict out that induces a cycle in
470- the dependency graph, i.e. a conflict out to some transaction
471- preceding T1 in theserial order . (A conflict out to T1 wouldwork
472- too, but that would mean T1 has a conflict in and wewould have
473- rolled back .)
478+ rw- conflict in. If that's the case, for this edge T1 -> T3 to make a
479+ difference, T3 must have a rw-conflict out that induces a cycle in the
480+ dependency graph, i.e. a conflict out to some transaction preceding T1
481+ in thegraph . (A conflict out to T1itself wouldbe problematic too,
482+ but that would mean T1 has a conflict in, the case wealready
483+ eliminated .)
474484
475485 o So now we're trying to figure out if there can be an
476486rw-conflict edge T3 -> T0, where T0 is some transaction that precedes
477- T1. For T0 to precede T1, there has to be has to be some edge, or
478- sequence of edges, from T0 to T1. At least the last edge has to be a
479- wr-dependency or ww-dependency rather than a rw-conflict, because T1
480- doesn't have a rw-conflict in. And that gives us enough information
481- about the order of transactions to see that T3 can't have a
482- rw-dependency to T0:
487+ T1. For T0 to precede T1, there has to be some edge, or sequence of
488+ edges, from T0 to T1. At least the last edge has to be a wr-dependency
489+ or ww-dependency rather than a rw-conflict, because T1 doesn't have a
490+ rw-conflict in. And that gives us enough information about the order
491+ of transactions to see that T3 can't have a rw-conflict to T0:
483492 - T0 committed before T1 started (the wr/ww-dependency implies this)
484493 - T1 started before T2 committed (the T1->T2 rw-conflict implies this)
485- - T2 committed before T3 started (otherwise, T3 wouldbe aborted
494+ - T2 committed before T3 started (otherwise, T3 wouldget aborted
486495 because of an update conflict)
487496
488497 o That means T0 committed before T3 started, and therefore
489498there can't be a rw-conflict from T3 to T0.
490499
491- o In both cases, we didn't need the T1 -> T3 edge.
500+ o So in all cases, we don't need the T1 -> T3 edge to
501+ recognize cycles. Therefore it's not necessary for T1's SIREAD lock
502+ on the original tuple version to cover later versions as well.
492503
493504 * Predicate locking in PostgreSQL starts at the tuple level
494505when possible. Multiple fine-grained locks are promoted to a single
@@ -520,17 +531,20 @@ NULL to indicate no conflict and a self-reference to indicate
520531multiple conflicts or conflicts with committed transactions, we use a
521532list of rw-conflicts. With the more complete information, false
522533positives are reduced and we have sufficient data for more aggressive
523- clean-up and other optimizations.
534+ clean-up and other optimizations:
535+
524536 o We can avoid ever rolling back a transaction until and
525537unless there is a pivot where a transaction on the conflict *out*
526538side of the pivot committed before either of the other transactions.
539+
527540 o We can avoid ever rolling back a transaction when the
528541transaction on the conflict *in* side of the pivot is explicitly or
529542implicitly READ ONLY unless the transaction on the conflict *out*
530543side of the pivot committed before the READ ONLY transaction acquired
531544its snapshot. (An implicit READ ONLY transaction is one which
532545committed without writing, even though it was not explicitly declared
533546to be READ ONLY.)
547+
534548 o We can more aggressively clean up conflicts, predicate
535549locks, and SSI transaction information.
536550
@@ -543,7 +557,7 @@ overlapping transaction dependencies.
543557until the conditions are right for it to start in the "opt out" state
544558described above. We add a DEFERRABLE state to transactions, which is
545559specified and maintained in a way similar to READ ONLY. It is
546- ignored for transactionswhich are not SERIALIZABLE and READ ONLY.
560+ ignored for transactionsthat are not SERIALIZABLE and READ ONLY.
547561
548562 * When a transaction must be rolled back, we pick among the
549563active transactions such that an immediate retry will not fail again
@@ -593,8 +607,8 @@ might never be touched, or should we keep adding returned items to
593607the end of the available list?
594608
595609
596- Footnotes
597- ---------
610+ References
611+ ----------
598612
599613[1] http://www.contrib.andrew.cmu.edu/~shadow/sql/sql1992.txt
600614Search for serial execution to find the relevant section.