NotificationsYou must be signed in to change notification settings
Fork5
Star26

Commit78475b0

committed

Update README-SSI. Add a section to describe the "dangerous structure" that

SSI is based on, as well as the optimizations about relative commit timesand read-only transactions. Plus a bunch of other misc fixes andimprovements.Dan Ports

1 parentf3008c3 commit78475b0Copy full SHA for 78475b0

File tree

1 file changed

+117

-87

lines changed

src/backend/storage/lmgr
- README-SSI

1 file changed

+117

-87

lines changed

`‎src/backend/storage/lmgr/README-SSI`

Lines changed: 117 additions & 87 deletions

Original file line number	Diff line number	Diff line change
`@@ -51,13 +51,13 @@ if a transaction can be shown to always do the right thing when it is`
`51`	`51`	`run alone (before or after any other transaction), it will always do`
`52`	`52`	`the right thing in any mix of concurrent serializable transactions.`
`53`	`53`	`Where conflicts with other transactions would result in an`
`54`		`-inconsistent state within the database, or an inconsistent view of`
	`54`	`+inconsistent state within the database or an inconsistent view of`
`55`	`55`	`the data, a serializable transaction will block or roll back to`
`56`	`56`	`prevent the anomaly. The SQL standard provides a specific SQLSTATE`
`57`	`57`	`for errors generated when a transaction rolls back for this reason,`
`58`	`58`	`so that transactions can be retried automatically.`
`59`	`59`
`60`		`-Before version 9.1 PostgreSQL did not support a full serializable`
	`60`	`+Before version 9.1, PostgreSQL did not support a full serializable`
`61`	`61`	`isolation level. A request for serializable transaction isolation`
`62`	`62`	`actually provided snapshot isolation. This has well known anomalies`
`63`	`63`	`which can allow data corruption or inconsistent views of the data`
`@@ -77,7 +77,7 @@ Serializable Isolation Implementation Strategies`
`77`	`77`
`78`	`78`	`Techniques for implementing full serializable isolation have been`
`79`	`79`	`published and in use in many database products for decades. The`
`80`		`-primary technique which has been used is Strict2Phase Locking`
	`80`	`+primary technique which has been used is StrictTwo-Phase Locking`
`81`	`81`	`(S2PL), which operates by blocking writes against data which has been`
`82`	`82`	`read by concurrent transactions and blocking any access (read or`
`83`	`83`	`write) against data which has been written by concurrent`
`@@ -112,54 +112,90 @@ visualize the difference between the serializable implementations`
`112`	`112`	`described above, is to consider that among transactions executing at`
`113`	`113`	`the serializable transaction isolation level, the results are`
`114`	`114`	`required to be consistent with some serial (one-at-a-time) execution`
`115`		`-of the transactions[1]. How is that order determined in each?`
	`115`	`+of the transactions[1]. How is that order determined in each?`
`116`	`116`
`117`		`-S2PL locks rows used by the transaction in a way which blocks`
`118`		`-conflicting access, so that at the moment of a successful commit it`
`119`		`-is certain that no conflicting access has occurred. Some transactions`
`120`		`-may have blocked, essentially being partially serialized with the`
`121`		`-committing transaction, to allow this. Some transactions may have`
`122`		`-been rolled back, due to cycles in the blocking. But with S2PL,`
`123`		`-transactions can always be viewed as having occurred serially, in the`
`124`		`-order of successful commit.`
	`117`	`+In S2PL, each transaction locks any data it accesses. It holds the`
	`118`	`+locks until committing, preventing other transactions from making`
	`119`	`+conflicting accesses to the same data in the interim. Some`
	`120`	`+transactions may have to be rolled back to prevent deadlock. But`
	`121`	`+successful transactions can always be viewed as having occurred`
	`122`	`+sequentially, in the order they committed.`
`125`	`123`
`126`	`124`	`With snapshot isolation, reads never block writes, nor vice versa, so`
`127`		`-there is much less actual serialization. The order in which`
`128`		`-transactions appear to have executed is determined by something more`
`129`		`-subtle than in S2PL: read/write dependencies. If a transaction`
`130`		`-attempts to read data which is not visible to it because the`
`131`		`-transaction which wrote it (or will later write it) is concurrent`
`132`		`-(one of them was running when the other acquired its snapshot), then`
`133`		`-the reading transaction appears to have executed first, regardless of`
`134`		`-the actual sequence of transaction starts or commits (since it sees a`
`135`		`-database state prior to that in which the other transaction leaves`
`136`		`-it). If one transaction has both rw-dependencies in (meaning that a`
`137`		`-concurrent transaction attempts to read data it writes) and out`
`138`		`-(meaning it attempts to read data a concurrent transaction writes),`
`139`		`-and a couple other conditions are met, there can appear to be a cycle`
`140`		`-in execution order of the transactions. This is when the anomalies`
`141`		`-occur.`
`142`		`-`
`143`		`-SSI works by watching for the conditions mentioned above, and rolling`
`144`		`-back a transaction when needed to prevent any anomaly. The apparent`
`145`		`-order of execution will always be consistent with any actual`
`146`		`-serialization (i.e., a transaction which run by itself can always be`
`147`		`-considered to have run after any transactions committed before it`
`148`		`-started and before any transacton which starts after it commits); but`
`149`		`-among concurrent transactions it will appear that the transaction on`
`150`		`-the read side of a rw-dependency executed before the transaction on`
`151`		`-the write side.`
	`125`	`+more concurrency is possible. The order in which transactions appear`
	`126`	`+to have executed is determined by something more subtle than in S2PL:`
	`127`	`+read/write dependencies. If a transaction reads data, it appears to`
	`128`	`+execute after the transaction that wrote the data it is reading.`
	`129`	`+Similarly, if it updates data, it appears to execute after the`
	`130`	`+transaction that wrote the previous version. These dependencies, which`
	`131`	`+we call "wr-dependencies" and "ww-dependencies", are consistent with`
	`132`	`+the commit order, because the first transaction must have committed`
	`133`	`+before the second starts. However, there can also be dependencies`
	`134`	`+between two concurrent transactions, i.e. where one was running when`
	`135`	`+the other acquired its snapshot. These "rw-conflicts" occur when one`
	`136`	`+transaction attempts to read data which is not visible to it because`
	`137`	`+the transaction which wrote it (or will later write it) is`
	`138`	`+concurrent. The reading transaction appears to have executed first,`
	`139`	`+regardless of the actual sequence of transaction starts or commits,`
	`140`	`+because it sees a database state prior to that in which the other`
	`141`	`+transaction leaves it.`
	`142`	`+`
	`143`	`+Anomalies occur when a cycle is created in the graph of dependencies:`
	`144`	`+when a dependency or series of dependencies causes transaction A to`
	`145`	`+appear to have executed before transaction B, but another series of`
	`146`	`+dependencies causes B to appear before A. If that's the case, then`
	`147`	`+the results can't be consistent with any serial execution of the`
	`148`	`+transactions.`
	`149`	`+`
	`150`	`+`
	`151`	`+SSI Algorithm`
	`152`	`+-------------`
	`153`	`+`
	`154`	`+Serializable transaction in PostgreSQL are implemented using`
	`155`	`+Serializable Snapshot Isolation (SSI), based on the work of Cahill`
	`156`	`+et al. Fundamentally, this allows snapshot isolation to run as it`
	`157`	`+has, while monitoring for conditions which could create a serialization`
	`158`	`+anomaly.`
	`159`	`+`
	`160`	`+SSI is based on the observation [2] that each snapshot isolation`
	`161`	`+anomaly corresponds to a cycle that contains a "dangerous structure"`
	`162`	`+of two adjacent rw-conflict edges:`
	`163`	`+`
	`164`	`+ Tin ------> Tpivot ------> Tout`
	`165`	`+ rw rw`
	`166`	`+`
	`167`	`+SSI works by watching for this dangerous structure, and rolling`
	`168`	`+back a transaction when needed to prevent any anomaly. This means it`
	`169`	`+only needs to track rw-conflicts between concurrent transactions, not`
	`170`	`+wr- and ww-dependencies. It also means there is a risk of false`
	`171`	`+positives, because not every dangerous structure corresponds to an`
	`172`	`+actual serialization failure.`
	`173`	`+`
	`174`	`+The PostgreSQL implementation uses two additional optimizations:`
	`175`	`+`
	`176`	`+* Tout must commit before any other transaction in the cycle`
	`177`	`+ (see proof of Theorem 2.1 of [2]). We only roll back a transaction`
	`178`	`+ if Tout commits before Tpivot and Tin.`
	`179`	`+`
	`180`	`+* if Tin is read-only, there can only be an anomaly if Tout committed`
	`181`	`+ before Tin takes its snapshot. This optimization is an original`
	`182`	`+ one. Proof:`
	`183`	`+`
	`184`	`+ - Because there is a cycle, there must be some transaction T0 that`
	`185`	`+ precedes Tin in the serial order. (T0 might be the same as Tout).`
	`186`	`+`
	`187`	`+ - The dependency between T0 and Tin can't be a rw-conflict,`
	`188`	`+ because Tin was read-only, so it must be a wr-dependency.`
	`189`	`+ Those can only occur if T0 committed before Tin started.`
	`190`	`+`
	`191`	`+ - Because Tout must commit before any other transaction in the`
	`192`	`+ cycle, it must commit before T0 commits -- and thus before Tin`
	`193`	`+ starts.`
`152`	`194`
`153`	`195`
`154`	`196`	`PostgreSQL Implementation`
`155`	`197`	`-------------------------`
`156`	`198`
`157`		`-The implementation of serializable transactions for PostgreSQL is`
`158`		`-accomplished through Serializable Snapshot Isolation (SSI), based on`
`159`		`-the work of Cahill, et al. Fundamentally, this allows snapshot`
`160`		`-isolation to run as it has, while monitoring for conditions which`
`161`		`-could create a serialization anomaly.`
`162`		`-`
`163`	`199`	`* Since this technique is based on Snapshot Isolation (SI), those`
`164`	`200`	`areas in PostgreSQL which don't use SI can't be brought under SSI.`
`165`	`201`	`This includes system tables, temporary tables, sequences, hint bit`
`@@ -180,7 +216,7 @@ lock or to use SELECT FOR SHARE or SELECT FOR UPDATE.`
`180`	`216`	`* Those who want to continue to use snapshot isolation without`
`181`	`217`	`the additional protections of SSI (and the associated costs of`
`182`	`218`	`enforcing those protections), can use the REPEATABLE READ transaction`
`183`		`-isolation level. This levelwill retain its legacy behavior, which`
	`219`	`+isolation level. This levelretains its legacy behavior, which`
`184`	`220`	`is identical to the old SERIALIZABLE implementation and fully`
`185`	`221`	`consistent with the standard's requirements for the REPEATABLE READ`
`186`	`222`	`transaction isolation level.`
`@@ -236,7 +272,7 @@ in PostgreSQL, but tailored to the needs of SIREAD predicate locking,`
`236`	`272`	`are used. These refer to physical objects actually accessed in the`
`237`	`273`	`course of executing the query, to model the predicates through`
`238`	`274`	`inference. Anyone interested in this subject should review the`
`239`		`-Hellerstein, Stonebraker and Hamilton paper[2], along with the`
	`275`	`+Hellerstein, Stonebraker and Hamilton paper [3], along with the`
`240`	`276`	`locking papers referenced from that and the Cahill papers.`
`241`	`277`
`242`	`278`	`Because the SIREAD locks don't block, traditional locking techniques`
`@@ -273,6 +309,15 @@ transaction already holds a write lock on any tuple representing the`
`273`	`309`	`row, since a rw-dependency would also create a ww-dependency which`
`274`	`310`	`has more aggressive enforcement and will thus prevent any anomaly.`
`275`	`311`
	`312`	`+ * Modifying a heap tuple creates a rw-conflict with any transaction`
	`313`	`+that holds a SIREAD lock on that tuple, or on the page or relation`
	`314`	`+that contains it.`
	`315`	`+`
	`316`	`+ * Inserting a new tuple creates a rw-conflict with any transaction`
	`317`	`+holding a SIREAD lock on the entire relation. It doesn't conflict with`
	`318`	`+page-level locks, because page-level locks are only used to aggregate`
	`319`	`+tuple locks. Unlike index page locks, they don't lock "gaps" on the page.`
	`320`	`+`
`276`	`321`
`277`	`322`	`Index AM implementations`
`278`	`323`	`------------------------`
`@@ -296,13 +341,13 @@ need not generate a conflict, although an update which "moves" a row`
`296`	`341`	`into the scan must generate a conflict. While correctness allows`
`297`	`342`	`false positives, they should be minimized for performance reasons.`
`298`	`343`
`299`		`-Several optimizations are possible:`
	`344`	`+Several optimizations are possible, though not all implemented yet:`
`300`	`345`
`301`	`346`	`* An index scan which is just finding the right position for an`
`302`		`-index insertion or deletionneed not acquire a predicate lock.`
	`347`	`+index insertion or deletionneeds not acquire a predicate lock.`
`303`	`348`
`304`	`349`	`* An index scan which is comparing for equality on the entire key`
`305`		`-for a unique indexneed not acquire a predicate lock as long as a key`
	`350`	`+for a unique indexneeds not acquire a predicate lock as long as a key`
`306`	`351`	`is found corresponding to a visible tuple which has not been modified`
`307`	`352`	`by another transaction -- there are no "between or around" gaps to`
`308`	`353`	`cover.`
`@@ -317,10 +362,10 @@ x = 1 AND x = 2), then no predicate lock is needed.`
`317`	`362`
`318`	`363`	`Other index AM implementation considerations:`
`319`	`364`
`320`		`- *If a btree search discovers that no root page has yet been`
`321`		`-created, a predicate lock onthe indexrelation is required;`
`322`		`-otherwise btree searches must get to the leaf level to determine`
`323`		`-which tuples match, so predicate locks go there.`
	`365`	`+ *B-tree index searches acquire predicate locks only on the`
	`366`	`+index leaf pages needed to locktheappropriateindexrange. If,`
	`367`	`+however, a search discovers that no root page has yet been created, a`
	`368`	`+predicate lock on the index relation is required.`
`324`	`369`
`325`	`370`	`* GiST searches can determine that there are no matches at any`
`326`	`371`	`level of the index, so there must be a predicate lock at each index`
`@@ -346,11 +391,6 @@ to be added from scratch.`
`346`	`391`
`347`	`392`	`2. The existing in-memory lock structures were not suitable for`
`348`	`393`	`tracking SIREAD locks.`
`349`		`- * The database products used for the prototype`
`350`		`-implementations for the papers used update-in-place with a rollback`
`351`		`-log for their MVCC implementations, while PostgreSQL leaves the old`
`352`		`-version of a row in place and adds a new tuple to represent the row`
`353`		`-at a new location.`
`354`	`394`	`* In PostgreSQL, tuple level locks are not held in RAM for`
`355`	`395`	`any length of time; lock information is written to the tuples`
`356`	`396`	`involved in the transactions.`
`@@ -450,18 +490,19 @@ there can't be a rw-conflict from T3 to T0.`
`450`	`490`
`451`	`491`	`o In both cases, we didn't need the T1 -> T3 edge.`
`452`	`492`
`453`		`- * Predicate locking in PostgreSQLwill start at the tuple level`
`454`		`-when possible, with automatic conversion of multiple fine-grained`
`455`		`-locks tocoarsergranularity asneed to avoid resource exhaustion.`
`456`		`-Theamount of memory used for these structureswill beconfigurable,`
`457`		`-to balanceRAM usage against SIREAD lock granularity.`
	`493`	`+ * Predicate locking in PostgreSQLstarts at the tuple level`
	`494`	`+when possible. Multiple fine-grained locks are promoted to a single`
	`495`	`+coarser-granularitylockasneeded to avoid resource exhaustion. The`
	`496`	`+amount of memory used for these structuresisconfigurable, to balance`
	`497`	`+RAM usage against SIREAD lock granularity.`
`458`	`498`
`459`		`- * A process-local copy of locks held by a process and the coarser`
`460`		`-covering locks with counts, are kept to support granularity promotion`
`461`		`-decisions with low CPU and locking overhead.`
	`499`	`+ * Each backend keeps a process-local table of the locks it holds.`
	`500`	`+To support granularity promotion decisions with low CPU and locking`
	`501`	`+overhead, this table also includes the coarser covering locks and the`
	`502`	`+number of finer-granularity locks they cover.`
`462`	`503`
`463`		`- * Conflictswill be identified by looking for predicate locks`
`464`		`-when tuples are written and looking at the MVCC information when`
	`504`	`+ * Conflictsare identified by looking for predicate locks`
	`505`	`+when tuples are written, and by looking at the MVCC information when`
`465`	`506`	`tuples are read. There is no matching between two RAM-based locks.`
`466`	`507`
`467`	`508`	`* Because write locks are stored in the heap tuples rather than a`
`@@ -493,12 +534,12 @@ to be READ ONLY.)`
`493`	`534`	`o We can more aggressively clean up conflicts, predicate`
`494`	`535`	`locks, and SSI transaction information.`
`495`	`536`
`496`		`- *Allow a READ ONLY transaction to "opt out" of SSI if there are`
	`537`	`+ *We allow a READ ONLY transaction to "opt out" of SSI if there are`
`497`	`538`	`no READ WRITE transactions which could cause the READ ONLY`
`498`	`539`	`transaction to ever become part of a "dangerous structure" of`
`499`	`540`	`overlapping transaction dependencies.`
`500`	`541`
`501`		`- *Allow the user to request that a READ ONLY transaction wait`
	`542`	`+ *We allow the user to request that a READ ONLY transaction wait`
`502`	`543`	`until the conditions are right for it to start in the "opt out" state`
`503`	`544`	`described above. We add a DEFERRABLE state to transactions, which is`
`504`	`545`	`specified and maintained in a way similar to READ ONLY. It is`
`@@ -538,28 +579,13 @@ address it?`
`538`	`579`	`replication solutions, like Postgres-R, Slony, pgpool, HS/SR, etc.`
`539`	`580`	`This is related to the "WAL file replay" issue.`
`540`	`581`
`541`		`- * Weak-memory-ordering machines. Make sure that shared memory`
`542`		`-access which involves visibility across multiple transactions uses`
`543`		`-locks as needed to avoid problems. On the other hand, ensure that we`
`544`		`-really need volatile where we're using it.`
`545`		`-http://archives.postgresql.org/pgsql-committers/2008-06/msg00228.php`
`546`		`-`
`547`	`582`	`* UNIQUE btree search for equality on all columns. Since a search`
`548`	`583`	`of a UNIQUE index using equality tests on all columns will lock the`
`549`	`584`	`heap tuple if an entry is found, it appears that there is no need to`
`550`	`585`	`get a predicate lock on the index in that case. A predicate lock is`
`551`	`586`	`still needed for such a search if a matching index entry which points`
`552`	`587`	`to a visible tuple is not found.`
`553`	`588`
`554`		`- * Planner index probes. To avoid problems with data skew at the`
`555`		`-ends of an index which have historically caused bad plans, the`
`556`		`-planner now probes the end of an index to see what the maximum or`
`557`		`-minimum value is when a query appears to be requesting a range of`
`558`		`-data outside what statistics shows is present. These planner checks`
`559`		`-don't require predicate locking, but there's currently no easy way to`
`560`		`-avoid it. What can we do to avoid predicate locking for such planner`
`561`		`-activity?`
`562`		`-`
`563`	`589`	`* Minimize touching of shared memory. Should lists in shared`
`564`	`590`	`memory push entries which have just been returned to the front of the`
`565`	`591`	`available list, so they will be popped back off soon and some memory`
`@@ -573,13 +599,17 @@ Footnotes`
`573`	`599`	`[1] http://www.contrib.andrew.cmu.edu/~shadow/sql/sql1992.txt`
`574`	`600`	`Search for serial execution to find the relevant section.`
`575`	`601`
`576`		`-[2] http://db.cs.berkeley.edu/papers/fntdb07-architecture.pdf`
`577`		`-Joseph M. Hellerstein, Michael Stonebraker and James Hamilton. 2007.`
	`602`	`+[2] A. Fekete et al. Making Snapshot Isolation Serializable. In ACM`
	`603`	`+Transactions on Database Systems 30:2, Jun. 2005.`
	`604`	`+http://dx.doi.org/10.1145/1071610.1071615`
	`605`	`+`
	`606`	`+[3] Joseph M. Hellerstein, Michael Stonebraker and James Hamilton. 2007.`
`578`	`607`	`Architecture of a Database System. Foundations and Trends(R) in`
`579`	`608`	`Databases Vol. 1, No. 2 (2007) 141-259.`
	`609`	`+http://db.cs.berkeley.edu/papers/fntdb07-architecture.pdf`
`580`	`610`	`Of particular interest:`
`581`	`611`	`* 6.1 A Note on ACID`
`582`	`612`	`* 6.2 A Brief Review of Serializability`
`583`	`613`	`* 6.3 Locking and Latching`
`584`	`614`	`* 6.3.1 Transaction Isolation Levels`
`585`		`- * 6.5.3 Next-Key Locking: Physical Surrogates for Logical`
	`615`	`+ * 6.5.3 Next-Key Locking: Physical Surrogates for Logical Properties`

0 commit comments

Comments

(0)

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Commit78475b0

File tree

1 file changed

1 file changed

`‎src/backend/storage/lmgr/README-SSI`

0 commit comments