Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Commit78475b0

Browse files
committed
Update README-SSI. Add a section to describe the "dangerous structure" that
SSI is based on, as well as the optimizations about relative commit timesand read-only transactions. Plus a bunch of other misc fixes andimprovements.Dan Ports
1 parentf3008c3 commit78475b0

File tree

1 file changed

+117
-87
lines changed

1 file changed

+117
-87
lines changed

‎src/backend/storage/lmgr/README-SSI

Lines changed: 117 additions & 87 deletions
Original file line numberDiff line numberDiff line change
@@ -51,13 +51,13 @@ if a transaction can be shown to always do the right thing when it is
5151
run alone (before or after any other transaction), it will always do
5252
the right thing in any mix of concurrent serializable transactions.
5353
Where conflicts with other transactions would result in an
54-
inconsistent state within the database, or an inconsistent view of
54+
inconsistent state within the database or an inconsistent view of
5555
the data, a serializable transaction will block or roll back to
5656
prevent the anomaly. The SQL standard provides a specific SQLSTATE
5757
for errors generated when a transaction rolls back for this reason,
5858
so that transactions can be retried automatically.
5959

60-
Before version 9.1 PostgreSQL did not support a full serializable
60+
Before version 9.1, PostgreSQL did not support a full serializable
6161
isolation level. A request for serializable transaction isolation
6262
actually provided snapshot isolation. This has well known anomalies
6363
which can allow data corruption or inconsistent views of the data
@@ -77,7 +77,7 @@ Serializable Isolation Implementation Strategies
7777

7878
Techniques for implementing full serializable isolation have been
7979
published and in use in many database products for decades. The
80-
primary technique which has been used is Strict2Phase Locking
80+
primary technique which has been used is StrictTwo-Phase Locking
8181
(S2PL), which operates by blocking writes against data which has been
8282
read by concurrent transactions and blocking any access (read or
8383
write) against data which has been written by concurrent
@@ -112,54 +112,90 @@ visualize the difference between the serializable implementations
112112
described above, is to consider that among transactions executing at
113113
the serializable transaction isolation level, the results are
114114
required to be consistent with some serial (one-at-a-time) execution
115-
of the transactions[1]. How is that order determined in each?
115+
of the transactions[1]. How is that order determined in each?
116116

117-
S2PL locks rows used by the transaction in a way which blocks
118-
conflicting access, so that at the moment of a successful commit it
119-
is certain that no conflicting access has occurred. Some transactions
120-
may have blocked, essentially being partially serialized with the
121-
committing transaction, to allow this. Some transactions may have
122-
been rolled back, due to cycles in the blocking. But with S2PL,
123-
transactions can always be viewed as having occurred serially, in the
124-
order of successful commit.
117+
In S2PL, each transaction locks any data it accesses. It holds the
118+
locks until committing, preventing other transactions from making
119+
conflicting accesses to the same data in the interim. Some
120+
transactions may have to be rolled back to prevent deadlock. But
121+
successful transactions can always be viewed as having occurred
122+
sequentially, in the order they committed.
125123

126124
With snapshot isolation, reads never block writes, nor vice versa, so
127-
there is much less actual serialization. The order in which
128-
transactions appear to have executed is determined by something more
129-
subtle than in S2PL: read/write dependencies. If a transaction
130-
attempts to read data which is not visible to it because the
131-
transaction which wrote it (or will later write it) is concurrent
132-
(one of them was running when the other acquired its snapshot), then
133-
the reading transaction appears to have executed first, regardless of
134-
the actual sequence of transaction starts or commits (since it sees a
135-
database state prior to that in which the other transaction leaves
136-
it). If one transaction has both rw-dependencies in (meaning that a
137-
concurrent transaction attempts to read data it writes) and out
138-
(meaning it attempts to read data a concurrent transaction writes),
139-
and a couple other conditions are met, there can appear to be a cycle
140-
in execution order of the transactions. This is when the anomalies
141-
occur.
142-
143-
SSI works by watching for the conditions mentioned above, and rolling
144-
back a transaction when needed to prevent any anomaly. The apparent
145-
order of execution will always be consistent with any actual
146-
serialization (i.e., a transaction which run by itself can always be
147-
considered to have run after any transactions committed before it
148-
started and before any transacton which starts after it commits); but
149-
among concurrent transactions it will appear that the transaction on
150-
the read side of a rw-dependency executed before the transaction on
151-
the write side.
125+
more concurrency is possible. The order in which transactions appear
126+
to have executed is determined by something more subtle than in S2PL:
127+
read/write dependencies. If a transaction reads data, it appears to
128+
execute after the transaction that wrote the data it is reading.
129+
Similarly, if it updates data, it appears to execute after the
130+
transaction that wrote the previous version. These dependencies, which
131+
we call "wr-dependencies" and "ww-dependencies", are consistent with
132+
the commit order, because the first transaction must have committed
133+
before the second starts. However, there can also be dependencies
134+
between two *concurrent* transactions, i.e. where one was running when
135+
the other acquired its snapshot. These "rw-conflicts" occur when one
136+
transaction attempts to read data which is not visible to it because
137+
the transaction which wrote it (or will later write it) is
138+
concurrent. The reading transaction appears to have executed first,
139+
regardless of the actual sequence of transaction starts or commits,
140+
because it sees a database state prior to that in which the other
141+
transaction leaves it.
142+
143+
Anomalies occur when a cycle is created in the graph of dependencies:
144+
when a dependency or series of dependencies causes transaction A to
145+
appear to have executed before transaction B, but another series of
146+
dependencies causes B to appear before A. If that's the case, then
147+
the results can't be consistent with any serial execution of the
148+
transactions.
149+
150+
151+
SSI Algorithm
152+
-------------
153+
154+
Serializable transaction in PostgreSQL are implemented using
155+
Serializable Snapshot Isolation (SSI), based on the work of Cahill
156+
et al. Fundamentally, this allows snapshot isolation to run as it
157+
has, while monitoring for conditions which could create a serialization
158+
anomaly.
159+
160+
SSI is based on the observation [2] that each snapshot isolation
161+
anomaly corresponds to a cycle that contains a "dangerous structure"
162+
of two adjacent rw-conflict edges:
163+
164+
Tin ------> Tpivot ------> Tout
165+
rw rw
166+
167+
SSI works by watching for this dangerous structure, and rolling
168+
back a transaction when needed to prevent any anomaly. This means it
169+
only needs to track rw-conflicts between concurrent transactions, not
170+
wr- and ww-dependencies. It also means there is a risk of false
171+
positives, because not every dangerous structure corresponds to an
172+
actual serialization failure.
173+
174+
The PostgreSQL implementation uses two additional optimizations:
175+
176+
* Tout must commit before any other transaction in the cycle
177+
(see proof of Theorem 2.1 of [2]). We only roll back a transaction
178+
if Tout commits before Tpivot and Tin.
179+
180+
* if Tin is read-only, there can only be an anomaly if Tout committed
181+
before Tin takes its snapshot. This optimization is an original
182+
one. Proof:
183+
184+
- Because there is a cycle, there must be some transaction T0 that
185+
precedes Tin in the serial order. (T0 might be the same as Tout).
186+
187+
- The dependency between T0 and Tin can't be a rw-conflict,
188+
because Tin was read-only, so it must be a wr-dependency.
189+
Those can only occur if T0 committed before Tin started.
190+
191+
- Because Tout must commit before any other transaction in the
192+
cycle, it must commit before T0 commits -- and thus before Tin
193+
starts.
152194

153195

154196
PostgreSQL Implementation
155197
-------------------------
156198

157-
The implementation of serializable transactions for PostgreSQL is
158-
accomplished through Serializable Snapshot Isolation (SSI), based on
159-
the work of Cahill, et al. Fundamentally, this allows snapshot
160-
isolation to run as it has, while monitoring for conditions which
161-
could create a serialization anomaly.
162-
163199
* Since this technique is based on Snapshot Isolation (SI), those
164200
areas in PostgreSQL which don't use SI can't be brought under SSI.
165201
This includes system tables, temporary tables, sequences, hint bit
@@ -180,7 +216,7 @@ lock or to use SELECT FOR SHARE or SELECT FOR UPDATE.
180216
* Those who want to continue to use snapshot isolation without
181217
the additional protections of SSI (and the associated costs of
182218
enforcing those protections), can use the REPEATABLE READ transaction
183-
isolation level. This levelwill retain its legacy behavior, which
219+
isolation level. This levelretains its legacy behavior, which
184220
is identical to the old SERIALIZABLE implementation and fully
185221
consistent with the standard's requirements for the REPEATABLE READ
186222
transaction isolation level.
@@ -236,7 +272,7 @@ in PostgreSQL, but tailored to the needs of SIREAD predicate locking,
236272
are used. These refer to physical objects actually accessed in the
237273
course of executing the query, to model the predicates through
238274
inference. Anyone interested in this subject should review the
239-
Hellerstein, Stonebraker and Hamilton paper[2], along with the
275+
Hellerstein, Stonebraker and Hamilton paper [3], along with the
240276
locking papers referenced from that and the Cahill papers.
241277

242278
Because the SIREAD locks don't block, traditional locking techniques
@@ -273,6 +309,15 @@ transaction already holds a write lock on any tuple representing the
273309
row, since a rw-dependency would also create a ww-dependency which
274310
has more aggressive enforcement and will thus prevent any anomaly.
275311

312+
* Modifying a heap tuple creates a rw-conflict with any transaction
313+
that holds a SIREAD lock on that tuple, or on the page or relation
314+
that contains it.
315+
316+
* Inserting a new tuple creates a rw-conflict with any transaction
317+
holding a SIREAD lock on the entire relation. It doesn't conflict with
318+
page-level locks, because page-level locks are only used to aggregate
319+
tuple locks. Unlike index page locks, they don't lock "gaps" on the page.
320+
276321

277322
Index AM implementations
278323
------------------------
@@ -296,13 +341,13 @@ need not generate a conflict, although an update which "moves" a row
296341
into the scan must generate a conflict. While correctness allows
297342
false positives, they should be minimized for performance reasons.
298343

299-
Several optimizations are possible:
344+
Several optimizations are possible, though not all implemented yet:
300345

301346
* An index scan which is just finding the right position for an
302-
index insertion or deletionneed not acquire a predicate lock.
347+
index insertion or deletionneeds not acquire a predicate lock.
303348

304349
* An index scan which is comparing for equality on the entire key
305-
for a unique indexneed not acquire a predicate lock as long as a key
350+
for a unique indexneeds not acquire a predicate lock as long as a key
306351
is found corresponding to a visible tuple which has not been modified
307352
by another transaction -- there are no "between or around" gaps to
308353
cover.
@@ -317,10 +362,10 @@ x = 1 AND x = 2), then no predicate lock is needed.
317362

318363
Other index AM implementation considerations:
319364

320-
*If a btree search discovers that no root page has yet been
321-
created, a predicate lock onthe indexrelation is required;
322-
otherwise btree searches must get to the leaf level to determine
323-
which tuples match, so predicate locks go there.
365+
*B-tree index searches acquire predicate locks only on the
366+
index *leaf* pages needed to locktheappropriateindexrange. If,
367+
however, a search discovers that no root page has yet been created, a
368+
predicate lock on the index relation is required.
324369

325370
* GiST searches can determine that there are no matches at any
326371
level of the index, so there must be a predicate lock at each index
@@ -346,11 +391,6 @@ to be added from scratch.
346391

347392
2. The existing in-memory lock structures were not suitable for
348393
tracking SIREAD locks.
349-
* The database products used for the prototype
350-
implementations for the papers used update-in-place with a rollback
351-
log for their MVCC implementations, while PostgreSQL leaves the old
352-
version of a row in place and adds a new tuple to represent the row
353-
at a new location.
354394
* In PostgreSQL, tuple level locks are not held in RAM for
355395
any length of time; lock information is written to the tuples
356396
involved in the transactions.
@@ -450,18 +490,19 @@ there can't be a rw-conflict from T3 to T0.
450490

451491
o In both cases, we didn't need the T1 -> T3 edge.
452492

453-
* Predicate locking in PostgreSQLwill start at the tuple level
454-
when possible, with automatic conversion of multiple fine-grained
455-
locks tocoarsergranularity asneed to avoid resource exhaustion.
456-
Theamount of memory used for these structureswill beconfigurable,
457-
to balanceRAM usage against SIREAD lock granularity.
493+
* Predicate locking in PostgreSQLstarts at the tuple level
494+
when possible. Multiple fine-grained locks are promoted to a single
495+
coarser-granularitylockasneeded to avoid resource exhaustion. The
496+
amount of memory used for these structuresisconfigurable, to balance
497+
RAM usage against SIREAD lock granularity.
458498

459-
* A process-local copy of locks held by a process and the coarser
460-
covering locks with counts, are kept to support granularity promotion
461-
decisions with low CPU and locking overhead.
499+
* Each backend keeps a process-local table of the locks it holds.
500+
To support granularity promotion decisions with low CPU and locking
501+
overhead, this table also includes the coarser covering locks and the
502+
number of finer-granularity locks they cover.
462503

463-
* Conflictswill be identified by looking for predicate locks
464-
when tuples are written and looking at the MVCC information when
504+
* Conflictsare identified by looking for predicate locks
505+
when tuples are written, and by looking at the MVCC information when
465506
tuples are read. There is no matching between two RAM-based locks.
466507

467508
* Because write locks are stored in the heap tuples rather than a
@@ -493,12 +534,12 @@ to be READ ONLY.)
493534
o We can more aggressively clean up conflicts, predicate
494535
locks, and SSI transaction information.
495536

496-
*Allow a READ ONLY transaction to "opt out" of SSI if there are
537+
*We allow a READ ONLY transaction to "opt out" of SSI if there are
497538
no READ WRITE transactions which could cause the READ ONLY
498539
transaction to ever become part of a "dangerous structure" of
499540
overlapping transaction dependencies.
500541

501-
*Allow the user to request that a READ ONLY transaction wait
542+
*We allow the user to request that a READ ONLY transaction wait
502543
until the conditions are right for it to start in the "opt out" state
503544
described above. We add a DEFERRABLE state to transactions, which is
504545
specified and maintained in a way similar to READ ONLY. It is
@@ -538,28 +579,13 @@ address it?
538579
replication solutions, like Postgres-R, Slony, pgpool, HS/SR, etc.
539580
This is related to the "WAL file replay" issue.
540581

541-
* Weak-memory-ordering machines. Make sure that shared memory
542-
access which involves visibility across multiple transactions uses
543-
locks as needed to avoid problems. On the other hand, ensure that we
544-
really need volatile where we're using it.
545-
http://archives.postgresql.org/pgsql-committers/2008-06/msg00228.php
546-
547582
* UNIQUE btree search for equality on all columns. Since a search
548583
of a UNIQUE index using equality tests on all columns will lock the
549584
heap tuple if an entry is found, it appears that there is no need to
550585
get a predicate lock on the index in that case. A predicate lock is
551586
still needed for such a search if a matching index entry which points
552587
to a visible tuple is not found.
553588

554-
* Planner index probes. To avoid problems with data skew at the
555-
ends of an index which have historically caused bad plans, the
556-
planner now probes the end of an index to see what the maximum or
557-
minimum value is when a query appears to be requesting a range of
558-
data outside what statistics shows is present. These planner checks
559-
don't require predicate locking, but there's currently no easy way to
560-
avoid it. What can we do to avoid predicate locking for such planner
561-
activity?
562-
563589
* Minimize touching of shared memory. Should lists in shared
564590
memory push entries which have just been returned to the front of the
565591
available list, so they will be popped back off soon and some memory
@@ -573,13 +599,17 @@ Footnotes
573599
[1] http://www.contrib.andrew.cmu.edu/~shadow/sql/sql1992.txt
574600
Search for serial execution to find the relevant section.
575601

576-
[2] http://db.cs.berkeley.edu/papers/fntdb07-architecture.pdf
577-
Joseph M. Hellerstein, Michael Stonebraker and James Hamilton. 2007.
602+
[2] A. Fekete et al. Making Snapshot Isolation Serializable. In ACM
603+
Transactions on Database Systems 30:2, Jun. 2005.
604+
http://dx.doi.org/10.1145/1071610.1071615
605+
606+
[3] Joseph M. Hellerstein, Michael Stonebraker and James Hamilton. 2007.
578607
Architecture of a Database System. Foundations and Trends(R) in
579608
Databases Vol. 1, No. 2 (2007) 141-259.
609+
http://db.cs.berkeley.edu/papers/fntdb07-architecture.pdf
580610
Of particular interest:
581611
* 6.1 A Note on ACID
582612
* 6.2 A Brief Review of Serializability
583613
* 6.3 Locking and Latching
584614
* 6.3.1 Transaction Isolation Levels
585-
* 6.5.3 Next-Key Locking: Physical Surrogates for Logical
615+
* 6.5.3 Next-Key Locking: Physical Surrogates for Logical Properties

0 commit comments

Comments
 (0)

[8]ページ先頭

©2009-2025 Movatter.jp