@@ -919,3 +919,137 @@ Http://www.rhyme.com.au | / \|
919919PGP key available upon request, | /
920920and from pgp5.ai.mit.edu:11371 |/
921921
922+ From pgsql-hackers-owner+M3501@postgresql.org Sat Jan 20 03:42:19 2001
923+ Received: from mail.postgresql.org (webmail.postgresql.org [216.126.85.28])
924+ by candle.pha.pa.us (8.9.0/8.9.0) with ESMTP id DAA12652
925+ for <pgman@candle.pha.pa.us>; Sat, 20 Jan 2001 03:42:18 -0500 (EST)
926+ Received: from mail.postgresql.org (webmail.postgresql.org [216.126.85.28])
927+ by mail.postgresql.org (8.11.1/8.11.1) with SMTP id f0K8ZG020426;
928+ Sat, 20 Jan 2001 03:35:16 -0500 (EST)
929+ (envelope-from pgsql-hackers-owner+M3501@postgresql.org)
930+ Received: from store.z.zembu.com (nat.zembu.com [209.128.96.253])
931+ by mail.postgresql.org (8.11.1/8.11.1) with ESMTP id f0K8TU016385
932+ for <pgsql-hackers@postgresql.org>; Sat, 20 Jan 2001 03:29:30 -0500 (EST)
933+ (envelope-from ncm@zembu.com)
934+ Received: by store.z.zembu.com (Postfix, from userid 509)
935+ id B33D9A782; Sat, 20 Jan 2001 00:29:24 -0800 (PST)
936+ Date: Sat, 20 Jan 2001 00:29:24 -0800
937+ To: pgsql-hackers@postgresql.org
938+ Subject: Re: [HACKERS] Transaction ID wraparound: problem and proposed solution
939+ Message-ID: <20010120002924.A2797@store.zembu.com>
940+ Reply-To: pgsql-hackers@postgresql.org
941+ References: <8382.973291660@sss.pgh.pa.us> <200101200500.AAA05265@candle.pha.pa.us>
942+ Mime-Version: 1.0
943+ Content-Type: text/plain; charset=us-ascii
944+ Content-Disposition: inline
945+ User-Agent: Mutt/1.2.5i
946+ In-Reply-To: <200101200500.AAA05265@candle.pha.pa.us>; from pgman@candle.pha.pa.us on Sat, Jan 20, 2001 at 12:00:09AM -0500
947+ From: ncm@zembu.com (Nathan Myers)
948+ Precedence: bulk
949+ Sender: pgsql-hackers-owner@postgresql.org
950+ Status: OR
951+
952+ I think the XID wraparound matter might be handled a bit more simply.
953+
954+ Given a global variable X which is the earliest XID value in use at
955+ some event (e.g. startup) you can compare two XIDs x and y, using
956+ unsigned arithmetic, with just (x-X < y-X). This has the further
957+ advantage that old transaction IDs need be "frozen" only every 4G
958+ transactions, rather than Tom's suggested 256M or 512M transactions.
959+ "Freezing", in this scheme, means to set all older XIDs to equal the
960+ chosen X, rather than setting them to some constant reserved value.
961+ No special cases are required for the comparison, even for folded
962+ values; it is (x-X < y-X) for all valid x and y.
963+
964+ I don't know the role of the "bootstrap" XID, or how it must be
965+ fitted into the above.
966+
967+ Nathan Myers
968+ ncm@zembu.com
969+
970+ ------------------------------------------------------------
971+ > We've expended a lot of worry and discussion in the past about what
972+ > happens if the OID generator wraps around. However, there is another
973+ > 4-byte counter in the system: the transaction ID (XID) generator.
974+ > While OID wraparound is survivable, if XIDs wrap around then we really
975+ > do have a Ragnarok scenario. The tuple validity checks do ordered
976+ > comparisons on XIDs, and will consider tuples with xmin > current xact
977+ > to be invalid. Result: after wraparound, your whole database would
978+ > instantly vanish from view.
979+ >
980+ > The first thought that comes to mind is that XIDs should be promoted to
981+ > eight bytes. However there are several practical problems with this:
982+ > * portability --- I don't believe long long int exists on all the
983+ > platforms we support.
984+ > * performance --- except on true 64-bit platforms, widening Datum to
985+ > eight bytes would be a system-wide performance hit, which is a tad
986+ > unpleasant to fix a scenario that's not yet been reported from the
987+ > field.
988+ > * disk space --- letting pg_log grow without bound isn't a pleasant
989+ > prospect either.
990+ >
991+ > I believe it is possible to fix these problems without widening XID,
992+ > by redefining XIDs in a way that allows for wraparound. Here's my
993+ > plan:
994+ >
995+ > 1. Allow XIDs to range from 0 to WRAPLIMIT-1 (WRAPLIMIT is not
996+ > necessarily 4G, see discussion below). Ordered comparisons on XIDs
997+ > are no longer simply "x < y", but need to be expressed as a macro.
998+ > We consider x < y if (y - x) % WRAPLIMIT < WRAPLIMIT/2.
999+ > This comparison will work as long as the range of interesting XIDs
1000+ > never exceeds WRAPLIMIT/2. Essentially, we envision the actual value
1001+ > of XID as being the low-order bits of a logical XID that always
1002+ > increases, and we assume that no extant XID is more than WRAPLIMIT/2
1003+ > transactions old, so we needn't keep track of the high-order bits.
1004+ >
1005+ > 2. To keep the system from having to deal with XIDs that are more than
1006+ > WRAPLIMIT/2 transactions old, VACUUM should "freeze" known-good old
1007+ > tuples. To do this, we'll reserve a special XID, say 1, that is always
1008+ > considered committed and is always less than any ordinary XID. (So the
1009+ > ordered-comparison macro is really a little more complicated than I said
1010+ > above. Note that there is already a reserved XID just like this in the
1011+ > system, the "bootstrap" XID. We could simply use the bootstrap XID, but
1012+ > it seems better to make another one.) When VACUUM finds a tuple that
1013+ > is committed good and has xmin < XmaxRecent (the oldest XID that might
1014+ > be considered uncommitted by any open transaction), it will replace that
1015+ > tuple's xmin by the special always-good XID. Therefore, as long as
1016+ > VACUUM is run on all tables in the installation more often than once per
1017+ > WRAPLIMIT/2 transactions, there will be no tuples with ordinary XIDs
1018+ > older than WRAPLIMIT/2.
1019+ >
1020+ > 3. At wraparound, the XID counter has to be advanced to skip over the
1021+ > InvalidXID value (zero) and the reserved XIDs, so that no real transaction
1022+ > is generated with those XIDs. No biggie here.
1023+ >
1024+ > 4. With the wraparound behavior, pg_log will have a bounded size: it
1025+ > will never exceed WRAPLIMIT*2 bits = WRAPLIMIT/4 bytes. Since we will
1026+ > recycle pg_log entries every WRAPLIMIT xacts, during transaction start
1027+ > the xact manager will have to take care to actively clear its pg_log
1028+ > entry to zeroes (I'm not sure if it does that already, or just assumes
1029+ > that new pg_log entries will start out zero). As long as that happens
1030+ > before the xact makes any data changes, it's OK to recycle the entry.
1031+ > Note we are assuming that no tuples will remain in the database with
1032+ > xmin or xmax equal to that XID from a prior cycle of the universe.
1033+ >
1034+ > This scheme allows us to survive XID wraparound at the cost of slight
1035+ > additional complexity in ordered comparisons of XIDs (which is not a
1036+ > really performance-critical task AFAIK), and at the cost that the
1037+ > original insertion XIDs of all but recent tuples will be lost by
1038+ > VACUUM. The system doesn't particularly care about that, but old XIDs
1039+ > do sometimes come in handy for debugging purposes. A possible
1040+ > compromise is to overwrite only XIDs that are older than, say,
1041+ > WRAPLIMIT/4 instead of doing so as soon as possible. This would mean
1042+ > the required VACUUM frequency is every WRAPLIMIT/4 xacts instead of
1043+ > every WRAPLIMIT/2 xacts.
1044+ >
1045+ > We have a straightforward tradeoff between the maximum size of pg_log
1046+ > (WRAPLIMIT/4 bytes) and the required frequency of VACUUM (at least
1047+ > every WRAPLIMIT/2 or WRAPLIMIT/4 transactions). This could be made
1048+ > configurable in config.h for those who're intent on customization,
1049+ > but I'd be inclined to set the default value at WRAPLIMIT = 1G.
1050+ >
1051+ > Comments? Vadim, is any of this about to be superseded by WAL?
1052+ > If not, I'd like to fix it for 7.1.
1053+ >
1054+ > regards, tom lane
1055+