@@ -22,9 +22,13 @@ refers to data that is stored in <productname>PostgreSQL</productname> tables.
2222</para>
2323
2424<para>
25- <xref linkend="page-table"> shows how pages in both normal <productname>PostgreSQL</productname> tables
26- and <productname>PostgreSQL</productname> indexes
27- (e.g., a B-tree index) are structured.
25+
26+ <xref linkend="page-table"> shows how pages in both normal
27+ <productname>PostgreSQL</productname> tables and
28+ <productname>PostgreSQL</productname> indexes (e.g., a B-tree index)
29+ are structured. This structure is also used for toast tables and sequences.
30+ There are five parts to each page.
31+
2832</para>
2933
3034<table tocentry="1" id="page-table">
@@ -43,113 +47,255 @@ Item
4347<tbody>
4448
4549<row>
46- <entry>itemPointerData</entry>
47- </row>
48-
49- <row>
50- <entry>filler</entry>
50+ <entry>PageHeaderData</entry>
51+ <entry>20 bytes long. Contains general information about the page to allow to access it.</entry>
5152</row>
5253
5354<row>
54- <entry>itemData...</entry>
55+ <entry>itemPointerData</entry>
56+ <entry>List of (offset,length) pairs pointing to the actual item.</entry>
5557</row>
5658
5759<row>
58- <entry>Unallocated Space</entry>
60+ <entry>Free space</entry>
61+ <entry>The unallocated space. All new tuples are allocated from here, generally from the end.</entry>
5962</row>
6063
6164<row>
62- <entry>ItemContinuationData</entry>
65+ <entry>items</entry>
66+ <entry>The actual items themselves. Different access method have different data here.</entry>
6367</row>
6468
6569<row>
6670<entry>Special Space</entry>
71+ <entry>Access method specific data. Different method store different data. Unused by normal tables.</entry>
6772</row>
6873
69- <row >
70- <entry><quote>ItemData 2</quote></entry >
71- </row >
74+ </tbody >
75+ </tgroup >
76+ </table >
7277
73- <row>
74- <entry><quote>ItemData 1</quote></entry>
75- </row>
78+ <para>
7679
77- <row>
78- <entry>ItemIdData</entry>
79- </row>
80+ The first 20 bytes of each page consists of a page header
81+ (PageHeaderData). It's format is detailed in <xref
82+ linkend="pageheaderdata-table">. The first two fields deal with WAL
83+ related stuff. This is followed by three 2-byte integer fields
84+ (<firstterm>lower</firstterm>, <firstterm>upper</firstterm>, and
85+ <firstterm>special</firstterm>). These represent byte offsets to the start
86+ of unallocated space, to the end of unallocated space, and to the start of
87+ the special space.
88+
89+ </para>
90+
91+ <table tocentry="1" id="pageheaderdata-table">
92+ <title>PageHeaderData Layout</title>
93+ <titleabbrev>PageHeaderData Layout</titleabbrev>
94+ <tgroup cols="4">
95+ <thead>
96+ <row>
97+ <entry>Field</entry>
98+ <entry>Type</entry>
99+ <entry>Length</entry>
100+ <entry>Description</entry>
101+ </row>
102+ </thead>
103+ <tbody>
104+ <row>
105+ <entry>pd_lsn</entry>
106+ <entry>XLogRecPtr</entry>
107+ <entry>6 bytes</entry>
108+ <entry>LSN: next byte after last byte of xlog</entry>
109+ </row>
110+ <row>
111+ <entry>pd_sui</entry>
112+ <entry>StartUpID</entry>
113+ <entry>4 bytes</entry>
114+ <entry>SUI of last changes (currently it's used by heap AM only)</entry>
115+ </row>
116+ <row>
117+ <entry>pd_lower</entry>
118+ <entry>LocationIndex</entry>
119+ <entry>2 bytes</entry>
120+ <entry>Offset to start of free space.</entry>
121+ </row>
122+ <row>
123+ <entry>pd_upper</entry>
124+ <entry>LocationIndex</entry>
125+ <entry>2 bytes</entry>
126+ <entry>Offset to end of free space.</entry>
127+ </row>
128+ <row>
129+ <entry>pd_special</entry>
130+ <entry>LocationIndex</entry>
131+ <entry>2 bytes</entry>
132+ <entry>Offset to start of special space.</entry>
133+ </row>
134+ <row>
135+ <entry>pd_opaque</entry>
136+ <entry>OpaqueData</entry>
137+ <entry>2 bytes</entry>
138+ <entry>AM-generic information. Currently just stores the page size.</entry>
139+ </row>
140+ </tbody>
141+ </tgroup>
142+ </table>
80143
81- <row>
82- <entry>PageHeaderData</entry>
83- </row>
144+ <para>
145+ Special space is a region at the end of the page that is allocated at page
146+ initialization time and contains information specific to an access method.
147+ The last 2 bytes of the page header, <firstterm>opaque</firstterm>,
148+ currently only stores the page size. Page size is stored in each page
149+ because frames in the buffer pool may be subdivided into equal sized pages
150+ on a frame by frame basis within a table (is this true? - mvo).
84151
85- </tbody>
86- </tgroup>
87- </table>
152+ </para>
88153
89- <!--
90- .\" Running
91- .\" .q .../bin/dumpbpages
92- .\" or
93- .\" .q .../src/support/dumpbpages
94- .\" as the postgres superuser
95- .\" with the file paths associated with
96- .\" (heap or B-tree index) classes,
97- .\" .q .../data/base/<database-name>/<class-name>,
98- .\" will display the page structure used by the classes.
99- .\" Specifying the
100- .\" .q -r
101- .\" flag will cause the classes to be
102- .\" treated as heap classes and for more information to be displayed.
103- -->
154+ <para>
104155
105- <para>
106- The first 8 bytes of each page consists of a page header
107- (PageHeaderData).
108- Within the header, the first three 2-byte integer fields
109- (<firstterm>lower</firstterm>,
110- <firstterm>upper</firstterm>,
111- and
112- <firstterm>special</firstterm>)
113- represent byte offsets to the start of unallocated space, to the end
114- of unallocated space, and to the start of <firstterm>special space</firstterm>.
115- Special space is a region at the end of the page that is allocated at
116- page initialization time and contains information specific to an
117- access method. The last 2 bytes of the page header,
118- <firstterm>opaque</firstterm>,
119- encode the page size and information on the internal fragmentation of
120- the page. Page size is stored in each page because frames in the
121- buffer pool may be subdivided into equal sized pages on a frame by
122- frame basis within a table. The internal fragmentation information is
123- used to aid in determining when page reorganization should occur.
124- </para>
156+ Following the page header are item identifiers
157+ (<firstterm>ItemIdData</firstterm>). New item identifiers are allocated
158+ from the first four bytes of unallocated space. Because an item
159+ identifier is never moved until it is freed, its index may be used to
160+ indicate the location of an item on a page. In fact, every pointer to an
161+ item (<firstterm>ItemPointer</firstterm>, also know as
162+ <firstterm>CTID</firstterm>) created by
163+ <productname>PostgreSQL</productname> consists of a frame number and an
164+ index of an item identifier. An item identifier contains a byte-offset to
165+ the start of an item, its length in bytes, and a set of attribute bits
166+ which affect its interpretation.
125167
126- <para>
127- Following the page header are item identifiers
128- (<firstterm>ItemIdData</firstterm>).
129- New item identifiers are allocated from the first four bytes of
130- unallocated space. Because an item identifier is never moved until it
131- is freed, its index may be used to indicate the location of an item on
132- a page. In fact, every pointer to an item
133- (<firstterm>ItemPointer</firstterm>)
134- created by <productname>PostgreSQL</productname> consists of a frame number and an index of an item
135- identifier. An item identifier contains a byte-offset to the start of
136- an item, its length in bytes, and a set of attribute bits which affect
137- its interpretation.
138- </para>
168+ </para>
139169
140- <para>
141- The items themselves are stored in space allocated backwards from
142- the end of unallocated space. Usually, the items are not interpreted.
143- However when the item is too long to be placed on a single page or
144- when fragmentation of the item is desired, the item is divided and
145- each piece is handled as distinct items in the following manner. The
146- first through the next to last piece are placed in an item
147- continuation structure
148- (<firstterm>ItemContinuationData</firstterm>).
149- This structure contains
150- itemPointerData
151- which points to the next piece and the piece itself. The last piece
152- is handled normally.
153- </para>
170+ <para>
171+
172+ The items themselves are stored in space allocated backwards from the end
173+ of unallocated space. The exact structure varies depending on what the
174+ table is to contain. Sequences and tables both use a structure named
175+ <firstterm>HeapTupleHeaderData</firstterm>, describe below.
176+
177+ </para>
178+
179+ <para>
180+
181+ The final section is the "special section" which may contain anything the
182+ access method wishes to store. Ordinary tables do not use this at all
183+ (indicated by setting the offset to the pagesize).
184+
185+ </para>
186+
187+ <para>
188+
189+ All tuples are structured the same way. A header of around 31 bytes
190+ followed by an optional null bitmask and the data. The header is detailed
191+ below in <xref linkend="heaptupleheaderdata-table">. The null bitmask is
192+ only present if the <firstterm>HEAP_HASNULL</firstterm> bit is set in the
193+ <firstterm>t_infomask</firstterm>. If it is present it takes up the space
194+ between the end of the header and the beginning of the data, as indicated
195+ by the <firstterm>t_hoff</firstterm> field. In this list of bits, a 1 bit
196+ indicates not-null, a 0 bit is a null.
197+
198+ </para>
199+
200+ <table tocentry="1" id="heaptupleheaderdata-table">
201+ <title>HeapTupleHeaderData Layout</title>
202+ <titleabbrev>HeapTupleHeaderData Layout</titleabbrev>
203+ <tgroup cols="4">
204+ <thead>
205+ <row>
206+ <entry>Field</entry>
207+ <entry>Type</entry>
208+ <entry>Length</entry>
209+ <entry>Description</entry>
210+ </row>
211+ </thead>
212+ <tbody>
213+ <row>
214+ <entry>t_oid</entry>
215+ <entry>Oid</entry>
216+ <entry>4 bytes</entry>
217+ <entry>OID of this tuple</entry>
218+ </row>
219+ <row>
220+ <entry>t_cmin</entry>
221+ <entry>CommandId</entry>
222+ <entry>4 bytes</entry>
223+ <entry>insert CID stamp</entry>
224+ </row>
225+ <row>
226+ <entry>t_cmax</entry>
227+ <entry>CommandId</entry>
228+ <entry>4 bytes</entry>
229+ <entry>delete CID stamp</entry>
230+ </row>
231+ <row>
232+ <entry>t_xmin</entry>
233+ <entry>TransactionId</entry>
234+ <entry>4 bytes</entry>
235+ <entry>insert XID stamp</entry>
236+ </row>
237+ <row>
238+ <entry>t_xmax</entry>
239+ <entry>TransactionId</entry>
240+ <entry>4 bytes</entry>
241+ <entry>delete XID stamp</entry>
242+ </row>
243+ <row>
244+ <entry>t_ctid</entry>
245+ <entry>ItemPointerData</entry>
246+ <entry>6 bytes</entry>
247+ <entry>current TID of this or newer tuple</entry>
248+ </row>
249+ <row>
250+ <entry>t_natts</entry>
251+ <entry>int16</entry>
252+ <entry>2 bytes</entry>
253+ <entry>number of attributes</entry>
254+ </row>
255+ <row>
256+ <entry>t_infomask</entry>
257+ <entry>uint16</entry>
258+ <entry>2 bytes</entry>
259+ <entry>Various flags</entry>
260+ </row>
261+ <row>
262+ <entry>t_hoff</entry>
263+ <entry>uint8</entry>
264+ <entry>1 byte</entry>
265+ <entry>length of tuple header. Also offset of data.</entry>
266+ </row>
267+ </tbody>
268+ </tgroup>
269+ </table>
270+
271+ <para>
272+
273+ All the details may be found in src/include/storage/bufpage.h.
274+
275+ </para>
276+
277+ <para>
278+
279+ Interpreting the actual data can only be done with information obtained
280+ from other tables, mostly <firstterm>pg_attribute</firstterm>. The
281+ particular fields are <firstterm>attlen</firstterm> and
282+ <firstterm>attalign</firstterm>. There is no way to directly get a
283+ particular attribute, except when there are only fixed width fields and no
284+ NULLs. All this trickery is wrapped up in the functions
285+ <firstterm>heap_getattr</firstterm>, <firstterm>fastgetattr</firstterm>
286+ and <firstterm>heap_getsysattr</firstterm>.
287+
288+ </para>
289+ <para>
154290
291+ To read the data you need to examine each attribute in turn. First check
292+ whether the field is NULL according to the null bitmap. If it is, go to
293+ the next. Then make sure you have the right alignment. If the field is a
294+ fixed width field, then all the bytes are simply placed. If it's a
295+ variable length field (attlen == -1) then it's a bit more complicated,
296+ using the variable length structure <firstterm>varattrib</firstterm>.
297+ Depending on the flags, the data may be either inline, compressed or in
298+ another table (TOAST).
299+
300+ </para>
155301</chapter>