|
1 | | -<!-- $Header: /cvsroot/pgsql/doc/src/sgml/wal.sgml,v 1.11 2001/09/29 04:02:19 tgl Exp $ --> |
| 1 | +<!-- $Header: /cvsroot/pgsql/doc/src/sgml/wal.sgml,v 1.12 2001/10/26 23:10:21 tgl Exp $ --> |
2 | 2 |
|
3 | 3 | <chapter id="wal"> |
4 | 4 | <title>Write-Ahead Logging (<acronym>WAL</acronym>)</title> |
|
88 | 88 | transaction identifiers. Once UNDO is implemented, |
89 | 89 | <filename>pg_clog</filename> will no longer be required to be |
90 | 90 | permanent; it will be possible to remove |
91 | | - <filename>pg_clog</filename> at shutdown, split it into segments |
92 | | - and remove old segments. |
| 91 | + <filename>pg_clog</filename> at shutdown. (However, the urgency |
| 92 | + of this concern has decreased greatly with the adoption of a segmented |
| 93 | + storage method for <filename>pg_clog</filename> --- it is no longer |
| 94 | + necessary to keep old <filename>pg_clog</filename> entries around |
| 95 | + forever.) |
93 | 96 | </para> |
94 | 97 |
|
95 | 98 | <para> |
|
116 | 119 | copying the data files (operating system copy commands are not |
117 | 120 | suitable). |
118 | 121 | </para> |
| 122 | + |
| 123 | + <para> |
| 124 | + A difficulty standing in the way of realizing these benefits is that they |
| 125 | + require saving <acronym>WAL</acronym> entries for considerable periods |
| 126 | + of time (eg, as long as the longest possible transaction if transaction |
| 127 | + UNDO is wanted). The present <acronym>WAL</acronym> format is |
| 128 | + extremely bulky since it includes many disk page snapshots. |
| 129 | + This is not a serious concern at present, since the entries only need |
| 130 | + to be kept for one or two checkpoint intervals; but to achieve |
| 131 | + these future benefits some sort of compressed <acronym>WAL</acronym> |
| 132 | + format will be needed. |
| 133 | + </para> |
119 | 134 | </sect2> |
120 | 135 | </sect1> |
121 | 136 |
|
|
133 | 148 | <para> |
134 | 149 | <acronym>WAL</acronym> logs are stored in the directory |
135 | 150 | <Filename><replaceable>$PGDATA</replaceable>/pg_xlog</Filename>, as |
136 | | - a set of segment files, each16 MB in size. Each segment is |
137 | | - divided into8 kB pages. The log record headers are described in |
| 151 | + a set of segment files, each16MB in size. Each segment is |
| 152 | + divided into8KB pages. The log record headers are described in |
138 | 153 | <filename>access/xlog.h</filename>; record content is dependent on |
139 | 154 | the type of event that is being logged. Segment files are given |
140 | 155 | ever-increasing numbers as names, starting at |
|
147 | 162 | The <acronym>WAL</acronym> buffers and control structure are in |
148 | 163 | shared memory, and are handled by the backends; they are protected |
149 | 164 | by lightweight locks. The demand on shared memory is dependent on the |
150 | | - number of buffers; the default size of the <acronym>WAL</acronym> |
151 | | - buffers is64 kB. |
| 165 | + number of buffers. The default size of the <acronym>WAL</acronym> |
| 166 | + buffers is8 8KB buffers, or 64KB. |
152 | 167 | </para> |
153 | 168 |
|
154 | 169 | <para> |
|
166 | 181 | disk drives that falsely report a successful write to the kernel, |
167 | 182 | when, in fact, they have only cached the data and not yet stored it |
168 | 183 | on the disk. A power failure in such a situation may still lead to |
169 | | - irrecoverable data corruption; administrators should try to ensure |
170 | | - that disks holding <productname>PostgreSQL</productname>'s data and |
| 184 | + irrecoverable data corruption. Administrators should try to ensure |
| 185 | + that disks holding <productname>PostgreSQL</productname>'s |
171 | 186 | log files do not make such false reports. |
172 | 187 | </para> |
173 | 188 |
|
|
179 | 194 | checkpoint's position is saved in the file |
180 | 195 | <filename>pg_control</filename>. Therefore, when recovery is to be |
181 | 196 | done, the backend first reads <filename>pg_control</filename> and |
182 | | - then the checkpoint record; next it reads the redo record, whose |
183 | | - position is saved in the checkpoint, and begins the REDO operation. |
184 | | - Because the entire content of the pages is saved in the log on the |
185 | | - first page modification after a checkpoint, the pages will be first |
186 | | - restored to a consistent state. |
| 197 | + then the checkpoint record; then it performs the REDO operation by |
| 198 | + scanning forward from the log position indicated in the checkpoint |
| 199 | + record. |
| 200 | + Because the entire content of data pages is saved in the log on the |
| 201 | + first page modification after a checkpoint, all pages changed since |
| 202 | + the checkpoint will be restored to a consistent state. |
187 | 203 | </para> |
188 | 204 |
|
189 | 205 | <para> |
|
217 | 233 | buffers. This is undesirable because <function>LogInsert</function> |
218 | 234 | is used on every database low level modification (for example, |
219 | 235 | tuple insertion) at a time when an exclusive lock is held on |
220 | | - affected data pages and the operationis supposed to be as fast as |
221 | | - possible; what is worse, writing <acronym>WAL</acronym> buffers may |
222 | | - alsocause the creation of a new log segment, which takes even more |
| 236 | + affected data pages, so the operationneeds to be as fast as |
| 237 | + possible. What is worse, writing <acronym>WAL</acronym> buffers may |
| 238 | + alsoforce the creation of a new log segment, which takes even more |
223 | 239 | time. Normally, <acronym>WAL</acronym> buffers should be written |
224 | 240 | and flushed by a <function>LogFlush</function> request, which is |
225 | 241 | made, for the most part, at transaction commit time to ensure that |
|
230 | 246 | one should increase the number of <acronym>WAL</acronym> buffers by |
231 | 247 | modifying the <varname>WAL_BUFFERS</varname> parameter. The default |
232 | 248 | number of <acronym>WAL</acronym> buffers is 8. Increasing this |
233 | | - value willhave an impact on shared memory usage. |
| 249 | + value willcorrespondingly increase shared memory usage. |
234 | 250 | </para> |
235 | 251 |
|
236 | 252 | <para> |
|
243 | 259 | log (known as the redo record) it should start the REDO operation, |
244 | 260 | since any changes made to data files before that record are already |
245 | 261 | on disk. After a checkpoint has been made, any log segments written |
246 | | - before the undo records are removed, so checkpoints are used to free |
247 | | - disk space in the <acronym>WAL</acronym> directory. (When |
248 | | - <acronym>WAL</acronym>-based <acronym>BAR</acronym> is implemented, |
249 | | - the log segments can be archived instead of just being removed.) |
250 | | - The checkpoint maker is also able to create a few log segments for |
251 | | - future use, so as to avoid the need for |
252 | | - <function>LogInsert</function> or <function>LogFlush</function> to |
253 | | - spend time in creating them. |
| 262 | + before the undo records are no longer needed and can be recycled or |
| 263 | + removed. (When <acronym>WAL</acronym>-based <acronym>BAR</acronym> is |
| 264 | + implemented, the log segments would be archived before being recycled |
| 265 | + or removed.) |
254 | 266 | </para> |
255 | 267 |
|
256 | 268 | <para> |
257 | | - The <acronym>WAL</acronym> log is held on the disk as a set of 16 |
258 | | - MB files called <firstterm>segments</firstterm>. By default a new |
259 | | - segment is created only if more than 75% of the current segment is |
260 | | - used. One can instruct the server to pre-create up to 64 log segments |
| 269 | + The checkpoint maker is also able to create a few log segments for |
| 270 | + future use, so as to avoid the need for |
| 271 | + <function>LogInsert</function> or <function>LogFlush</function> to |
| 272 | + spend time in creating them. (If that happens, the entire database |
| 273 | + system will be delayed by the creation operation, so it's better if |
| 274 | + the files can be created in the checkpoint maker, which is not on |
| 275 | + anyone's critical path.) |
| 276 | + By default a new 16MB segment file is created only if more than 75% of |
| 277 | + the current segment has been used. This is inadequate if the system |
| 278 | + generates more than 4MB of log output between checkpoints. |
| 279 | + One can instruct the server to pre-create up to 64 log segments |
261 | 280 | at checkpoint time by modifying the <varname>WAL_FILES</varname> |
262 | 281 | configuration parameter. |
263 | 282 | </para> |
264 | 283 |
|
265 | | - <para> |
266 | | - For faster after-crash recovery, it would be better to create |
267 | | - checkpoints more often. However, one should balance this against |
268 | | - the cost of flushing dirty data pages; in addition, to ensure data |
269 | | - page consistency, the first modification of a data page after each |
270 | | - checkpoint results in logging the entire page content, thus |
271 | | - increasing output to log and the log's size. |
272 | | - </para> |
273 | | - |
274 | 284 | <para> |
275 | 285 | The postmaster spawns a special backend process every so often |
276 | 286 | to create the next checkpoint. A checkpoint is created every |
|
281 | 291 | <command>CHECKPOINT</command>. |
282 | 292 | </para> |
283 | 293 |
|
| 294 | + <para> |
| 295 | + Reducing <varname>CHECKPOINT_SEGMENTS</varname> and/or |
| 296 | + <varname>CHECKPOINT_TIMEOUT</varname> causes checkpoints to be |
| 297 | + done more often. This allows faster after-crash recovery (since |
| 298 | + less work will need to be redone). However, one must balance this against |
| 299 | + the increased cost of flushing dirty data pages more often. In addition, |
| 300 | + to ensure data page consistency, the first modification of a data page |
| 301 | + after each checkpoint results in logging the entire page content. |
| 302 | + Thus a smaller checkpoint interval increases the volume of output to |
| 303 | + the log, partially negating the goal of using a smaller interval, and |
| 304 | + in any case causing more disk I/O. |
| 305 | + </para> |
| 306 | + |
| 307 | + <para> |
| 308 | + The number of 16MB segment files will always be at least |
| 309 | + <varname>WAL_FILES</varname> + 1, and will normally not exceed |
| 310 | + <varname>WAL_FILES</varname> + 2 * <varname>CHECKPOINT_SEGMENTS</varname> |
| 311 | + + 1. This may be used to estimate space requirements for WAL. Ordinarily, |
| 312 | + when an old log segment file is no longer needed, it is recycled (renamed |
| 313 | + to become the next sequential future segment). If, due to a short-term |
| 314 | + peak of log output rate, there are more than <varname>WAL_FILES</varname> + |
| 315 | + 2 * <varname>CHECKPOINT_SEGMENTS</varname> + 1 segment files, then unneeded |
| 316 | + segment files will be deleted instead of recycled until the system gets |
| 317 | + back under this limit. (If this happens on a regular basis, |
| 318 | + <varname>WAL_FILES</varname> should be increased to avoid it. Deleting log |
| 319 | + segments that will only have to be created again later is expensive and |
| 320 | + pointless.) |
| 321 | + </para> |
| 322 | + |
284 | 323 | <para> |
285 | 324 | The <varname>COMMIT_DELAY</varname> parameter defines for how many |
286 | 325 | microseconds the backend will sleep after writing a commit |
|
294 | 333 | Note that on most platforms, the resolution of a sleep request is |
295 | 334 | ten milliseconds, so that any nonzero <varname>COMMIT_DELAY</varname> |
296 | 335 | setting between 1 and 10000 microseconds will have the same effect. |
| 336 | + Good values for these parameters are not yet clear; experimentation |
| 337 | + is encouraged. |
297 | 338 | </para> |
298 | 339 |
|
299 | 340 | <para> |
|