Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Commit5edb24a

Browse files
committed
Buffering GiST index build algorithm.
When building a GiST index that doesn't fit in cache, buffers are attachedto some internal nodes in the index. This speeds up the build by avoidingrandom I/O that would otherwise be needed to traverse all the way down thetree to the find right leaf page for tuple.Alexander Korotkov
1 parent09b68c7 commit5edb24a

File tree

11 files changed

+2297
-186
lines changed

11 files changed

+2297
-186
lines changed

‎doc/src/sgml/gist.sgml

Lines changed: 34 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -642,6 +642,40 @@ my_distance(PG_FUNCTION_ARGS)
642642

643643
</variablelist>
644644

645+
<sect2 id="gist-buffering-build">
646+
<title>GiST buffering build</title>
647+
<para>
648+
Building large GiST indexes by simply inserting all the tuples tends to be
649+
slow, because if the index tuples are scattered across the index and the
650+
index is large enough to not fit in cache, the insertions need to perform
651+
a lot of random I/O. PostgreSQL from version 9.2 supports a more efficient
652+
method to build GiST indexes based on buffering, which can dramatically
653+
reduce number of random I/O needed for non-ordered data sets. For
654+
well-ordered datasets the benefit is smaller or non-existent, because
655+
only a small number of pages receive new tuples at a time, and those pages
656+
fit in cache even if the index as whole does not.
657+
</para>
658+
659+
<para>
660+
However, buffering index build needs to call the <function>penalty</>
661+
function more often, which consumes some extra CPU resources. Also, the
662+
buffers used in the buffering build need temporary disk space, up to
663+
the size of the resulting index. Buffering can also infuence the quality
664+
of the produced index, in both positive and negative directions. That
665+
influence depends on various factors, like the distribution of the input
666+
data and operator class implementation.
667+
</para>
668+
669+
<para>
670+
By default, the index build switches to the buffering method when the
671+
index size reaches <xref linkend="guc-effective-cache-size">. It can
672+
be manually turned on or off by the <literal>BUFFERING</literal> parameter
673+
to the CREATE INDEX clause. The default behavior is good for most cases,
674+
but turning buffering off might speed up the build somewhat if the input
675+
data is ordered.
676+
</para>
677+
678+
</sect2>
645679
</sect1>
646680

647681
<sect1 id="gist-examples">

‎doc/src/sgml/ref/create_index.sgml

Lines changed: 20 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -340,6 +340,26 @@ CREATE [ UNIQUE ] INDEX [ CONCURRENTLY ] [ <replaceable class="parameter">name</
340340
</listitem>
341341
</varlistentry>
342342

343+
</variablelist>
344+
<para>
345+
GiST indexes additionaly accepts parameters:
346+
</para>
347+
348+
<variablelist>
349+
350+
<varlistentry>
351+
<term><literal>BUFFERING</></term>
352+
<listitem>
353+
<para>
354+
Determines whether the buffering build technique described in
355+
<xref linkend="gist-buffering-build"> is used to build the index. With
356+
<literal>OFF</> it is disabled, with <literal>ON</> it is enabled, and
357+
with <literal>AUTO</> it is initially disabled, but turned on
358+
on-the-fly once the index size reaches <xref linkend="guc-effective-cache-size">. The default is <literal>AUTO</>.
359+
</para>
360+
</listitem>
361+
</varlistentry>
362+
343363
</variablelist>
344364
</refsect2>
345365

‎src/backend/access/common/reloptions.c

Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -219,6 +219,17 @@ static relopt_real realRelOpts[] =
219219

220220
staticrelopt_stringstringRelOpts[]=
221221
{
222+
{
223+
{
224+
"buffering",
225+
"Enables buffering build for this GiST index",
226+
RELOPT_KIND_GIST
227+
},
228+
4,
229+
false,
230+
gistValidateBufferingOption,
231+
"auto"
232+
},
222233
/* list terminator */
223234
{{NULL}}
224235
};

‎src/backend/access/gist/Makefile

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -13,6 +13,6 @@ top_builddir = ../../../..
1313
include$(top_builddir)/src/Makefile.global
1414

1515
OBJS = gist.o gistutil.o gistxlog.o gistvacuum.o gistget.o gistscan.o\
16-
gistproc.o gistsplit.o
16+
gistproc.o gistsplit.o gistbuild.o gistbuildbuffers.o
1717

1818
include$(top_srcdir)/src/backend/common.mk

‎src/backend/access/gist/README

Lines changed: 135 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -24,13 +24,20 @@ The current implementation of GiST supports:
2424
* provides NULL-safe interface to GiST core
2525
* Concurrency
2626
* Recovery support via WAL logging
27+
* Buffering build algorithm
2728

2829
The support for concurrency implemented in PostgreSQL was developed based on
2930
the paper "Access Methods for Next-Generation Database Systems" by
3031
Marcel Kornaker:
3132

3233
http://www.sai.msu.su/~megera/postgres/gist/papers/concurrency/access-methods-for-next-generation.pdf.gz
3334

35+
Buffering build algorithm for GiST was developed based on the paper "Efficient
36+
Bulk Operations on Dynamic R-trees" by Lars Arge, Klaus Hinrichs, Jan Vahrenhold
37+
and Jeffrey Scott Vitter.
38+
39+
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.135.9894&rep=rep1&type=pdf
40+
3441
The original algorithms were modified in several ways:
3542

3643
* They had to be adapted to PostgreSQL conventions. For example, the SEARCH
@@ -278,6 +285,134 @@ would complicate the insertion algorithm. So when an insertion sees a page
278285
with F_FOLLOW_RIGHT set, it immediately tries to bring the split that
279286
crashed in the middle to completion by adding the downlink in the parent.
280287

288+
Buffering build algorithm
289+
-------------------------
290+
291+
In the buffering index build algorithm, some or all internal nodes have a
292+
buffer attached to them. When a tuple is inserted at the top, the descend down
293+
the tree is stopped as soon as a buffer is reached, and the tuple is pushed to
294+
the buffer. When a buffer gets too full, all the tuples in it are flushed to
295+
the lower level, where they again hit lower level buffers or leaf pages. This
296+
makes the insertions happen in more of a breadth-first than depth-first order,
297+
which greatly reduces the amount of random I/O required.
298+
299+
In the algorithm, levels are numbered so that leaf pages have level zero,
300+
and internal node levels count up from 1. This numbering ensures that a page's
301+
level number never changes, even when the root page is split.
302+
303+
Level Tree
304+
305+
3 *
306+
/ \
307+
2 * *
308+
/ | \ / | \
309+
1 * * * * * *
310+
/ \ / \ / \ / \ / \ / \
311+
0 o o o o o o o o o o o o
312+
313+
* - internal page
314+
o - leaf page
315+
316+
Internal pages that belong to certain levels have buffers associated with
317+
them. Leaf pages never have buffers. Which levels have buffers is controlled
318+
by "level step" parameter: level numbers that are multiples of level_step
319+
have buffers, while others do not. For example, if level_step = 2, then
320+
pages on levels 2, 4, 6, ... have buffers. If level_step = 1 then every
321+
internal page has a buffer.
322+
323+
Level Tree (level_step = 1) Tree (level_step = 2)
324+
325+
3 * *
326+
/ \ / \
327+
2 *(b) *(b) *(b) *(b)
328+
/ | \ / | \ / | \ / | \
329+
1 *(b) *(b) *(b) *(b) *(b) *(b) * * * * * *
330+
/ \ / \ / \ / \ / \ / \ / \ / \ / \ / \ / \ / \
331+
0 o o o o o o o o o o o o o o o o o o o o o o o o
332+
333+
(b) - buffer
334+
335+
Logically, a buffer is just bunch of tuples. Physically, it is divided in
336+
pages, backed by a temporary file. Each buffer can be in one of two states:
337+
a) Last page of the buffer is kept in main memory. A node buffer is
338+
automatically switched to this state when a new index tuple is added to it,
339+
or a tuple is removed from it.
340+
b) All pages of the buffer are swapped out to disk. When a buffer becomes too
341+
full, and we start to flush it, all other buffers are switched to this state.
342+
343+
When an index tuple is inserted, its initial processing can end in one of the
344+
following points:
345+
1) Leaf page, if the depth of the index <= level_step, meaning that
346+
none of the internal pages have buffers associated with them.
347+
2) Buffer of topmost level page that has buffers.
348+
349+
New index tuples are processed until one of the buffers in the topmost
350+
buffered level becomes half-full. When a buffer becomes half-full, it's added
351+
to the emptying queue, and will be emptied before a new tuple is processed.
352+
353+
Buffer emptying process means that index tuples from the buffer are moved
354+
into buffers at a lower level, or leaf pages. First, all the other buffers are
355+
swapped to disk to free up the memory. Then tuples are popped from the buffer
356+
one by one, and cascaded down the tree to the next buffer or leaf page below
357+
the buffered node.
358+
359+
Emptying a buffer has the interesting dynamic property that any intermediate
360+
pages between the buffer being emptied, and the next buffered or leaf level
361+
below it, become cached. If there are no more buffers below the node, the leaf
362+
pages where the tuples finally land on get cached too. If there are, the last
363+
buffer page of each buffer below is kept in memory. This is illustrated in
364+
the figures below:
365+
366+
Buffer being emptied to
367+
lower-level buffers Buffer being emptied to leaf pages
368+
369+
+(fb) +(fb)
370+
/ \ / \
371+
+ + + +
372+
/ \ / \ / \ / \
373+
*(ab) *(ab) *(ab) *(ab) x x x x
374+
375+
+ - cached internal page
376+
x - cached leaf page
377+
* - non-cached internal page
378+
(fb) - buffer being emptied
379+
(ab) - buffers being appended to, with last page in memory
380+
381+
In the beginning of the index build, the level-step is chosen so that all those
382+
pages involved in emptying one buffer fit in cache, so after each of those
383+
pages have been accessed once and cached, emptying a buffer doesn't involve
384+
any more I/O. This locality is where the speedup of the buffering algorithm
385+
comes from.
386+
387+
Emptying one buffer can fill up one or more of the lower-level buffers,
388+
triggering emptying of them as well. Whenever a buffer becomes too full, it's
389+
added to the emptying queue, and will be emptied after the current buffer has
390+
been processed.
391+
392+
To keep the size of each buffer limited even in the worst case, buffer emptying
393+
is scheduled as soon as a buffer becomes half-full, and emptying it continues
394+
until 1/2 of the nominal buffer size worth of tuples has been emptied. This
395+
guarantees that when buffer emptying begins, all the lower-level buffers
396+
are at most half-full. In the worst case that all the tuples are cascaded down
397+
to the same lower-level buffer, that buffer therefore has enough space to
398+
accommodate all the tuples emptied from the upper-level buffer. There is no
399+
hard size limit in any of the data structures used, though, so this only needs
400+
to be approximate; small overfilling of some buffers doesn't matter.
401+
402+
If an internal page that has a buffer associated with it is split, the buffer
403+
needs to be split too. All tuples in the buffer are scanned through and
404+
relocated to the correct sibling buffers, using the penalty function to decide
405+
which buffer each tuple should go to.
406+
407+
After all tuples from the heap have been processed, there are still some index
408+
tuples in the buffers. At this point, final buffer emptying starts. All buffers
409+
are emptied in top-down order. This is slightly complicated by the fact that
410+
new buffers can be allocated during the emptying, due to page splits. However,
411+
the new buffers will always be siblings of buffers that haven't been fully
412+
emptied yet; tuples never move upwards in the tree. The final emptying loops
413+
through buffers at a given level until all buffers at that level have been
414+
emptied, and then moves down to the next level.
415+
281416

282417
Authors:
283418
Teodor Sigaev<teodor@sigaev.ru>

0 commit comments

Comments
 (0)

[8]ページ先頭

©2009-2025 Movatter.jp