NotificationsYou must be signed in to change notification settings
Fork5
Star27

Commit5edb24a

committed

Buffering GiST index build algorithm.

When building a GiST index that doesn't fit in cache, buffers are attachedto some internal nodes in the index. This speeds up the build by avoidingrandom I/O that would otherwise be needed to traverse all the way down thetree to the find right leaf page for tuple.Alexander Korotkov

1 parent09b68c7 commit5edb24aCopy full SHA for 5edb24a

File tree

11 files changed

+2297

-186

lines changed

doc/src/sgml
- gist.sgml
- ref
  - create_index.sgml
src
- backend/access
  - common
    - reloptions.c
  - gist
- include/access
  - gist_private.h

11 files changed

+2297

-186

lines changed

`‎doc/src/sgml/gist.sgml‎`

Lines changed: 34 additions & 0 deletions

Original file line number	Diff line number	Diff line change
`@@ -642,6 +642,40 @@ my_distance(PG_FUNCTION_ARGS)`
`642`	`642`
`643`	`643`	`</variablelist>`
`644`	`644`
	`645`	`+ <sect2 id="gist-buffering-build">`
	`646`	`+ <title>GiST buffering build</title>`
	`647`	`+ <para>`
	`648`	`+ Building large GiST indexes by simply inserting all the tuples tends to be`
	`649`	`+ slow, because if the index tuples are scattered across the index and the`
	`650`	`+ index is large enough to not fit in cache, the insertions need to perform`
	`651`	`+ a lot of random I/O. PostgreSQL from version 9.2 supports a more efficient`
	`652`	`+ method to build GiST indexes based on buffering, which can dramatically`
	`653`	`+ reduce number of random I/O needed for non-ordered data sets. For`
	`654`	`+ well-ordered datasets the benefit is smaller or non-existent, because`
	`655`	`+ only a small number of pages receive new tuples at a time, and those pages`
	`656`	`+ fit in cache even if the index as whole does not.`
	`657`	`+ </para>`
	`658`	`+`
	`659`	`+ <para>`
	`660`	`+ However, buffering index build needs to call the <function>penalty</>`
	`661`	`+ function more often, which consumes some extra CPU resources. Also, the`
	`662`	`+ buffers used in the buffering build need temporary disk space, up to`
	`663`	`+ the size of the resulting index. Buffering can also infuence the quality`
	`664`	`+ of the produced index, in both positive and negative directions. That`
	`665`	`+ influence depends on various factors, like the distribution of the input`
	`666`	`+ data and operator class implementation.`
	`667`	`+ </para>`
	`668`	`+`
	`669`	`+ <para>`
	`670`	`+ By default, the index build switches to the buffering method when the`
	`671`	`+ index size reaches <xref linkend="guc-effective-cache-size">. It can`
	`672`	`+ be manually turned on or off by the <literal>BUFFERING</literal> parameter`
	`673`	`+ to the CREATE INDEX clause. The default behavior is good for most cases,`
	`674`	`+ but turning buffering off might speed up the build somewhat if the input`
	`675`	`+ data is ordered.`
	`676`	`+ </para>`
	`677`	`+`
	`678`	`+ </sect2>`
`645`	`679`	`</sect1>`
`646`	`680`
`647`	`681`	`<sect1 id="gist-examples">`

`‎doc/src/sgml/ref/create_index.sgml‎`

Lines changed: 20 additions & 0 deletions

Original file line number	Diff line number	Diff line change
`@@ -340,6 +340,26 @@ CREATE [ UNIQUE ] INDEX [ CONCURRENTLY ] [ <replaceable class="parameter">name</`
`340`	`340`	`</listitem>`
`341`	`341`	`</varlistentry>`
`342`	`342`
	`343`	`+ </variablelist>`
	`344`	`+ <para>`
	`345`	`+ GiST indexes additionaly accepts parameters:`
	`346`	`+ </para>`
	`347`	`+`
	`348`	`+ <variablelist>`
	`349`	`+`
	`350`	`+ <varlistentry>`
	`351`	`+ <term><literal>BUFFERING</></term>`
	`352`	`+ <listitem>`
	`353`	`+ <para>`
	`354`	`+ Determines whether the buffering build technique described in`
	`355`	`+ <xref linkend="gist-buffering-build"> is used to build the index. With`
	`356`	`+ <literal>OFF</> it is disabled, with <literal>ON</> it is enabled, and`
	`357`	`+ with <literal>AUTO</> it is initially disabled, but turned on`
	`358`	`+ on-the-fly once the index size reaches <xref linkend="guc-effective-cache-size">. The default is <literal>AUTO</>.`
	`359`	`+ </para>`
	`360`	`+ </listitem>`
	`361`	`+ </varlistentry>`
	`362`	`+`
`343`	`363`	`</variablelist>`
`344`	`364`	`</refsect2>`
`345`	`365`

`‎src/backend/access/common/reloptions.c‎`

Lines changed: 11 additions & 0 deletions

Original file line number	Diff line number	Diff line change
`@@ -219,6 +219,17 @@ static relopt_real realRelOpts[] =`
`219`	`219`
`220`	`220`	`staticrelopt_stringstringRelOpts[]=`
`221`	`221`	`{`
	`222`	`+{`
	`223`	`+{`
	`224`	`+"buffering",`
	`225`	`+"Enables buffering build for this GiST index",`
	`226`	`+RELOPT_KIND_GIST`
	`227`	`+},`
	`228`	`+4,`
	`229`	`+false,`
	`230`	`+gistValidateBufferingOption,`
	`231`	`+"auto"`
	`232`	`+},`
`222`	`233`	`/* list terminator */`
`223`	`234`	`{{NULL}}`
`224`	`235`	`};`

`‎src/backend/access/gist/Makefile‎`

Lines changed: 1 addition & 1 deletion

Original file line number	Diff line number	Diff line change
`@@ -13,6 +13,6 @@ top_builddir = ../../../..`
`13`	`13`	`include$(top_builddir)/src/Makefile.global`
`14`	`14`
`15`	`15`	`OBJS = gist.o gistutil.o gistxlog.o gistvacuum.o gistget.o gistscan.o\`
`16`		`- gistproc.o gistsplit.o`
	`16`	`+ gistproc.o gistsplit.o gistbuild.o gistbuildbuffers.o`
`17`	`17`
`18`	`18`	`include$(top_srcdir)/src/backend/common.mk`

`‎src/backend/access/gist/README‎`

Lines changed: 135 additions & 0 deletions

Original file line number	Diff line number	Diff line change
`@@ -24,13 +24,20 @@ The current implementation of GiST supports:`
`24`	`24`	`* provides NULL-safe interface to GiST core`
`25`	`25`	`* Concurrency`
`26`	`26`	`* Recovery support via WAL logging`
	`27`	`+ * Buffering build algorithm`
`27`	`28`
`28`	`29`	`The support for concurrency implemented in PostgreSQL was developed based on`
`29`	`30`	`the paper "Access Methods for Next-Generation Database Systems" by`
`30`	`31`	`Marcel Kornaker:`
`31`	`32`
`32`	`33`	`http://www.sai.msu.su/~megera/postgres/gist/papers/concurrency/access-methods-for-next-generation.pdf.gz`
`33`	`34`
	`35`	`+Buffering build algorithm for GiST was developed based on the paper "Efficient`
	`36`	`+Bulk Operations on Dynamic R-trees" by Lars Arge, Klaus Hinrichs, Jan Vahrenhold`
	`37`	`+and Jeffrey Scott Vitter.`
	`38`	`+`
	`39`	`+ http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.135.9894&rep=rep1&type=pdf`
	`40`	`+`
`34`	`41`	`The original algorithms were modified in several ways:`
`35`	`42`
`36`	`43`	`* They had to be adapted to PostgreSQL conventions. For example, the SEARCH`
`@@ -278,6 +285,134 @@ would complicate the insertion algorithm. So when an insertion sees a page`
`278`	`285`	`with F_FOLLOW_RIGHT set, it immediately tries to bring the split that`
`279`	`286`	`crashed in the middle to completion by adding the downlink in the parent.`
`280`	`287`
	`288`	`+Buffering build algorithm`
	`289`	`+-------------------------`
	`290`	`+`
	`291`	`+In the buffering index build algorithm, some or all internal nodes have a`
	`292`	`+buffer attached to them. When a tuple is inserted at the top, the descend down`
	`293`	`+the tree is stopped as soon as a buffer is reached, and the tuple is pushed to`
	`294`	`+the buffer. When a buffer gets too full, all the tuples in it are flushed to`
	`295`	`+the lower level, where they again hit lower level buffers or leaf pages. This`
	`296`	`+makes the insertions happen in more of a breadth-first than depth-first order,`
	`297`	`+which greatly reduces the amount of random I/O required.`
	`298`	`+`
	`299`	`+In the algorithm, levels are numbered so that leaf pages have level zero,`
	`300`	`+and internal node levels count up from 1. This numbering ensures that a page's`
	`301`	`+level number never changes, even when the root page is split.`
	`302`	`+`
	`303`	`+Level Tree`
	`304`	`+`
	`305`	`+3 *`
	`306`	`+ / \`
	`307`	`+2 * *`
	`308`	`+ / \| \ / \| \`
	`309`	`+1 * * * * * *`
	`310`	`+ / \ / \ / \ / \ / \ / \`
	`311`	`+0 o o o o o o o o o o o o`
	`312`	`+`
	`313`	`+* - internal page`
	`314`	`+o - leaf page`
	`315`	`+`
	`316`	`+Internal pages that belong to certain levels have buffers associated with`
	`317`	`+them. Leaf pages never have buffers. Which levels have buffers is controlled`
	`318`	`+by "level step" parameter: level numbers that are multiples of level_step`
	`319`	`+have buffers, while others do not. For example, if level_step = 2, then`
	`320`	`+pages on levels 2, 4, 6, ... have buffers. If level_step = 1 then every`
	`321`	`+internal page has a buffer.`
	`322`	`+`
	`323`	`+Level Tree (level_step = 1) Tree (level_step = 2)`
	`324`	`+`
	`325`	`+3 * *`
	`326`	`+ / \ / \`
	`327`	`+2 (b) (b) (b) (b)`
	`328`	`+ / \| \ / \| \ / \| \ / \| \`
	`329`	`+1 (b) (b) (b) (b) (b) (b) * * * * * *`
	`330`	`+ / \ / \ / \ / \ / \ / \ / \ / \ / \ / \ / \ / \`
	`331`	`+0 o o o o o o o o o o o o o o o o o o o o o o o o`
	`332`	`+`
	`333`	`+(b) - buffer`
	`334`	`+`
	`335`	`+Logically, a buffer is just bunch of tuples. Physically, it is divided in`
	`336`	`+pages, backed by a temporary file. Each buffer can be in one of two states:`
	`337`	`+a) Last page of the buffer is kept in main memory. A node buffer is`
	`338`	`+automatically switched to this state when a new index tuple is added to it,`
	`339`	`+or a tuple is removed from it.`
	`340`	`+b) All pages of the buffer are swapped out to disk. When a buffer becomes too`
	`341`	`+full, and we start to flush it, all other buffers are switched to this state.`
	`342`	`+`
	`343`	`+When an index tuple is inserted, its initial processing can end in one of the`
	`344`	`+following points:`
	`345`	`+1) Leaf page, if the depth of the index <= level_step, meaning that`
	`346`	`+ none of the internal pages have buffers associated with them.`
	`347`	`+2) Buffer of topmost level page that has buffers.`
	`348`	`+`
	`349`	`+New index tuples are processed until one of the buffers in the topmost`
	`350`	`+buffered level becomes half-full. When a buffer becomes half-full, it's added`
	`351`	`+to the emptying queue, and will be emptied before a new tuple is processed.`
	`352`	`+`
	`353`	`+Buffer emptying process means that index tuples from the buffer are moved`
	`354`	`+into buffers at a lower level, or leaf pages. First, all the other buffers are`
	`355`	`+swapped to disk to free up the memory. Then tuples are popped from the buffer`
	`356`	`+one by one, and cascaded down the tree to the next buffer or leaf page below`
	`357`	`+the buffered node.`
	`358`	`+`
	`359`	`+Emptying a buffer has the interesting dynamic property that any intermediate`
	`360`	`+pages between the buffer being emptied, and the next buffered or leaf level`
	`361`	`+below it, become cached. If there are no more buffers below the node, the leaf`
	`362`	`+pages where the tuples finally land on get cached too. If there are, the last`
	`363`	`+buffer page of each buffer below is kept in memory. This is illustrated in`
	`364`	`+the figures below:`
	`365`	`+`
	`366`	`+ Buffer being emptied to`
	`367`	`+ lower-level buffers Buffer being emptied to leaf pages`
	`368`	`+`
	`369`	`+ +(fb) +(fb)`
	`370`	`+ / \ / \`
	`371`	`+ + + + +`
	`372`	`+ / \ / \ / \ / \`
	`373`	`+ (ab) (ab) (ab) (ab) x x x x`
	`374`	`+`
	`375`	`++ - cached internal page`
	`376`	`+x - cached leaf page`
	`377`	`+* - non-cached internal page`
	`378`	`+(fb) - buffer being emptied`
	`379`	`+(ab) - buffers being appended to, with last page in memory`
	`380`	`+`
	`381`	`+In the beginning of the index build, the level-step is chosen so that all those`
	`382`	`+pages involved in emptying one buffer fit in cache, so after each of those`
	`383`	`+pages have been accessed once and cached, emptying a buffer doesn't involve`
	`384`	`+any more I/O. This locality is where the speedup of the buffering algorithm`
	`385`	`+comes from.`
	`386`	`+`
	`387`	`+Emptying one buffer can fill up one or more of the lower-level buffers,`
	`388`	`+triggering emptying of them as well. Whenever a buffer becomes too full, it's`
	`389`	`+added to the emptying queue, and will be emptied after the current buffer has`
	`390`	`+been processed.`
	`391`	`+`
	`392`	`+To keep the size of each buffer limited even in the worst case, buffer emptying`
	`393`	`+is scheduled as soon as a buffer becomes half-full, and emptying it continues`
	`394`	`+until 1/2 of the nominal buffer size worth of tuples has been emptied. This`
	`395`	`+guarantees that when buffer emptying begins, all the lower-level buffers`
	`396`	`+are at most half-full. In the worst case that all the tuples are cascaded down`
	`397`	`+to the same lower-level buffer, that buffer therefore has enough space to`
	`398`	`+accommodate all the tuples emptied from the upper-level buffer. There is no`
	`399`	`+hard size limit in any of the data structures used, though, so this only needs`
	`400`	`+to be approximate; small overfilling of some buffers doesn't matter.`
	`401`	`+`
	`402`	`+If an internal page that has a buffer associated with it is split, the buffer`
	`403`	`+needs to be split too. All tuples in the buffer are scanned through and`
	`404`	`+relocated to the correct sibling buffers, using the penalty function to decide`
	`405`	`+which buffer each tuple should go to.`
	`406`	`+`
	`407`	`+After all tuples from the heap have been processed, there are still some index`
	`408`	`+tuples in the buffers. At this point, final buffer emptying starts. All buffers`
	`409`	`+are emptied in top-down order. This is slightly complicated by the fact that`
	`410`	`+new buffers can be allocated during the emptying, due to page splits. However,`
	`411`	`+the new buffers will always be siblings of buffers that haven't been fully`
	`412`	`+emptied yet; tuples never move upwards in the tree. The final emptying loops`
	`413`	`+through buffers at a given level until all buffers at that level have been`
	`414`	`+emptied, and then moves down to the next level.`
	`415`	`+`
`281`	`416`
`282`	`417`	`Authors:`
`283`	`418`	`Teodor Sigaev<teodor@sigaev.ru>`

0 commit comments

Comments

(0)

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Commit5edb24a

File tree

11 files changed

11 files changed

`‎doc/src/sgml/gist.sgml‎`

`‎doc/src/sgml/ref/create_index.sgml‎`

`‎src/backend/access/common/reloptions.c‎`

`‎src/backend/access/gist/Makefile‎`

`‎src/backend/access/gist/README‎`

0 commit comments