NotificationsYou must be signed in to change notification settings
Fork5
Star27

Commit36a35c5

committed

Compress GIN posting lists, for smaller index size.

GIN posting lists are now encoded using varbyte-encoding, which allows themto fit in much smaller space than the straight ItemPointer array format usedbefore. The new encoding is used for both the lists stored in-line in entrytree items, and in posting tree leaf pages.To maintain backwards-compatibility and keep pg_upgrade working, the codecan still read old-style pages and tuples. Posting tree leaf pages in thenew format are flagged with GIN_COMPRESSED flag, to distinguish old and newformat pages. Likewise, entry tree tuples in the new format have aGIN_ITUP_COMPRESSED flag set in a bit that was previously unused.This patch bumps GIN_CURRENT_VERSION from 1 to 2. New indexes created withversion 9.4 will therefore have version number 2 in the metapage, while oldpg_upgraded indexes will have version 1. The code treats them the same, butit might be come handy in the future, if we want to drop support for theuncompressed format.Alexander Korotkov and me. Reviewed by Tomas Vondra and Amit Langote.

1 parent243ee26 commit36a35c5Copy full SHA for 36a35c5

File tree

13 files changed

+2309

-718

lines changed

contrib/pgstattuple/expected
- pgstattuple.out
src
- backend/access
  - gin
  - rmgrdesc
    - gindesc.c
- include/access
  - gin_private.h

13 files changed

+2309

-718

lines changed

`‎contrib/pgstattuple/expected/pgstattuple.out‎`

Lines changed: 1 addition & 1 deletion

Original file line number	Diff line number	Diff line change
`@@ -123,6 +123,6 @@ create index test_ginidx on test using gin (b);`
`123`	`123`	`select * from pgstatginindex('test_ginidx');`
`124`	`124`	`version \| pending_pages \| pending_tuples`
`125`	`125`	`---------+---------------+----------------`
`126`		`-1 \| 0 \| 0`
	`126`	`+2 \| 0 \| 0`
`127`	`127`	`(1 row)`
`128`	`128`

`‎src/backend/access/gin/README‎`

Lines changed: 114 additions & 9 deletions

Original file line number	Diff line number	Diff line change
`@@ -135,15 +135,15 @@ same category of null entry are merged into one index entry just as happens`
`135`	`135`	`with ordinary key entries.`
`136`	`136`
`137`	`137`	`* In a key entry at the btree leaf level, at the next SHORTALIGN boundary,`
`138`		`-there isan array ofzero or more ItemPointers, which store the heap tuple`
`139`		`-TIDs for whichtheindexable items contain this key. This is calledthe`
`140`		`-"posting list". The TIDs in aposting list must appear in sorted order.`
`141`		`-If the list would be too big for the index tuple to fit on an index page,`
`142`		`-theItemPointers are pushed out to a separate posting page or pages, and`
`143`		`-none appear in the key entry itself. The separate pages are called a`
`144`		`-"posting tree"; they are organized as a btree of ItemPointer values.`
`145`		`-Note that in either case, the ItemPointers associated with a key can`
`146`		`-easily be read out in sorted order; this is relied on by the scan`
	`138`	`+there isa list ofitem pointers, in compressed format (see Posting List`
	`139`	`+Compression section), pointing totheheap tuples for whichthe indexable`
	`140`	`+items contain this key. This is called the "posting list".`
	`141`	`+`
	`142`	`+Ifthelist would be too big for the index tuple to fit on an index page, the`
	`143`	`+ItemPointers are pushed out to a separate posting page or pages, and none`
	`144`	`+appear in the key entry itself. The separate pages are called a "posting`
	`145`	`+tree" (see below);Note that in either case, the ItemPointers associated with`
	`146`	`+a key caneasily be read out in sorted order; this is relied on by the scan`
`147`	`147`	`algorithms.`
`148`	`148`
`149`	`149`	`* The index tuple header fields of a leaf key entry are abused as follows:`
`@@ -163,6 +163,11 @@ algorithms.`
`163`	`163`
`164`	`164`	`* The posting list can be accessed with GinGetPosting(itup)`
`165`	`165`
	`166`	`+* If GinITupIsCompressed(itup), the posting list is stored in compressed`
	`167`	`+ format. Otherwise it is just an array of ItemPointers. New tuples are always`
	`168`	`+ stored in compressed format, uncompressed items can be present if the`
	`169`	`+ database was migrated from 9.3 or earlier version.`
	`170`	`+`
`166`	`171`	`2) Posting tree case:`
`167`	`172`
`168`	`173`	`* ItemPointerGetBlockNumber(&itup->t_tid) contains the index block number`
`@@ -210,6 +215,76 @@ fit on one pending-list page must have those pages to itself, even if this`
`210`	`215`	`results in wasting much of the space on the preceding page and the last`
`211`	`216`	`page for the tuple.)`
`212`	`217`
	`218`	`+Posting tree`
	`219`	`+------------`
	`220`	`+`
	`221`	`+If a posting list is too large to store in-line in a key entry, a posting tree`
	`222`	`+is created. A posting tree is a B-tree structure, where the ItemPointer is`
	`223`	`+used as the key.`
	`224`	`+`
	`225`	`+Internal posting tree pages use the standard PageHeader and the same "opaque"`
	`226`	`+struct as other GIN page, but do not contain regular index tuples. Instead,`
	`227`	`+the contents of the page is an array of PostingItem structs. Each PostingItem`
	`228`	`+consists of the block number of the child page, and the right bound of that`
	`229`	`+child page, as an ItemPointer. The right bound of the page is stored right`
	`230`	`+after the page header, before the PostingItem array.`
	`231`	`+`
	`232`	`+Posting tree leaf pages also use the standard PageHeader and opaque struct,`
	`233`	`+and the right bound of the page is stored right after the page header,`
	`234`	`+but the page content comprises of 0-32 compressed posting lists, and an`
	`235`	`+additional array of regular uncompressed item pointers. The compressed posting`
	`236`	`+lists are stored one after each other, between page header and pd_lower. The`
	`237`	`+uncompressed array is stored between pd_upper and pd_special. The space`
	`238`	`+between pd_lower and pd_upper is unused, which allows full-page images of`
	`239`	`+posting tree leaf pages to skip the unused space in middle (buffer_std = true`
	`240`	`+in XLogRecData). For historical reasons, this does not apply to internal`
	`241`	`+pages, or uncompressed leaf pages migrated from earlier versions.`
	`242`	`+`
	`243`	`+The item pointers are stored in a number of independent compressed posting`
	`244`	`+lists (also called segments), instead of one big one, to make random access`
	`245`	`+to a given item pointer faster: to find an item in a compressed list, you`
	`246`	`+have to read the list from the beginning, but when the items are split into`
	`247`	`+multiple lists, you can first skip over to the list containing the item you're`
	`248`	`+looking for, and read only that segment. Also, an update only needs to`
	`249`	`+re-encode the affected segment.`
	`250`	`+`
	`251`	`+The uncompressed items array is used for insertions, to avoid re-encoding`
	`252`	`+a compressed list on every update. If there is room on a page, an insertion`
	`253`	`+simply inserts the new item to the right place in the uncompressed array.`
	`254`	`+When a page becomes full, it is rewritten, merging all the uncompressed items`
	`255`	`+are into the compressed lists. When reading, the uncompressed array and the`
	`256`	`+compressed lists are read in tandem, and merged into one stream of sorted`
	`257`	`+item pointers.`
	`258`	`+`
	`259`	`+Posting List Compression`
	`260`	`+------------------------`
	`261`	`+`
	`262`	`+To fit as many item pointers on a page as possible, posting tree leaf pages`
	`263`	`+and posting lists stored inline in entry tree leaf tuples use a lightweight`
	`264`	`+form of compression. We take advantage of the fact that the item pointers`
	`265`	`+are stored in sorted order. Instead of storing the block and offset number of`
	`266`	`+each item pointer separately, we store the difference from the previous item.`
	`267`	`+That in itself doesn't do much, but it allows us to use so-called varbyte`
	`268`	`+encoding to compress them.`
	`269`	`+`
	`270`	`+Varbyte encoding is a method to encode integers, allowing smaller numbers to`
	`271`	`+take less space at the cost of larger numbers. Each integer is represented by`
	`272`	`+variable number of bytes. High bit of each byte in varbyte encoding determines`
	`273`	`+whether the next byte is still part of this number. Therefore, to read a single`
	`274`	`+varbyte encoded number, you have to read bytes until you find a byte with the`
	`275`	`+high bit not set.`
	`276`	`+`
	`277`	`+When encoding, the block and offset number forming the item pointer are`
	`278`	`+combined into a single integer. The offset number is stored in the 11 low`
	`279`	`+bits (see MaxHeapTuplesPerPageBits in ginpostinglist.c), and the block number`
	`280`	`+is stored in the higher bits. That requires 43 bits in total, which`
	`281`	`+conveniently fits in at most 6 bytes.`
	`282`	`+`
	`283`	`+A compressed posting list is passed around and stored on disk in a`
	`284`	`+PackedPostingList struct. The first item in the list is stored uncompressed`
	`285`	`+as a regular ItemPointerData, followed by the length of the list in bytes,`
	`286`	`+followed by the packed items.`
	`287`	`+`
`213`	`288`	`Concurrency`
`214`	`289`	`-----------`
`215`	`290`
`@@ -260,6 +335,36 @@ page-deletions safe; it stamps the deleted pages with an XID and keeps the`
`260`	`335`	`deleted pages around with the right-link intact until all concurrent scans`
`261`	`336`	`have finished.)`
`262`	`337`
	`338`	`+Compatibility`
	`339`	`+-------------`
	`340`	`+`
	`341`	`+Compression of TIDs was introduced in 9.4. Some GIN indexes could remain in`
	`342`	`+uncompressed format because of pg_upgrade from 9.3 or earlier versions.`
	`343`	`+For compatibility, old uncompressed format is also supported. Following`
	`344`	`+rules are used to handle it:`
	`345`	`+`
	`346`	`+* GIN_ITUP_COMPRESSED flag marks index tuples that contain a posting list.`
	`347`	`+This flag is stored in high bit of ItemPointerGetBlockNumber(&itup->t_tid).`
	`348`	`+Use GinItupIsCompressed(itup) to check the flag.`
	`349`	`+`
	`350`	`+* Posting tree pages in the new format are marked with the GIN_COMPRESSED flag.`
	`351`	`+ Macros GinPageIsCompressed(page) and GinPageSetCompressed(page) are used to`
	`352`	`+ check and set this flag.`
	`353`	`+`
	`354`	`+* All scan operations check format of posting list add use corresponding code`
	`355`	`+to read its content.`
	`356`	`+`
	`357`	`+* When updating an index tuple containing an uncompressed posting list, it`
	`358`	`+will be replaced with new index tuple containing a compressed list.`
	`359`	`+`
	`360`	`+* When updating an uncompressed posting tree leaf page, it's compressed.`
	`361`	`+`
	`362`	`+* If vacuum finds some dead TIDs in uncompressed posting lists, they are`
	`363`	`+converted into compressed posting lists. This assumes that the compressed`
	`364`	`+posting list fits in the space occupied by the uncompressed list. IOW, we`
	`365`	`+assume that the compressed version of the page, with the dead items removed,`
	`366`	`+takes less space than the old uncompressed version.`
	`367`	`+`
`263`	`368`	`Limitations`
`264`	`369`	`-----------`
`265`	`370`

`‎src/backend/access/gin/ginbtree.c‎`

Lines changed: 40 additions & 33 deletions

Original file line number	Diff line number	Diff line change
`@@ -325,9 +325,10 @@ ginPlaceToPage(GinBtree btree, GinBtreeStack *stack,`
`325`	`325`	`{`
`326`	`326`	`Pagepage=BufferGetPage(stack->buffer);`
`327`	`327`	`XLogRecData*payloadrdata;`
`328`		`-boolfit;`
	`328`	`+GinPlaceToPageRCrc;`
`329`	`329`	`uint16xlflags=0;`
`330`	`330`	`Pagechildpage=NULL;`
	`331`	`+Pagenewlpage=NULL,newrpage=NULL;`
`331`	`332`
`332`	`333`	`if (GinPageIsData(page))`
`333`	`334`	`xlflags \|=GIN_INSERT_ISDATA;`
`@@ -345,16 +346,17 @@ ginPlaceToPage(GinBtree btree, GinBtreeStack *stack,`
`345`	`346`	`}`
`346`	`347`
`347`	`348`	`/*`
`348`		`- * Try to put the incoming tuple on the page. If it doesn't fit,`
`349`		`- * placeToPage method will return false and leave the page unmodified, and`
`350`		`- * we'll have to split the page.`
	`349`	`+ * Try to put the incoming tuple on the page. placeToPage will decide`
	`350`	`+ * if the page needs to be split.`
`351`	`351`	`*/`
`352`		`-START_CRIT_SECTION();`
`353`		`-fit=btree->placeToPage(btree,stack->buffer,stack->off,`
`354`		`-insertdata,updateblkno,`
`355`		`-&payloadrdata);`
`356`		`-if (fit)`
	`352`	`+rc=btree->placeToPage(btree,stack->buffer,stack,`
	`353`	`+insertdata,updateblkno,`
	`354`	`+&payloadrdata,&newlpage,&newrpage);`
	`355`	`+if (rc==UNMODIFIED)`
	`356`	`+return true;`
	`357`	`+elseif (rc==INSERTED)`
`357`	`358`	`{`
	`359`	`+/* placeToPage did START_CRIT_SECTION() */`
`358`	`360`	`MarkBufferDirty(stack->buffer);`
`359`	`361`
`360`	`362`	`/* An insert to an internal page finishes the split of the child. */`
`@@ -373,7 +375,6 @@ ginPlaceToPage(GinBtree btree, GinBtreeStack *stack,`
`373`	`375`
`374`	`376`	`xlrec.node=btree->index->rd_node;`
`375`	`377`	`xlrec.blkno=BufferGetBlockNumber(stack->buffer);`
`376`		`-xlrec.offset=stack->off;`
`377`	`378`	`xlrec.flags=xlflags;`
`378`	`379`
`379`	`380`	`rdata[0].buffer=InvalidBuffer;`
`@@ -415,20 +416,16 @@ ginPlaceToPage(GinBtree btree, GinBtreeStack *stack,`
`415`	`416`
`416`	`417`	`return true;`
`417`	`418`	`}`
`418`		`-else`
	`419`	`+elseif (rc==SPLIT)`
`419`	`420`	`{`
`420`	`421`	`/* Didn't fit, have to split */`
`421`	`422`	`Bufferrbuffer;`
`422`		`-Pagenewlpage;`
`423`	`423`	`BlockNumbersavedRightLink;`
`424`		`-Pagerpage;`
`425`	`424`	`XLogRecDatardata[2];`
`426`	`425`	`ginxlogSplitdata;`
`427`	`426`	`Bufferlbuffer=InvalidBuffer;`
`428`	`427`	`Pagenewrootpg=NULL;`
`429`	`428`
`430`		`-END_CRIT_SECTION();`
`431`		`-`
`432`	`429`	`rbuffer=GinNewBuffer(btree->index);`
`433`	`430`
`434`	`431`	`/* During index build, count the new page */`
`@@ -443,12 +440,9 @@ ginPlaceToPage(GinBtree btree, GinBtreeStack *stack,`
`443`	`440`	`savedRightLink=GinPageGetOpaque(page)->rightlink;`
`444`	`441`
`445`	`442`	`/*`
`446`		`- * newlpageis a pointerto memorypage, it isnot associated with a`
`447`		`- *buffer. stack->buffer is not touched yet.`
	`443`	`+ * newlpageand newrpage are pointersto memorypages,not associated`
	`444`	`+ *with buffers. stack->buffer is not touched yet.`
`448`	`445`	`*/`
`449`		`-newlpage=btree->splitPage(btree,stack->buffer,rbuffer,stack->off,`
`450`		`-insertdata,updateblkno,`
`451`		`-&payloadrdata);`
`452`	`446`
`453`	`447`	`data.node=btree->index->rd_node;`
`454`	`448`	`data.rblkno=BufferGetBlockNumber(rbuffer);`
`@@ -481,8 +475,6 @@ ginPlaceToPage(GinBtree btree, GinBtreeStack *stack,`
`481`	`475`	`else`
`482`	`476`	`rdata[0].next=payloadrdata;`
`483`	`477`
`484`		`-rpage=BufferGetPage(rbuffer);`
`485`		`-`
`486`	`478`	`if (stack->parent==NULL)`
`487`	`479`	`{`
`488`	`480`	`/*`
`@@ -508,7 +500,7 @@ ginPlaceToPage(GinBtree btree, GinBtreeStack *stack,`
`508`	`500`	`data.lblkno=BufferGetBlockNumber(lbuffer);`
`509`	`501`	`data.flags \|=GIN_SPLIT_ROOT;`
`510`	`502`
`511`		`-GinPageGetOpaque(rpage)->rightlink=InvalidBlockNumber;`
	`503`	`+GinPageGetOpaque(newrpage)->rightlink=InvalidBlockNumber;`
`512`	`504`	`GinPageGetOpaque(newlpage)->rightlink=BufferGetBlockNumber(rbuffer);`
`513`	`505`
`514`	`506`	`/*`
`@@ -517,20 +509,20 @@ ginPlaceToPage(GinBtree btree, GinBtreeStack *stack,`
`517`	`509`	`* than overwriting the original page directly, so that we can still`
`518`	`510`	`* abort gracefully if this fails.)`
`519`	`511`	`*/`
`520`		`-newrootpg=PageGetTempPage(rpage);`
`521`		`-GinInitPage(newrootpg,GinPageGetOpaque(newlpage)->flags& ~GIN_LEAF,BLCKSZ);`
	`512`	`+newrootpg=PageGetTempPage(newrpage);`
	`513`	`+GinInitPage(newrootpg,GinPageGetOpaque(newlpage)->flags& ~(GIN_LEAF \|GIN_COMPRESSED),BLCKSZ);`
`522`	`514`
`523`	`515`	`btree->fillRoot(btree,newrootpg,`
`524`	`516`	`BufferGetBlockNumber(lbuffer),newlpage,`
`525`		`-BufferGetBlockNumber(rbuffer),rpage);`
	`517`	`+BufferGetBlockNumber(rbuffer),newrpage);`
`526`	`518`	`}`
`527`	`519`	`else`
`528`	`520`	`{`
`529`	`521`	`/* split non-root page */`
`530`	`522`	`data.rrlink=savedRightLink;`
`531`	`523`	`data.lblkno=BufferGetBlockNumber(stack->buffer);`
`532`	`524`
`533`		`-GinPageGetOpaque(rpage)->rightlink=savedRightLink;`
	`525`	`+GinPageGetOpaque(newrpage)->rightlink=savedRightLink;`
`534`	`526`	`GinPageGetOpaque(newlpage)->flags \|=GIN_INCOMPLETE_SPLIT;`
`535`	`527`	`GinPageGetOpaque(newlpage)->rightlink=BufferGetBlockNumber(rbuffer);`
`536`	`528`	`}`
`@@ -550,16 +542,24 @@ ginPlaceToPage(GinBtree btree, GinBtreeStack *stack,`
`550`	`542`	`START_CRIT_SECTION();`
`551`	`543`
`552`	`544`	`MarkBufferDirty(rbuffer);`
	`545`	`+MarkBufferDirty(stack->buffer);`
`553`	`546`
	`547`	`+/*`
	`548`	`+ * Restore the temporary copies over the real buffers. But don't free`
	`549`	`+ * the temporary copies yet, WAL record data points to them.`
	`550`	`+ */`
`554`	`551`	`if (stack->parent==NULL)`
`555`	`552`	`{`
`556`		`-PageRestoreTempPage(newlpage,BufferGetPage(lbuffer));`
`557`	`553`	`MarkBufferDirty(lbuffer);`
`558`		`-newlpage=newrootpg;`
	`554`	`+memcpy(BufferGetPage(stack->buffer),newrootpg,BLCKSZ);`
	`555`	`+memcpy(BufferGetPage(lbuffer),newlpage,BLCKSZ);`
	`556`	`+memcpy(BufferGetPage(rbuffer),newrpage,BLCKSZ);`
	`557`	`+}`
	`558`	`+else`
	`559`	`+{`
	`560`	`+memcpy(BufferGetPage(stack->buffer),newlpage,BLCKSZ);`
	`561`	`+memcpy(BufferGetPage(rbuffer),newrpage,BLCKSZ);`
`559`	`562`	`}`
`560`		`-`
`561`		`-PageRestoreTempPage(newlpage,BufferGetPage(stack->buffer));`
`562`		`-MarkBufferDirty(stack->buffer);`
`563`	`563`
`564`	`564`	`/* write WAL record */`
`565`	`565`	`if (RelationNeedsWAL(btree->index))`
`@@ -568,7 +568,7 @@ ginPlaceToPage(GinBtree btree, GinBtreeStack *stack,`
`568`	`568`
`569`	`569`	`recptr=XLogInsert(RM_GIN_ID,XLOG_GIN_SPLIT,rdata);`
`570`	`570`	`PageSetLSN(BufferGetPage(stack->buffer),recptr);`
`571`		`-PageSetLSN(rpage,recptr);`
	`571`	`+PageSetLSN(BufferGetPage(rbuffer),recptr);`
`572`	`572`	`if (stack->parent==NULL)`
`573`	`573`	`PageSetLSN(BufferGetPage(lbuffer),recptr);`
`574`	`574`	`}`
`@@ -582,6 +582,11 @@ ginPlaceToPage(GinBtree btree, GinBtreeStack *stack,`
`582`	`582`	`if (stack->parent==NULL)`
`583`	`583`	`UnlockReleaseBuffer(lbuffer);`
`584`	`584`
	`585`	`+pfree(newlpage);`
	`586`	`+pfree(newrpage);`
	`587`	`+if (newrootpg)`
	`588`	`+pfree(newrootpg);`
	`589`	`+`
`585`	`590`	`/*`
`586`	`591`	`* If we split the root, we're done. Otherwise the split is not`
`587`	`592`	`* complete until the downlink for the new page has been inserted to`
`@@ -592,6 +597,8 @@ ginPlaceToPage(GinBtree btree, GinBtreeStack *stack,`
`592`	`597`	`else`
`593`	`598`	`return false;`
`594`	`599`	`}`
	`600`	`+else`
	`601`	`+elog(ERROR,"unknown return code from GIN placeToPage method: %d",rc);`
`595`	`602`	`}`
`596`	`603`
`597`	`604`	`/*`

0 commit comments

Comments

(0)

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Commit36a35c5

File tree

13 files changed

13 files changed

`‎contrib/pgstattuple/expected/pgstattuple.out‎`

`‎src/backend/access/gin/README‎`

`‎src/backend/access/gin/ginbtree.c‎`

0 commit comments