@@ -15,23 +15,23 @@ is similar to GiST and differs from btree indices, which have predefined,
1515comparison-based operations.
1616
1717An inverted index is an index structure storing a set of (key, posting list)
18- pairs, where 'posting list' is a set ofdocuments in which the key occurs.
18+ pairs, where 'posting list' is a set ofheap rows in which the key occurs.
1919(A text document would usually contain many keys.) The primary goal of
2020Gin indices is support for highly scalable, full-text search in PostgreSQL.
2121
22- Gin consists of a B-tree index constructed overentries (ET, entries tree) ,
23- where eachentry is an element ofthe indexedvalue (element of array, lexeme
24- for tsvector) and where each tuple in a leaf pageis either a pointer to a
25- B-tree over item pointers (PT, posting tree), or a list of item pointers
26- (PL, posting list) if thetuple is small enough.
22+ A Ginindex consists of a B-tree index constructed overkey values ,
23+ where eachkey is an element ofsome indexeditems (element of array, lexeme
24+ for tsvector) and where each tuple in a leaf pagecontains either a pointer to
25+ a B-tree over item pointers (posting tree), or a simple list of item pointers
26+ (posting list) if thelist is small enough.
2727
28- Note: There is no delete operation for ET. The reason for this is that in
29- our experience, the set of distinct words in a large corpus changes very
30- rarely. This greatly simplifies the code and concurrency algorithms.
28+ Note: There is no delete operation in the key (entry) tree. The reason for
29+ this is that in our experience, the set of distinct words in a large corpus
30+ changes very slowly. This greatly simplifies the code and concurrency
31+ algorithms.
3132
32- Gin comes with built-in support for one-dimensional arrays (eg. integer[],
33- text[]), but no support for NULL elements. The following operations are
34- available:
33+ Core PostgreSQL includes built-in Gin support for one-dimensional arrays
34+ (eg. integer[], text[]). The following operations are available:
3535
3636 * contains: value_array @> query_array
3737 * overlaps: value_array && query_array
@@ -71,7 +71,7 @@ limit).
7171If a non-zero search limit is set, then the returned set is a subset of the
7272whole result set, chosen at random.
7373
74- "Soft" means that the actual number of returned results couldslightly differ
74+ "Soft" means that the actual number of returned results could differ
7575from the specified limit, depending on the query and the quality of the
7676system's random number generator.
7777
@@ -80,40 +80,156 @@ From experience, a value of 'gin_fuzzy_search_limit' in the thousands
8080have no effect for queries returning a result set with less tuples than this
8181number.
8282
83- Limitations
84- -----------
85-
86- * No support for multicolumn indices
87- * Gin doesn't uses scan->kill_prior_tuple & scan->ignore_killed_tuples
88- * Gin searches entries only by equality matching. This may be improved in
89- future.
90- * Gin doesn't support full scans of indices.
91- * Gin doesn't index NULL values.
83+ Index structure
84+ ---------------
9285
93- Open Items
94- ----------
86+ The "items" that a GIN index indexes are composite values that contain
87+ zero or more "keys". For example, an item might be an integer array, and
88+ then the keys would be the individual integer values. The index actually
89+ stores and searches for the key values, not the items per se. In the
90+ pg_opclass entry for a GIN opclass, the opcintype is the data type of the
91+ items, and the opckeytype is the data type of the keys. GIN is optimized
92+ for cases where items contain many keys and the same key values appear
93+ in many different items.
94+
95+ A GIN index contains a metapage, a btree of key entries, and possibly
96+ "posting tree" pages, which hold the overflow when a key entry acquires
97+ too many heap tuple pointers to fit in a btree page. Additionally, if the
98+ fast-update feature is enabled, there can be "list pages" holding "pending"
99+ key entries that haven't yet been merged into the main btree. The list
100+ pages have to be scanned linearly when doing a search, so the pending
101+ entries should be merged into the main btree before there get to be too
102+ many of them. The advantage of the pending list is that bulk insertion of
103+ a few thousand entries can be much faster than retail insertion. (The win
104+ comes mainly from not having to do multiple searches/insertions when the
105+ same key appears in multiple new heap tuples.)
106+
107+ Key entries are nominally of the same IndexEntry format as used in other
108+ index types, but since a leaf key entry typically refers to multiple heap
109+ tuples, there are significant differences. (See GinFormTuple, which works
110+ by building a "normal" index tuple and then modifying it.) The points to
111+ know are:
112+
113+ * In a single-column index, a key tuple just contains the key datum, but
114+ in a multi-column index, a key tuple contains the pair (column number,
115+ key datum) where the column number is stored as an int2. This is needed
116+ to support different key data types in different columns. This much of
117+ the tuple is built by index_form_tuple according to the usual rules.
118+ The column number (if present) can never be null, but the key datum can
119+ be, in which case a null bitmap is present as usual. (As usual for index
120+ tuples, the size of the null bitmap is fixed at INDEX_MAX_KEYS.)
121+
122+ * If the key datum is null (ie, IndexTupleHasNulls() is true), then
123+ just after the nominal index data (ie, at offset IndexInfoFindDataOffset
124+ or IndexInfoFindDataOffset + sizeof(int2)) there is a byte indicating
125+ the "category" of the null entry. These are the possible categories:
126+ 1 = ordinary null key value extracted from an indexable item
127+ 2 = placeholder for zero-key indexable item
128+ 3 = placeholder for null indexable item
129+ Placeholder null entries are inserted into the index because otherwise
130+ there would be no index entry at all for an empty or null indexable item,
131+ which would mean that full index scans couldn't be done and various corner
132+ cases would give wrong answers. The different categories of null entries
133+ are treated as distinct keys by the btree, but heap itempointers for the
134+ same category of null entry are merged into one index entry just as happens
135+ with ordinary key entries.
136+
137+ * In a key entry at the btree leaf level, at the next SHORTALIGN boundary,
138+ there is an array of zero or more ItemPointers, which store the heap tuple
139+ TIDs for which the indexable items contain this key. This is called the
140+ "posting list". The TIDs in a posting list must appear in sorted order.
141+ If the list would be too big for the index tuple to fit on an index page,
142+ the ItemPointers are pushed out to a separate posting page or pages, and
143+ none appear in the key entry itself. The separate pages are called a
144+ "posting tree"; they are organized as a btree of ItemPointer values.
145+ Note that in either case, the ItemPointers associated with a key can
146+ easily be read out in sorted order; this is relied on by the scan
147+ algorithms.
148+
149+ * The index tuple header fields of a leaf key entry are abused as follows:
150+
151+ 1) Posting list case:
152+
153+ * ItemPointerGetBlockNumber(&itup->t_tid) contains the offset from index
154+ tuple start to the posting list.
155+ Access macros: GinGetPostingOffset(itup) / GinSetPostingOffset(itup,n)
156+
157+ * ItemPointerGetOffsetNumber(&itup->t_tid) contains the number of elements
158+ in the posting list (number of heap itempointers).
159+ Access macros: GinGetNPosting(itup) / GinSetNPosting(itup,n)
160+
161+ * If IndexTupleHasNulls(itup) is true, the null category byte can be
162+ accessed/set with GinGetNullCategory(itup,gs) / GinSetNullCategory(itup,gs,c)
163+
164+ * The posting list can be accessed with GinGetPosting(itup)
165+
166+ 2) Posting tree case:
167+
168+ * ItemPointerGetBlockNumber(&itup->t_tid) contains the index block number
169+ of the root of the posting tree.
170+ Access macros: GinGetPostingTree(itup) / GinSetPostingTree(itup, blkno)
171+
172+ * ItemPointerGetOffsetNumber(&itup->t_tid) contains the magic number
173+ GIN_TREE_POSTING, which distinguishes this from the posting-list case
174+ (it's large enough that that many heap itempointers couldn't possibly
175+ fit on an index page). This value is inserted automatically by the
176+ GinSetPostingTree macro.
177+
178+ * If IndexTupleHasNulls(itup) is true, the null category byte can be
179+ accessed/set with GinGetNullCategory(itup) / GinSetNullCategory(itup,c)
180+
181+ * The posting list is not present and must not be accessed.
182+
183+ Use the macro GinIsPostingTree(itup) to determine which case applies.
184+
185+ In both cases, itup->t_info & INDEX_SIZE_MASK contains actual total size of
186+ tuple, and the INDEX_VAR_MASK and INDEX_NULL_MASK bits have their normal
187+ meanings as set by index_form_tuple.
188+
189+ Index tuples in non-leaf levels of the btree contain the optional column
190+ number, key datum, and null category byte as above. They do not contain
191+ a posting list. ItemPointerGetBlockNumber(&itup->t_tid) is the downlink
192+ to the next lower btree level, and ItemPointerGetOffsetNumber(&itup->t_tid)
193+ is InvalidOffsetNumber. Use the access macros GinGetDownlink/GinSetDownlink
194+ to get/set the downlink.
195+
196+ Index entries that appear in "pending list" pages work a tad differently as
197+ well. The optional column number, key datum, and null category byte are as
198+ for other GIN index entries. However, there is always exactly one heap
199+ itempointer associated with a pending entry, and it is stored in the t_tid
200+ header field just as in non-GIN indexes. There is no posting list.
201+ Furthermore, the code that searches the pending list assumes that all
202+ entries for a given heap tuple appear consecutively in the pending list and
203+ are sorted by the column-number-plus-key-datum. The GIN_LIST_FULLROW page
204+ flag bit tells whether entries for a given heap tuple are spread across
205+ multiple pending-list pages. If GIN_LIST_FULLROW is set, the page contains
206+ all the entries for one or more heap tuples. If GIN_LIST_FULLROW is clear,
207+ the page contains entries for only one heap tuple, *and* they are not all
208+ the entries for that tuple. (Thus, a heap tuple whose entries do not all
209+ fit on one pending-list page must have those pages to itself, even if this
210+ results in wasting much of the space on the preceding page and the last
211+ page for the tuple.)
95212
96- We appreciate any comments, help and suggestions.
213+ Limitations
214+ -----------
97215
98- *Teach optimizer/executor that GIN is intrinsically clustered. i.e., it
99- always returns ItemPointer in ascending order.
100- * Tweak gincostestimate .
216+ *Gin doesn't use scan->kill_prior_tuple & scan->ignore_killed_tuples
217+ * Gin searches entries only by equality matching, or simple range
218+ matching using the "partial match" feature .
101219
102220TODO
103221----
104222
105223Nearest future:
106224
107- * Opclasses forall types (no programming, just many catalog changes).
225+ * Opclasses formore types (no programming, just many catalog changes)
108226
109227Distant future:
110228
111229 * Replace B-tree of entries to something like GiST
112- * Add multicolumn support
113- * Optimize insert operations (background index insertion)
114230
115231Authors
116232-------
117233
118- All work was done by Teodor Sigaev (teodor@sigaev.ru) and Oleg Bartunov
234+ Original work was done by Teodor Sigaev (teodor@sigaev.ru) and Oleg Bartunov
119235(oleg@sai.msu.su).