|
1 | | -<!-- $PostgreSQL: pgsql/doc/src/sgml/gin.sgml,v 2.6 2006/11/30 20:50:44 petere Exp $ --> |
| 1 | +<!-- $PostgreSQL: pgsql/doc/src/sgml/gin.sgml,v 2.7 2006/12/01 23:46:46 tgl Exp $ --> |
2 | 2 |
|
3 | 3 | <chapter id="GIN"> |
4 | 4 | <title>GIN Indexes</title> |
|
14 | 14 | <para> |
15 | 15 | <acronym>GIN</acronym> stands for Generalized Inverted Index. It is |
16 | 16 | an index structure storing a set of (key, posting list) pairs, where |
17 | | - 'posting list' is a set of rows in which the key occurs. Each |
18 | | - row may contain many keys. |
| 17 | + a <quote>posting list</> is a set of rows in which the key occurs. Each |
| 18 | + indexed value may contain many keys, so the same row ID may appear in |
| 19 | + multiple posting lists. |
19 | 20 | </para> |
20 | 21 |
|
21 | 22 | <para> |
|
45 | 46 |
|
46 | 47 | <para> |
47 | 48 | The <acronym>GIN</acronym> interface has a high level of abstraction, |
48 | | - requiring the access method implementertoonly implement the semantics of |
| 49 | + requiring the access method implementer only to implement the semantics of |
49 | 50 | the data type being accessed. The <acronym>GIN</acronym> layer itself |
50 | 51 | takes care of concurrency, logging and searching the tree structure. |
51 | 52 | </para> |
52 | 53 |
|
53 | 54 | <para> |
54 | 55 | All it takes to get a <acronym>GIN</acronym> access method working |
55 | 56 | is to implement four user-defined methods, which define the behavior of |
56 | | - keys in the tree. In short, <acronym>GIN</acronym> combines extensibility |
57 | | - along with generality, code reuse, and a clean interface. |
58 | | - </para> |
59 | | - |
60 | | -</sect1> |
61 | | - |
62 | | -<sect1 id="gin-implementation"> |
63 | | - <title>Implementation</title> |
64 | | - |
65 | | - <para> |
66 | | - Internally, <acronym>GIN</acronym> consists of a B-tree index constructed |
67 | | - over keys, where each key is an element of the indexed value |
68 | | - (element of array, for example) and where each tuple in a leaf page is |
69 | | - either a pointer to a B-tree over heap pointers (PT, posting tree), or a |
70 | | - list of heap pointers (PL, posting list) if the tuple is small enough. |
| 57 | + keys in the tree and the relationships between keys, indexed values, |
| 58 | + and indexable queries. In short, <acronym>GIN</acronym> combines |
| 59 | + extensibility with generality, code reuse, and a clean interface. |
71 | 60 | </para> |
72 | 61 |
|
73 | 62 | <para> |
74 | | -There are four methods that an index operator class for |
75 | | - <acronym>GIN</acronym> must provide(prototypesare in pseudocode): |
| 63 | +The four methods that an index operator class for |
| 64 | + <acronym>GIN</acronym> must provide are: |
76 | 65 | </para> |
77 | 66 |
|
78 | 67 | <variablelist> |
79 | 68 | <varlistentry> |
80 | 69 | <term>int compare(Datum a, Datum b)</term> |
81 | 70 | <listitem> |
82 | 71 | <para> |
83 | | - Compares keys (not indexed values!) and returns an integer less than |
84 | | - zero, zero, or greater than zero, indicating whether the first key is |
85 | | - less than, equal to, or greater than the second. |
| 72 | +Compares keys (not indexed values!) and returns an integer less than |
| 73 | +zero, zero, or greater than zero, indicating whether the first key is |
| 74 | + less than, equal to, or greater than the second. |
86 | 75 | </para> |
87 | 76 | </listitem> |
88 | 77 | </varlistentry> |
|
91 | 80 | <term>Datum* extractValue(Datum inputValue, uint32 *nkeys)</term> |
92 | 81 | <listitem> |
93 | 82 | <para> |
94 | | - Returns an array of keysofvalue to be indexed, nkeys should |
95 | | -contain thenumber of returned keys. |
| 83 | +Returns an array of keysgiven avalue to be indexed. The |
| 84 | +number of returned keys must be stored into <literal>*nkeys</>. |
96 | 85 | </para> |
97 | 86 | </listitem> |
98 | 87 | </varlistentry> |
99 | 88 |
|
100 | 89 | <varlistentry> |
101 | | - <term>Datum* extractQuery(Datum query, uint32 nkeys, |
102 | | -StrategyNumber n)</term> |
| 90 | + <term>Datum* extractQuery(Datum query, uint32*nkeys, |
| 91 | +StrategyNumber n)</term> |
103 | 92 | <listitem> |
104 | 93 | <para> |
105 | | - Returns an array of keys of the query to be executed. n contains the |
106 | | - strategy number of the operation (see <xref |
107 | | - linkend="xindex-strategies">). Depending on n, query may be |
108 | | - different type. |
| 94 | + Returns an array of keys given a value to be queried; that is, |
| 95 | + <literal>query</> is the value on the right-hand side of an |
| 96 | + indexable operator whose left-hand side is the indexed column. |
| 97 | + <literal>n</> is the strategy number of the operator within the |
| 98 | + operator class (see <xref linkend="xindex-strategies">). |
| 99 | + Often, <function>extractQuery</> will need |
| 100 | + to consult <literal>n</> to determine the data type of |
| 101 | + <literal>query</> and the key values that need to be extracted. |
| 102 | + The number of returned keys must be stored into <literal>*nkeys</>. |
109 | 103 | </para> |
110 | 104 | </listitem> |
111 | 105 | </varlistentry> |
|
114 | 108 | <term>bool consistent(bool check[], StrategyNumber n, Datum query)</term> |
115 | 109 | <listitem> |
116 | 110 | <para> |
117 | | - Returns TRUE if the indexed value satisfies the query qualifier with |
118 | | - strategy n (or may satisfy in case of RECHECK mark in operator class). |
119 | | - Each element of the check array is TRUE if the indexed value has a |
120 | | - corresponding key in the query: if (check[i] == TRUE) the i-th key of |
121 | | - the query is present in the indexed value. |
| 111 | + Returns TRUE if the indexed value satisfies the query operator with |
| 112 | + strategy number <literal>n</> (or may satisfy, if the operator is |
| 113 | + marked RECHECK in the operator class). The <literal>check</> array has |
| 114 | + the same length as the number of keys previously returned by |
| 115 | + <function>extractQuery</> for this query. Each element of the |
| 116 | + <literal>check</> array is TRUE if the indexed value contains the |
| 117 | + corresponding query key, ie, if (check[i] == TRUE) the i-th key of the |
| 118 | + <function>extractQuery</> result array is present in the indexed value. |
| 119 | + The original <literal>query</> datum (not the extracted key array!) is |
| 120 | + passed in case the <function>consistent</> method needs to consult it. |
122 | 121 | </para> |
123 | 122 | </listitem> |
124 | 123 | </varlistentry> |
|
127 | 126 |
|
128 | 127 | </sect1> |
129 | 128 |
|
| 129 | +<sect1 id="gin-implementation"> |
| 130 | + <title>Implementation</title> |
| 131 | + |
| 132 | + <para> |
| 133 | + Internally, a <acronym>GIN</acronym> index contains a B-tree index |
| 134 | + constructed over keys, where each key is an element of the indexed value |
| 135 | + (a member of an array, for example) and where each tuple in a leaf page is |
| 136 | + either a pointer to a B-tree over heap pointers (PT, posting tree), or a |
| 137 | + list of heap pointers (PL, posting list) if the list is small enough. |
| 138 | + </para> |
| 139 | + |
| 140 | +</sect1> |
| 141 | + |
130 | 142 | <sect1 id="gin-tips"> |
131 | 143 | <title>GIN tips and tricks</title> |
132 | 144 |
|
133 | 145 | <variablelist> |
134 | 146 | <varlistentry> |
135 | 147 | <term>Create vs insert</term> |
136 | 148 | <listitem> |
137 | | -<para> |
138 | | - In most cases, insertion into a <acronym>GIN</acronym> index is slow |
139 | | - due to the likelihood of many keys being inserted for each value. |
140 | | -So, for bulk insertions into a table it is advisable totodrop the GIN |
141 | | - index and recreate it after finishing bulk insertion. |
142 | | -</para> |
| 149 | +<para> |
| 150 | + In most cases, insertion into a <acronym>GIN</acronym> index is slow |
| 151 | + due to the likelihood of many keys being inserted for each value. |
| 152 | +So, for bulk insertions into a table it is advisable to drop the GIN |
| 153 | + index and recreate it after finishing bulk insertion. |
| 154 | +</para> |
143 | 155 | </listitem> |
144 | 156 | </varlistentry> |
145 | 157 |
|
146 | 158 | <varlistentry> |
147 | | - <term>gin_fuzzy_search_limit</term> |
| 159 | + <term><xref linkend="guc-gin-fuzzy-search-limit"></term> |
148 | 160 | <listitem> |
149 | | -<para> |
150 | | - The primary goal of developing <acronym>GIN</acronym> indices was |
151 | | - support for highly scalable, full-text search in |
152 | | - <productname>PostgreSQL</productname> and there are often situations when |
153 | | - a full-text search returns a very large set of results. Since reading |
154 | | - tuples from the disk and sorting them could take a lot of time, this is |
155 | | - unacceptable for production. (Note that the index search itself is very |
156 | | - fast.) |
| 161 | + <para> |
| 162 | + The primary goal of developing <acronym>GIN</acronym> indexes was |
| 163 | + to create support for highly scalable, full-text search in |
| 164 | + <productname>PostgreSQL</productname>, and there are often situations when |
| 165 | + a full-text search returns a very large set of results. Moreover, this |
| 166 | + often happens when the query contains very frequent words, so that the |
| 167 | + large result set is not even useful. Since reading many |
| 168 | + tuples from the disk and sorting them could take a lot of time, this is |
| 169 | + unacceptable for production. (Note that the index search itself is very |
| 170 | + fast.) |
| 171 | + </para> |
| 172 | + <para> |
| 173 | + To facilitate controlled execution of such queries |
| 174 | + <acronym>GIN</acronym> has a configurable soft upper limit on the size |
| 175 | + of the returned set, the |
| 176 | + <varname>gin_fuzzy_search_limit</varname> configuration parameter. |
| 177 | + It is set to 0 (meaning no limit) by default. |
| 178 | + If a non-zero limit is set, then the returned set is a subset of |
| 179 | + the whole result set, chosen at random. |
| 180 | + </para> |
| 181 | + <para> |
| 182 | + <quote>Soft</quote> means that the actual number of returned results |
| 183 | + could differ slightly from the specified limit, depending on the query |
| 184 | + and the quality of the system's random number generator. |
157 | 185 | </para> |
158 | | -<para> |
159 | | - Such queries usually contain very frequent words, so the results are not |
160 | | - very helpful. To facilitate execution of such queries |
161 | | - <acronym>GIN</acronym> has a configurable soft upper limit of the size |
162 | | - of the returned set, determined by the |
163 | | - <varname>gin_fuzzy_search_limit</varname> GUC variable. It is set to 0 by |
164 | | - default (no limit). |
165 | | -</para> |
166 | | -<para> |
167 | | - If a non-zero search limit is set, then the returned set is a subset of |
168 | | - the whole result set, chosen at random. |
169 | | -</para> |
170 | | -<para> |
171 | | - <quote>Soft</quote> means that the actual number of returned results |
172 | | - could slightly differ from the specified limit, depending on the query |
173 | | - and the quality of the system's random number generator. |
174 | | -</para> |
175 | 186 | </listitem> |
176 | 187 | </varlistentry> |
177 | 188 | </variablelist> |
|
182 | 193 | <title>Limitations</title> |
183 | 194 |
|
184 | 195 | <para> |
185 | | - <acronym>GIN</acronym> doesn't support full index scans due to their |
186 | | -extreme inefficiency: because there areoften many keys per value, |
187 | | -each heap pointer will be returned several times. |
| 196 | + <acronym>GIN</acronym> doesn't support full index scans: because there are |
| 197 | + often many keys per value, each heap pointer would be returned many times, |
| 198 | +and there is no easy way to prevent this. |
188 | 199 | </para> |
189 | 200 |
|
190 | 201 | <para> |
191 | 202 | When <function>extractQuery</function> returns zero keys, |
192 | | - <acronym>GIN</acronym> will emit an error: for different opclasses and |
193 | | - strategies the semantic meaning of a void query may be different (for |
194 | | - example, any array contains the void array, but they don't overlap the |
195 | | - void array), and <acronym>GIN</acronym> can't suggest a reasonable answer. |
| 203 | + <acronym>GIN</acronym> will emit an error. Depending on the operator, |
| 204 | + a void query might match all, some, or none of the indexed values (for |
| 205 | + example, every array contains the empty array, but does not overlap the |
| 206 | + empty array), and <acronym>GIN</acronym> can't determine the correct |
| 207 | + answer, nor produce a full-index-scan result if it could determine that |
| 208 | + that was correct. |
196 | 209 | </para> |
197 | 210 |
|
198 | 211 | <para> |
199 | | - <acronym>GIN</acronym> searches keys only by equality matching. This may |
| 212 | + It is not an error for <function>extractValue</> to return zero keys, |
| 213 | + but in this case the indexed value will be unrepresented in the index. |
| 214 | + This is another reason why full index scan is not useful — it would |
| 215 | + miss such rows. |
| 216 | + </para> |
| 217 | + |
| 218 | + <para> |
| 219 | + <acronym>GIN</acronym> searches keys only by equality matching. This may |
200 | 220 | be improved in future. |
201 | 221 | </para> |
202 | 222 | </sect1> |
|
206 | 226 |
|
207 | 227 | <para> |
208 | 228 | The <productname>PostgreSQL</productname> source distribution includes |
209 | | - <acronym>GIN</acronym> classes for one-dimensional arrays of all internal |
| 229 | + <acronym>GIN</acronym> classes for one-dimensional arrays of all internal |
210 | 230 | types. The following |
211 | 231 | <filename>contrib</> modules also contain <acronym>GIN</acronym> |
212 | | - operator classes: |
| 232 | + operator classes: |
213 | 233 | </para> |
214 | | -
|
| 234 | + |
215 | 235 | <variablelist> |
216 | 236 | <varlistentry> |
217 | 237 | <term>intarray</term> |
|