Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Commite6dbcb7

Browse files
committed
Extend GIN to support partial-match searches, and extend tsquery to support
prefix matching using this facility.Teodor Sigaev and Oleg Bartunov
1 parente1bdd07 commite6dbcb7

32 files changed

+1279
-503
lines changed

‎doc/src/sgml/datatype.sgml

Lines changed: 20 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
<!-- $PostgreSQL: pgsql/doc/src/sgml/datatype.sgml,v 1.226 2008/03/30 04:08:14 neilc Exp $ -->
1+
<!-- $PostgreSQL: pgsql/doc/src/sgml/datatype.sgml,v 1.227 2008/05/16 16:31:01 tgl Exp $ -->
22

33
<chapter id="datatype">
44
<title id="datatype-title">Data Types</title>
@@ -3298,18 +3298,17 @@ SELECT * FROM test;
32983298
SELECT 'a fat cat sat on a mat and ate a fat rat'::tsvector;
32993299
tsvector
33003300
----------------------------------------------------
3301-
'a' 'on' 'and' 'ate' 'cat' 'fat' 'mat' 'rat' 'sat'
3301+
'a' 'and' 'ate' 'cat' 'fat' 'mat' 'on' 'rat' 'sat'
33023302
</programlisting>
33033303

3304-
(As the example shows, the sorting is first by length and then
3305-
alphabetically, but that detail is seldom important.) To represent
3304+
To represent
33063305
lexemes containing whitespace or punctuation, surround them with quotes:
33073306

33083307
<programlisting>
33093308
SELECT $$the lexeme ' ' contains spaces$$::tsvector;
33103309
tsvector
33113310
-------------------------------------------
3312-
'the' ' ' 'lexeme' 'spaces' 'contains'
3311+
'' 'contains' 'lexeme' 'spaces' 'the'
33133312
</programlisting>
33143313

33153314
(We use dollar-quoted string literals in this example and the next one,
@@ -3320,7 +3319,7 @@ SELECT $$the lexeme ' ' contains spaces$$::tsvector;
33203319
SELECT $$the lexeme 'Joe''s' contains a quote$$::tsvector;
33213320
tsvector
33223321
------------------------------------------------
3323-
'a' 'the' 'Joe''s' 'quote' 'lexeme' 'contains'
3322+
'Joe''s' 'a' 'contains' 'lexeme' 'quote' 'the'
33243323
</programlisting>
33253324

33263325
Optionally, integer <firstterm>position(s)</>
@@ -3330,7 +3329,7 @@ SELECT $$the lexeme 'Joe''s' contains a quote$$::tsvector;
33303329
SELECT 'a:1 fat:2 cat:3 sat:4 on:5 a:6 mat:7 and:8 ate:9 a:10 fat:11 rat:12'::tsvector;
33313330
tsvector
33323331
-------------------------------------------------------------------------------
3333-
'a':1,6,10 'on':5 'and':8 'ate':9 'cat':3 'fat':2,11 'mat':7 'rat':12 'sat':4
3332+
'a':1,6,10 'and':8 'ate':9 'cat':3 'fat':2,11 'mat':7 'on':5 'rat':12 'sat':4
33343333
</programlisting>
33353334

33363335
A position normally indicates the source word's location in the
@@ -3369,7 +3368,7 @@ SELECT 'a:1A fat:2B,4C cat:5D'::tsvector;
33693368
select 'The Fat Rats'::tsvector;
33703369
tsvector
33713370
--------------------
3372-
'Fat' 'The' 'Rats'
3371+
'Fat' 'Rats' 'The'
33733372
</programlisting>
33743373

33753374
For most English-text-searching applications the above words would
@@ -3439,6 +3438,19 @@ SELECT 'fat:ab &amp; cat'::tsquery;
34393438
</programlisting>
34403439
</para>
34413440

3441+
<para>
3442+
Also, lexemes in a <type>tsquery</type> can be labeled with <literal>*</>
3443+
to specify prefix matching:
3444+
<programlisting>
3445+
SELECT 'super:*'::tsquery;
3446+
tsquery
3447+
-----------
3448+
'super':*
3449+
</programlisting>
3450+
This query will match any word in a <type>tsvector</> that begins
3451+
with <quote>super</>.
3452+
</para>
3453+
34423454
<para>
34433455
Quoting rules for lexemes are the same as described above for
34443456
lexemes in <type>tsvector</>; and, as with <type>tsvector</>,

‎doc/src/sgml/gin.sgml

Lines changed: 95 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
<!-- $PostgreSQL: pgsql/doc/src/sgml/gin.sgml,v 2.14 2008/04/14 17:05:32 tgl Exp $ -->
1+
<!-- $PostgreSQL: pgsql/doc/src/sgml/gin.sgml,v 2.15 2008/05/16 16:31:01 tgl Exp $ -->
22

33
<chapter id="GIN">
44
<title>GIN Indexes</title>
@@ -52,15 +52,15 @@
5252
</para>
5353

5454
<para>
55-
All it takes to get a <acronym>GIN</acronym> access method working
56-
is toimplement four user-defined methods, which define the behavior of
55+
All it takes to get a <acronym>GIN</acronym> access method working is to
56+
implement four (or five) user-defined methods, which define the behavior of
5757
keys in the tree and the relationships between keys, indexed values,
5858
and indexable queries. In short, <acronym>GIN</acronym> combines
5959
extensibility with generality, code reuse, and a clean interface.
6060
</para>
6161

6262
<para>
63-
The four methods that anindexoperator class for
63+
The four methods that an operator class for
6464
<acronym>GIN</acronym> must provide are:
6565
</para>
6666

@@ -77,7 +77,7 @@
7777
</varlistentry>
7878

7979
<varlistentry>
80-
<term>Datum*extractValue(Datum inputValue, int32 *nkeys)</term>
80+
<term>Datum *extractValue(Datum inputValue, int32 *nkeys)</term>
8181
<listitem>
8282
<para>
8383
Returns an array of keys given a value to be indexed. The
@@ -87,8 +87,8 @@
8787
</varlistentry>
8888

8989
<varlistentry>
90-
<term>Datum*extractQuery(Datum query, int32 *nkeys,
91-
StrategyNumber n)</term>
90+
<term>Datum *extractQuery(Datum query, int32 *nkeys,
91+
StrategyNumber n, bool **pmatch)</term>
9292
<listitem>
9393
<para>
9494
Returns an array of keys given a value to be queried; that is,
@@ -100,13 +100,22 @@
100100
to consult <literal>n</> to determine the data type of
101101
<literal>query</> and the key values that need to be extracted.
102102
The number of returned keys must be stored into <literal>*nkeys</>.
103-
If number of keys is equal to zero then <function>extractQuery</>
104-
should store 0 or -1 into <literal>*nkeys</>. 0 means that any
105-
row matches the <literal>query</> and sequence scan should be
106-
produced. -1 means nothing can satisfy <literal>query</>.
107-
Choice of value should be based on semantics meaning of operation with
108-
given strategy number.
103+
If the query contains no keys then <function>extractQuery</>
104+
should store 0 or -1 into <literal>*nkeys</>, depending on the
105+
semantics of the operator. 0 means that every
106+
value matches the <literal>query</> and a sequential scan should be
107+
produced. -1 means nothing can match the <literal>query</>.
108+
<literal>pmatch</> is an output argument for use when partial match
109+
is supported. To use it, <function>extractQuery</> must allocate
110+
an array of <literal>*nkeys</> booleans and store its address at
111+
<literal>*pmatch</>. Each element of the array should be set to TRUE
112+
if the corresponding key requires partial match, FALSE if not.
113+
If <literal>*pmatch</> is set to NULL then GIN assumes partial match
114+
is not required. The variable is initialized to NULL before call,
115+
so this argument can simply be ignored by operator classes that do
116+
not support partial match.
109117
</para>
118+
110119
</listitem>
111120
</varlistentry>
112121

@@ -133,6 +142,39 @@
133142

134143
</variablelist>
135144

145+
<para>
146+
Optionally, an operator class for
147+
<acronym>GIN</acronym> can supply a fifth method:
148+
</para>
149+
150+
<variablelist>
151+
152+
<varlistentry>
153+
<term>int comparePartial(Datum partial_key, Datum key, StrategyNumber n)</term>
154+
<listitem>
155+
<para>
156+
Compare a partial-match query to an index key. Returns an integer
157+
whose sign indicates the result: less than zero means the index key
158+
does not match the query, but the index scan should continue; zero
159+
means that the index key does match the query; greater than zero
160+
indicates that the index scan should stop because no more matches
161+
are possible. The strategy number <literal>n</> of the operator
162+
that generated the partial match query is provided, in case its
163+
semantics are needed to determine when to end the scan.
164+
</para>
165+
</listitem>
166+
</varlistentry>
167+
168+
</variablelist>
169+
170+
<para>
171+
To support <quote>partial match</> queries, an operator class must
172+
provide the <function>comparePartial</> method, and its
173+
<function>extractQuery</> method must set the <literal>pmatch</>
174+
parameter when a partial-match query is encountered. See
175+
<xref linkend="gin-partial-match"> for details.
176+
</para>
177+
136178
</sect1>
137179

138180
<sect1 id="gin-implementation">
@@ -146,6 +188,33 @@
146188
list of heap pointers (PL, posting list) if the list is small enough.
147189
</para>
148190

191+
<sect2 id="gin-partial-match">
192+
<title>Partial match algorithm</title>
193+
194+
<para>
195+
GIN can support <quote>partial match</> queries, in which the query
196+
does not determine an exact match for one or more keys, but the possible
197+
matches fall within a reasonably narrow range of key values (within the
198+
key sorting order determined by the <function>compare</> support method).
199+
The <function>extractQuery</> method, instead of returning a key value
200+
to be matched exactly, returns a key value that is the lower bound of
201+
the range to be searched, and sets the <literal>pmatch</> flag true.
202+
The key range is then searched using the <function>comparePartial</>
203+
method. <function>comparePartial</> must return zero for an actual
204+
match, less than zero for a non-match that is still within the range
205+
to be searched, or greater than zero if the index key is past the range
206+
that could match.
207+
</para>
208+
209+
<para>
210+
During a partial-match scan, all <literal>itemPointer</>s for matching keys
211+
are OR'ed into a <literal>TIDBitmap</>.
212+
The scan fails if the <literal>TIDBitmap</> becomes lossy.
213+
In this case an error message will be reported with advice
214+
to increase <literal>work_mem</>.
215+
</para>
216+
</sect2>
217+
149218
</sect1>
150219

151220
<sect1 id="gin-tips">
@@ -236,8 +305,14 @@
236305
</para>
237306

238307
<para>
239-
<acronym>GIN</acronym> searches keys only by equality matching. This might
240-
be improved in future.
308+
It is possible for an operator class to circumvent the restriction against
309+
full index scan. To do that, <function>extractValue</> must return at least
310+
one (possibly dummy) key for every indexed value, and
311+
<function>extractQuery</function> must convert an unrestricted search into
312+
a partial-match query that will scan the whole index. This is inefficient
313+
but might be necessary to avoid corner-case failures with operators such
314+
as LIKE. Note however that failure could still occur if the intermediate
315+
<literal>TIDBitmap</> becomes lossy.
241316
</para>
242317
</sect1>
243318

@@ -247,9 +322,11 @@
247322
<para>
248323
The <productname>PostgreSQL</productname> source distribution includes
249324
<acronym>GIN</acronym> operator classes for <type>tsvector</> and
250-
for one-dimensional arrays of all internal types. The following
251-
<filename>contrib</> modules also contain <acronym>GIN</acronym>
252-
operator classes:
325+
for one-dimensional arrays of all internal types. Prefix searching in
326+
<type>tsvector</> is implemented using the <acronym>GIN</> partial match
327+
feature.
328+
The following <filename>contrib</> modules also contain
329+
<acronym>GIN</acronym> operator classes:
253330
</para>
254331

255332
<variablelist>

‎doc/src/sgml/textsearch.sgml

Lines changed: 17 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
<!-- $PostgreSQL: pgsql/doc/src/sgml/textsearch.sgml,v 1.43 2008/04/14 17:05:32 tgl Exp $ -->
1+
<!-- $PostgreSQL: pgsql/doc/src/sgml/textsearch.sgml,v 1.44 2008/05/16 16:31:01 tgl Exp $ -->
22

33
<chapter id="textsearch">
44
<title id="textsearch-title">Full Text Search</title>
@@ -754,6 +754,20 @@ SELECT to_tsquery('english', 'Fat | Rats:AB');
754754
'fat' | 'rat':AB
755755
</programlisting>
756756

757+
Also, <literal>*</> can be attached to a lexeme to specify prefix matching:
758+
759+
<programlisting>
760+
SELECT to_tsquery('supern:*A &amp; star:A*B');
761+
to_tsquery
762+
--------------------------
763+
'supern':*A &amp; 'star':*AB
764+
</programlisting>
765+
766+
Such a lexeme will match any word in a <type>tsvector</> that begins
767+
with the given string.
768+
</para>
769+
770+
<para>
757771
<function>to_tsquery</function> can also accept single-quoted
758772
phrases. This is primarily useful when the configuration includes a
759773
thesaurus dictionary that may trigger on such phrases.
@@ -798,7 +812,8 @@ SELECT to_tsquery('''supernovae stars'' &amp; !crab');
798812
</programlisting>
799813

800814
Note that <function>plainto_tsquery</> cannot
801-
recognize either Boolean operators or weight labels in its input:
815+
recognize Boolean operators, weight labels, or prefix-match labels
816+
in its input:
802817

803818
<programlisting>
804819
SELECT plainto_tsquery('english', 'The Fat &amp; Rats:C');

‎doc/src/sgml/xindex.sgml

Lines changed: 8 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
<!-- $PostgreSQL: pgsql/doc/src/sgml/xindex.sgml,v 1.62 2008/04/14 17:05:32 tgl Exp $ -->
1+
<!-- $PostgreSQL: pgsql/doc/src/sgml/xindex.sgml,v 1.63 2008/05/16 16:31:01 tgl Exp $ -->
22

33
<sect1 id="xindex">
44
<title>Interfacing Extensions To Indexes</title>
@@ -444,6 +444,13 @@
444444
<entry>consistent - determine whether value matches query condition</entry>
445445
<entry>4</entry>
446446
</row>
447+
<row>
448+
<entry>comparePartial - (optional method) compare partial key from
449+
query and key from index, and return an integer less than zero, zero,
450+
or greater than zero, indicating whether GIN should ignore this index
451+
entry, treat the entry as a match, or stop the index scan</entry>
452+
<entry>5</entry>
453+
</row>
447454
</tbody>
448455
</tgroup>
449456
</table>

0 commit comments

Comments
 (0)

[8]ページ先頭

©2009-2025 Movatter.jp