- Notifications
You must be signed in to change notification settings - Fork57
RUM access method - inverted index with additional information in posting lists
License
postgrespro/rum
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
Therum module provides an access method to work with aRUM
index. It is basedon theGIN
access method's code.
AGIN
index allows performing fast full-text search usingtsvector
andtsquery
types. But full-text search with a GIN index has several problems:
- Slow ranking. It needs positional information about lexemes to do ranking. A
GIN
index doesn't store positions of lexemes. So after index scanning, we need anadditional heap scan to retrieve lexeme positions. - Slow phrase search with a
GIN
index. This problem relates to the previousproblem. It needs positional information to perform phrase search. - Slow ordering by timestamp. A
GIN
index can't store some related informationin the index with lexemes. So it is necessary to perform an additional heap scan.
RUM
solves these problems by storing additional information in a posting tree.For example, positional information of lexemes or timestamps. You can get anidea ofRUM
with the following diagram:
A drawback ofRUM
is that it has slower build and insert times thanGIN
.This is because we need to store additional information besides keys and becauseRUM
uses generic Write-Ahead Log (WAL) records.
This module is available under thelicense similar toPostgreSQL.
Before building and installingrum, you should ensure following are installed:
- PostgreSQL version is 9.6+.
Typical installation procedure may look like this:
$ git clone https://github.com/postgrespro/rum$ cd rum$ make USE_PGXS=1$ make USE_PGXS=1 install$ make USE_PGXS=1 installcheck$ psql DB -c "CREATE EXTENSION rum;"
$ USE_PGXS=1 pgxn install rum
Important: Don't forget to set the
PG_CONFIG
variable in case you want to testRUM
on a custom build of PostgreSQL. Read morehere.
Therum module provides next operators.
Operator | Returns | Description |
---|---|---|
tsvector <=> tsquery | float4 | Returns distance between tsvector and tsquery. |
timestamp <=> timestamp | float8 | Returns distance between two timestamps. |
timestamp <=| timestamp | float8 | Returns distance only for left timestamps. |
timestamp |=> timestamp | float8 | Returns distance only for right timestamps. |
The last three operations also work for types timestamptz, int2, int4, int8, float4, float8,money and oid.
rum provides the following operator classes.
For type:tsvector
This operator class storestsvector
lexemes with positional information. It supportsordering by the<=>
operator and prefix search. See the example below.
Let us assume we have the table:
CREATETABLEtest_rum(ttext, a tsvector);CREATETRIGGERtsvectorupdateBEFOREUPDATEOR INSERTON test_rumFOR EACH ROW EXECUTE PROCEDURE tsvector_update_trigger('a','pg_catalog.english','t');INSERT INTO test_rum(t)VALUES ('The situation is most beautiful');INSERT INTO test_rum(t)VALUES ('It is a beautiful');INSERT INTO test_rum(t)VALUES ('It looks like a beautiful place');
To create therum index we need create an extension:
CREATE EXTENSION rum;
Then we can create new index:
CREATEINDEXrumidxON test_rum USING rum (a rum_tsvector_ops);
And we can execute the following queries:
SELECT t, a<=> to_tsquery('english','beautiful | place')AS rankFROM test_rumWHERE a @@ to_tsquery('english','beautiful | place')ORDER BY a<=> to_tsquery('english','beautiful | place'); t | rank---------------------------------+--------- It lookslike a beautiful place |8.22467 The situation is most beautiful |16.4493 It is a beautiful |16.4493(3 rows)SELECT t, a<=> to_tsquery('english','place | situation')AS rankFROM test_rumWHERE a @@ to_tsquery('english','place | situation')ORDER BY a<=> to_tsquery('english','place | situation'); t | rank---------------------------------+--------- The situation is most beautiful |16.4493 It lookslike a beautiful place |16.4493(2 rows)
For type:tsvector
This operator class stores a hash oftsvector
lexemes with positional information.It supports ordering by the<=>
operator. Itdoesn't support prefix search.
For types: int2, int4, int8, float4, float8, money, oid, time, timetz, date,interval, macaddr, inet, cidr, text, varchar, char, bytea, bit, varbit,numeric, timestamp, timestamptz
Supported operations:<
,<=
,=
,>=
,>
for all types and<=>
,<=|
and|=>
for int2, int4, int8, float4, float8, money, oid,timestamp and timestamptz types.
This operator supports ordering by the<=>
,<=|
and|=>
operators. It can be used withrum_tsvector_addon_ops
,rum_tsvector_hash_addon_ops' and
rum_anyarray_addon_ops` operator classes.
For type:tsvector
This operator class storestsvector
lexemes with any supported by modulefield. See the example below.
Let us assume we have the table:
CREATETABLEtsts (idint, t tsvector, dtimestamp);\copy tstsfrom'rum/data/tsts.data'CREATEINDEXtsts_idxON tsts USING rum (t rum_tsvector_addon_ops, d) WITH (attach='d', to='t');
Now we can execute the following queries:
EXPLAIN (costs off)SELECT id, d, d<=>'2016-05-16 14:21:25'FROM tstsWHERE t @@'wr&qh'ORDER BY d<=>'2016-05-16 14:21:25'LIMIT5; QUERY PLAN-----------------------------------------------------------------------------------Limit-> Index Scan using tsts_idxon tsts Index Cond: (t @@'''wr'' &''qh'''::tsquery)Order By: (d<=>'Mon May 16 14:21:25 2016'::timestamp without time zone)(4 rows)SELECT id, d, d<=>'2016-05-16 14:21:25'FROM tstsWHERE t @@'wr&qh'ORDER BY d<=>'2016-05-16 14:21:25'LIMIT5; id | d | ?column?-----+---------------------------------+---------------355 | Mon May1614:21:22.3267242016 |2.673276354 | Mon May1613:21:22.3267242016 |3602.673276371 | Tue May1706:21:22.3267242016 |57597.326724406 | Wed May1817:21:22.3267242016 |183597.326724415 | Thu May1902:21:22.3267242016 |215997.326724(5 rows)
Warning: Currently RUM has bogus behaviour when one creates an index using ordering over pass-by-reference additional information. This is due to the fact that posting trees have fixed length right bound and fixed length non-leaf posting items. It isn't allowed to create such indexes.
For type:tsvector
This operator class stores a hash oftsvector
lexemes with any supported by modulefield.
Itdoesn't support prefix search.
For type:tsquery
It stores branches of query tree in additional information. For example, we have the table:
CREATETABLEquery (q tsquery, tagtext);INSERT INTO queryVALUES ('supernova & star','sn'), ('black','color'), ('big & bang & black & hole','bang'), ('spiral & galaxy','shape'), ('black & hole','color');CREATEINDEXquery_idxON query USING rum(q);
Now we can execute the following fast query:
SELECT*FROM queryWHERE to_tsvector('black holes never exists before we think about them') @@ q; q | tag------------------+-------'black' | color'black' &'hole' | color(2 rows)
For type:anyarray
This operator class storesanyarray
elements with length of the array.It supports operators&&
,@>
,<@
,=
,%
operators. It also supports ordering by<=>
operator.For example, we have the table:
CREATETABLEtest_array (i int2[]);INSERT INTO test_arrayVALUES ('{}'), ('{0}'), ('{1,2,3,4}'), ('{1,2,3}'), ('{1,2}'), ('{1}');CREATEINDEXidx_arrayON test_array USING rum (i rum_anyarray_ops);
Now we can execute the query using index scan:
SET enable_seqscan TO off;EXPLAIN (COSTS OFF)SELECT*FROM test_arrayWHERE i &&'{1}'ORDER BY i<=>'{1}'ASC; QUERY PLAN------------------------------------------ Index Scan using idx_arrayon test_array Index Cond: (i &&'{1}'::smallint[])Order By: (i<=>'{1}'::smallint[])(3 rowsSELECT*FROM test_arrayWHERE i &&'{1}'ORDER BY i<=>'{1}'ASC; i----------- {1} {1,2} {1,2,3} {1,2,3,4}(4 rows)
For type:anyarray
This operator class storesanyarray
elements with any supported by modulefield.
- Allow multiple additional information (lexemes positions + timestamp).
- Improve ranking function to support TF/IDF.
- Improve insert time.
- Improve GENERIC WAL to support shift (PostgreSQL core changes).
Alexander Korotkova.korotkov@postgrespro.ru Postgres Professional Ltd., Russia
Oleg Bartunovo.bartunov@postgrespro.ru Postgres Professional Ltd., Russia
Teodor Sigaevteodor@postgrespro.ru Postgres Professional Ltd., Russia
Arthur Zakirova.zakirov@postgrespro.ru Postgres Professional Ltd., Russia
Pavel Borisovp.borisov@postgrespro.com Postgres Professional Ltd., Russia
Maxim Orlovm.orlov@postgrespro.ru Postgres Professional Ltd., Russia
About
RUM access method - inverted index with additional information in posting lists
Topics
Resources
License
Uh oh!
There was an error while loading.Please reload this page.