|
1 | 1 | Tsearch2 - full text search extension for PostgreSQL |
2 | 2 |
|
3 | | - [10][Online version] of this document is available |
4 | | - |
5 | | - This module is sponsored by Delta-Soft Ltd., Moscow, Russia. |
6 | | - |
7 | | - Notice: This version is fully incompatible with old tsearch (V1), |
8 | | - which was deprecated in 7.4 and obsoleted in 8.0. |
9 | | - |
10 | | - The Tsearch2 contrib module contains an implementation of a new data |
11 | | - type tsvector - a searchable data type with indexed access. In a |
12 | | - nutshell, tsvector is a set of unique words along with their |
13 | | - positional information in the document, organized in a special |
14 | | - structure optimized for fast access and lookup. Actually, each word |
15 | | - entry, besides its position in the document, could have a weight |
16 | | - attribute, describing importance of this word (at a specific) position |
17 | | - in document. A set of bit-signatures of a fixed length, representing |
18 | | - tsvectors, are stored in a search tree (developed using PostgreSQL |
19 | | - GiST), which provides online update of full text index and fast query |
20 | | - lookup. The module provides indexed access methods, queries, |
21 | | - operations and supporting routines for the tsvector data type and easy |
22 | | - conversion of text data to tsvector. Table driven configuration allows |
23 | | - creation of custom configuration optimized for specific searches using |
| 3 | + [1]Online version of this document is available |
| 4 | + |
| 5 | + Tsearch2 - is the full text engine, fully integrated into PostgreSQL |
| 6 | + RDBMS. |
| 7 | + |
| 8 | +Main features |
| 9 | + |
| 10 | + * Full online update |
| 11 | + * Supports multiple table driven configurations |
| 12 | + * flexible and rich linguistic support (dictionaries, stop words), |
| 13 | + thesaurus |
| 14 | + * full multibyte (UTF-8) support |
| 15 | + * Sophisticated ranking functions with support of proximity and |
| 16 | + structure information (rank, rank_cd) |
| 17 | + * Index support (GiST and Gin) with concurrency and recovery support |
| 18 | + * Rich query language with query rewriting support |
| 19 | + * Headline support (text fragments with highlighted search terms) |
| 20 | + * Ability to plug-in custom dictionaries and parsers |
| 21 | + * Template generator for tsearch2 dictionaries with [2]snowball |
| 22 | + stemmer support |
| 23 | + * It is mature (5 years of development) |
| 24 | + |
| 25 | + Tsearch2, in a nutshell, provides FTS operator (contains) for the new |
| 26 | + data types, representing document (tsvector) and query (tsquery). |
| 27 | + Table driven configuration allows creation of custom searches using |
24 | 28 | standard SQL commands. |
25 | | - |
26 | | - Configuration allows you to: |
27 | | - * specify the type of lexemes to be indexed and the way they are |
28 | | - processed. |
29 | | - * specify dictionaries to be used along with stop words recognition. |
30 | | - * specify the parser used to process a document. |
31 | | - |
32 | | - See [11]Documentation Roadmap for links to documentation. |
| 29 | + |
| 30 | + tsvector is a searchable data type, representing document. It is a set |
| 31 | + of unique words along with their positional information in the |
| 32 | + document, organized in a special structure optimized for fast access |
| 33 | + and lookup. Each entry could be labelled to reflect its importance in |
| 34 | + document. |
| 35 | + |
| 36 | + tsquery is a data type for textual queries with support of boolean |
| 37 | + operators. It consists of lexemes (optionally labelled) with boolean |
| 38 | + operators between. |
| 39 | + |
| 40 | + Table driven configuration allows to specify: |
| 41 | + * parser, which used to break document onto lexemes |
| 42 | + * what lexemes to index and the way they are processed |
| 43 | + * dictionaries to be used along with stop words recognition. |
33 | 44 |
|
34 | 45 | OpenFTS vs Tsearch2 |
35 | 46 |
|
36 | | - OpenFTS is a middleware between application and database, so it uses |
37 | | - tsearch2 as a storage, while database engine is used as a query executor |
38 | | - (searching). Everything else (parsing of documents, query processing, |
39 | | - linguistics) carry outs on client side. That's why OpenFTS has its own |
40 | | - configuration table (fts_conf) and works with its own set of dictionaries. |
41 | | - OpenFTS is more flexible, because it could be used in multi-server |
42 | | - architecture with separated machines for repository of documents |
43 | | - (documents could be stored in file system), database and query engine. |
| 47 | + [3]OpenFTS is a middleware between application and database. OpenFTS |
| 48 | + uses tsearch2 as a storage and database engine as a query executor |
| 49 | + (searching). Everything else, i.e. parsing of documents, query |
| 50 | + processing, linguistics, carry outs on client side. That's why OpenFTS |
| 51 | + has its own configuration table (fts_conf) and works with its own set |
| 52 | + of dictionaries. OpenFTS is more flexible, because it could be used in |
| 53 | + multi-server architecture with separate machines for repository of |
| 54 | + documents (documents could be stored in filesystem), database and |
| 55 | + query engine. |
| 56 | + |
| 57 | + See [4]Documentation Roadmap for links to documentation. |
44 | 58 |
|
45 | 59 | Authors |
46 | 60 |
|
47 | 61 | * Oleg Bartunov <oleg@sai.msu.su>, Moscow, Moscow University, Russia |
48 | | - * Teodor Sigaev <teodor@sigaev.ru>, Moscow, Delta-Soft Ltd.,Russia |
49 | | -
|
| 62 | + * Teodor Sigaev <teodor@sigaev.ru>, Moscow,Moscow University,Russia |
| 63 | + |
50 | 64 | Contributors |
51 | 65 |
|
52 | | - * Robert John Shepherd and Andrew J. Kopciuch submitted |
53 | | - "Introductionto tsearch" (Robert - tsearch v1, Andrew - tsearch |
| 66 | + * RobertJohnShepherdandAndrewJ.Kopciuch submitted |
| 67 | + "Introduction to tsearch" (Robert - tsearch v1, Andrew - tsearch |
54 | 68 | v2) |
55 | | - * Brandon Craig Rhodes wrote "Tsearch2 Guide" and "Tsearch2 |
| 69 | + * BrandonCraigRhodeswrote"Tsearch2Guide"and "Tsearch2 |
56 | 70 | Reference" and proposed new naming convention for tsearch V2 |
57 | | - |
58 | | -Features Added with Tsearch2 |
59 | 71 |
|
60 | | - * Relevance ranking of search results |
61 | | - * Table driven configuration |
62 | | - * Morphology support (ispell dictionaries, snowball stemmers) |
63 | | - * Headline support (text fragments with highlighted search terms) |
64 | | - * Ability to plug-in custom dictionaries and parsers |
65 | | - * Synonym dictionary |
66 | | - * Generator of templates for dictionaries (built-in snowball stemmer |
67 | | - support) |
68 | | - * Statistics of indexed words is available |
69 | | - |
| 72 | +Sponsors |
| 73 | + |
| 74 | + * ABC Startsiden - compound words support |
| 75 | + * University of Mannheim for UTF-8 support (in 8.2) |
| 76 | + * jfg:networks ([5]http:www.jfg-networks.com/) for Gin - Generalized |
| 77 | + Inverted index (in 8.2) |
| 78 | + * Georgia Public Library Service and LibLime, Inc. for Thesaurus |
| 79 | + dictionary |
| 80 | + * PostGIS community - GiST Concurrency and Recovery |
| 81 | + |
| 82 | + The authors are grateful to the Russian Foundation for Basic Research |
| 83 | + and Delta-Soft Ltd., Moscow, Russia for support. |
| 84 | + |
70 | 85 | Limitations |
71 | 86 |
|
72 | | - * Lexeme should be not longer than 2048 bytes |
73 | | - * The number of lexemes is limited by 2^32. Note, that actual |
74 | | - capacity of tsvector is depends on whether positional information |
75 | | - is stored or not. |
76 | | - * tsvector - the size is limited by approximately 2^20 bytes. |
77 | | - * tsquery - the number of entries (lexemes and operations) < 32768 |
78 | | - * Positional information |
79 | | - + maximal position of lexeme < 2^14 (16384) |
80 | | - + lexeme could have maximum 256 positions |
81 | | - |
| 87 | + * Length of lexeme < 2K |
| 88 | + * Length of tsvector (lexemes + positions) < 1Mb |
| 89 | + * The number of lexemes < 4^32 |
| 90 | + * 0< Positional information < 16383 |
| 91 | + * No more than 256 positions per lexeme |
| 92 | + * The number of nodes ( lexemes + operations) in tsquery < 32768 |
| 93 | + |
82 | 94 | References |
83 | 95 |
|
84 | 96 | * GiST development site - |
85 | | - [12]http://www.sai.msu.su/~megera/postgres/gist |
86 | | - * OpenFTS home page - [13]http://openfts.sourceforge.net/ |
| 97 | + [6]http://www.sai.msu.su/~megera/postgres/gist |
| 98 | + * GiN development - [7]http://www.sigaev.ru/gin/ |
| 99 | + * OpenFTS home page - [8]http://openfts.sourceforge.net/ |
87 | 100 | * Mailing list - |
88 | | - [14]http://sourceforge.net/mailarchive/forum.php?forum=openfts-gen |
89 | | - eral |
90 | | - |
91 | | - [15]Documentation Roadmap |
92 | | - |
| 101 | + [9]http://sourceforge.net/mailarchive/forum.php?forum=openfts-gene |
| 102 | + ral |
| 103 | + |
93 | 104 | Documentation Roadmap |
94 | 105 |
|
95 | 106 | * Several docs are available from docs/ subdirectory |
96 | 107 | + "Tsearch V2 Introduction" by Andrew Kopciuch |
97 | 108 | + "Tsearch2 Guide" by Brandon Rhodes |
98 | 109 | + "Tsearch2 Reference" by Brandon Rhodes |
99 | 110 | * Readme.gendict in gendict/ subdirectory |
100 | | - + [16][Gendict tutorial] |
101 | | - |
102 | | - Online version of documentation is always available from Tsearch V2 |
103 | | - home page - |
104 | | - [17]http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/ |
105 | | - |
| 111 | + + Also, check [10]Gendict tutorial |
| 112 | + * Check [11]tsearch2 Wiki pages for various documentation |
| 113 | + |
106 | 114 | Support |
107 | 115 |
|
108 | | - Authors urgently recommend people to use [18][openfts-general] or |
109 | | - [19][pgsql-general] mailing lists for questions and discussions. |
110 | | - |
111 | | -Caution |
| 116 | + Authors urgently recommend people to use [12]openfts-general or |
| 117 | + [13]pgsql-general mailing lists for questions and discussions. |
112 | 118 |
|
113 | | - In spite of apparent easy full text searching with our tsearch module |
114 | | - (authors hope it's so), any serious search engine require profound |
115 | | - study of various aspects, such as stop words, dictionaries, special |
116 | | - parsers. Tsearch module was designed to facilitate both those cases. |
117 | | - |
118 | 119 | Development History |
119 | 120 |
|
| 121 | + Latest news |
| 122 | + |
| 123 | + To the PostgreSQL 8.2 release we added: |
| 124 | + * multibyte (UTF-8) support |
| 125 | + * Thesaurus dictionary |
| 126 | + * Query rewriting |
| 127 | + * rank_cd relevation function now support different weights of |
| 128 | + lexemes |
| 129 | + * GiN support adds scalability of tsearch2 |
| 130 | + |
120 | 131 | Pre-tsearch era |
121 | | - Development of OpenFTS began in 2000 after realizing that we |
122 | | -needed asearch engine optimized for online updatesand able to |
123 | | -accessmetadata from the database. This is essential for online |
| 132 | + DevelopmentofOpenFTS began in 2000 after realizing that we |
| 133 | +need asearch engine optimized for online updateswith access |
| 134 | +tometadatafromthe database. This is essential for online |
124 | 135 | news agencies, web portals, digital libraries, etc. Most search |
125 | | - engines available utilize an inverted index which is very fast |
126 | | - for searching but very slow for online updates. Incremental |
127 | | - updates of an inverted index is a complex engineering task |
128 | | - while we needed something light, free and with the ability to |
129 | | - access metadata from the database. The last requirement is very |
130 | | - important because in a real life application a search engine |
131 | | - should always consult metadata ( topic, permissions, date |
132 | | - range, version, etc.). We extensively use PostgreSQL as a |
133 | | - database backend and have no intention to move from it, so the |
134 | | - problem was to find a data structure and a fast way to access |
135 | | - it. PostgreSQL has rather unique data type for storing sets |
136 | | - (think about words) - arrays, but lacks index access to them. A |
137 | | - document is parsed into lexemes, which are identified in |
138 | | - various ways (e.g. stemming, morphology, dictionary), and as a |
139 | | - result is reduced to an array of integer numbers. During our |
140 | | - research we found a paper of Joseph Hellerstein which |
141 | | - introduced an interesting data structure suitable for sets - |
142 | | - RD-tree (Russian Doll tree). It looked very attractive, but |
143 | | - implementing it in PostgreSQL seemed difficult because of our |
144 | | - ignorance of database internals. Further research lead us to |
145 | | - the idea to use GiST for implementing RD-tree, but at that time |
146 | | - the GiST code had for a long while remained untouched and |
147 | | - contained several bugs. After work on improving GiST for |
148 | | - version 7.0.3 of PostgreSQL was done, we were able to implement |
149 | | - RD-Tree and use it for index access to arrays of integers. This |
150 | | - implementation was ideally suited for small arrays and |
151 | | - eliminated complex joins, but was practically useless for |
152 | | - indexing large arrays. The next improvement came from an idea |
153 | | - to represent a document by a single bit-signature, a so-called |
154 | | - superimposed signature (see "Index Structures for Databases |
155 | | - Containing Data Items with Set-valued Attributes", 1997, Sven |
156 | | - Helmer for details). We developeded the contrib/intarray module |
157 | | - and used it for full text indexing. |
158 | | - |
| 136 | + engines available utilize an inverted index which is very fast |
| 137 | + for searching but very slow for online updates. Incremental |
| 138 | + updates of an inverted index is a complex engineering task |
| 139 | + while we needed something light, free and with the ability to |
| 140 | + access metadata from the database. The last requirement was |
| 141 | + very important because in a real life application search engine |
| 142 | + should always consult metadata ( topic, permissions, date |
| 143 | + range, version, etc.). We extensively use PostgreSQL as a |
| 144 | + database backend and have no intention to move from it, so the |
| 145 | + problem was to find a data structure and a fast way to access |
| 146 | + it. PostgreSQL has rather unique data type for storing sets |
| 147 | + (think about words) - arrays, but lacks index access to them. |
| 148 | + During our research we found a paper of Joseph Hellerstein, who |
| 149 | + introduced an interesting data structure suitable for sets - |
| 150 | + RD-tree (Russian Doll tree). Further research lead us to the |
| 151 | + idea to use GiST for implementing RD-tree, but at that time the |
| 152 | + GiST code was intouched for a long time and contained several |
| 153 | + bugs. After work on improving GiST for version 7.0.3 of |
| 154 | + PostgreSQL was done, we were able to implement RD-Tree and use |
| 155 | + it for index access to arrays of integers. This implementation |
| 156 | + was ideally suited for small arrays and eliminated complex |
| 157 | + joins, but was practically useless for indexing large arrays. |
| 158 | + The next improvement came from an idea to represent a document |
| 159 | + by a single bit-signature, a so-called superimposed signature |
| 160 | + (see "Index Structures for Databases Containing Data Items with |
| 161 | + Set-valued Attributes", 1997, Sven Helmer for details). We |
| 162 | + developeded the contrib/intarray module and used it for full |
| 163 | + text indexing. |
| 164 | + |
159 | 165 | tsearch v1 |
160 | 166 | It was inconvenient to use integer id's instead of words, so we |
161 | | - introduced a new data type called 'txtidx' - a searchable data |
162 | | - type (textual) with indexed access. This was a first step of |
163 | | - our work onan implementation of a built-in PostgreSQL full |
| 167 | + introduceda new data type called 'txtidx' - a searchable data |
| 168 | + type(textual)with indexed access. This was a first step of |
| 169 | + ourworkon an implementation of a built-in PostgreSQL full |
164 | 170 | text search engine. Even though tsearch v1 had many features of |
165 | | - a search engine it lacked configuration support and relevance |
166 | | - ranking. People were encouraged to use OpenFTS, which provided |
167 | | - relevance ranking based oncoordinate information and flexible |
168 | | - configuration. OpenFTS v.0.34 is the last version based on |
| 171 | + asearch engine it lacked configuration support and relevance |
| 172 | + ranking.People were encouraged to use OpenFTS, which provided |
| 173 | + relevanceranking based onpositional information and flexible |
| 174 | + configuration.OpenFTSv.0.34isthe last version based on |
169 | 175 | tsearch v1. |
170 | | -
|
| 176 | + |
171 | 177 | tsearch V2 |
172 | | - People recognized tsearch as a powerful tool for full text |
173 | | - searching and insisted on adding ranking support, better |
174 | | - configurability, etc. We already thought about moving most of |
175 | | - the features of OpenFTS to tsearch, and in the early 2003 we |
176 | | - decided to work on a new version of tsearch - tsearch v2. We've |
177 | | - abandoned auxiliary index tables which were used by OpenFTS to |
178 | | - store coordinate information and modified the txtidx type to |
179 | | - store them internally. Also, we've added table-driven |
180 | | - configuration, support of ispell dictionaries, snowball |
181 | | - stemmers and the ability to specify which types of lexemes to |
182 | | - index. Also, it's now possible to generate headlines of |
183 | | - documents with highlighted search terms. These changes make |
184 | | - tsearch more user friendly and turn it into a really powerful |
185 | | - full text search engine. After announcing the alpha version, we |
186 | | - received a proposal from Brandon Rhodes to rename tsearch |
187 | | - functions to be more consistent. So, we have renamed txtidx |
188 | | - type to tsvector and other things as well. |
189 | | - |
190 | | - To allow users of tsearch v1 smooth upgrade, we named the module as |
191 | | - tsearch2. |
192 | | - |
193 | | - Future release of OpenFTS (v.0.35) will be based on tsearch2. Brave |
194 | | - people could download it from OpenFTS CVS (see link from [20][OpenFTS |
195 | | - page] |
| 178 | + People recognized tsearch as a powerful tool for full text |
| 179 | + searching and insisted on adding ranking support, better |
| 180 | + configurability, etc. We already thought about moving most of |
| 181 | + the features of OpenFTS to tsearch, and in the early 2003 we |
| 182 | + decided to work on a new version of tsearch. We abandoned |
| 183 | + auxiliary index tables which were used by OpenFTS to store |
| 184 | + positional information and modified the txtidx type to store |
| 185 | + them internally. We added table-driven configuration, support |
| 186 | + of ispell dictionaries, snowball stemmers and the ability to |
| 187 | + specify which types of lexemes to index. Now, it's possible to |
| 188 | + generate headlines of documents with highlighted search terms. |
| 189 | + These changes make tsearch more user friendly and turn it into |
| 190 | + a really powerful full text search engine. Brandon Rhodes |
| 191 | + proposed to rename tsearch functions for consistency and we |
| 192 | + renamed txtidx type to tsvector and other things as well. To |
| 193 | + allow users of tsearch v1 smooth upgrade, we named the module |
| 194 | + as tsearch2. Since version 0.35 OpenFTS uses tsearch2. |
196 | 195 |
|
197 | 196 | References |
198 | 197 |
|
199 | | - 10. http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/docs/Tsearch_V2_Readme.html |
200 | | - 11. http://www.sai.msu.su/~megera/oddmuse/index.cgi/Tsearch_V2_Readme#Documentation_Roadmap |
201 | | - 12. http://www.sai.msu.su/~megera/postgres/gist |
202 | | - 13. http://openfts.sourceforge.net/ |
203 | | - 14. http://sourceforge.net/mailarchive/forum.php?forum=openfts-general |
204 | | - 15. http://www.sai.msu.su/~megera/oddmuse/index.cgi?action=anchor&id=Documentation_Roadmap#Documentation_Roadmap |
205 | | - 16. http://www.sai.msu.su/~megera/oddmuse/index.cgi?Gendict |
206 | | - 17. http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/ |
207 | | - 18. http://sourceforge.net/mailarchive/forum.php?forum=openfts-general |
208 | | - 19. http://archives.postgresql.org/pgsql-general/ |
209 | | - 20. http://openfts.sourceforge.net/ |
| 198 | + 1. http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/docs/Tsearch_V2_Readme.html |
| 199 | + 2. http://snowball.tartarus.org/ |
| 200 | + 3. http://openfts.sourceforge.net/ |
| 201 | + 4. file://localhost/u/megera/WWW/postgres/gist/tsearch/V2/docs/Tsearch_V2_Readme82.html#dm |
| 202 | + 5. http:www.jfg-networks.com/ |
| 203 | + 6. http://www.sai.msu.su/~megera/postgres/gist |
| 204 | + 7. http://www.sigaev.ru/gin/ |
| 205 | + 8. http://openfts.sourceforge.net/ |
| 206 | + 9. http://sourceforge.net/mailarchive/forum.php?forum=openfts-general |
| 207 | + 10. http://www.sai.msu.su/~megera/wiki/Gendict |
| 208 | + 11. http://www.sai.msu.su/~megera/wiki/Tsearch2 |
| 209 | + 12. http://sourceforge.net/mailarchive/forum.php?forum=openfts-general |
| 210 | + 13. http://archives.postgresql.org/pgsql-general/ |