Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Commit092ed29

Browse files
committed
New README, forgotten when docs was updated
1 parent0c96e42 commit092ed29

File tree

1 file changed

+167
-166
lines changed

1 file changed

+167
-166
lines changed

‎contrib/tsearch2/README.tsearch2

Lines changed: 167 additions & 166 deletions
Original file line numberDiff line numberDiff line change
@@ -1,209 +1,210 @@
11
Tsearch2 - full text search extension for PostgreSQL
22

3-
[10][Online version] of this document is available
4-
5-
This module is sponsored by Delta-Soft Ltd., Moscow, Russia.
6-
7-
Notice: This version is fully incompatible with old tsearch (V1),
8-
which was deprecated in 7.4 and obsoleted in 8.0.
9-
10-
The Tsearch2 contrib module contains an implementation of a new data
11-
type tsvector - a searchable data type with indexed access. In a
12-
nutshell, tsvector is a set of unique words along with their
13-
positional information in the document, organized in a special
14-
structure optimized for fast access and lookup. Actually, each word
15-
entry, besides its position in the document, could have a weight
16-
attribute, describing importance of this word (at a specific) position
17-
in document. A set of bit-signatures of a fixed length, representing
18-
tsvectors, are stored in a search tree (developed using PostgreSQL
19-
GiST), which provides online update of full text index and fast query
20-
lookup. The module provides indexed access methods, queries,
21-
operations and supporting routines for the tsvector data type and easy
22-
conversion of text data to tsvector. Table driven configuration allows
23-
creation of custom configuration optimized for specific searches using
3+
[1]Online version of this document is available
4+
5+
Tsearch2 - is the full text engine, fully integrated into PostgreSQL
6+
RDBMS.
7+
8+
Main features
9+
10+
* Full online update
11+
* Supports multiple table driven configurations
12+
* flexible and rich linguistic support (dictionaries, stop words),
13+
thesaurus
14+
* full multibyte (UTF-8) support
15+
* Sophisticated ranking functions with support of proximity and
16+
structure information (rank, rank_cd)
17+
* Index support (GiST and Gin) with concurrency and recovery support
18+
* Rich query language with query rewriting support
19+
* Headline support (text fragments with highlighted search terms)
20+
* Ability to plug-in custom dictionaries and parsers
21+
* Template generator for tsearch2 dictionaries with [2]snowball
22+
stemmer support
23+
* It is mature (5 years of development)
24+
25+
Tsearch2, in a nutshell, provides FTS operator (contains) for the new
26+
data types, representing document (tsvector) and query (tsquery).
27+
Table driven configuration allows creation of custom searches using
2428
standard SQL commands.
25-
26-
Configuration allows you to:
27-
* specify the type of lexemes to be indexed and the way they are
28-
processed.
29-
* specify dictionaries to be used along with stop words recognition.
30-
* specify the parser used to process a document.
31-
32-
See [11]Documentation Roadmap for links to documentation.
29+
30+
tsvector is a searchable data type, representing document. It is a set
31+
of unique words along with their positional information in the
32+
document, organized in a special structure optimized for fast access
33+
and lookup. Each entry could be labelled to reflect its importance in
34+
document.
35+
36+
tsquery is a data type for textual queries with support of boolean
37+
operators. It consists of lexemes (optionally labelled) with boolean
38+
operators between.
39+
40+
Table driven configuration allows to specify:
41+
* parser, which used to break document onto lexemes
42+
* what lexemes to index and the way they are processed
43+
* dictionaries to be used along with stop words recognition.
3344

3445
OpenFTS vs Tsearch2
3546

36-
OpenFTS is a middleware between application and database, so it uses
37-
tsearch2 as a storage, while database engine is used as a query executor
38-
(searching). Everything else (parsing of documents, query processing,
39-
linguistics) carry outs on client side. That's why OpenFTS has its own
40-
configuration table (fts_conf) and works with its own set of dictionaries.
41-
OpenFTS is more flexible, because it could be used in multi-server
42-
architecture with separated machines for repository of documents
43-
(documents could be stored in file system), database and query engine.
47+
[3]OpenFTS is a middleware between application and database. OpenFTS
48+
uses tsearch2 as a storage and database engine as a query executor
49+
(searching). Everything else, i.e. parsing of documents, query
50+
processing, linguistics, carry outs on client side. That's why OpenFTS
51+
has its own configuration table (fts_conf) and works with its own set
52+
of dictionaries. OpenFTS is more flexible, because it could be used in
53+
multi-server architecture with separate machines for repository of
54+
documents (documents could be stored in filesystem), database and
55+
query engine.
56+
57+
See [4]Documentation Roadmap for links to documentation.
4458

4559
Authors
4660

4761
* Oleg Bartunov <oleg@sai.msu.su>, Moscow, Moscow University, Russia
48-
* Teodor Sigaev <teodor@sigaev.ru>, Moscow, Delta-Soft Ltd.,Russia
49-
62+
* Teodor Sigaev <teodor@sigaev.ru>, Moscow,Moscow University,Russia
63+
5064
Contributors
5165

52-
* Robert John Shepherd and Andrew J. Kopciuch submitted
53-
"Introductionto tsearch" (Robert - tsearch v1, Andrew - tsearch
66+
* RobertJohnShepherdandAndrewJ.Kopciuch submitted
67+
"Introduction to tsearch" (Robert - tsearch v1, Andrew - tsearch
5468
v2)
55-
* Brandon Craig Rhodes wrote "Tsearch2 Guide" and "Tsearch2
69+
* BrandonCraigRhodeswrote"Tsearch2Guide"and "Tsearch2
5670
Reference" and proposed new naming convention for tsearch V2
57-
58-
Features Added with Tsearch2
5971

60-
* Relevance ranking of search results
61-
* Table driven configuration
62-
* Morphology support (ispell dictionaries, snowball stemmers)
63-
* Headline support (text fragments with highlighted search terms)
64-
* Ability to plug-in custom dictionaries and parsers
65-
* Synonym dictionary
66-
* Generator of templates for dictionaries (built-in snowball stemmer
67-
support)
68-
* Statistics of indexed words is available
69-
72+
Sponsors
73+
74+
* ABC Startsiden - compound words support
75+
* University of Mannheim for UTF-8 support (in 8.2)
76+
* jfg:networks ([5]http:www.jfg-networks.com/) for Gin - Generalized
77+
Inverted index (in 8.2)
78+
* Georgia Public Library Service and LibLime, Inc. for Thesaurus
79+
dictionary
80+
* PostGIS community - GiST Concurrency and Recovery
81+
82+
The authors are grateful to the Russian Foundation for Basic Research
83+
and Delta-Soft Ltd., Moscow, Russia for support.
84+
7085
Limitations
7186

72-
* Lexeme should be not longer than 2048 bytes
73-
* The number of lexemes is limited by 2^32. Note, that actual
74-
capacity of tsvector is depends on whether positional information
75-
is stored or not.
76-
* tsvector - the size is limited by approximately 2^20 bytes.
77-
* tsquery - the number of entries (lexemes and operations) < 32768
78-
* Positional information
79-
+ maximal position of lexeme < 2^14 (16384)
80-
+ lexeme could have maximum 256 positions
81-
87+
* Length of lexeme < 2K
88+
* Length of tsvector (lexemes + positions) < 1Mb
89+
* The number of lexemes < 4^32
90+
* 0< Positional information < 16383
91+
* No more than 256 positions per lexeme
92+
* The number of nodes ( lexemes + operations) in tsquery < 32768
93+
8294
References
8395

8496
* GiST development site -
85-
[12]http://www.sai.msu.su/~megera/postgres/gist
86-
* OpenFTS home page - [13]http://openfts.sourceforge.net/
97+
[6]http://www.sai.msu.su/~megera/postgres/gist
98+
* GiN development - [7]http://www.sigaev.ru/gin/
99+
* OpenFTS home page - [8]http://openfts.sourceforge.net/
87100
* Mailing list -
88-
[14]http://sourceforge.net/mailarchive/forum.php?forum=openfts-gen
89-
eral
90-
91-
[15]Documentation Roadmap
92-
101+
[9]http://sourceforge.net/mailarchive/forum.php?forum=openfts-gene
102+
ral
103+
93104
Documentation Roadmap
94105

95106
* Several docs are available from docs/ subdirectory
96107
+ "Tsearch V2 Introduction" by Andrew Kopciuch
97108
+ "Tsearch2 Guide" by Brandon Rhodes
98109
+ "Tsearch2 Reference" by Brandon Rhodes
99110
* Readme.gendict in gendict/ subdirectory
100-
+ [16][Gendict tutorial]
101-
102-
Online version of documentation is always available from Tsearch V2
103-
home page -
104-
[17]http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/
105-
111+
+ Also, check [10]Gendict tutorial
112+
* Check [11]tsearch2 Wiki pages for various documentation
113+
106114
Support
107115

108-
Authors urgently recommend people to use [18][openfts-general] or
109-
[19][pgsql-general] mailing lists for questions and discussions.
110-
111-
Caution
116+
Authors urgently recommend people to use [12]openfts-general or
117+
[13]pgsql-general mailing lists for questions and discussions.
112118

113-
In spite of apparent easy full text searching with our tsearch module
114-
(authors hope it's so), any serious search engine require profound
115-
study of various aspects, such as stop words, dictionaries, special
116-
parsers. Tsearch module was designed to facilitate both those cases.
117-
118119
Development History
119120

121+
Latest news
122+
123+
To the PostgreSQL 8.2 release we added:
124+
* multibyte (UTF-8) support
125+
* Thesaurus dictionary
126+
* Query rewriting
127+
* rank_cd relevation function now support different weights of
128+
lexemes
129+
* GiN support adds scalability of tsearch2
130+
120131
Pre-tsearch era
121-
Development of OpenFTS began in 2000 after realizing that we
122-
needed asearch engine optimized for online updatesand able to
123-
accessmetadata from the database. This is essential for online
132+
DevelopmentofOpenFTS began in 2000 after realizing that we
133+
need asearch engine optimized for online updateswith access
134+
tometadatafromthe database. This is essential for online
124135
news agencies, web portals, digital libraries, etc. Most search
125-
engines available utilize an inverted index which is very fast
126-
for searching but very slow for online updates. Incremental
127-
updates of an inverted index is a complex engineering task
128-
while we needed something light, free and with the ability to
129-
access metadata from the database. The last requirement is very
130-
important because in a real life application a search engine
131-
should always consult metadata ( topic, permissions, date
132-
range, version, etc.). We extensively use PostgreSQL as a
133-
database backend and have no intention to move from it, so the
134-
problem was to find a data structure and a fast way to access
135-
it. PostgreSQL has rather unique data type for storing sets
136-
(think about words) - arrays, but lacks index access to them. A
137-
document is parsed into lexemes, which are identified in
138-
various ways (e.g. stemming, morphology, dictionary), and as a
139-
result is reduced to an array of integer numbers. During our
140-
research we found a paper of Joseph Hellerstein which
141-
introduced an interesting data structure suitable for sets -
142-
RD-tree (Russian Doll tree). It looked very attractive, but
143-
implementing it in PostgreSQL seemed difficult because of our
144-
ignorance of database internals. Further research lead us to
145-
the idea to use GiST for implementing RD-tree, but at that time
146-
the GiST code had for a long while remained untouched and
147-
contained several bugs. After work on improving GiST for
148-
version 7.0.3 of PostgreSQL was done, we were able to implement
149-
RD-Tree and use it for index access to arrays of integers. This
150-
implementation was ideally suited for small arrays and
151-
eliminated complex joins, but was practically useless for
152-
indexing large arrays. The next improvement came from an idea
153-
to represent a document by a single bit-signature, a so-called
154-
superimposed signature (see "Index Structures for Databases
155-
Containing Data Items with Set-valued Attributes", 1997, Sven
156-
Helmer for details). We developeded the contrib/intarray module
157-
and used it for full text indexing.
158-
136+
engines available utilize an inverted index which is very fast
137+
for searching but very slow for online updates. Incremental
138+
updates of an inverted index is a complex engineering task
139+
while we needed something light, free and with the ability to
140+
access metadata from the database. The last requirement was
141+
very important because in a real life application search engine
142+
should always consult metadata ( topic, permissions, date
143+
range, version, etc.). We extensively use PostgreSQL as a
144+
database backend and have no intention to move from it, so the
145+
problem was to find a data structure and a fast way to access
146+
it. PostgreSQL has rather unique data type for storing sets
147+
(think about words) - arrays, but lacks index access to them.
148+
During our research we found a paper of Joseph Hellerstein, who
149+
introduced an interesting data structure suitable for sets -
150+
RD-tree (Russian Doll tree). Further research lead us to the
151+
idea to use GiST for implementing RD-tree, but at that time the
152+
GiST code was intouched for a long time and contained several
153+
bugs. After work on improving GiST for version 7.0.3 of
154+
PostgreSQL was done, we were able to implement RD-Tree and use
155+
it for index access to arrays of integers. This implementation
156+
was ideally suited for small arrays and eliminated complex
157+
joins, but was practically useless for indexing large arrays.
158+
The next improvement came from an idea to represent a document
159+
by a single bit-signature, a so-called superimposed signature
160+
(see "Index Structures for Databases Containing Data Items with
161+
Set-valued Attributes", 1997, Sven Helmer for details). We
162+
developeded the contrib/intarray module and used it for full
163+
text indexing.
164+
159165
tsearch v1
160166
It was inconvenient to use integer id's instead of words, so we
161-
introduced a new data type called 'txtidx' - a searchable data
162-
type (textual) with indexed access. This was a first step of
163-
our work onan implementation of a built-in PostgreSQL full
167+
introduceda new data type called 'txtidx' - a searchable data
168+
type(textual)with indexed access. This was a first step of
169+
ourworkon an implementation of a built-in PostgreSQL full
164170
text search engine. Even though tsearch v1 had many features of
165-
a search engine it lacked configuration support and relevance
166-
ranking. People were encouraged to use OpenFTS, which provided
167-
relevance ranking based oncoordinate information and flexible
168-
configuration. OpenFTS v.0.34 is the last version based on
171+
asearch engine it lacked configuration support and relevance
172+
ranking.People were encouraged to use OpenFTS, which provided
173+
relevanceranking based onpositional information and flexible
174+
configuration.OpenFTSv.0.34isthe last version based on
169175
tsearch v1.
170-
176+
171177
tsearch V2
172-
People recognized tsearch as a powerful tool for full text
173-
searching and insisted on adding ranking support, better
174-
configurability, etc. We already thought about moving most of
175-
the features of OpenFTS to tsearch, and in the early 2003 we
176-
decided to work on a new version of tsearch - tsearch v2. We've
177-
abandoned auxiliary index tables which were used by OpenFTS to
178-
store coordinate information and modified the txtidx type to
179-
store them internally. Also, we've added table-driven
180-
configuration, support of ispell dictionaries, snowball
181-
stemmers and the ability to specify which types of lexemes to
182-
index. Also, it's now possible to generate headlines of
183-
documents with highlighted search terms. These changes make
184-
tsearch more user friendly and turn it into a really powerful
185-
full text search engine. After announcing the alpha version, we
186-
received a proposal from Brandon Rhodes to rename tsearch
187-
functions to be more consistent. So, we have renamed txtidx
188-
type to tsvector and other things as well.
189-
190-
To allow users of tsearch v1 smooth upgrade, we named the module as
191-
tsearch2.
192-
193-
Future release of OpenFTS (v.0.35) will be based on tsearch2. Brave
194-
people could download it from OpenFTS CVS (see link from [20][OpenFTS
195-
page]
178+
People recognized tsearch as a powerful tool for full text
179+
searching and insisted on adding ranking support, better
180+
configurability, etc. We already thought about moving most of
181+
the features of OpenFTS to tsearch, and in the early 2003 we
182+
decided to work on a new version of tsearch. We abandoned
183+
auxiliary index tables which were used by OpenFTS to store
184+
positional information and modified the txtidx type to store
185+
them internally. We added table-driven configuration, support
186+
of ispell dictionaries, snowball stemmers and the ability to
187+
specify which types of lexemes to index. Now, it's possible to
188+
generate headlines of documents with highlighted search terms.
189+
These changes make tsearch more user friendly and turn it into
190+
a really powerful full text search engine. Brandon Rhodes
191+
proposed to rename tsearch functions for consistency and we
192+
renamed txtidx type to tsvector and other things as well. To
193+
allow users of tsearch v1 smooth upgrade, we named the module
194+
as tsearch2. Since version 0.35 OpenFTS uses tsearch2.
196195

197196
References
198197

199-
10. http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/docs/Tsearch_V2_Readme.html
200-
11. http://www.sai.msu.su/~megera/oddmuse/index.cgi/Tsearch_V2_Readme#Documentation_Roadmap
201-
12. http://www.sai.msu.su/~megera/postgres/gist
202-
13. http://openfts.sourceforge.net/
203-
14. http://sourceforge.net/mailarchive/forum.php?forum=openfts-general
204-
15. http://www.sai.msu.su/~megera/oddmuse/index.cgi?action=anchor&id=Documentation_Roadmap#Documentation_Roadmap
205-
16. http://www.sai.msu.su/~megera/oddmuse/index.cgi?Gendict
206-
17. http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/
207-
18. http://sourceforge.net/mailarchive/forum.php?forum=openfts-general
208-
19. http://archives.postgresql.org/pgsql-general/
209-
20. http://openfts.sourceforge.net/
198+
1. http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/docs/Tsearch_V2_Readme.html
199+
2. http://snowball.tartarus.org/
200+
3. http://openfts.sourceforge.net/
201+
4. file://localhost/u/megera/WWW/postgres/gist/tsearch/V2/docs/Tsearch_V2_Readme82.html#dm
202+
5. http:www.jfg-networks.com/
203+
6. http://www.sai.msu.su/~megera/postgres/gist
204+
7. http://www.sigaev.ru/gin/
205+
8. http://openfts.sourceforge.net/
206+
9. http://sourceforge.net/mailarchive/forum.php?forum=openfts-general
207+
10. http://www.sai.msu.su/~megera/wiki/Gendict
208+
11. http://www.sai.msu.su/~megera/wiki/Tsearch2
209+
12. http://sourceforge.net/mailarchive/forum.php?forum=openfts-general
210+
13. http://archives.postgresql.org/pgsql-general/

0 commit comments

Comments
 (0)

[8]ページ先頭

©2009-2025 Movatter.jp