Movatterモバイル変換


[0]ホーム

URL:



Facebook
Postgres Pro
Facebook
Downloads
12.5. Parsers
Prev UpChapter 12. Full Text SearchHome Next

12.5. Parsers#

Text search parsers are responsible for splitting raw document text intotokens and identifying each token's type, where the set of possible types is defined by the parser itself. Note that a parser does not modify the text at all — it simply identifies plausible word boundaries. Because of this limited scope, there is less need for application-specific custom parsers than there is for custom dictionaries. At presentPostgreSQL provides just one built-in parser, which has been found to be useful for a wide range of applications.

The built-in parser is namedpg_catalog.default. It recognizes 23 token types, shown inTable 12.1.

Table 12.1. Default Parser's Token Types

AliasDescriptionExample
asciiwordWord, all ASCII letterselephant
wordWord, all lettersmañana
numwordWord, letters and digitsbeta1
asciihwordHyphenated word, all ASCIIup-to-date
hwordHyphenated word, all letterslógico-matemática
numhwordHyphenated word, letters and digitspostgresql-beta1
hword_asciipartHyphenated word part, all ASCIIpostgresql in the contextpostgresql-beta1
hword_partHyphenated word part, all letterslógico ormatemática in the contextlógico-matemática
hword_numpartHyphenated word part, letters and digitsbeta1 in the contextpostgresql-beta1
emailEmail addressfoo@example.com
protocolProtocol headhttp://
urlURLexample.com/stuff/index.html
hostHostexample.com
url_pathURL path/stuff/index.html, in the context of a URL
fileFile or path name/usr/local/foo.txt, if not within a URL
sfloatScientific notation-1.234e56
floatDecimal notation-1.234
intSigned integer-1234
uintUnsigned integer1234
versionVersion number8.3.0
tagXML tag<a href="/docs/postgresql/current/dictionaries.html">
entityXML entity&amp;
blankSpace symbols(any whitespace or punctuation not otherwise recognized)

Note

The parser's notion of aletter is determined by the database's locale setting, specificallylc_ctype. Words containing only the basic ASCII letters are reported as a separate token type, since it is sometimes useful to distinguish them. In most European languages, token typesword andasciiword should be treated alike.

email does not support all valid email characters as defined byRFC 5322. Specifically, the only non-alphanumeric characters supported for email user names are period, dash, and underscore.

tag does not support all valid tag names as defined byW3C Recommendation, XML. Specifically, the only tag names supported are those starting with an ASCII letter, underscore, or colon, and containing only letters, digits, hyphens, underscores, periods, and colons.tag also includes XML comments starting with<!-- and ending with-->, and XML declarations (but note that this includes anything starting with<?x and ending with>).

It is possible for the parser to produce overlapping tokens from the same piece of text. As an example, a hyphenated word will be reported both as the entire word and as each component:

SELECT alias, description, token FROM ts_debug('foo-bar-beta1');      alias      |               description                |     token-----------------+------------------------------------------+--------------- numhword        | Hyphenated word, letters and digits      | foo-bar-beta1 hword_asciipart | Hyphenated word part, all ASCII          | foo blank           | Space symbols                            | - hword_asciipart | Hyphenated word part, all ASCII          | bar blank           | Space symbols                            | - hword_numpart   | Hyphenated word part, letters and digits | beta1

This behavior is desirable since it allows searches to work for both the whole compound word and for components. Here is another instructive example:

SELECT alias, description, token FROM ts_debug('http://example.com/stuff/index.html');  alias   |  description  |            token----------+---------------+------------------------------ protocol | Protocol head | http:// url      | URL           | example.com/stuff/index.html host     | Host          | example.com url_path | URL path      | /stuff/index.html


Prev Up Next
12.4. Additional Features Home 12.6. Dictionaries
pdfepub
Go to PostgreSQL 17
By continuing to browse this website, you agree to the use of cookies. Go toPrivacy Policy.

[8]ページ先頭

©2009-2025 Movatter.jp