Movatterモバイル変換

12.8. Testing and Debugging Text Search
Prev	Up	Chapter 12. Full Text Search	Home	Next

12.8. Testing and Debugging Text Search#

12.8.1. Configuration Testing
12.8.2. Parser Testing
12.8.3. Dictionary Testing

The behavior of a custom text search configuration can easily become confusing. The functions described in this section are useful for testing text search objects. You can test a complete configuration, or test parsers and dictionaries separately.

12.8.1. Configuration Testing#

The functionts_debug allows easy testing of a text search configuration.

ts_debug([configregconfig,]documenttext,         OUTaliastext,         OUTdescriptiontext,         OUTtokentext,         OUTdictionariesregdictionary[],         OUTdictionaryregdictionary,         OUTlexemestext[])         returns setof record

ts_debug displays information about every token ofdocument as produced by the parser and processed by the configured dictionaries. It uses the configuration specified byconfig, ordefault_text_search_config if that argument is omitted.

ts_debug returns one row for each token identified in the text by the parser. The columns returned are

aliastext — short name of the token type
descriptiontext — description of the token type
tokentext — text of the token
dictionariesregdictionary[] — the dictionaries selected by the configuration for this token type
dictionaryregdictionary — the dictionary that recognized the token, orNULL if none did
lexemestext[] — the lexeme(s) produced by the dictionary that recognized the token, orNULL if none did; an empty array ({}) means it was recognized as a stop word

Here is a simple example:

SELECT * FROM ts_debug('english', 'a fat  cat sat on a mat - it ate a fat rats');   alias   |   description   | token |  dictionaries  |  dictionary  | lexemes-----------+-----------------+-------+----------------+--------------+--------- asciiword | Word, all ASCII | a     | {english_stem} | english_stem | {} blank     | Space symbols   |       | {}             |              | asciiword | Word, all ASCII | fat   | {english_stem} | english_stem | {fat} blank     | Space symbols   |       | {}             |              | asciiword | Word, all ASCII | cat   | {english_stem} | english_stem | {cat} blank     | Space symbols   |       | {}             |              | asciiword | Word, all ASCII | sat   | {english_stem} | english_stem | {sat} blank     | Space symbols   |       | {}             |              | asciiword | Word, all ASCII | on    | {english_stem} | english_stem | {} blank     | Space symbols   |       | {}             |              | asciiword | Word, all ASCII | a     | {english_stem} | english_stem | {} blank     | Space symbols   |       | {}             |              | asciiword | Word, all ASCII | mat   | {english_stem} | english_stem | {mat} blank     | Space symbols   |       | {}             |              | blank     | Space symbols   | -     | {}             |              | asciiword | Word, all ASCII | it    | {english_stem} | english_stem | {} blank     | Space symbols   |       | {}             |              | asciiword | Word, all ASCII | ate   | {english_stem} | english_stem | {ate} blank     | Space symbols   |       | {}             |              | asciiword | Word, all ASCII | a     | {english_stem} | english_stem | {} blank     | Space symbols   |       | {}             |              | asciiword | Word, all ASCII | fat   | {english_stem} | english_stem | {fat} blank     | Space symbols   |       | {}             |              | asciiword | Word, all ASCII | rats  | {english_stem} | english_stem | {rat}

For a more extensive demonstration, we first create apublic.english configuration and Ispell dictionary for the English language:

CREATE TEXT SEARCH CONFIGURATION public.english ( COPY = pg_catalog.english );CREATE TEXT SEARCH DICTIONARY english_ispell (    TEMPLATE = ispell,    DictFile = english,    AffFile = english,    StopWords = english);ALTER TEXT SEARCH CONFIGURATION public.english   ALTER MAPPING FOR asciiword WITH english_ispell, english_stem;

SELECT * FROM ts_debug('public.english', 'The Brightest supernovaes');   alias   |   description   |    token    |         dictionaries          |   dictionary   |   lexemes-----------+-----------------+-------------+-------------------------------+----------------+------------- asciiword | Word, all ASCII | The         | {english_ispell,english_stem} | english_ispell | {} blank     | Space symbols   |             | {}                            |                | asciiword | Word, all ASCII | Brightest   | {english_ispell,english_stem} | english_ispell | {bright} blank     | Space symbols   |             | {}                            |                | asciiword | Word, all ASCII | supernovaes | {english_ispell,english_stem} | english_stem   | {supernova}

In this example, the wordBrightest was recognized by the parser as anASCII word (aliasasciiword). For this token type the dictionary list isenglish_ispell andenglish_stem. The word was recognized byenglish_ispell, which reduced it to the nounbright. The wordsupernovaes is unknown to theenglish_ispell dictionary so it was passed to the next dictionary, and, fortunately, was recognized (in fact,english_stem is a Snowball dictionary which recognizes everything; that is why it was placed at the end of the dictionary list).

The wordThe was recognized by theenglish_ispell dictionary as a stop word (Section 12.6.1) and will not be indexed. The spaces are discarded too, since the configuration provides no dictionaries at all for them.

You can reduce the width of the output by explicitly specifying which columns you want to see:

SELECT alias, token, dictionary, lexemesFROM ts_debug('public.english', 'The Brightest supernovaes');   alias   |    token    |   dictionary   |   lexemes-----------+-------------+----------------+------------- asciiword | The         | english_ispell | {} blank     |             |                | asciiword | Brightest   | english_ispell | {bright} blank     |             |                | asciiword | supernovaes | english_stem   | {supernova}

12.8.2. Parser Testing#

The following functions allow direct testing of a text search parser.

ts_parse(parser_nametext,documenttext,         OUTtokidinteger, OUTtokentext) returnssetof recordts_parse(parser_oidoid,documenttext,         OUTtokidinteger, OUTtokentext) returnssetof record

ts_parse parses the givendocument and returns a series of records, one for each token produced by parsing. Each record includes atokid showing the assigned token type and atoken which is the text of the token. For example:

SELECT * FROM ts_parse('default', '123 - a number'); tokid | token-------+--------    22 | 123    12 |    12 | -     1 | a    12 |     1 | number

ts_token_type(parser_nametext, OUTtokidinteger,              OUTaliastext, OUTdescriptiontext) returnssetof recordts_token_type(parser_oidoid, OUTtokidinteger,              OUTaliastext, OUTdescriptiontext) returnssetof record

ts_token_type returns a table which describes each type of token the specified parser can recognize. For each token type, the table gives the integertokid that the parser uses to label a token of that type, thealias that names the token type in configuration commands, and a shortdescription. For example:

SELECT * FROM ts_token_type('default'); tokid |      alias      |               description-------+-----------------+------------------------------------------     1 | asciiword       | Word, all ASCII     2 | word            | Word, all letters     3 | numword         | Word, letters and digits     4 | email           | Email address     5 | url             | URL     6 | host            | Host     7 | sfloat          | Scientific notation     8 | version         | Version number     9 | hword_numpart   | Hyphenated word part, letters and digits    10 | hword_part      | Hyphenated word part, all letters    11 | hword_asciipart | Hyphenated word part, all ASCII    12 | blank           | Space symbols    13 | tag             | XML tag    14 | protocol        | Protocol head    15 | numhword        | Hyphenated word, letters and digits    16 | asciihword      | Hyphenated word, all ASCII    17 | hword           | Hyphenated word, all letters    18 | url_path        | URL path    19 | file            | File or path name    20 | float           | Decimal notation    21 | int             | Signed integer    22 | uint            | Unsigned integer    23 | entity          | XML entity

12.8.3. Dictionary Testing#

Thets_lexize function facilitates dictionary testing.

ts_lexize(dictregdictionary,tokentext) returnstext[]

ts_lexize returns an array of lexemes if the inputtoken is known to the dictionary, or an empty array if the token is known to the dictionary but it is a stop word, orNULL if it is an unknown word.

Examples:

SELECT ts_lexize('english_stem', 'stars'); ts_lexize----------- {star}SELECT ts_lexize('english_stem', 'a'); ts_lexize----------- {}

Note

Thets_lexize function expects a singletoken, not text. Here is a case where this can be confusing:

SELECT ts_lexize('thesaurus_astro', 'supernovae stars') is null; ?column?---------- t

The thesaurus dictionarythesaurus_astro does know the phrasesupernovae stars, butts_lexize fails since it does not parse the input text but treats it as a single token. Useplainto_tsquery orto_tsvector to test thesaurus dictionaries, for example:

SELECT plainto_tsquery('supernovae stars'); plainto_tsquery----------------- 'sn'

Prev	Up	Next
12.7. Configuration Example	Home	12.9. Preferred Index Types for Text Search

epub pdf