Movatterモバイル変換

F.48. unaccent — a text search dictionary which removes diacritics
Prev	Up	Appendix F. Additional Supplied Modules and Extensions	Home	Next

F.48. unaccent — a text search dictionary which removes diacritics#

F.48.1. Configuration
F.48.2. Usage
F.48.3. Functions

unaccent is a text search dictionary that removes accents (diacritic signs) from lexemes. It's a filtering dictionary, which means its output is always passed to the next dictionary (if any), unlike the normal behavior of dictionaries. This allows accent-insensitive processing for full text search.

The current implementation ofunaccent cannot be used as a normalizing dictionary for thethesaurus dictionary.

This module is considered“trusted”, that is, it can be installed by non-superusers who haveCREATE privilege on the current database.

F.48.1. Configuration#

Anunaccent dictionary accepts the following options:

RULES is the base name of the file containing the list of translation rules. This file must be stored in$SHAREDIR/tsearch_data/ (where$SHAREDIR means thePostgreSQL installation's shared-data directory). Its name must end in.rules (which is not to be included in theRULES parameter).

The rules file has the following format:

Each line represents one translation rule, consisting of a character with accent followed by a character without accent. The first is translated into the second. For example,
```
À        AÁ        AÂ        AÃ        AÄ        AÅ        AÆ        AE
```
The two characters must be separated by whitespace, and any leading or trailing whitespace on a line is ignored.
Alternatively, if only one character is given on a line, instances of that character are deleted; this is useful in languages where accents are represented by separate characters.
Actually, each“character” can be any string not containing whitespace, sounaccent dictionaries could be used for other sorts of substring substitutions besides diacritic removal.
Some characters, like numeric symbols, may require whitespaces in their translation rule. It is possible to use double quotes around the translated characters in this case. A double quote needs to be escaped with a second double quote when including one in the translated character. For example:
```
¼      " 1/4"½      " 1/2"¾      " 3/4"“       """"”       """"
```
As with otherPostgreSQL text search configuration files, the rules file must be stored in UTF-8 encoding. The data is automatically translated into the current database's encoding when loaded. Any lines containing untranslatable characters are silently ignored, so that rules files can contain rules that are not applicable in the current encoding.

A more complete example, which is directly useful for most European languages, can be found inunaccent.rules, which is installed in$SHAREDIR/tsearch_data/ when theunaccent module is installed. This rules file translates characters with accents to the same characters without accents, and it also expands ligatures into the equivalent series of simple characters (for example, Æ to AE).

F.48.2. Usage#

Installing theunaccent extension creates a text search templateunaccent and a dictionaryunaccent based on it. Theunaccent dictionary has the default parameter settingRULES='unaccent', which makes it immediately usable with the standardunaccent.rules file. If you wish, you can alter the parameter, for example

mydb=# ALTER TEXT SEARCH DICTIONARY unaccent (RULES='my_rules');

or create new dictionaries based on the template.

To test the dictionary, you can try:

mydb=# select ts_lexize('unaccent','Hôtel'); ts_lexize----------- {Hotel}(1 row)

Here is an example showing how to insert theunaccent dictionary into a text search configuration:

mydb=# CREATE TEXT SEARCH CONFIGURATION fr ( COPY = french );mydb=# ALTER TEXT SEARCH CONFIGURATION fr        ALTER MAPPING FOR hword, hword_part, word        WITH unaccent, french_stem;mydb=# select to_tsvector('fr','Hôtels de la Mer');    to_tsvector------------------- 'hotel':1 'mer':4(1 row)mydb=# select to_tsvector('fr','Hôtel de la Mer') @@ to_tsquery('fr','Hotels'); ?column?---------- t(1 row)mydb=# select ts_headline('fr','Hôtel de la Mer',to_tsquery('fr','Hotels'));      ts_headline------------------------ <b>Hôtel</b> de la Mer(1 row)

F.48.3. Functions#

Theunaccent() function removes accents (diacritic signs) from a given string. Basically, it's a wrapper aroundunaccent-type dictionaries, but it can be used outside normal text search contexts.

unaccent([dictionaryregdictionary,]stringtext) returnstext

If thedictionary argument is omitted, the text search dictionary namedunaccent and appearing in the same schema as theunaccent() function itself is used.

For example:

SELECT unaccent('unaccent', 'Hôtel');SELECT unaccent('Hôtel');

Prev	Up	Next
F.47. tsm_system_time — the`SYSTEM_TIME` sampling method for`TABLESAMPLE`	Home	F.49. uuid-ossp — a UUID generator

epub pdf