Hunspell token filter

Provides dictionary stemming based on a providedHunspell dictionary. Thehunspell filter requiresconfiguration of one or more language-specific Hunspell dictionaries.

This filter uses Lucene’sHunspellStemFilter.

Tip

If available, we recommend trying an algorithmic stemmer for your language before using thehunspell token filter. In practice, algorithmic stemmers typically outperform dictionary stemmers. SeeDictionary stemmers.

Configure Hunspell dictionaries

Hunspell dictionaries are stored and detected on a dedicatedhunspell directory on the filesystem:<$ES_PATH_CONF>/hunspell. Each dictionary is expected to have its own directory, named after its associated language and locale (e.g.,pt_BR,en_GB). This dictionary directory is expected to hold a single.aff and one or more.dic files, all of which will automatically be picked up. For example, the following directory layout will define theen_US dictionary:

- config    |-- hunspell    |    |-- en_US    |    |    |-- en_US.dic    |    |    |-- en_US.aff

Each dictionary can be configured with one setting:

ignore_case

(Static, Boolean) If true, dictionary matching will be case insensitive. Defaults tofalse.

This setting can be configured globally inelasticsearch.yml usingindices.analysis.hunspell.dictionary.ignore_case.

To configure the setting for a specific locale, use theindices.analysis.hunspell.dictionary.<locale>.ignore_case setting (e.g., for theen_US (American English) locale, the setting isindices.analysis.hunspell.dictionary.en_US.ignore_case).

You can also add asettings.yml file under the dictionary directory which holds these settings. This overrides any otherignore_case settings defined inelasticsearch.yml.

Example

The following analyze API request uses thehunspell filter to stemthe foxes jumping quickly tothe fox jump quick.

The request specifies theen_US locale, meaning that the.aff and.dic files in the<$ES_PATH_CONF>/hunspell/en_US directory are used for the Hunspell dictionary.

GET /_analyze{  "tokenizer": "standard",  "filter": [    {      "type": "hunspell",      "locale": "en_US"    }  ],  "text": "the foxes jumping quickly"}

The filter produces the following tokens:

[ the, fox, jump, quick ]

Configurable parameters

dictionary

(Optional, string or array of strings) One or more.dic files (e.g,en_US.dic, my_custom.dic) to use for the Hunspell dictionary.

By default, thehunspell filter uses all.dic files in the<$ES_PATH_CONF>/hunspell/<locale> directory specified using thelang,language, orlocale parameter.

dedup

(Optional, Boolean) Iftrue, duplicate tokens are removed from the filter’s output. Defaults totrue.

lang

(Required*, string) An alias for thelocale parameter.

If this parameter is not specified, thelanguage orlocale parameter is required.

language

(Required*, string) An alias for thelocale parameter.

If this parameter is not specified, thelang orlocale parameter is required.

locale

(Required*, string) Locale directory used to specify the.aff and.dic files for a Hunspell dictionary. SeeConfigure Hunspell dictionaries.

If this parameter is not specified, thelang orlanguage parameter is required.

longest_only

(Optional, Boolean) Iftrue, only the longest stemmed version of each token is included in the output. Iffalse, all stemmed versions of the token are included. Defaults tofalse.

Customize and add to an analyzer

To customize thehunspell filter, duplicate it to create the basis for a new custom token filter. You can modify the filter using its configurable parameters.

For example, the followingcreate index API request uses a customhunspell filter,my_en_US_dict_stemmer, to configure a newcustom analyzer.

Themy_en_US_dict_stemmer filter uses alocale ofen_US, meaning that the.aff and.dic files in the<$ES_PATH_CONF>/hunspell/en_US directory are used. The filter also includes adedup argument offalse, meaning that duplicate tokens added from the dictionary are not removed from the filter’s output.

PUT /my-index-000001{  "settings": {    "analysis": {      "analyzer": {        "en": {          "tokenizer": "standard",          "filter": [ "my_en_US_dict_stemmer" ]        }      },      "filter": {        "my_en_US_dict_stemmer": {          "type": "hunspell",          "locale": "en_US",          "dedup": false        }      }    }  }}

Settings

In addition to theignore_case settings, you can configure the following global settings for thehunspell filter usingelasticsearch.yml: