WARNING: The 2.x versions of Elasticsearch have passed theirEOL dates. If you are running a 2.x version, we strongly advise you to upgrade.
This documentation is no longer maintained and may be removed. For the latest information, see thecurrent Elasticsearch documentation.
In the same way as thelowercase token filter is a good starting point formany languages but falls short when exposed to the entire tower of Babel, sotheasciifolding token filter requires a moreeffective Unicodecharacter-folding counterpart for dealing with the manylanguages of the world.
Theicu_folding token filter (provided by theicu plug-in)does the same job as theasciifolding filter, but extends the transformationto scripts that are not ASCII-based, such as Greek, Hebrew, Han, conversionof numbers in other scripts into their Latin equivalents, plus various othernumeric, symbolic, and punctuation transformations.
Theicu_folding token filter applies Unicode normalization and case foldingfromnfkc_cf automatically, so theicu_normalizer is not required:
PUT /my_index{ "settings": { "analysis": { "analyzer": { "my_folder": { "tokenizer": "icu_tokenizer", "filter": [ "icu_folding" ] } } } }}GET /my_index/_analyze?analyzer=my_folder١٢٣٤٥If there are particular characters that you would like to protect fromfolding, you can use aUnicodeSet(much like a character class in regular expressions) to specify which Unicodecharacters may be folded. For instance, to exclude the Swedish letterså,ä,ö,Å,Ä, andÖ from folding, you would specify a character classrepresenting all Unicode characters, except for those letters:[^åäöÅÄÖ](^ meanseverything except).
PUT /my_index{ "settings": { "analysis": { "filter": { "swedish_folding": { "type": "icu_folding", "unicodeSetFilter": "[^åäöÅÄÖ]" } }, "analyzer": { "swedish_analyzer": { "tokenizer": "icu_tokenizer", "filter": [ "swedish_folding", "lowercase" ] } } } }}The | |
The |