WARNING: The 2.x versions of Elasticsearch have passed theirEOL dates. If you are running a 2.x version, we strongly advise you to upgrade.

This documentation is no longer maintained and may be removed. For the latest information, see thecurrent Elasticsearch documentation.

« Unicode Case FoldingSorting and Collations »

Elastic Docs›Elasticsearch: The Definitive Guide [2.x]›Dealing with Human Language›Normalizing Tokens

Unicode Character Folding

edit

IMPORTANT: This documentation is no longer updated. Refer toElastic's version policy and thelatest documentation.

Unicode Character Folding

edit

In the same way as thelowercase token filter is a good starting point formany languages but falls short when exposed to the entire tower of Babel, sotheasciifolding token filter requires a moreeffective Unicodecharacter-folding counterpart for dealing with the manylanguages of the world.

Theicu_folding token filter (provided by theicu plug-in)does the same job as theasciifolding filter, but extends the transformationto scripts that are not ASCII-based, such as Greek, Hebrew, Han, conversionof numbers in other scripts into their Latin equivalents, plus various othernumeric, symbolic, and punctuation transformations.

Theicu_folding token filter applies Unicode normalization and case foldingfromnfkc_cf automatically, so theicu_normalizer is not required:

PUT /my_index{  "settings": {    "analysis": {      "analyzer": {        "my_folder": {          "tokenizer": "icu_tokenizer",          "filter":  [ "icu_folding" ]        }      }    }  }}GET /my_index/_analyze?analyzer=my_folder١٢٣٤٥

The Arabic numerals١٢٣٤٥ are folded to their Latin equivalent:12345.

If there are particular characters that you would like to protect fromfolding, you can use aUnicodeSet(much like a character class in regular expressions) to specify which Unicodecharacters may be folded. For instance, to exclude the Swedish letterså,ä,ö,Å,Ä, andÖ from folding, you would specify a character classrepresenting all Unicode characters, except for those letters:[^åäöÅÄÖ](^ meanseverything except).

PUT /my_index{  "settings": {    "analysis": {      "filter": {        "swedish_folding": {          "type": "icu_folding",          "unicodeSetFilter": "[^åäöÅÄÖ]"        }      },      "analyzer": {        "swedish_analyzer": {          "tokenizer": "icu_tokenizer",          "filter":  [ "swedish_folding", "lowercase" ]        }      }    }  }}

	The`swedish_folding` token filter customizes the`icu_folding` token filter to exclude Swedish letters,both uppercase and lowercase.
	The`swedish` analyzer first tokenizes words, then foldseach token by using the`swedish_folding` filter, and thenlowercases each token in case it includes some ofthe uppercase excluded letters:`Å`,`Ä`, or`Ö`.

« Unicode Case FoldingSorting and Collations »

Movatterモバイル変換

Unicode Character Folding

Unicode Character Folding