Movatterモバイル変換

[0]ホーム

Jump to content

Unicode equivalence

Edit links

From Wikipedia, the free encyclopedia

(Redirected fromUnicode normalization)

Aspect of the Unicode standard

This articleneeds additional citations forverification. Please helpimprove this article byadding citations to reliable sources. Unsourced material may be challenged and removed.
Find sources: "Unicode equivalence" – news ·newspapers ·books ·scholar ·JSTOR(November 2014) (Learn how and when to remove this message)

Unicode equivalence is the specification by theUnicode character encoding standard that some sequences ofcode points represent essentially the same character. This feature was introduced in the standard to allow compatibility with pre-existing standardcharacter sets, which often included similar or identical characters.

Unicode provides two such notions,canonical equivalence and compatibility.Code point sequences that are defined ascanonically equivalent are assumed to have the same appearance and meaning when printed or displayed. For example, the code pointU+006EnLATIN SMALL LETTER N followed byU+0303◌̃COMBINING TILDE is defined by Unicode to be canonically equivalent to the single code pointU+00F1ñLATIN SMALL LETTER N WITH TILDE of theSpanish alphabet). Therefore, those sequences should be displayed in the same manner, should be treated in the same way by applications such asalphabetizing names orsearching, and may be substituted for each other. Similarly, eachHangul syllable block that is encoded as a single character may be equivalently encoded as a combination of a leading conjoining jamo, a vowel conjoining jamo, and, if appropriate, a trailing conjoining jamo.

Sequences that are defined ascompatible are assumed to have possibly distinct appearances, but the same meaning in some contexts. Thus, for example, the code point U+FB00 (thetypographic ligature "ﬀ") is defined to be compatible—but not canonically equivalent—to the sequence U+0066 U+0066 (two Latin "f" letters). Compatible sequences may be treated the same way in some applications (such assorting andindexing), but not in others; and may be substituted for each other in some situations, but not in others. Sequences that are canonically equivalent are also compatible, but the opposite is not necessarily true.

The standard also defines atext normalization procedure, calledUnicode normalization, that replaces equivalent sequences of characters so that any two texts that are equivalent will be reduced to the same sequence of code points, called thenormalization form ornormal form of the original text. For each of the two equivalence notions, Unicode defines two normal forms, onefully composed (where multiple code points are replaced by single points whenever possible), and onefully decomposed (where single points are split into multiple ones).

NFC character	A	m	é		l	i	e
NFC code point	0041	006d	00e9		006c	0069	0065
NFD code point	0041	006d	0065	0301	006c	0069	0065
NFD character	A	m	e	◌́	l	i	e

NFD Normalization Form Canonical Decomposition	Characters are decomposed by canonical equivalence, and multiple combining characters are arranged in a specific order.
NFC Normalization Form Canonical Composition	Characters are decomposed and then recomposed by canonical equivalence.
NFKD Normalization Form Compatibility Decomposition	Characters are decomposed by compatibility, and multiple combining characters are arranged in a specific order.
NFKC Normalization Form Compatibility Composition	Characters are decomposed by compatibility, then recomposed by canonical equivalence.

Movatterモバイル変換

Sources of equivalence

Character duplication

Combining and precomposed characters

Example

Typographical non-interaction

Typographic conventions

Encoding errors

Normalization

Algorithms

Normal forms

Canonical ordering

Errors due to normalization differences

See also

Notes

References

External links