Movatterモバイル変換

[0]ホーム

Jump to content

Precomposed character

Edit links

From Wikipedia, the free encyclopedia

Compound character with single codepoint

This article includes alist of references,related reading, orexternal links,but its sources remain unclear because it lacksinline citations. Please helpimprove this article byintroducing more precise citations.(October 2024) (Learn how and when to remove this message)

This page usesIPA notation for orthographic or other linguistic analysis. For the meaning of how⟨ ⟩,| |,/ /, and[ ]are used here, seethis page.

Aprecomposed character (alternativelycomposite character ordecomposable character) is a multi-glyph entity represented inUnicode by a singlecodepoint. A precomposed character may typically represent a letter with adiacritical mark, such as⟨é⟩ (U+00E9 éLATIN SMALL LETTER E WITH ACUTE).

The same character can also be created using a sequence of codepoints, one for each of the glyphs that comprise the character. For example in Unicode terms, the letter⟨é⟩ is a character that can be represented directly usingU+00E9 or alternatively can be decomposed into anequivalent string of the base letter⟨e⟩ (U+0065 eLATIN SMALL LETTER E), together with thecombining form of theacute accent (U+0301 ◌́COMBINING ACUTE ACCENT). Similarly, precomposedligatures are precompositions of their constituent letters orgraphemes – for example, theU+0133 ĳLATIN SMALL LIGATURE IJ used inDutch.

Precomposed characters are the legacy solution for representing many special letters in variouscharacter sets. In Unicode, they were included for compatibility with early encoding systems such as the various components ofISO 8859 or other kinds of "extended ASCII". More recent Unicode policy has been to resist creation of new precomposed characters if the character can be produced using combining forms.

Comparing precomposed and decomposed characters

[edit]

In the following example, there is a commonSwedish surname Åström written in the two alternative methods, the first one with a precomposedÅ (U+00C5) andö (U+00F6), and the second one using a decomposed base letterA (U+0041) with a combiningring above (U+030A) and ano (U+006F) with a combiningdiaeresis (U+0308).

Åström (U+00C5 U+0073 U+0074 U+0072U+00F6 U+006D)
Åström (U+0041U+030A U+0073 U+0074 U+0072 U+006FU+0308 U+006D)

Except for the different colors, the two solutions are equivalent and should render identically. In practice, however, some Unicode implementations still have difficulties with decomposed characters. In the worst case, combining diacritics may be disregarded or rendered as unrecognized characters after their base letters, as they are not included in allcomputer fonts. To overcome the problems, some applications may simply attempt to replace the decomposed characters with the equivalent precomposed characters.

With an incomplete font, however, precomposed characters may also be problematic – especially if they are more exotic, as in the following example (showing the reconstructedProto-Indo-European word for "dog"):

ḱṷṓn (U+1E31 U+1E77 U+1E53 U+006E)
ḱṷṓn (U+006BU+0301 U+0075U+032D U+006FU+0304 U+0301 U+006E)

In some situations, the precomposed greenk,u ando with diacritics may render asunrecognized characters, or theirtypographical appearance may be very different from the final lettern with no diacritic. On the second line, the base letters should at least render correctly even if the combining diacritics could not be recognized.

OpenType has theccmp "feature tag" to define graphemes that are compositions or decompositions involving combining characters.

Chinese characters

[edit]

In theory, mostChinese characters as encoded byHan unification and similar schemes could be treated as precomposed characters, since they can be reduced (decomposed) to their constituentradical and phonetic components withChinese character description languages. Such an approach could reduce the number of characters in the character set from tens of thousands to just a few thousand. On the other hand, a decomposed character set would introduce challenges for searching and editing software and require more bytes of encoding per document. One particular challenge would be the multiple-to-multiple projections between the set of decomposed characters and the precomposed character—one precomposed character may be decomposed into multiple different sets of decomposed characters while one set of decomposed characters could contract themselves into multiple different precomposed characters. There is no strict requirement or constraints regarding the relative position between components within a character, the form of variant and transform (narrow, widen, stretch, rotate, etc.) applied on components, nor the number of each components.

Sources

[edit]

The Unicode Standard, Version 5.2:Conformance (see Section 3.7 for Decomposition). The Unicode Consortium, December 2009.
MSDN:Defining a Character Set. April 8, 2010.
Unicode Normalization Forms (Unicode® Standard Annex #15):http://unicode.org/reports/tr15/

External links

[edit]

Free Idg Serif, a derivative of theFreeSerif font with added declarations of precomposed characters.

Unicode

Code points

Characters

Special purpose	BOM Combining grapheme joiner Left-to-right mark /Right-to-left mark Soft hyphen Variant form Word joiner Zero-width joiner Zero-width non-joiner Zero-width space
Lists	Characters CJK Unified Ideographs Combining character Duplicate characters Numerals Scripts Spaces Symbols Halfwidth and fullwidth Alias names and abbreviations Whitespace characters

Processing

Algorithms	Bidirectional text Collation ISO/IEC 14651 Equivalence Variation sequences International Ideographs Core
Comparison of encodings	BOCU-1 CESU-8 Punycode SCSU UTF-1 UTF-7 UTF-8 UTF-16/UCS-2 UTF-32/UCS-4 UTF-EBCDIC

On pairs of
code points

Usage

Related standards

Related topics

Scripts and symbols in Unicode
Common and inherited scripts	Combining marks Diacritics Punctuation marks Spaces Numbers
Modern scripts	Adlam Arabic Armenian Balinese Bamum Batak Bengali Beria Erfe Bopomofo Braille Buhid Burmese Canadian Aboriginal Chakma Cham Cherokee CJK Unified Ideographs (Han) Cyrillic Deseret Devanagari Garay Geʽez Georgian Greek Gujarati Gunjala Gondi Gurmukhi Gurung Khema Hangul Hanifi Rohingya Hanja Hanunuoo Hebrew Hiragana Javanese Kanji Kannada Katakana Kayah Li Khmer Kirat Rai Lao Latin Lepcha Limbu Lisu (Fraser) Lontara Malayalam Masaram Gondi Mende Kikakui Medefaidrin Miao (Pollard) Mongolian Mru N'Ko Nag Mundari New Tai Lue Nüshu Nyiakeng Puachue Hmong Odia Ol Chiki Ol Onal Osage Osmanya Pahawh Hmong Pau Cin Hau Pracalit (Newa) Ranjana Rejang Samaritan Saurashtra Shavian Sinhala Sorang Sompeng Sundanese Sunuwar Syriac Tagbanwa Tai Le Tai Tham Tai Viet Tai Yo Tamil Tangsa Telugu Thaana Thai Tibetan Tifinagh Tirhuta Tolong Siki Toto Vai Wancho Warang Citi Yi
Ancient and historic scripts	Ahom Anatolian hieroglyphs Ancient North Arabian Avestan Bassa Vah Bhaiksuki Brāhmī Carian Caucasian Albanian Coptic Cuneiform Cypriot Cypro-Minoan Dives Akuru Dogra Egyptian hieroglyphs Elbasan Elymaic Glagolitic Gothic Grantha Hatran Imperial Aramaic Inscriptional Pahlavi Inscriptional Parthian Kaithi Kawi Kharosthi Khitan small script Khojki Khudawadi Khwarezmian (Chorasmian) Linear A Linear B Lycian Lydian Mahajani Makasar Mandaic Manichaean Marchen Meetei Mayek Meroitic Modi Multani Nabataean Nandinagari Ogham Old Hungarian Old Italic Old Permic Old Persian cuneiform Old Sogdian Old Turkic Old Uyghur Palmyrene ʼPhags-pa Phoenician Psalter Pahlavi Runic Sharada Siddham Sidetic Sogdian South Arabian Soyombo Sylheti Nagri Tagalog (Baybayin) Takri Tangut Todhri Tulu Tigalari Ugaritic Vithkuqi Yezidi Zanabazar Square
Notational scripts	Duployan SignWriting
Symbols, emojis	Cultural, political, and religious symbols Currency Control Pictures Mathematical operators and symbols Glossary Phonetic symbols (including IPA) Emoji
Category: Unicode Category: Unicode blocks

Retrieved from "https://en.wikipedia.org/w/index.php?title=Precomposed_character&oldid=1332743165"

Category:

Unicode

Hidden categories:

[8]ページ先頭

Movatterモバイル変換

Comparing precomposed and decomposed characters

Chinese characters

See also

Sources

External links