Movatterモバイル変換


[0]ホーム

URL:


Jump to content
WikipediaThe Free Encyclopedia
Search

CESU-8

From Wikipedia, the free encyclopedia
(Redirected fromCompatibility Encoding Scheme for UTF-16: 8-Bit)
Encoding scheme for Unicode

TheCompatibility Encoding Scheme for UTF-16: 8-Bit (CESU-8) is a variant ofUTF-8 that is described inUnicode Technical Report #26.[1] A Unicode code point from theBasic Multilingual Plane (BMP), i.e. a code point in the rangeU+0000 toU+FFFF, is encoded in the same way as in UTF-8. A Unicode supplementary character, i.e. a code point in the rangeU+10000 toU+10FFFF, is first represented as a surrogate pair, like inUTF-16, and then each surrogate code point is encoded in UTF-8. Therefore, CESU-8 needs six bytes (3 bytes per surrogate) for each Unicode supplementary character while UTF-8 needs only four. Though not specified in the technical report,unpaired surrogates are also encoded as 3 bytes each, and CESU-8 is exactly the same as applying an olderUCS-2 to UTF-8 converter to UTF-16 data.

The encoding of Unicode non-BMP characters works out to11101101 1010yyyy 10xxxxxx 11101101 1011xxxx 10xxxxxx (yyyy represents the top five bits of the character minus one). The byte values0xF0—0xF4 will not appear in CESU-8, as they start the 4-byte encodings used by UTF-8.

CESU-8 is not an official part of the Unicode Standard, because Unicode Technical Reports are informative documents only.[2] It should be used exclusively for internal processing and never for external data exchange. Supporting CESU-8 inHTML is prohibited by theW3C[3] andWHATWG.[4]

Java'sModified UTF-8 is CESU-8 with a special overlong encoding ofNULU+0000 as the two-byte sequenceC0 80.[5] TheOracle database uses CESU-8 for itsUTF8 character set while standard UTF-8 is calledAL32UTF8 (since Oracle version 9.0).[6]

Examples

[edit]
Code pointU+0045⟨E⟩U+0205⟨ȅ⟩U+10400⟨𐐀⟩
UTF-845C885F0909080
UTF-1600450205D801DC00
CESU-845C885EDA081EDB080

References

[edit]
  1. ^McGowan, Rick (2011-12-19)."Unicode Technical Report #26 - Compatibility Encoding Scheme for UTF-16: 8-Bit (CESU-8)". Unicode Consortium.
  2. ^"About Unicode Technical Reports - Types of Unicode Technical Reports: UAX, UTS, UTR". Unicode Consortium.
  3. ^"8.2.2.3. Character encodings".HTML 5.1 Standard. W3C.
  4. ^"12.2.3.3 Character encodings".HTML Living Standard. WHATWG.
  5. ^"Java SE documentation for Interface java.io.DataInput, subsection on Modified UTF-8".Oracle Corporation. 2015. Retrieved2021-04-30.
  6. ^"Table A-10 Universal Character Sets".

External links

[edit]
Unicode
Code points
Characters
Special purpose
Lists
Processing
Algorithms
Comparison of encodings
On pairs of
code points
Usage
Related standards
Related topics
Scripts and symbols in Unicode
Common and
inherited scripts
Modern scripts
Ancient and
historic scripts
Notational scripts
Symbols, emojis
Early telecommunications
ISO/IEC 8859
Bibliographic use
National standards
ISO/IEC 2022
Mac OSCode pages
("scripts")
DOS code pages
IBM AIX code pages
Windows code pages
EBCDIC code pages
DEC terminals (VTx)
Platform specific
Unicode /ISO/IEC 10646
TeX typesetting system
Miscellaneous code pages
Control character
Related topics
Retrieved from "https://en.wikipedia.org/w/index.php?title=CESU-8&oldid=1323261382"
Categories:
Hidden categories:

[8]ページ先頭

©2009-2025 Movatter.jp