TheCompatibility Encoding Scheme for UTF-16: 8-Bit (CESU-8) is a variant ofUTF-8 that is described inUnicode Technical Report #26.[1] A Unicode code point from theBasic Multilingual Plane (BMP), i.e. a code point in the rangeU+0000 toU+FFFF, is encoded in the same way as in UTF-8. A Unicode supplementary character, i.e. a code point in the rangeU+10000 toU+10FFFF, is first represented as a surrogate pair, like inUTF-16, and then each surrogate code point is encoded in UTF-8. Therefore, CESU-8 needs six bytes (3 bytes per surrogate) for each Unicode supplementary character while UTF-8 needs only four. Though not specified in the technical report,unpaired surrogates are also encoded as 3 bytes each, and CESU-8 is exactly the same as applying an olderUCS-2 to UTF-8 converter to UTF-16 data.
The encoding of Unicode non-BMP characters works out to11101101 1010yyyy 10xxxxxx 11101101 1011xxxx 10xxxxxx (yyyy represents the top five bits of the character minus one). The byte values0xF0—0xF4 will not appear in CESU-8, as they start the 4-byte encodings used by UTF-8.
CESU-8 is not an official part of the Unicode Standard, because Unicode Technical Reports are informative documents only.[2] It should be used exclusively for internal processing and never for external data exchange. Supporting CESU-8 inHTML is prohibited by theW3C[3] andWHATWG.[4]
Java'sModified UTF-8 is CESU-8 with a special overlong encoding ofNULU+0000 as the two-byte sequenceC0 80.[5] TheOracle database uses CESU-8 for itsUTF8 character set while standard UTF-8 is calledAL32UTF8 (since Oracle version 9.0).[6]
| Code point | U+0045⟨E⟩ | U+0205⟨ȅ⟩ | U+10400⟨𐐀⟩ | ||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| UTF-8 | 45 | C8 | 85 | F0 | 90 | 90 | 80 | ||||||||
| UTF-16 | 0045 | 0205 | D801 | DC00 | |||||||||||
| CESU-8 | 45 | C8 | 85 | ED | A0 | 81 | ED | B0 | 80 | ||||||