Movatterモバイル変換

Comparison of Unicode encodings

From Wikipedia, the free encyclopedia

This article includes a list ofgeneral references, butit lacks sufficient correspondinginline citations. Please help toimprove this article byintroducing more precise citations.(July 2019) (Learn how and when to remove this message)

This article comparesUnicode encodings in two types of environments:8-bit clean environments, and environments that forbid the use ofbyte values with the high bit set. Originally, such prohibitions allowed for links that used only seven data bits, but they remain in some standards, so some standard-conforming software must generate messages that comply with the restrictions.^{[further explanation needed]} TheStandard Compression Scheme for Unicode and the Binary Ordered Compression for Unicode are excluded from the comparison tables because it is difficult to simply quantify their size.

Compatibility issues

[edit]

AUTF-8 file that contains onlyASCII characters is identical to an ASCII file. Legacy programs can generally handle UTF-8-encoded files, even if they contain non-ASCII characters. For instance, theC printf function can print a UTF-8 string because it only looks for the ASCII '%' character to define a formatting string. All other bytes are printed unchanged.

UTF-16 andUTF-32 are incompatible with ASCII files, and thus requireUnicode-aware programs to display, print, and manipulate them even if the file is known to contain only characters in the ASCII subset. Because they contain many zero bytes, character strings representing such files cannot be manipulated by commonnull-terminated string handling logic.^[a] The prevalence of string handling using this logic means that, even in the context of UTF-16 systems such asWindows andJava, UTF-16 text files are not commonly used. Rather, older 8-bit encodings such as ASCII orISO-8859-1 are still used, forgoing Unicode support entirely, or UTF-8 is used for Unicode.^{[citation needed]} One rare counter-example is the "strings" file introduced inMac OS X 10.3 Panther, which is used by applications to look up internationalized versions of messages. By default, this file is encoded in UTF-16, with "files encoded using UTF-8 ... not guaranteed to work."^[1]

XML isconventionally encoded as UTF-8,^{[citation needed]} and all XML processors must at least support UTF-8 and UTF-16.^[2]

Efficiency

[edit]

UTF-8 requires 8, 16, 24 or 32 bits (one to fourbytes) to encode a Unicode character,UTF-16 requires either 16 or 32 bits to encode a character, andUTF-32 always requires 32 bits to encode a character.

The first 128 Unicodecode points, U+0000 to U+007F, which are used for theC0 Controls and Basic Latin characters and which correspond to ASCII, are encoded using 8 bits in UTF-8, 16 bits in UTF-16, and 32 bits in UTF-32. The next 1,920 characters, U+0080 to U+07FF, represent the rest of the characters used by almost allLatin-script alphabets as well asGreek,Cyrillic,Coptic,Armenian,Hebrew,Arabic,Syriac,Thaana andN'Ko. Characters in this range require 16 bits to encode in both UTF-8 and UTF-16, and 32 bits in UTF-32. For U+0800 to U+FFFF, the remaining characters in theBasic Multilingual Plane and capable of representing the rest of the characters of most of the world's living languages, UTF-8 needs 24 bits to encode a character while UTF-16 needs 16 bits and UTF-32 needs 32. Code points U+010000 to U+10FFFF, which represent characters in thesupplementary planes, require 32 bits in UTF-8, UTF-16 and UTF-32.

A file is shorter in UTF-8 than in UTF-16 if there are more ASCII code points than there are code points in the range U+0800 to U+FFFF. Advocates of UTF-8 as the preferred form argue that real-world documents written in languages that use characters only in the high range are still often shorter in UTF-8 due to the extensive use of spaces, digits, punctuation, newlines,HTML orXML markup (e.g. indocx orodt files), and embedded words and acronyms written with Latin letters.^[3] UTF-32, by contrast, is always longer unless there are no code points less than U+10000.

All printable characters inUTF-EBCDIC use at least as many bytes as in UTF-8, and most use more, due to a decision made to allow encoding the C1 control codes as single bytes. For seven-bit environments,UTF-7 is more space efficient than the combination of other Unicode encodings withquoted-printable orbase64 for almost all types of text^{[further explanation needed]} (see "Seven-bit environments" below).

Processing time

[edit]

Text with variable-length encoding such as UTF-8 or UTF-16 is harder to process if there is a need to work with individual code units as opposed to working with code points. Searching is unaffected by whether the characters are variably sized since a search for a sequence of code units does not care about the divisions. However, it does require that the encoding beself-synchronizing, which both UTF-8 and UTF-16 are. A common misconception is that there is a need to "find thenth character" and that this requires a fixed-length encoding; however, in real use the numbern is only derived from examining then−1 characters, thus sequential access is needed anyway.^{[citation needed]}

Processing issues

[edit]

For processing, a format should be easy to search, truncate, and generally process safely.^{[citation needed]} All normal Unicode encodings use some form of fixed-size code unit. Depending on the format and thecode point to be encoded, one or more of these code units will represent a Unicode code point. To allow easy searching and truncation, a sequence must not occur within a longer sequence or across the boundary of two other sequences. UTF-8, UTF-16, UTF-32 and UTF-EBCDIC have these important properties butUTF-7 andGB 18030 do not.

Fixed-size characters can be helpful, but even if there is a fixed byte count per code point (as in UTF-32), there is not a fixed byte count per displayed character due tocombining characters. Considering these incompatibilities and other quirks among different encoding schemes, handling Unicode data with the same (or compatible) protocol throughout and across the interfaces (e.g. using an API/library, handling Unicode characters in client/server model, etc.) can in general simplify the whole pipeline while simultaneously eliminating a potential source of bugs.

UTF-16 is popular because many APIs date to the time when Unicode was 16-bit fixed width (referred as UCS-2). However, using UTF-16 makes characters outside theBasic Multilingual Plane a special case, which increases the risk of oversights related to their handling. That said, programs that mishandle surrogate pairs probably also have problems with combining sequences, so using UTF-32 is unlikely to solve the more general problem of poor handling of multi-code-unit characters.

If any stored data is in UTF-8 (such as file contents or names), it is very difficult to write a system that uses UTF-16 or UTF-32 as an API. This is due to the oft-overlooked fact that the byte array used by UTF-8 can physically contain invalid sequences. For instance, it is impossible to fix an invalid UTF-8 filename using a UTF-16 API, as no possible UTF-16 string will translate to that invalid filename. The opposite is not true: it is trivial to translate invalid UTF-16 to a unique (though technically invalid) UTF-8 string, so a UTF-8 API can control both UTF-8 and UTF-16 files and names, making UTF-8 preferred in any such mixed environment. An unfortunate but far more common workaround used by UTF-16 systems is to interpret the UTF-8 as some other encoding such asCP-1252 and ignore themojibake for any non-ASCII data.

For communication and storage

[edit]

UTF-16 and UTF-32 do not haveendianness defined, so a byte order must be selected when receiving them over a byte-oriented network or reading them from a byte-oriented storage. This may be achieved by using abyte-order mark at the start of the text or assuming big-endian (RFC 2781).UTF-8,UTF-16BE,UTF-32BE,UTF-16LE andUTF-32LE are standardised on a single byte order and do not have this problem.

If the byte stream is subject tocorruption then some encodings recover better than others. UTF-8 and UTF-EBCDIC are best in this regard as they can always resynchronize after a corrupt or missing byte at the start of the next code point; GB 18030 is unable to recover until the next ASCII non-number. UTF-16 can handlealtered bytes, but not an odd number ofmissing bytes, which will garble all the following text (though it will produce uncommon and/or unassigned characters).^[b] Ifbits can be lost all of them will garble the following text, though UTF-8 can be resynchronized as incorrect byte boundaries will produce invalid UTF-8 in almost all text longer than a few bytes.

In detail

[edit]

The tables below list numbers of bytes per code point, not per user-visible "character" (or "grapheme cluster"). It can take multiple code points to describe a single grapheme cluster, so even in UTF-32, care must be taken when splitting or concatenating strings.

The tables below list the number of bytes per code point for different Unicode ranges. Any additional comments needed are included in the table. The figures assume that overheads at the start and end of the block of text are negligible.

Eight-bit environments

[edit]

Code range (hexadecimal)	UTF-8	UTF-16	UTF-32	UTF-EBCDIC	GB 18030
000000 – 00007F	1	2	4	1	1
000080 – 00009F	2			1	2 for characters inherited from GB 2312/GBK (e.g. most Chinese characters), 4 for everything else
0000A0 – 0003FF				2
000400 – 0007FF				3
000800 – 003FFF	3			3
004000 – 00FFFF	3			4
010000 – 03FFFF	4	4		4	4
040000 – 10FFFF	4	4		5	4

Seven-bit environments

[edit]

This table may not cover every special case and so should be used for estimation and comparison only. To accurately determine the size of text in an encoding, see the actual specifications.

Code range (hexadecimal)	UTF-7	UTF-8quoted- printable	UTF-8base64	UTF-16 q.-p.	UTF-16 base64	GB 18030 q.-p.	GB 18030 base64
ASCII graphic characters (except U+003D "=")	1 for "direct characters" (depends on the encoder setting for some code points), 2 for U+002B "+", otherwise same as for 000080 – 00FFFF	1	1+1⁄3	4	2+2⁄3	1	1+1⁄3
00003D (equals sign)		3		6		3
ASCII control characters: 000000 – 00001F and 00007F		1 or 3 depending on directness		6		1 or 3 depending on directness
000080 – 0007FF	5 for an isolated case inside a run of single byte characters. For runs2+2⁄3 per character plus padding to make it a whole number of bytes plus two to start and finish the run	6	2+2⁄3	2–6 depending on if the byte values need to be escaped		4–6 for characters inherited from GB2312/GBK (e.g. most Chinese characters), 8 for everything else	2+2⁄3 for characters inherited from GB2312/GBK (e.g. most Chinese characters),5+1⁄3 for everything else
000800 – 00FFFF		9	4	2–6 depending on if the byte values need to be escaped
010000 – 10FFFF	8 for isolated case,5+1⁄3 per character plus padding to integer plus 2 for a run	12	5+1⁄3	8–12 depending on if the low bytes of the surrogates need to be escaped	5+1⁄3	8	5+1⁄3

Endianness does not affect sizes (UTF-16BE andUTF-32BE have the same size asUTF-16LE andUTF-32LE, respectively).The use of UTF-32 under quoted-printable is highly impractical, but if implemented, will result in 8–12 bytes per code point (about 10 bytes in average), namely for BMP, each code point will occupy exactly 6 bytes more than the same code in quoted-printable/UTF-16. Base64/UTF-32 gets5+1⁄3 bytes forany code point.

An ASCII control character under quoted-printable or UTF-7 may be represented either directly or encoded (escaped). The need to escape a given control character depends on many circumstances, butnewlines in text data are usually coded directly.

Compression schemes

[edit]

BOCU-1 andSCSU are two ways to compress Unicode data. Theirencoding relies on how frequently the text is used. Most runs of text use the same script; for example,Latin,Cyrillic,Greek and so on. This normal use allows many runs of text to compress down to about 1 byte per code point. These stateful encodings make it more difficult to randomly access text at any position of a string.

These two compression schemes are not as efficient as other compression schemes, likezip orbzip2. Those general-purpose compression schemes can compress longer runs of bytes to just a few bytes. TheSCSU andBOCU-1 compression schemes will not compress more than the theoretical 25% of text encoded as UTF-8, UTF-16 or UTF-32. Other general-purpose compression schemes can easily compress to 10% of original text size. The general-purpose schemes require more complicated algorithms and longer chunks of text for a good compression ratio.

Unicode Technical Note #14 contains a more detailed comparison of compression schemes.

Historical: UTF-5 and UTF-6

[edit]

Proposals have been made for a UTF-5 and UTF-6 for theinternationalization of domain names (IDN). The UTF-5 proposal used abase 32 encoding, wherePunycode is (among other things, and not exactly) abase 36 encoding. The nameUTF-5 for a code unit of 5 bits is explained by the equation 2⁵ = 32.^[4] The UTF-6 proposal added a running length encoding to UTF-5; here6 simply stands forUTF-5 plus 1.^[5]TheIETF IDN WG later adopted the more efficientPunycode for this purpose.^[6]

Not being seriously pursued

[edit]

UTF-1 never gained serious acceptance. UTF-8 is much more frequently used.

Thenonet encodingsUTF-9 and UTF-18 areApril Fools' Day RFC joke specifications, although UTF-9 is a functioning nonet Unicode transformation format, and UTF-18 is a functioning nonet encoding for all non-Private-Use code points in Unicode 12 and below, although not forSupplementary Private Use Areas orportions of Unicode 13 and later.

Notes

[edit]

^ASCII softwarenot using null characters to terminate strings would handle UTF-16 and UTF-32 encoded files correctly (such files, if containing only ASCII-subset characters, would appear as normal ASCII padded withnull characters), but such software is not common.^{[citation needed]}
^Aneven number of missing bytes in UTF-16, in contrast, will garble at most one character.

References

[edit]

^"Apple Developer Connection: Internationalization Programming Topics: Strings Files".
^"Character Encoding in Entities".Extensible Markup Language (XML) 1.0 (Fifth Edition).World Wide Web Consortium. 2008.
^"UTF-8 Everywhere".utf8everywhere.org. Retrieved28 August 2022.
^Seng, James,UTF-5, a transformation format of Unicode and ISO 10646, 28 January 2000
^Welter, Mark; Spolarich, Brian W. (16 November 2000)."UTF-6 - Yet Another ASCII-Compatible Encoding for ID".Ietf Datatracker.Archived from the original on 23 May 2016. Retrieved9 April 2016.
^"Internationalized Domain Name (idn)". Internet Engineering Task Force. Retrieved20 March 2023.

Unicode

Code points

Characters

Special purpose	BOM Combining grapheme joiner Left-to-right mark /Right-to-left mark Soft hyphen Variant form Word joiner Zero-width joiner Zero-width non-joiner Zero-width space
Lists	Characters CJK Unified Ideographs Combining character Duplicate characters Numerals Scripts Spaces Symbols Halfwidth and fullwidth Alias names and abbreviations Whitespace characters

Processing

Algorithms	Bidirectional text Collation ISO/IEC 14651 Equivalence Variation sequences International Ideographs Core
Comparison of encodings	BOCU-1 CESU-8 Punycode SCSU UTF-1 UTF-7 UTF-8 UTF-16/UCS-2 UTF-32/UCS-4 UTF-EBCDIC

On pairs of
code points

Usage

Related standards