This articlemay be too technical for most readers to understand. Pleasehelp improve it tomake it understandable to non-experts, without removing the technical details.(September 2024) (Learn how and when to remove this message) |
This article includes alist of references,related reading, orexternal links,but its sources remain unclear because it lacksinline citations. Please helpimprove this article byintroducing more precise citations.(September 2024) (Learn how and when to remove this message) |
| MIME / IANA | ISO-10646-UTF-1 |
|---|---|
| Language | International |
| Current status | Obscure, of mainly historical interest. |
| Classification | Unicode Transformation Format,extended ASCII,variable-width encoding |
| Extends | US-ASCII |
| Transforms / Encodes | ISO/IEC 10646 (Unicode) |
| Succeeded by | UTF-8 |
UTF-1 is an obsolete method of transformingISO/IEC 10646/Unicode into a stream ofbytes. Its design does not provideself-synchronization, which makes searching forsubstrings and error recovery difficult. It reuses the ASCII printing characters for multi-byte encodings, making it unsuited for some uses (for instance Unix filenames cannot contain the byte value used for forward slash). UTF-1 is also slow to encode or decode due to its use of division and multiplication by a number which is not a power of 2. Due to these issues, it did not gain acceptance and was quickly replaced byUTF-8.
Similar toUTF-8, UTF-1 is avariable-width encoding that is backwards-compatible withASCII. EveryUnicodecode point is represented by either a single byte, or a sequence of two, three, orfive bytes. All ASCII code points are a single byte (the code pointsU+0080 throughU+009F are also single bytes).
UTF-1 does not use theC0 and C1 control codes or the space character in multi-byte encodings: a byte in the range 0–0x20 or 0x7F–0x9F always stands for the corresponding code point. This design with 66protected characters tried to beISO/IEC 2022 compatible.
UTF-1 uses "modulo 190" arithmetic (256 − 66 = 190). For comparison, UTF-8 protects all 128 ASCII characters and needs one bit for this, and a second bit to make it self-synchronizing, resulting in "modulo 64" arithmetic (8 − 2 = 6;26 = 64).BOCU-1 protects only the minimal set required forMIME-compatibility (0x00, 0x07–0x0F, 0x1A–0x1B, and 0x20), resulting in "modulo 243" arithmetic (256 − 13 = 243).
| First code point | Last code point | Byte 1 | Byte 2 | Byte 3 | Byte 4 | Byte 5 |
|---|---|---|---|---|---|---|
| U+0000 | U+009F | 00–9F | ||||
| U+00A0 | U+00FF | A0 | A0–FF | |||
| U+0100 | U+4015 | A1–F5 | 21–7E, A0–FF | |||
| U+4016 | U+38E2D | F6–FB | 21–7E, A0–FF | 21–7E, A0–FF | ||
| U+38E2E | U+7FFFFFFF | FC–FF | 21–7E, A0–FF | 21–7E, A0–FF | 21–7E, A0–FF | 21–7E, A0–FF |
| code point | UTF-8 | UTF-1 |
|---|---|---|
| U+007F | 7F | 7F |
| U+0080 | C2 80 | 80 |
| U+009F | C2 9F | 9F |
| U+00A0 | C2 A0 | A0 A0 |
| U+00BF | C2 BF | A0 BF |
| U+00C0 | C3 80 | A0 C0 |
| U+00FF | C3 BF | A0 FF |
| U+0100 | C4 80 | A1 21 |
| U+015D | C5 9D | A1 7E |
| U+015E | C5 9E | A1 A0 |
| U+01BD | C6 BD | A1 FF |
| U+01BE | C6 BE | A2 21 |
| U+07FF | DF BF | AA 72 |
| U+0800 | E0 A0 80 | AA 73 |
| U+0FFF | E0 BF BF | B5 48 |
| U+1000 | E1 80 80 | B5 49 |
| U+4015 | E4 80 95 | F5 FF |
| U+4016 | E4 80 96 | F6 21 21 |
| U+D7FF | ED 9F BF | F7 2F C3 |
| U+E000 | EE 80 80 | F7 3A 79 |
| U+F8FF | EF A3 BF | F7 5C 3C |
| U+FDD0 | EF B7 90 | F7 62 BA |
| U+FDEF | EF B7 AF | F7 62 D9 |
| U+FEFF | EF BB BF | F7 64 4C |
| U+FFFD | EF BF BD | F7 65 AD |
| U+FFFE | EF BF BE | F7 65 AE |
| U+FFFF | EF BF BF | F7 65 AF |
| U+10000 | F0 90 80 80 | F7 65 B0 |
| U+38E2D | F0 B8 B8 AD | FB FF FF |
| U+38E2E | F0 B8 B8 AE | FC 21 21 21 21 |
| U+FFFFF | F3 BF BF BF | FC 21 37 B2 7A |
| U+100000 | F4 80 80 80 | FC 21 37 B2 7B |
| U+10FFFF | F4 8F BF BF | FC 21 39 6E 6C |
| U+7FFFFFFF | FD BF BF BF BF BF | FD BD 2B B9 40 |
Although modern Unicode ends at U+10FFFF, both UTF-1 and UTF-8 were designed to encode the complete 31 bits of the originalUniversal Character Set (UCS-4), and the last entry in this table shows this original final code point.
{{cite web}}: CS1 maint: numeric names: authors list (link)