Movatterモバイル変換

[0]ホーム

Jump to content

Binary Ordered Compression for Unicode

Edit links

From Wikipedia, the free encyclopedia

MIME compatible Unicode compression scheme

"BOCU" redirects here. For other uses, seeBOCU (disambiguation).

Binary Ordered Compression for Unicode (BOCU) is aMIME compatible Unicode compression scheme. BOCU-1 combines the wide applicability ofUTF-8 with the compactness ofStandard Compression Scheme for Unicode (SCSU). ThisUnicode encoding is designed to be useful for compressing short strings, and maintains code point order. BOCU-1 is specified in a Unicode Technical Note.^[1]

For comparison SCSU was adopted as standard Unicode compression scheme with a byte/code point ratio similar to language-specificcode pages. SCSU has not been widely adopted, as it is not suitable for MIME "text" media types. For example, SCSU cannot be used directly in emails and similar protocols. SCSU requires a complicated encoder design for good performance. Usually, thezip,bzip2, and other industry standard algorithms compact larger amounts of Unicode text more efficiently.^[2]

Both SCSU^[3] and BOCU-1^[4] areIANA registered charsets.

Details

[edit]

All numbers in this section arehexadecimal, and all ranges are inclusive.

Code points fromU+0000 toU+0020 are encoded in BOCU-1 as the corresponding byte value. All other code points (that is,U+0021 throughU+D7FF andU+E000 throughU+10FFFF) are encoded as a difference between the code point and a normalized version of the most recently encoded code point that was not an ASCII space (U+0020). The initial state isU+0040. The normalization mapping is as follows:

Code range	Normalized code point	Notes
`U+3040` to`U+309F`	`U+3070`	Hiragana
`U+4E00` to`U+9FA5`	`U+7711`	Unihan
`U+AC00` to`U+D7A3`	`U+C1D1`	Hangul
`U+0020`	encoder state kept as is	Space
`U+hhhh00` to`U+hhhh7F` (excluding ranges above)	`U+hhhh40`	middle of 128
`U+hhhh80` to`U+hhhhFF` (excluding ranges above)	`U+hhhhC0`	middle of 128

The difference between the current code point and the normalized previous code point is encoded as follows:

Difference range	Byte sequence range (see below)
`-10FF9F` to`-2DD0D`	`21F058D9` to`21FFFFFF`
`-2DD0C` to`-2912`	`220101` to`24FFFF`
`-2911` to`-41`	`2501` to`4FFF`
`-40` to`3F`	`50` to`CF`
`40` to`2910`	`D001` to`FAFF`
`2911` to`2DD0B`	`FB0101` to`FDFFFF`
`2DD0C` to`10FFBF`	`FE010101` to`FE19B454`

Each byte range islexicographically ordered with the following thirteen byte values excluded:00 07 08 09 0A 0B 0C 0D 0E 0F 1A 1B 20. For example, the byte sequenceFC 06 FF, coding for a difference of1156B, is immediately followed by the byte sequenceFC 10 01, coding for a difference of1156C.

Any ASCII inputU+0000 toU+007F excluding spaceU+0020 resets the encoder toU+0040. Because the above-mentioned values cover line end code pointsU+000D andU+000Aas is (0D 0A), the encoder is in a known state at the beginning of each line. The corruption of a single byte therefore affects at most one line. For comparison, the corruption of a single byte inUTF-8 affects at most one code point, and forSCSU it can affect the entire document.

BOCU-1 offers a similar robustness also for input texts without the above-mentioned values with the special reset code0xFF. When a decoder finds this octet it resets its state toU+0040 as for a line end. The use of0xFF reset bytes is not recommended in the BOCU-1 specification, because it conflicts with other BOCU-1 design goals, notably thebinary order.

The optional use of a signatureU+FEFF at the begin of BOCU-1 encoded texts, i.e. the BOCU-1 byte sequenceFB EE 28, changes the initial stateU+0040 toU+FEC0. In other words, the signature cannot simply be stripped as in most other Unicode encoding schemes. Adding a reset byte after the signature (FB EE 28 FF) could avoid this effect, but the BOCU-1 specification does not recommend this practice.

In theoryUTF-1 andUTF-8 could encode the originalUCS-4 set with 31 bits up to7FFFFFFF. BOCU-1 andUTF-16 can encodethe modernUnicode set fromU+0000 toU+10FFFF. Excluding the thirteenprotected code points encoded as single octets BOCU-1 can use $256-13=243$ octets in multi-byte encodings. BOCU-1 needs at most four bytes consisting of a lead byte and one to three trail bytes. The trail bytes encode a remaining "modulo 243" (base 243) difference, the lead byte determines the number of trail bytes and an initial difference. Note that the reset byte0xFF is notprotected and can occur as trail byte.

Patent

[edit]

Prior to 16 November 2022, the general BOCU algorithm was covered byUnited States Patent #6,737,994, which also mentions the specific BOCU-1 implementation.^[5] This patent has now expired.

IBM, which employed both of the inventors of BOCU-1 at the time it was created, stated in the Unicode Technical Note that implementers of a "fully compliant version of BOCU-1" had to contact IBM to request a royalty-free license.^[6] BOCU-1 is the only Unicode compression scheme described on the Unicode Web site that is known to have been encumbered withintellectual property restrictions.

By contrast, IBM also filed for a patent onUTF-EBCDIC, but it chose in that case to make the documentation andencoding scheme "freely available to anyone concerned towards making the transformation format as part of the UCS standards", instead of requiring implementers to request a license.^[7]

References

[edit]

^Markus Scherer,Mark Davis (2006-02-04)."UTN #6: BOCU-1". Retrieved2008-05-18.
^Ewell, Doug (2004-01-30)."UTN #14: A survey of Unicode compression"(PDF). Retrieved2008-06-13.
^IANA registration record for SCSU
^IANA registration record for BOCU-1
^Davis; et al. (2004-05-18)."United States Patent #6,737,994, "Binary-ordered compression for unicode"". Retrieved2022-12-28.
^Markus Scherer,Mark Davis (2006-02-04)."UTN #6: BOCU-1". Retrieved2014-02-05.
^V.S. Umamaheswaran (2002-04-16)."UTR #16: UTF-EBCDIC". Retrieved2008-11-16.

Special purpose	BOM Combining grapheme joiner Left-to-right mark /Right-to-left mark Soft hyphen Variant form Word joiner Zero-width joiner Zero-width non-joiner Zero-width space
Lists	Characters CJK Unified Ideographs Combining character Duplicate characters Numerals Scripts Spaces Symbols Halfwidth and fullwidth Alias names and abbreviations Whitespace characters

Processing

Algorithms	Bidirectional text Collation ISO/IEC 14651 Equivalence Variation sequences International Ideographs Core
Comparison of encodings	BOCU-1 CESU-8 Punycode SCSU UTF-1 UTF-7 UTF-8 UTF-16/UCS-2 UTF-32/UCS-4 UTF-EBCDIC

On pairs of
code points

Usage

Related standards

Related topics

Scripts and symbols in Unicode
Common and inherited scripts	Combining marks Diacritics Punctuation marks Spaces Numbers
Modern scripts	Adlam Arabic Armenian Balinese Bamum Batak Bengali Beria Erfe Bopomofo Braille Buhid Burmese Canadian Aboriginal Chakma Cham Cherokee CJK Unified Ideographs (Han) Cyrillic Deseret Devanagari Garay Geʽez Georgian Greek Gujarati Gunjala Gondi Gurmukhi Gurung Khema Hangul Hanifi Rohingya Hanja Hanunuoo Hebrew Hiragana Javanese Kanji Kannada Katakana Kayah Li Khmer Kirat Rai Lao Latin Lepcha Limbu Lisu (Fraser) Lontara Malayalam Masaram Gondi Mende Kikakui Medefaidrin Miao (Pollard) Mongolian Mru N'Ko Nag Mundari New Tai Lue Nüshu Nyiakeng Puachue Hmong Odia Ol Chiki Ol Onal Osage Osmanya Pahawh Hmong Pau Cin Hau Pracalit (Newa) Ranjana Rejang Samaritan Saurashtra Shavian Sinhala Sorang Sompeng Sundanese Sunuwar Syriac Tagbanwa Tai Le Tai Tham Tai Viet Tai Yo Tamil Tangsa Telugu Thaana Thai Tibetan Tifinagh Tirhuta Tolong Siki Toto Vai Wancho Warang Citi Yi
Ancient and historic scripts	Ahom Anatolian hieroglyphs Ancient North Arabian Avestan Bassa Vah Bhaiksuki Brāhmī Carian Caucasian Albanian Coptic Cuneiform Cypriot Cypro-Minoan Dives Akuru Dogra Egyptian hieroglyphs Elbasan Elymaic Glagolitic Gothic Grantha Hatran Imperial Aramaic Inscriptional Pahlavi Inscriptional Parthian Kaithi Kawi Kharosthi Khitan small script Khojki Khudawadi Khwarezmian (Chorasmian) Linear A Linear B Lycian Lydian Mahajani Makasar Mandaic Manichaean Marchen Meetei Mayek Meroitic Modi Multani Nabataean Nandinagari Ogham Old Hungarian Old Italic Old Permic Old Persian cuneiform Old Sogdian Old Turkic Old Uyghur Palmyrene ʼPhags-pa Phoenician Psalter Pahlavi Runic Sharada Siddham Sidetic Sogdian South Arabian Soyombo Sylheti Nagri Tagalog (Baybayin) Takri Tangut Todhri Tulu Tigalari Ugaritic Vithkuqi Yezidi Zanabazar Square
Notational scripts	Duployan SignWriting
Symbols, emojis	Cultural, political, and religious symbols Currency Control Pictures Mathematical operators and symbols Glossary Phonetic symbols (including IPA) Emoji
Category: Unicode Category: Unicode blocks

Character encodings

Early
telecommunication

Telegraph code
Fieldata
ASCII
- ISO/IEC 646
BCDIC
Teletex andVideotex/Teletext
- T.51/ISO/IEC 6937
- ITU T.61
- ITU T.101
- World System Teletext
  - background
  - sets
Transcode

ISO/IEC 8859

Approved parts
Abandoned parts
- -12 (Devanagari)
Proposed but not approved
- KOI-8 Cyrillic
- Sámi
Adaptations

Bibliographic use

National standards

ISO/IEC 2022

Code pages

Mac OS ("scripts")	Armenian Arabic Barents Cyrillic Celtic Central European Croatian Cyrillic Devanagari Font X (Kermit) Gaelic Georgian Greek Gujarati Gurmukhi Hebrew Iceland Inuit Keyboard Latin (Kermit) Maltese/Esperanto Ogham Roman Romanian Sámi Turkish Turkic Cyrillic Ukrainian VT100
DOS	437 737 850 858 861 862 863 864 865 866 867 868 869 899 904 932 936 942 949 950 951 1040 1043 1046 1098 1115 1116 1117 1118 1127 ABICOMP CS Indic CSX Indic CSX+ Indic CWI-2 Iran System Kamenický Mazovia MIK
IBM AIX	895 896 912 915 921 922 1006 1008 1009 1010 1012 1013 1014 1015 1016 1017 1018 1019 1046 1133
Windows	CER-GS 932 936 (GBK) 950 Extended Latin-8 1250 1251 1252 1253 1254 1255 1256 1257 1258 1270 Cyrillic + French Cyrillic + German Polytonic Greek
EBCDIC	Japanese language in EBCDIC DKOI
DEC terminals (VTx)	Multinational (MCS) National Replacement (NRCS) French Canadian Swiss Spanish United Kingdom Dutch Finnish French Norwegian and Danish Swedish Norwegian and Danish (alternative) 8-bit Greek 8-bit Turkish SI 960 Hebrew Special Graphics Technical (TCS)
Platform specific	1052 1053 1054 1055 1058 Acorn RISC OS Amstrad CPC Apple II ATASCII Atari ST BICS Casio calculators CDC Compucolor 8001 Compucolor II CP/M+ DEC RADIX 50 DEC MCS/NRCS DG International Galaksija GEM GSM 03.38 HP Roman HP FOCAL HP RPL SQUOZE LICS LMBCS MSX NEC APC NeXT PETSCII PostScript Standard PostScript Latin 1 SAM Coupé Sega SC-3000 Sharp calculators Sharp MZ Sinclair QL Teletext TI calculators TRS-80 Ventura International WISCII XCCS ZX80 ZX81 ZX Spectrum
Other	ABICOMP ASMO 449 Digital encoding of APL symbols ISO-IR-68 ARIB STD-B24 Fieldata HZ IEC-P27-1 INIS 7-bit 8-bit ISO-IR-169 ISO 2033 KOI KOI8-R KOI8-RU KOI8-U Mojikyō SEASCII Stanford/ITS Symbol TRON Unified Hangul Code

Unicode,
ISO/IEC 10646

TeX typesetting

Control character

Movatterモバイル変換

Details

Patent

References

See also