Movatterモバイル変換


[0]ホーム

URL:


Jump to content
WikipediaThe Free Encyclopedia
Search

Binary Ordered Compression for Unicode

From Wikipedia, the free encyclopedia
MIME compatible Unicode compression scheme
"BOCU" redirects here. For other uses, seeBOCU (disambiguation).

Binary Ordered Compression for Unicode (BOCU) is aMIME compatible Unicode compression scheme. BOCU-1 combines the wide applicability ofUTF-8 with the compactness ofStandard Compression Scheme for Unicode (SCSU). ThisUnicodeencoding is designed to be useful for compressing short strings, and maintains code point order. BOCU-1 is specified in a Unicode Technical Note.[1]

For comparison SCSU was adopted as standard Unicode compression scheme with a byte/code point ratio similar to language-specificcode pages. SCSU has not been widely adopted, as it is not suitable for MIME "text" media types. For example, SCSU cannot be used directly in emails and similar protocols. SCSU requires a complicated encoder design for good performance. Usually, thezip,bzip2, and other industry standard algorithms compact larger amounts of Unicode text more efficiently.[2]

Both SCSU[3] and BOCU-1[4] areIANA registered charsets.

Details

[edit]

All numbers in this section arehexadecimal, and all ranges are inclusive.

Code points fromU+0000 toU+0020 are encoded in BOCU-1 as the corresponding byte value. All other code points (that is,U+0021 throughU+D7FF andU+E000 throughU+10FFFF) are encoded as a difference between the code point and a normalized version of the most recently encoded code point that was not an ASCII space (U+0020). The initial state isU+0040. The normalization mapping is as follows:

Code rangeNormalized code pointNotes
U+3040 toU+309FU+3070Hiragana
U+4E00 toU+9FA5U+7711Unihan
U+AC00 toU+D7A3U+C1D1Hangul
U+0020encoder state kept as isSpace
U+hhhh00 toU+hhhh7F
(excluding ranges above)
U+hhhh40middle
of 128
U+hhhh80 toU+hhhhFF
(excluding ranges above)
U+hhhhC0middle
of 128

The difference between the current code point and the normalized previous code point is encoded as follows:

Difference rangeByte sequence range
(see below)
-10FF9F to-2DD0D21F058D9 to21FFFFFF
-2DD0C to-2912220101 to24FFFF
-2911 to-412501 to4FFF
-40 to3F50 toCF
40 to2910D001 toFAFF
2911 to2DD0BFB0101 toFDFFFF
2DD0C to10FFBFFE010101 toFE19B454

Each byte range islexicographically ordered with the following thirteen byte values excluded:00 07 08 09 0A 0B 0C 0D 0E 0F 1A 1B 20. For example, the byte sequenceFC 06 FF, coding for a difference of1156B, is immediately followed by the byte sequenceFC 10 01, coding for a difference of1156C.

Any ASCII inputU+0000 toU+007F excluding spaceU+0020 resets the encoder toU+0040. Because the above-mentioned values cover line end code pointsU+000D andU+000Aas is (0D 0A), the encoder is in a known state at the beginning of each line. The corruption of a single byte therefore affects at most one line. For comparison, the corruption of a single byte inUTF-8 affects at most one code point, and forSCSU it can affect the entire document.

BOCU-1 offers a similar robustness also for input texts without the above-mentioned values with the special reset code0xFF. When a decoder finds this octet it resets its state toU+0040 as for a line end. The use of0xFF reset bytes is not recommended in the BOCU-1 specification, because it conflicts with other BOCU-1 design goals, notably thebinary order.

The optional use of a signatureU+FEFF at the begin of BOCU-1 encoded texts, i.e. the BOCU-1 byte sequenceFB EE 28, changes the initial stateU+0040 toU+FEC0. In other words, the signature cannot simply be stripped as in most other Unicode encoding schemes. Adding a reset byte after the signature (FB EE 28 FF) could avoid this effect, but the BOCU-1 specification does not recommend this practice.

In theoryUTF-1 andUTF-8 could encode the originalUCS-4 set with 31 bits up to7FFFFFFF. BOCU-1 andUTF-16 can encodethe modernUnicode set fromU+0000 toU+10FFFF. Excluding the thirteenprotected code points encoded as single octets BOCU-1 can use25613=243{\displaystyle 256-13=243} octets in multi-byte encodings. BOCU-1 needs at most four bytes consisting of a lead byte and one to three trail bytes. The trail bytes encode a remaining "modulo 243" (base 243) difference, the lead byte determines the number of trail bytes and an initial difference. Note that the reset byte0xFF is notprotected and can occur as trail byte.

Patent

[edit]

Prior to 16 November 2022, the general BOCU algorithm was covered byUnited States Patent #6,737,994, which also mentions the specific BOCU-1 implementation.[5] This patent has now expired.

IBM, which employed both of the inventors of BOCU-1 at the time it was created, stated in the Unicode Technical Note that implementers of a "fully compliant version of BOCU-1" had to contact IBM to request a royalty-free license.[6] BOCU-1 is the only Unicode compression scheme described on the Unicode Web site that is known to have been encumbered withintellectual property restrictions.

By contrast, IBM also filed for a patent onUTF-EBCDIC, but it chose in that case to make the documentation andencoding scheme "freely available to anyone concerned towards making the transformation format as part of the UCS standards", instead of requiring implementers to request a license.[7]

References

[edit]
  1. ^Markus Scherer,Mark Davis (2006-02-04)."UTN #6: BOCU-1". Retrieved2008-05-18.
  2. ^Ewell, Doug (2004-01-30)."UTN #14: A survey of Unicode compression"(PDF). Retrieved2008-06-13.
  3. ^IANA registration record for SCSU
  4. ^IANA registration record for BOCU-1
  5. ^Davis; et al. (2004-05-18)."United States Patent #6,737,994, "Binary-ordered compression for unicode"". Retrieved2022-12-28.
  6. ^Markus Scherer,Mark Davis (2006-02-04)."UTN #6: BOCU-1". Retrieved2014-02-05.
  7. ^V.S. Umamaheswaran (2002-04-16)."UTR #16: UTF-EBCDIC". Retrieved2008-11-16.

See also

[edit]
Unicode
Code points
Characters
Special purpose
Lists
Processing
Algorithms
Comparison of encodings
On pairs of
code points
Usage
Related standards
Related topics
Scripts and symbols in Unicode
Common and
inherited scripts
Modern scripts
Ancient and
historic scripts
Notational scripts
Symbols, emojis
Early telecommunications
ISO/IEC 8859
Bibliographic use
National standards
ISO/IEC 2022
Mac OSCode pages
("scripts")
DOS code pages
IBM AIX code pages
Windows code pages
EBCDIC code pages
DEC terminals (VTx)
Platform specific
Unicode /ISO/IEC 10646
TeX typesetting system
Miscellaneous code pages
Control character
Related topics
Retrieved from "https://en.wikipedia.org/w/index.php?title=Binary_Ordered_Compression_for_Unicode&oldid=1305341905"
Categories:
Hidden categories:

[8]ページ先頭

©2009-2025 Movatter.jp