Movatterモバイル変換

[0]ホーム

Jump to content

Canonical Huffman code

Edit links

From Wikipedia, the free encyclopedia

Standardized representation of a Huffman code

This article has multiple issues. Please helpimprove it or discuss these issues on thetalk page.(Learn how and when to remove these messages)

This article includes a list ofgeneral references, butit lacks sufficient correspondinginline citations. Please help toimprove this article byintroducing more precise citations.(March 2014) (Learn how and when to remove this message)

This articlemay be too technical for most readers to understand. Pleasehelp improve it tomake it understandable to non-experts, without removing the technical details.(June 2011) (Learn how and when to remove this message)

(Learn how and when to remove this message)

Incomputer science andinformation theory, acanonical Huffman code is a particular type ofHuffman code with unique properties which allow it to be described in a very compact manner. Rather than storing the structure of the code tree explicitly, canonical Huffman codes are ordered in such a way that it suffices to only store the lengths of the codewords, which reduces the overhead of the codebook.

Motivation

[edit]

Data compressors generally work in one of two ways. Either the decompressor can infer whatcodebook the compressor has used from previous context, or the compressor must tell the decompressor what the codebook is. Since a canonical Huffman codebook can be stored especially efficiently, most compressors start by generating a "normal" Huffman codebook, and then convert it to canonical Huffman before using it.

In order for asymbol code scheme such as theHuffman code to be decompressed, the same model that the encoding algorithm used to compress the source data must be provided to the decoding algorithm so that it can use it to decompress the encoded data. In standard Huffman coding this model takes the form of a tree of variable-length codes, with the most frequent symbols located at the top of the structure and being represented by the fewest bits.

However, this code tree introduces two critical inefficiencies into an implementation of the coding scheme. Firstly, each node of the tree must store either references to its child nodes or the symbol that it represents. This is expensive in memory usage and if there is a high proportion of unique symbols in the source data then the size of the code tree can account for a significant amount of the overall encoded data. Secondly, traversing the tree is computationally costly, since it requires the algorithm to jump randomly through the structure in memory as each bit in the encoded data is read in.

Canonical Huffman codes address these two issues by generating the codes in a clear standardized format; all the codes for a given length are assigned their values sequentially. This means that instead of storing the structure of the code tree for decompression only the lengths of the codes are required, reducing the size of the encoded data. Additionally, because the codes are sequential, the decoding algorithm can be dramatically simplified so that it is computationally efficient.

Algorithm

[edit]

The canonical Huffman algorithm converts a standard Huffman codebook into a standardized, or canonical, form. This is achieved by ordering the symbols according to a clear convention: first, sort all symbols by the length of their codeword, from shortest to longest. Second, for any symbols that have the same codeword length, sort them by their alphabetical or numerical value. This creates a definitive, sorted list of symbols.

The normal Huffman codingalgorithm assigns a variable length code to every symbol in the alphabet. More frequently used symbols will be assigned a shorter code. For example, suppose we have the followingnon-canonical codebook:

A = 11B = 0C = 101D = 100

Here the letter A has been assigned 2bits, B has 1 bit, and C and D both have 3 bits. To make the code acanonical Huffman code, the codes are renumbered. The bit lengths stay the same with the code book being sortedfirst by codeword length andsecondly byalphabetical value of the letter:

B = 0A = 11C = 101D = 100

Each of the existing codes are replaced with a new one of the same length, using the following algorithm:

Thefirst symbol in the list gets assigned a codeword which is the same length as the symbol's original codeword but all zeros. This will often be a single zero ('0').
Each subsequent symbol is assigned the nextbinary number in sequence, ensuring that following codes are always higher in value.
When you reach a longer codeword, thenafter incrementing, append zeros until the length of the new codeword is equal to the length of the old codeword. This can be thought of as aleft shift.

By following these three rules, thecanonical version of the code book produced will be:

B = 0A = 10C = 110D = 111

As a fractional binary number

[edit]

Another perspective on the canonical codewords is that they are the digits past theradix point (binary point) in a binary representation of a certain series. Specifically, suppose the lengths of the codewords arel₁ ...l_n. Then the canonical codeword for symboli is the firstl_i binary digits past the radix point in the binary representation of

$\sum _{j=1}^{i-1}2^{-l_{j}}.$

This perspective is particularly useful in light ofKraft's inequality, which says that the sum above will always be less than or equal to 1 (since the lengths come from a prefix free code). This shows that adding one in the algorithm above never overflows and creates a codeword that is longer than intended.

Encoding the codebook

[edit]

The advantage of a canonical Huffman tree is that it can be encoded in fewer bits than an arbitrary tree.

Let us take our original Huffman codebook:

A = 11B = 0C = 101D = 100

There are several ways we could encode this Huffman tree. For example, we could write eachsymbol followed by thenumber of bits andcode:

('A',2,11), ('B',1,0), ('C',3,101), ('D',3,100)

Since we are listing the symbols in sequential alphabetical order, we can omit the symbols themselves, listing just thenumber of bits andcode:

(2,11), (1,0), (3,101), (3,100)

With ourcanonical version we have the knowledge that the symbols are in sequential alphabetical orderand that a later code will always be higher in value than an earlier one. The only parts left to transmit are thebit-lengths (number of bits) for each symbol. Note that our canonical Huffman tree always has higher values for longer bit lengths and that any symbols of the same bit length (C andD) have higher code values for higher symbols:

A = 10    (code value: 2 decimal, bits:2)B = 0     (code value: 0 decimal, bits:1)C = 110   (code value: 6 decimal, bits:3)D = 111   (code value: 7 decimal, bits:3)

Since two-thirds of the constraints are known, only thenumber of bits for each symbol need be transmitted:

2, 1, 3, 3

With knowledge of the canonical Huffman algorithm, it is then possible to recreate the entire table (symbol and code values) from just the bit-lengths. Unused symbols are normally transmitted as having zero bit length.

Another efficient way representing the codebook is to list all symbols in increasing order by their bit-lengths, and record the number of symbols for each bit-length. For the example mentioned above, the encoding becomes:

(1,1,2), ('B','A','C','D')

This means that the first symbolB is of length 1, then theA of length 2, and remaining 2 symbols (C and D) of length 3. Since the symbols are sorted by bit-length, we can efficiently reconstruct the codebook. Apseudo code describing the reconstruction is introduced on the next section.

This type of encoding is advantageous when only a few symbols in the alphabet are being compressed. For example, suppose the codebook contains only 4 lettersC,O,D andE, each of length 2. To represent the letterO using the previous method, we need to either add a lot of zeros (Method1):

0, 0, 2, 2, 2, 0, ... , 2, ...

or record which 4 letters we have used. Each way makes the description longer than the following (Method2):

(0,4), ('C','O','D','E')

TheJPEG File Interchange Format uses Method2 of encoding, because at most only 162 symbols out of the8-bit alphabet, which has size 256, will be in the codebook.

Pseudocode

[edit]

Given a list of symbols sorted by bit-length, the followingpseudocode will print a canonical Huffman code book:

code := 0while more symbolsdo    print symbol,codecode := (code + 1) << ((bit length of the next symbol) − (current bit length))

algorithm compute huffman codeisinput:  message ensemble (set of (message, probability)).                  baseD.output: code ensemble (set of (message, code)).     1- sort the message ensemble by decreasing probability.    2-N is the cardinal of the message ensemble (number of different       messages).    3- compute the integer⁠ $n_{0}$ ⁠ such as⁠ $2\leq n_{0}\leq D$ ⁠ and⁠ $(N-n_{0})/(D-1)$ ⁠ is integer.    4- select the⁠ $n_{0}$ ⁠ least probable messages, and assign them each a       digit code.    5- substitute the selected messages by a composite message summing       their probability, and re-order it.    6- while there remains more than one message, do steps thru 8.    7-    selectD least probable messages, and assign them each a          digit code.    8-    substitute the selected messages by a composite message          summing their probability, and re-order it.    9- the code of each message is given by the concatenation of the       code digits of the aggregate they've been put in.

^[1]^[2]

References

[edit]

^This algorithm described in:"A Method for the Construction of Minimum-Redundancy Codes"David A. Huffman, Proceedings of the I.R.E.
^Managing Gigabytes: A book with an implementation of canonical Huffman codes for word dictionaries.

Data compression methods

Lossless
type

Entropy	Adaptive coding Arithmetic Asymmetric numeral systems Golomb Huffman Adaptive Canonical Modified Range Shannon Shannon–Fano Shannon–Fano–Elias Tunstall Unary Universal Exp-Golomb Fibonacci Gamma Levenshtein
Dictionary	Byte-pair encoding Lempel–Ziv 842 LZ4 LZJB LZO LZRW LZSS LZW LZWL Snappy
Other	BWT CTW CM Delta Incremental DMC DPCM Grammar Re-Pair Sequitur LDCT MTF PAQ PPM RLE
Hybrid	LZ77 + Huffman Deflate LZX LZS LZ77 + ANS LZFSE LZ77 + Huffman + ANS Zstandard LZ77 + Huffman + context Brotli LZSS + Huffman LHA/LZH LZ77 + Range LZMA LZHAM RLE + BWT + MTF + Huffman bzip2

Lossy
type

Transform	Discrete cosine transform DCT MDCT DST FFT Wavelet Daubechies DWT SPIHT
Predictive	DPCM ADPCM LPC ACELP CELP LAR LSP WLPC Motion Compensation Estimation Vector Psychoacoustic

Audio

Concepts	Bit rate ABR CBR VBR Companding Convolution Dynamic range Latency Nyquist–Shannon theorem Sampling Silence compression Sound quality Speech coding Sub-band coding
Codec parts	A-law μ-law DPCM ADPCM DM FT FFT LPC ACELP CELP LAR LSP WLPC MDCT Psychoacoustic model

Image

Concepts	Chroma subsampling Coding tree unit Color space Compression artifact Image resolution Macroblock Pixel PSNR Quantization Standard test image Texture compression
Methods	Chain code DCT Deflate Fractal KLT LP RLE Wavelet Daubechies DWT EZW SPIHT

Video

Concepts	Bit rate ABR CBR VBR Display resolution Frame Frame rate Frame types Interlace Video characteristics Video quality
Codec parts	DCT DPCM Deblocking filter Lapped transform Motion Compensation Estimation Vector Wavelet Daubechies DWT

Theory

Community

Hutter Prize

People

Retrieved from "https://en.wikipedia.org/w/index.php?title=Canonical_Huffman_code&oldid=1321323725"

Categories:

Hidden categories:

[8]ページ先頭