Movatterモバイル変換

[0]ホーム

Jump to content

Grammar-based code

Edit links

From Wikipedia, the free encyclopedia

Lossless data compression algorithm

Straight-line grammar (with start symbol ß) for the second sentence of theUnited States Declaration of Independence. Each blue character denotes anonterminal symbol; they were obtained from agzip-compression of the sentence.

Grammar-based codes orgrammar-based compression arecompression algorithms based on the idea of constructing acontext-free grammar (CFG) for the string to be compressed. Examples include universallossless data compression algorithms.^[1] To compress a data sequence $x=x_{1}\cdots x_{n}$ , a grammar-based code transforms $x {\displaystyle x}$ into a context-free grammar $G {\displaystyle G}$ .The problem of finding a smallest grammar for an input sequence (smallest grammar problem) is known to be NP-hard,^[2] so many grammar-transform algorithms are proposed from theoretical and practical viewpoints.Generally, the produced grammar $G {\displaystyle G}$ is further compressed by statistical encoders likearithmetic coding.

Examples and characteristics

[edit]

The class of grammar-based codes is very broad. It includesblock codes, the multilevel pattern matching (MPM) algorithm,^[3] variations of the incremental parsingLempel-Ziv code,^[4] and many other new universal lossless compression algorithms.Grammar-based codes are universal in the sense that they can achieve asymptotically theentropy rate of any stationary,ergodic source with a finite alphabet.

Practical algorithms

[edit]

The compression programs of the following are available from external links.

Sequitur^[5] is a classical grammar compression algorithm that sequentially translates an input text into a CFG, and then the produced CFG is encoded by an arithmetic coder.
Re-Pair^[6] is a greedy algorithm using the strategy of most-frequent-first substitution. The compressive performance is powerful, although the main memory space requirement is very large.
GLZA,^[7] which constructs a grammar that may be reducible, i.e., contain repeats, where the entropy-coding cost of "spelling out" the repeats is less than the cost creating and entropy-coding a rule to capture them. (In general, the compression-optimal SLG is not irreducible, and the Smallest Grammar Problem is different from the actual SLG compression problem.)

References

[edit]

^Kieffer, J. C.; Yang, E.-H. (2000), "Grammar-based codes: A new class of universal lossless source codes",IEEE Trans. Inf. Theory,46 (3):737–754,doi:10.1109/18.841160
^Charikar, M.; Lehman, E.; Liu, D.; Panigrahy, R.; Prabharakan, M.; Sahai, A.; Shelat, A. (2005), "The Smallest Grammar Problem",IEEE Trans. Inf. Theory,51 (7):2554–2576,doi:10.1109/tit.2005.850116,S2CID 6900082
^Kieffer, J. C.; Yang, E.-H.; Nelson, G.; Cosman, P. (2000),"Universal lossless compression via multilevel pattern matching",IEEE Trans. Inf. Theory,46 (4):1227–1245,doi:10.1109/18.850665,S2CID 8191526
^Ziv, J.; Lempel, A. (1978), "Compression of individual sequences via variable rate coding",IEEE Trans. Inf. Theory,24 (5):530–536,doi:10.1109/TIT.1978.1055934,hdl:10338.dmlcz/142945
^Nevill-Manning, C. G.; Witten, I. H. (1997), "Identifying Hierarchical Structure in Sequences: A linear-time algorithm",Journal of Artificial Intelligence Research,7 (4):67–82,arXiv:cs/9709102,doi:10.1613/jair.374,hdl:10289/1186,S2CID 2957960
^Larsson, N. J.; Moffat, A. (2000),"Offline Dictionary-Based Compression"(PDF),Proceedings of the IEEE,88 (11):1722–1732,doi:10.1109/5.892708
^Conrad, Kennon J.; Wilson, Paul R. (2016). "Grammatical Ziv-Lempel Compression: Achieving PPM-Class Text Compression Ratios with LZ-Class Decompression Speed".2016 Data Compression Conference (DCC). p. 586.doi:10.1109/DCC.2016.119.ISBN 978-1-5090-1853-6.S2CID 3116024.

External links

[edit]

GLZA discussion and paper
Description of grammar-based codes with example
Sequitur codes Archived 2008-10-13 at theWayback Machine
Re-Pair codes
Re-Pair codes a version of Gonzalo Navarro.
GrammarViz 2.0 - implementation of Sequitur, Re-Pair, and parallel Re-Pair in Java.

Data compression methods

Lossless
type

Entropy	Adaptive coding Arithmetic Asymmetric numeral systems Golomb Huffman Adaptive Canonical Modified Range Shannon Shannon–Fano Shannon–Fano–Elias Tunstall Unary Universal Exp-Golomb Fibonacci Gamma Levenshtein
Dictionary	Byte-pair encoding Lempel–Ziv 842 LZ4 LZJB LZO LZRW LZSS LZW LZWL Snappy
Other	BWT CTW CM Delta Incremental DMC DPCM Grammar Re-Pair Sequitur LDCT MTF PAQ PPM RLE
Hybrid	LZ77 + Huffman Deflate LZX LZS LZ77 + ANS LZFSE LZ77 + Huffman + ANS Zstandard LZ77 + Huffman + context Brotli LZSS + Huffman LHA/LZH LZ77 + Range LZMA LZHAM RLE + BWT + MTF + Huffman bzip2

Lossy
type

Transform	Discrete cosine transform DCT MDCT DST FFT Wavelet Daubechies DWT SPIHT
Predictive	DPCM ADPCM LPC ACELP CELP LAR LSP WLPC Motion Compensation Estimation Vector Psychoacoustic

Audio

Concepts	Bit rate ABR CBR VBR Companding Convolution Dynamic range Latency Nyquist–Shannon theorem Sampling Silence compression Sound quality Speech coding Sub-band coding
Codec parts	A-law μ-law DPCM ADPCM DM FT FFT LPC ACELP CELP LAR LSP WLPC MDCT Psychoacoustic model

Image

Concepts	Chroma subsampling Coding tree unit Color space Compression artifact Image resolution Macroblock Pixel PSNR Quantization Standard test image Texture compression
Methods	Chain code DCT Deflate Fractal KLT LP RLE Wavelet Daubechies DWT EZW SPIHT

Video

Concepts	Bit rate ABR CBR VBR Display resolution Frame Frame rate Frame types Interlace Video characteristics Video quality
Codec parts	DCT DPCM Deblocking filter Lapped transform Motion Compensation Estimation Vector Wavelet Daubechies DWT