Movatterモバイル変換

[0]ホーム

Jump to content

LZ77 and LZ78

Edit links

From Wikipedia, the free encyclopedia

(Redirected fromLempel Ziv)

Lossless data compression algorithms

LZ77 andLZ78 are the twolossless data compression algorithms published in papers byAbraham Lempel andJacob Ziv in 1977^[1] and 1978.^[2]They are also known asLempel-Ziv 1 (LZ1) andLempel-Ziv 2 (LZ2) respectively.^[3] These two algorithms form the basis for many variations includingLZW,LZSS,LZMA and others. Besides their academic influence, these algorithms formed the basis of several ubiquitous compression schemes, includingGIF and theDEFLATE algorithm used inPNG andZIP.

They are both theoreticallydictionary coders. LZ77 maintains a sliding window during compression. This was later shown to be equivalent to theexplicit dictionary constructed by LZ78—however, they are only equivalent when the entire data is intended to be decompressed.

Since LZ77 encodes and decodes from a sliding window over previously seen characters, decompression must always start at the beginning of the input. Conceptually, LZ78 decompression could allow random access to the input if the entire dictionary were known in advance. However, in practice the dictionary is created during encoding and decoding by creating a new phrase whenever a token is output.^[4]

The algorithms were named anIEEE Milestone in 2004.^[5] In 2021 Jacob Ziv was awarded theIEEE Medal of Honor for his involvement in their development.^[6]

Theoretical efficiency

[edit]

In the second of the two papers that introduced these algorithms they are analyzed as encoders defined by finite-state machines. A measure analogous toinformation entropy is developed for individual sequences (as opposed to probabilistic ensembles). This measure gives a bound on thedata compression ratio that can be achieved. It is then shown that there exists finite lossless encoders for every sequence that achieve this bound as the length of the sequence grows to infinity. In this sense an algorithm based on this scheme produces asymptotically optimal encodings. This result can be proven more directly, as for example in notes byPeter Shor.^[7]

Formally, (Theorem 13.5.2^[8]).

LZ78 is universal and entropic—If ${\textstyle X}$ is a binary source that is stationary and ergodic, then $\limsup _{n}{\frac {1}{n}}l_{LZ78}(X_{1:n})\leq h(X)$ with probability 1. Here ${\textstyle h(X)}$ is the entropy rate of the source.

Similar theorems apply to other versions of LZ algorithm.

LZ77

[edit]

LZ77 algorithms achieve compression by replacing repeated occurrences of data with references to a single copy of that data existing earlier in the uncompressed data stream. A match is encoded by a pair of numbers called alength-distance pair, which is equivalent to the statement "each of the nextlength characters is equal to the characters exactlydistance characters behind it in the uncompressed stream". (Thedistance is sometimes called theoffset instead.)

To spot matches, the encoder must keep track of some amount of the most recent data, such as the last 2 KB, 4 KB, or 32 KB. The structure in which this data is held is called asliding window, which is why LZ77 is sometimes calledsliding-window compression. The encoder needs to keep this data to look for matches, and the decoder needs to keep this data to interpret the matches the encoder refers to. The larger the sliding window is, the longer back the encoder may search for creating references.

It is not only acceptable but frequently useful to allow length-distance pairs to specify a length that actually exceeds the distance. As a copy command, this is puzzling: "Go backfour characters and copyten characters from that position into the current position". How can ten characters be copied over when only four of them are actually in the buffer? Tackling one byte at a time, there is no problem serving this request, because as a byte is copied over, it may be fed again as input to the copy command. When the copy-from position makes it to the initial destination position, it is consequently fed data that was pasted from thebeginning of the copy-from position. The operation is thus equivalent to the statement "copy the data you were given and repetitively paste it until it fits". As this type of pair repeats a single copy of data multiple times, it can be used to incorporate a flexible and easy form ofrun-length encoding.

Another way to see things is as follows: While encoding, for the search pointer to continue finding matched pairs past the end of the search window, all characters from the first match at offsetD and forward to the end of the search window must have matched input, and these are the (previously seen) characters that compose a single run unit of lengthL_R, which must equalD. Then as the search pointer proceeds past the search window and forward, as far as the run pattern repeats in the input, the search and input pointers will be in sync and match characters until the run pattern is interrupted. ThenL characters have been matched in total,L >D, and the code is [D,L,c].

Upon decoding [D,L,c], again,D =L_R. When the firstL_R characters are read to the output, this corresponds to a single run unit appended to the output buffer. At this point, the read pointer could be thought of as only needing to return int(L/L_R) + (1 ifL modL_R ≠ 0) times to the start of that single buffered run unit, readL_R characters (or maybe fewer on the last return), and repeat until a total ofL characters are read. But mirroring the encoding process, since the pattern is repetitive, the read pointer need only trail in sync with the write pointer by a fixed distance equal to the run lengthL_R untilL characters have been copied to output in total.

Considering the above, especially if the compression of data runs is expected to predominate, the window search should begin at the end of the window and proceed backwards, since run patterns, if they exist, will be found first and allow the search to terminate, absolutely if the current maximal matching sequence length is met, or judiciously, if a sufficient length is met, and finally for the simple possibility that the data is more recent and may correlate better with the next input.

Pseudocode

[edit]

The following pseudocode is a reproduction of the LZ77 compression algorithm sliding window.

while input is not emptydo    match := longest repeated occurrence of input that begins in windowif match existsthen        d := distance to start of match        l := length of match        c := char following match in inputelse        d := 0        l := 0        c := first char of inputend ifoutput (d, l, c)        discardl + 1 chars from front of window    s := popl + 1 chars from front of input    append s to back of windowrepeat

Implementations

[edit]

Even though all LZ77 algorithms work by definition on the same basic principle, they can vary widely in how they encode their compressed data to vary the numerical ranges of a length–distance pair, alter the number of bits consumed for a length–distance pair, and distinguish their length–distance pairs fromliterals (raw data encoded as itself, rather than as part of a length–distance pair). A few examples:

The algorithm illustrated in Lempel and Ziv's original 1977 article outputs all its data three values at a time: the length and distance of the longest match found in the buffer, and the literal that followed that match. If two successive characters in the input stream could be encoded only as literals, the length of the length–distance pair would be 0.
LZSS improves on LZ77 by using a 1-bit flag to indicate whether the next chunk of data is a literal or a length–distance pair, and using literals if a length–distance pair would be longer.
In the PalmDoc format, a length–distance pair is always encoded by a two-byte sequence. Of the 16 bits that make up these two bytes, 11 bits go to encoding the distance, 3 go to encoding the length, and the remaining two are used to make sure the decoder can identify the first byte as the beginning of such a two-byte sequence.
In the implementation used for many games byElectronic Arts,^[9] the size in bytes of a length–distance pair can be specified inside the first byte of the length–distance pair itself; depending on whether the first byte begins with a 0, 10, 110, or 111 (when read inbig-endian bit orientation), the length of the entire length–distance pair can be 1 to 4 bytes.
As of 2008^[update], the most popular LZ77-based compression method isDEFLATE; it combines LZSS withHuffman coding.^[10] Literals, lengths, and a symbol to indicate the end of the current block of data are all placed together into one alphabet. Distances can be safely placed into a separate alphabet; because a distance only occurs just after a length, it cannot be mistaken for another kind of symbol or vice versa.

LZ78

[edit]

The LZ78 algorithms compress sequential data by building a dictionary of token sequences from the input, and then replacing the second and subsequent occurrence of the sequence in the data stream with a reference to the dictionary entry. The observation is that the number of repeated sequences is a good measure of the non random nature of a sequence. The algorithms represent the dictionary as an n-ary tree where n is the number of tokens used to form token sequences. Each dictionary entry is of the formdictionary[...] = {index, token}, where index is the index to a dictionary entry representing a previously seen sequence, and token is the next token from the input that makes this entry unique in the dictionary. Note how the algorithm is greedy, and so nothing is added to the table until a unique making token is found. The algorithm is to initialize last matching index = 0 and next available index = 1 and then, for each token of the input stream, the dictionary searched for a match:{last matching index, token}. If a match is found, then last matching index is set to the index of the matching entry, nothing is output, and last matching index is left representing the input so far. Input is processed until a match isnot found. Then a new dictionary entry is created,dictionary[next available index] = {last matching index, token}, and the algorithm outputs last matching index, followed by token, then resets last matching index = 0 and increments next available index. As an example consider the sequence of tokensAABBA which would assemble the dictionary;

0 {0,_}1 {0,A}2 {1,B}3 {0,B}

and the output sequence of the compressed data would be0A1B0B. Note that the last A is not represented yet as the algorithm cannot know what comes next. In practice an EOF marker is added to the input –AABBA$ for example. Note also that in this case the output0A1B0B1$ is longer than the original input but compression ratio improves considerably as the dictionary grows, and in binary the indexes need not be represented by any more than the minimum number of bits.^[11]

Decompression consists of rebuilding the dictionary from the compressed sequence. From the sequence0A1B0B1$ the first entry is always the terminator0 {...}, and the first from the sequence would be1 {0,A}. TheA is added to the output. The second pair from the input is1B and results in entry number 2 in the dictionary,{1,B}. The token "B" is output, preceded by the sequence represented by dictionary entry 1. Entry 1 is an 'A' (followed by "entry 0" – nothing) soAB is added to the output. Next0B is added to the dictionary as the next entry,3 {0,B}, and B (preceded by nothing) is added to the output. Finally a dictionary entry for1$ is created andA$ is output resulting inA AB B A$ orAABBA removing the spaces and EOF marker.

LZW

[edit]

LZW is an LZ78-based algorithm that uses a dictionary pre-initialized with all possible characters (symbols) or emulation of a pre-initialized dictionary. The main improvement ofLZW is that when a match is not found, the current input stream character is assumed to be the first character of an existing string in the dictionary (since the dictionary is initialized with all possible characters), so only thelast matching index is output (which may be the pre-initialized dictionary index corresponding to the previous (or the initial) input character). Refer to theLZW article for implementation details.

BTLZ is an LZ78-based algorithm that was developed for use in real-time communications systems (originally modems) and standardized by CCITT/ITU asV.42bis. When thetrie-structured dictionary is full, a simple re-use/recovery algorithm is used to ensure that the dictionary can keep adapting to changing data. A counter cycles through the dictionary. When a new entry is needed, the counter steps through the dictionary until a leaf node is found (a node with no dependents). This is deleted and the space re-used for the new entry. This is simpler to implement than LRU or LFU and achieves equivalent performance.

References

[edit]

^Ziv, Jacob;Lempel, Abraham (May 1977). "A Universal Algorithm for Sequential Data Compression".IEEE Transactions on Information Theory.23 (3):337–343.CiteSeerX 10.1.1.118.8921.doi:10.1109/TIT.1977.1055714.S2CID 9267632.
^Ziv, Jacob;Lempel, Abraham (September 1978). "Compression of Individual Sequences via Variable-Rate Coding".IEEE Transactions on Information Theory.24 (5):530–536.CiteSeerX 10.1.1.14.2892.doi:10.1109/TIT.1978.1055934.
^US Patent No. 5532693 Adaptive data compression system with systolic string matching logic
^"Lossless Data Compression: LZ78".cs.stanford.edu.
^"Milestones:Lempel–Ziv Data Compression Algorithm, 1977".IEEE Global History Network.Institute of Electrical and Electronics Engineers. 22 July 2014. Retrieved9 November 2014.
^Joanna, Goodrich."IEEE Medal of Honor Goes to Data Compression Pioneer Jacob Ziv".IEEE Spectrum: Technology, Engineering, and Science News. Retrieved18 January 2021.
^Peter Shor (14 October 2005)."Lempel–Ziv notes"(PDF). Archived fromthe original(PDF) on 28 May 2021. Retrieved9 November 2014.
^Cover, Thomas M.; Thomas, Joy A. (2006).Elements of information theory (2nd ed.). Hoboken, N.J: Wiley-Interscience.ISBN 978-0-471-24195-9.
^"QFS Compression (RefPack)".Niotso Wiki. Retrieved9 November 2014.
^Feldspar, Antaeus (23 August 1997)."An Explanation of the Deflate Algorithm".comp.compressionnewsgroup. zlib.net. Retrieved9 November 2014.
^https://math.mit.edu/~goemans/18310S15/lempel-ziv-notes.pdf^{[bare URL PDF]}

External links

[edit]

Media related toLZ77 algorithm at Wikimedia Commons
Media related toLZ78 algorithm at Wikimedia Commons
"The LZ77 algorithm".Data Compression Reference Center: RASIP working group.Faculty of Electrical Engineering and Computing, University of Zagreb. 1997. Archived fromthe original on 7 January 2013. Retrieved22 June 2012.

"The LZ78 algorithm".Data Compression Reference Center: RASIP working group. Faculty of Electrical Engineering and Computing, University of Zagreb. 1997. Archived fromthe original on 7 January 2013. Retrieved22 June 2012.

"The LZW algorithm".Data Compression Reference Center: RASIP working group. Faculty of Electrical Engineering and Computing, University of Zagreb. 1997. Archived fromthe original on 7 January 2013. Retrieved22 June 2012.

Data compression methods

Lossless

Entropy type	Adaptive coding Arithmetic Asymmetric numeral systems Golomb Huffman Adaptive Canonical Modified Range Shannon Shannon–Fano Shannon–Fano–Elias Tunstall Unary Universal Exp-Golomb Fibonacci Gamma Levenshtein
Dictionary type	Byte pair encoding Lempel–Ziv 842 LZ4 LZJB LZO LZRW LZSS LZW LZWL Snappy
Other types	BWT CTW CM Delta Incremental DMC DPCM Grammar Re-Pair Sequitur LDCT MTF PAQ PPM RLE
Hybrid	LZ77 + Huffman Deflate LZX LZS LZ77 + ANS LZFSE LZ77 + Huffman + ANS Zstandard LZ77 + Huffman + context Brotli LZSS + Huffman LHA/LZH LZ77 + Range LZMA LZHAM RLE + BWT + MTF + Huffman bzip2

Lossy

Transform type	Discrete cosine transform DCT MDCT DST FFT Wavelet Daubechies DWT SPIHT
Predictive type	DPCM ADPCM LPC ACELP CELP LAR LSP WLPC Motion Compensation Estimation Vector Psychoacoustic

Audio

Concepts	Bit rate ABR CBR VBR Companding Convolution Dynamic range Latency Nyquist–Shannon theorem Sampling Silence compression Sound quality Speech coding Sub-band coding
Codec parts	A-law μ-law DPCM ADPCM DM FT FFT LPC ACELP CELP LAR LSP WLPC MDCT Psychoacoustic model

Image

Concepts	Chroma subsampling Coding tree unit Color space Compression artifact Image resolution Macroblock Pixel PSNR Quantization Standard test image Texture compression
Methods	Chain code DCT Deflate Fractal KLT LP RLE Wavelet Daubechies DWT EZW SPIHT

Video

Concepts	Bit rate ABR CBR VBR Display resolution Frame Frame rate Frame types Interlace Video characteristics Video quality
Codec parts	DCT DPCM Deblocking filter Lapped transform Motion Compensation Estimation Vector Wavelet Daubechies DWT