![]() | Data Compression Multiple transformations | Executable Compression![]() |
Early text data compression techniques were divided into two very different-seeming approaches:
Researchers are searching for some way to combine them,to get the benefits of both.[1]
A typical paragraph of text has several different kinds of patterns.Some archive software applies a series of different transformations,each transformation targeting a different kind of pattern in the text.
Such a transformation is often based onsome kind of modelof how the file was produced.
Compression software (especially image compression software)often sends the raw data throughone or more stages of decorrelation --such as Karhunen–Loève transform[2],color space transformation,linear filtering, DFT, transform coding,etc. --the "precompressor" stages --before feeding the result into an entropy compressor.
Many transforms used in compression, counter-intuitively, produce an intermediate result that requires more bits of dynamic range to store.Transforming raw 24-bit RGB pixels into 26-bit YCoCg-R pixels[2]or transforming raw 16-bit audio samples into 17-bit delta coded differencesmay seem to "waste" bits,but it improves net compression performance:In many cases, the entropy compressor gives smaller compressed file sizeswhen given such "de-correlated" intermediate data than when given the original raw data.(The improvement in results resulting from such transformations is often called "coding gain").
The JPEG 2000 Reversible Color Transform (RCT) is a YUV-like color space that does not introduce quantization errors, so it is fully reversible.
The compressor converts RGB to YCbCr with:
and then compresses the image.
The decompressor decompresses the image, then takes YCbCr pixels and restores the original RGB image with:
A pixel in the JPEG 2000 YCbCr format requires 26 bits to represent losslessly.
The Red, Green, and Blue layers of an image are typically highly correlated.So compressing the R, G, and B layers independentlyends up storing much of the same data 3 times.
Even though "decorrelating" a 24-bit-per-pixel RGB image into a YCbCr requires *more* bits (26 bits) to hold each pixel,decorrelating helps later stages in the compression pipeline do a better job.So independently compressing the Y, Cb, and Cr planesusually results in a smaller compressed file than independently compressing the R, G, and B layers of the same file.
Most (but not all!) lossless color space transforms, including the JPEG 2000 RCT, are "expanding"."Expanding" color space transforms convert 24 bit RGB pixels to intermediate YCbCr pixels that require *more* than 24 bits to represent losslessly.
A few lossless color space transforms are non-expanding."Non-expanding" color space transforms convert 24 bit RGB pixels to intermediate YCoCg24 pixels that require the same 24 bits to represent losslessly.
All color space transforms that convert 24 bit RGB pixels to something less than 24 bits per pixel are lossy. (Do I need to spell out why?).
![]() | To do: |
... (fill in more details) ...
Most modern lossy audio codecs use modified discrete cosine transform (MDCT) as the decorrelation stage (as in Vorbis, AAC and AC-3) or as one of the decorrelation stages (as in MP3 and ATRAC).
Speech compression often uses a relatively simple model of the human voice.The glottis produces a buzz at some frequency.The vocal tract resonates or dampens that frequency and its harmonics.The tongue and lips occasional producing hissing sibilants and popping plosive sounds.
![]() | Wikipedia has related information at linear predictive coding |
A fairly intelligible speech sound can be reconstructed at low bit rate,by updating a linear predictive coding model at 30 to 50 frames per second.
There are several ways to represent the linear predictive filter coefficients.
![]() | Wikipedia has related information at line spectral pairs |
While in principle the linear predictive filter coefficients could be transmitted directly, typically the system is set up so (after entropy decoding) the received bits represent reflection coefficients,[3]log area ratios, line spectral pairs, or some other representation of the filter coefficients.
![]() |
... (fill in details) ...
![]() |
DEFLATE combines LZ77 with Huffman encoding.It is specified inRFC 1951.
It is extremely widely used to decode ZIP files and gzip files and PNG image files.
![]() |
The "zlib compressed data format specification" is a standardized way of packaging compressed data in a file, with a simple header that indicates the type of compression and helps detect the most common kinds of data corruption.It is often used to store data compressed with DEFLATE with a window size up to 32K.
The "zlib compressed data format specification" is specified inRFC 1950.
![]() | Wikipedia has related information at LZX (algorithm) |
The LZX file format and compression algorithm was invented by Jonathan Forbes and Tomi Poutanen.LZX is used in a popular stand-alone AMIGA file archiver.Microsoft Compressed HTML Help (CHM) files use LZX compression.Windows Imaging Format (".WIM") files support several data compression methods, including LZX.Xbox Live Avatars use LZX compression.
LZX uses both LZ77-style sliding window compression and dynamic Huffman codes, somewhat like DEFLATE uses both.The LZ77-style window size (WINDOW_SIZE) is some power-of-two, at least 2^15 and at most 2^21.LZX uses several Huffman tree structures, each of them canonical.
The most important Huffman tree in LZX is the "main tree",composed of 256 elements corresponding to all possible ASCII characters,plus 8*NUM_POSITION_SLOTS elements corresponding to matches.(The NUM_POSITION_SLOTS is in the range of 30 slots to 50 slots).The second most important Huffman tree in LZX is the "length tree",composed of 249 elements.Other trees have smaller roles.
The LZX compressor arbitrarily divides up a file into one or more blocks.Each block is one of 3 types: "verbatim block", "aligned offset block", "uncompressed block".The "uncompressed block" has a few more things in the header, and then a literal copy of the uncompressed data.
In the data portion of the "verbatim block" and "aligned offset block" blocks, LZX uses length-limited Huffman codewords; LZX limits each codeword to at most 16 bits wide.Each "verbatim block" and "aligned offset block" has a larger header that gives the information (in a very compact differentially-compressed form) required for the decoder to regenerate the various Huffman trees used in that block.
The header is decoded to get the codeword length for each symbol, has a length of 0 to 16 bits (inclusive), where "0 bits" indicates the corresponding symbol never occurs in this block, and "16 bits" are used for symbols that occur extremely rarely.A variety of techniques are used to compress the header into relatively few bits.
When the "distance" value is decoded, there are several special values.It happens quite often that two long phrases of English text are *almost* identical, but there is a small typo or substitution in the middle of the otherwise-identical long phrase.With the original LZ77, the result is 3 copy items: [long distance X to first part, length of first part], [distance Y to the insertion, length of insertion], [exactly the same long distance X to start of second part, length of second part].LZX, rather than repeat exactly the same long distance X twice, uses a special "distance" value to indicate that we are "continuing" to copy a previous block after making a substitution in the middle. (The special distance value ends up occupying fewer bits than the full long distance).
After the decompressor uses the block header to set up all the trees, the decompressor's main loop in a verbatim block is:
After the decompressor uses the block header to set up all the trees, the decompressor's main loop in an aligned block is:
The Microsoft "cabinet" (".CAB") compressed archive file formatsupports 3 data compression methods:DEFLATE, LZX, and Quantum compression.The compressor that puts executable files into a ".CAB" archive may optionally"detect "CALL" instructions, converting their operands from relative addressing to absolute addressing, thus calls to the same location resulted in repeated strings that the compressor could match, improving compression of 80x86 binary code."The ".CAB" decompressor, after doing LZX decompression described above, and if the "preprocessing" bit in the header was set to one, must convert those operands back to relative addressing.We will discuss such preprocessor/decorrelators in general, andExecutable Compression specifically, in more detail in a later chapter.
![]() | To do: |
LZX DELTA (LZXD) is a variation of LZX modified to support efficient delta compression (patching).
A LZX DELTA decompressor takes the old version of some plaintext file (this must be exactly the same as the old version used by the compressor) and the LZXD-compressed patch file,and uses them to generate the new version of that plaintext file.
The main difference between LZXD and LZX is that the window is pre-loaded with the reference file,with the decompressor decoding the first byte of the output file into the window at the location immediately after the last byte of the reference file.Normal LZX copy items have a distance that is at most the full distance back to the first byte of the output file; LZXD allows a distance that reaches much further back to allow copies from the reference file.
LZXD supports many more "slots", up to 290 slots, which are encoded into the main tree.This allows LZXD copy items to grab things much further into the past, allowing it to take advantage of very long reference files.
LZXD also supports much larger "length" values. Much as LZX has a "short length" of 0 to 7, where "7" is used as an escape for to get a "long length" from 7 to 255, LZXD uses the same "short length" and "long length" and goes one step further, using a long length of "257" as an escape for an "extra-long length" that can copy 32 KByte in a single copy.
LZARI uses LZSS pre-processor followed by arithmetic encoding.LZARI was developed by Haruhiko Okumura.
![]() | Wikipedia has related information at LHA (file format) |
LZH uses LZSS pre-processor followed by Huffman coding.LZH was developed by Haruyasu Yoshizaki for the LHA archiver, based on LZARI.[10]
A LZB compressor uses a LZSS pre-processor followed by Elias gamma coding of the "match length" and of the "offset".(That extra step makes LZB give a better compression ratio, but run slower, than LZSS alone).
LZB gets compression roughly equal to LZH and LZARI,but is simpler and faster,[10]because a universal code such as Elias gamma coding is simpler and faster to decode thanHuffman or arithmetic encoding (respectively).
![]() |
The Lempel–Ziv–Markov chain algorithm (w:LZMA)uses a LZ77-like compression algorithm,whose output is encoded with range coding.
LZMA is one of the compression methods used in the open-source 7-Zip file archiver.
The LZ77-like compression part of LZMA has many similarities with the corresponding LZ77-like compression part of LZX.LZMA does not use Huffman codes -- instead, it uses context-coded binary arithmetic codes.
LZHAM (LZ, Huffman, Arithmetic, Markov) is a general purpose lossless data compression library by Richard Geldreich, Jr.Both the unstable/experimental versionand the "bitstream compatibility" stable versionof LZHAM are released under the MIT license.[11][12]
Some people combine LZW and Huffman encoding.
The decoder for such a system reads a series of Huffman codes from the compressed file.The decoder decodes each variable-length Huffman code into an intermediate fixed-length 12-bit index.With a typical plaintext file, some of those 12-bit numbers are used far more frequently than others, and so are assigned Huffman codes of less than 12 bits.The decoder then uses a LZW decoder to recover the original variable-length plaintext of 1 or more characters associated with that intermediate index.
Many compressed file formats must be decompressed using an entropy decoder,followed by a whole series of transformations -- a kind ofWikipedia: Pipeline (software).
There are 4 ways to implement such decompressors (and the corresponding compressors):
Converting software from a "call down to fetch more data" API to a "return up to fetch more data" API is difficult.[14]
Because encrypt-then-compress doesn't make the file any smaller, many people do compress-then-encrypt.[15][16][17][18][19][20][21][22][23]
Because modern CPUs process data faster than data can be read off a spinning disk or most external flash drives,it is possible to read and write a compressed, encrypted file in less time than it takes to read and write a normal plain text file containing the same data.[24]
A few people are doing research on algorithms that compress and encrypt simultaneously.[25][26][27][28][29][30]
Some researchers test encryption algorithms by trying to compress the encrypted data.Many encryption algorithms are specifically designed to produce ciphertext that is "indistinguishable from random noise".[31]When trying to implement such an algorithm,If compression produces a compressed output file that is smaller than the encrypted input file,that tells the researcher that there is a flaw in the encryption software --either a bug in the implementation,or it's a correct implementation of an insecure algorithm.
Some people using lossless compression prefer to use "idempotent"(Is there a better term for this?) compression software, even if it takes a little longer to compress.[32]In other words, they expect that when you compress a file with "maximum compression" (often "-9n") to create one compressed file, then uncompress it, then re-compress it, they expect the resulting second compressed file is identical to the first compressed file.
(Is this "idempotent" the same idea as what others call "deterministic"?[33])
This generally requires that each and every transformation implementation (at least when passed the "-9n" option) is idempotent.Some things that are not idempotent (and so should be avoided, at least when passed the "-9n" option) are:
(FIXME: replace "idempotent" above with "non-drifting lossy", as defined below? Or is there an even better term?)
Some information theory people[34]classify transformations into 4 categories of loss:
Some transforms are known not to completely remove all redundancy.In particular,researchers have designed special LZ77 encoders and LZW encodersthat encode extra "redundant bits" in the choice of which copy was selected,in a way that is backwards-compatible:Standard LZ77 or LZW decoders can extract the original file.Special LZ77 or LZW decoders can extract the original file plus these extra "redundant bits".These extra bits can be used for Reed-Solomon or other forms of error correction,or to store metadata about the original file such as a public signature or message authentication code.[35][36]
Network sparsification has applications in data compression.[37]
![]() | To do: |
![]() | Data Compression Multiple transformations | Executable Compression![]() |