This article includes a list ofgeneral references, butit lacks sufficient correspondinginline citations. Please help toimprove this article byintroducing more precise citations.(December 2013) (Learn how and when to remove this message) |
Ininformation theory, anentropy coding (orentropy encoding) is any lossless data compression method that attempts to approach the lower bound declared byShannon'ssource coding theorem, which states that any lossless data compression method must have an expected code length greater than or equal to the entropy of the source.[1]
More precisely, the source coding theorem states that for any source distribution, the expected code length satisfies, where is the function specifying the number of symbols in a code word, is the coding function, is the number of symbols used to make output codes and is the probability of the source symbol. An entropy coding attempts to approach this lower bound.
Two of the most common entropy coding techniques areHuffman coding andarithmetic coding.[2]If the approximate entropy characteristics of a data stream are known in advance (especially forsignal compression), a simpler static code may be useful.These static codes includeuniversal codes (such asElias gamma coding orFibonacci coding) andGolomb codes (such asunary coding orRice coding).
Since 2014, data compressors have started using theasymmetric numeral systems family of entropy coding techniques, which allows combination of the compression ratio ofarithmetic coding with a processing cost similar toHuffman coding.
Besides using entropy coding as a way to compress digital data, an entropy encoder can also be used to measure the amount ofsimilarity betweenstreams of data and already existing classes of data. This is done by generating an entropy coder/compressor for each class of data; unknown data is thenclassified by feeding the uncompressed data to each compressor and seeing which compressor yields the highest compression. The coder with the best compression is probably the coder trained on the data that was most similar to the unknown data.