Ininformation theory, anentropy coding (orentropy encoding) is anylossless data compression method that attempts to approach the lower bound declared byShannon'ssource coding theorem, which states that any lossless data compression method must have an expected code length greater than or equal to theentropy of the source.[1][2]
More precisely, the source coding theorem states that for any source distribution, the expected code length satisfies, where is the function specifying the number of symbols in a code word, is the coding function, is the number of symbols used to make output codes and is the probability of the source symbol. An entropy coding attempts to approach this lower bound.[2][3]
Two of the most common entropy coding techniques areHuffman coding andarithmetic coding.[4][5] If the approximate entropy characteristics of a data stream are known in advance (especially forsignal compression), a simpler static code may be useful. These static codes includeuniversal codes (such asElias gamma coding orFibonacci coding) andGolomb codes (such asunary coding orRice coding).[5]
Since 2014, data compressors have started using theasymmetric numeral systems (ANS) family of entropy coding techniques, which allows combination of the compression ratio ofarithmetic coding with a processing cost similar toHuffman coding.[6][1] ANS has been adopted by compressors developed byFacebook (Zstandard),Apple (LZFSE), andGoogle (Draco), among others.[6]
Entropy coding exploits the fact that some symbols occur more frequently than others. When symbol probabilities are unequal, some outcomes are more predictable, and this predictability can be used to represent the data in fewer bits. Conversely, when all symbols are equally likely, each symbol carries the maximum possible amount of information and no compression is possible.[3][2]
When compression is not possible: A stream of independent fair coin flips, where heads and tails each occur with probability 0.5, has an entropy of 1 bit per symbol, exactly the cost of storing one binary digit. Since every symbol already takes the minimum possible space, there is no redundancy to exploit, and no entropy coding method can make the data any smaller on average. The same principle applies to larger alphabets: independent ternary symbols (0, 1, 2) each with probability 1/3 have an entropy of about 1.585 bits per symbol, the maximum for a three-symbol alphabet, and are likewise incompressible.[3][2]
When compression is possible: If the same binary source instead produces 1s with probability 0.9 and 0s with probability 0.1, the entropy drops to about 0.469 bits per symbol. This is well below the 1-bit storage cost, because the predominance of 1s makes each symbol partially predictable. An entropy coder such asarithmetic coding can exploit this predictability to achieve a compression ratio of roughly 2.1:1 by assigning shorter codes to the more common symbol.[3][5]
Practical example: English text has an alphabet of roughly 27 characters (26 letters plus a space). If all characters occurred equally often, each would require about 4.75 bits. However, because letter frequencies are highly unequal ('e' occurs far more often than 'z') and letters are not independent ('u' almost always follows 'q'), the true entropy of English has been estimated at roughly 1.0 to 1.5 bits per character. This large gap is what makes English text highly compressible.[7][3]
Besides using entropy coding as a way to compress digital data, an entropy encoder can also be used to measure the amount ofsimilarity betweenstreams of data and already existing classes of data. This is done by generating an entropy coder/compressor for each class of data; unknown data is thenclassified by feeding the uncompressed data to each compressor and seeing which compressor yields the highest compression. The coder with the best compression is probably the coder trained on the data that was most similar to the unknown data.[8] This approach is grounded in the concept ofnormalized compression distance, a parameter-free, universal similarity metric based on compression that approximates the uncomputablenormalized information distance.[8][9]