Aprefix code is a type ofcode system distinguished by its possession of the "prefix property", which requires that there is no wholecode word in the system that is aprefix (initial segment) of any other code word in the system. It is trivially true for fixed-length codes, so only a point of consideration forvariable-length codes.
For example, a code with code {9, 55} has the prefix property; a code consisting of {9, 5, 59, 55} does not, because "5" is a prefix of "59" and also of "55". A prefix code is auniquely decodable code: given a complete and accurate sequence, a receiver can identify each word without requiring a special marker between words. However, there are uniquely decodable codes that are not prefix codes; for instance, the reverse of a prefix code is still uniquely decodable (it is a suffix code), but it is not necessarily a prefix code.
Prefix codes are also known asprefix-free codes,prefix condition codes andinstantaneous codes. AlthoughHuffman coding is just one of many algorithms for deriving prefix codes, prefix codes are also widely referred to as "Huffman codes", even when the code was not produced by a Huffman algorithm. The termcomma-free code is sometimes also applied as a synonym for prefix-free codes[1][2] but in most mathematical books and articles (e.g.[3][4]) a comma-free code is used to mean aself-synchronizing code, a subclass of prefix codes.
Using prefix codes, a message can be transmitted as a sequence of concatenated code words, without anyout-of-band markers or (alternatively) special markers between words toframe the words in the message. The recipient can decode the message unambiguously, by repeatedly finding and removing sequences that form valid code words. This is not generally possible with codes that lack the prefix property, for example {0, 1, 10, 11}: a receiver reading a "1" at the start of a code word would not know whether that was the complete code word "1", or merely the prefix of the code word "10" or "11"; so the string "10" could be interpreted either as a single codeword or as the concatenation of the words "1" then "0".
The variable-lengthHuffman codes,country calling codes, the country and publisher parts ofISBNs, the Secondary Synchronization Codes used in theUMTSW-CDMA 3G Wireless Standard, and theinstruction sets (machine language) of most computer microarchitectures are prefix codes.
Prefix codes are noterror-correcting codes. In practice, a message might first be compressed with a prefix code, and then encoded again withchannel coding (including error correction) before transmission.
For anyuniquely decodable code there is a prefix code that has the same code word lengths.[5]Kraft's inequality characterizes the sets of code word lengths that are possible in auniquely decodable code.[6]
If every word in the code has the same length, the code is called afixed-length code, or ablock code (though the termblock code is also used for fixed-sizeerror-correcting codes inchannel coding). For example,ISO 8859-15 letters are always 8 bits long.UTF-32/UCS-4 letters are always 32 bits long.ATM cells are always 424 bits (53 bytes) long. A fixed-length code of fixed lengthk bits can encode up to source symbols.
A fixed-length code is necessarily a prefix code. It is possible to turn any code into a fixed-length code by padding fixed symbols to the shorter prefixes in order to meet the length of the longest prefixes. Alternately, such padding codes may be employed to introduce redundancy that allows autocorrection and/or synchronisation. However, fixed length encodings are inefficient in situations where some words are much more likely to be transmitted than others.
Truncated binary encoding is a straightforward generalization of fixed-length codes to deal with cases where the number of symbolsn is not a power of two. Source symbols are assigned codewords of lengthk andk+1, wherek is chosen so that2k < n ≤ 2k+1.
Huffman coding is a more sophisticated technique for constructing variable-length prefix codes. The Huffman coding algorithm takes as input the frequencies that the code words should have, and constructs a prefix code that minimizes the weighted average of the code word lengths. (This is closely related to minimizing the entropy.) This is a form oflossless data compression based onentropy encoding.
Some codes mark the end of a code word with a special "comma" symbol (also called aSentinel value), different from normal data.[7] This is somewhat analogous to the spaces between words in a sentence; they mark where one word ends and another begins. If every code word ends in a comma, and the comma does not appear elsewhere in a code word, the code is automatically prefix-free. However, reserving an entire symbol only for use as a comma can be inefficient, especially for languages with a small number of symbols.Morse code is an everyday example of a variable-length code with a comma. The long pauses between letters, and the even longer pauses between words, help people recognize where one letter (or word) ends, and the next begins. Similarly,Fibonacci coding uses a "11" to mark the end of every code word.
Self-synchronizing codes are prefix codes that allowframe synchronization.
Asuffix code is a set of words none of which is a suffix of any other; equivalently, a set of words which are the reverse of a prefix code. As with a prefix code, the representation of a string as a concatenation of such words is unique. Abifix code is a set of words which is both a prefix and a suffix code.[8]Anoptimal prefix code is a prefix code with minimal average length. That is, assume an alphabet ofn symbols with probabilities for a prefix codeC. IfC' is another prefix code and are the lengths of the codewords ofC', then.[9]
Examples of prefix codes include:
Commonly used techniques for constructing prefix codes includeHuffman codes and the earlierShannon–Fano codes, anduniversal codes such as: