Brief tutorial on CRC computation¶
A CRC is a long-division remainder. You add the CRC to the message,and the whole thing (message+CRC) is a multiple of the givenCRC polynomial. To check the CRC, you can either check that theCRC matches the recomputed value,or you can check that theremainder computed on the message+CRC is 0. This latter approachis used by a lot of hardware implementations, and is why so manyprotocols put the end-of-frame flag after the CRC.
It’s actually the same long division you learned in school, except that:
We’re working in binary, so the digits are only 0 and 1, and
When dividing polynomials, there are no carries. Rather than add andsubtract, we just xor. Thus, we tend to get a bit sloppy aboutthe difference between adding and subtracting.
Like all division, the remainder is always smaller than the divisor.To produce a 32-bit CRC, the divisor is actually a 33-bit CRC polynomial.Since it’s 33 bits long, bit 32 is always going to be set, so usually theCRC is written in hex with the most significant bit omitted. (If you’refamiliar with the IEEE 754 floating-point format, it’s the same idea.)
Note that a CRC is computed over a string ofbits, so you haveto decide on the endianness of the bits within each byte. To getthe best error-detecting properties, this should correspond to theorder they’re actually sent. For example, standard RS-232 serial islittle-endian; the most significant bit (sometimes used for parity)is sent last. And when appending a CRC word to a message, you shoulddo it in the right order, matching the endianness.
Just like with ordinary division, you proceed one digit (bit) at a time.Each step of the division you take one more digit (bit) of the dividendand append it to the current remainder. Then you figure out theappropriate multiple of the divisor to subtract to bring the remainderback into range. In binary, this is easy - it has to be either 0 or 1,and to make the XOR cancel, it’s just a copy of bit 32 of the remainder.
When computing a CRC, we don’t care about the quotient, so we canthrow the quotient bit away, but subtract the appropriate multiple ofthe polynomial from the remainder and we’re back to where we started,ready to process the next bit.
A big-endian CRC written this way would be coded like:
for (i = 0; i < input_bits; i++) { multiple = remainder & 0x80000000 ? CRCPOLY : 0; remainder = (remainder << 1 | next_input_bit()) ^ multiple;}Notice how, to get at bit 32 of the shifted remainder, we lookat bit 31 of the remainderbefore shifting it.
But also notice how thenext_input_bit() bits we’re shifting intothe remainder don’t actually affect any decision-making until32 bits later. Thus, the first 32 cycles of this are pretty boring.Also, to add the CRC to a message, we need a 32-bit-long hole for it atthe end, so we have to add 32 extra cycles shifting in zeros at theend of every message.
These details lead to a standard trick: rearrange merging in thenext_input_bit() until the moment it’s needed. Then the first 32 cyclescan be precomputed, and merging in the final 32 zero bits to make roomfor the CRC can be skipped entirely. This changes the code to:
for (i = 0; i < input_bits; i++) { remainder ^= next_input_bit() << 31; multiple = (remainder & 0x80000000) ? CRCPOLY : 0; remainder = (remainder << 1) ^ multiple;}With this optimization, the little-endian code is particularly simple:
for (i = 0; i < input_bits; i++) { remainder ^= next_input_bit(); multiple = (remainder & 1) ? CRCPOLY : 0; remainder = (remainder >> 1) ^ multiple;}The most significant coefficient of the remainder polynomial is storedin the least significant bit of the binary “remainder” variable.The other details of endianness have been hidden in CRCPOLY (which mustbe bit-reversed) andnext_input_bit().
As long as next_input_bit is returning the bits in a sensible order, we don’thave to wait until the last possible moment to merge in additional bits.We can do it 8 bits at a time rather than 1 bit at a time:
for (i = 0; i < input_bytes; i++) { remainder ^= next_input_byte() << 24; for (j = 0; j < 8; j++) { multiple = (remainder & 0x80000000) ? CRCPOLY : 0; remainder = (remainder << 1) ^ multiple; }}Or in little-endian:
for (i = 0; i < input_bytes; i++) { remainder ^= next_input_byte(); for (j = 0; j < 8; j++) { multiple = (remainder & 1) ? CRCPOLY : 0; remainder = (remainder >> 1) ^ multiple; }}If the input is a multiple of 32 bits, you can even XOR in a 32-bitword at a time and increase the inner loop count to 32.
You can also mix and match the two loop styles, for example doing thebulk of a message byte-at-a-time and adding bit-at-a-time processingfor any fractional bytes at the end.
To reduce the number of conditional branches, software commonly usesthe byte-at-a-time table method, popularized by Dilip V. Sarwate,“Computation of Cyclic Redundancy Checks via Table Look-Up”, Comm. ACMv.31 no.8 (August 1988) p. 1008-1013.
Here, rather than just shifting one bit of the remainder to decidein the correct multiple to subtract, we can shift a byte at a time.This produces a 40-bit (rather than a 33-bit) intermediate remainder,and the correct multiple of the polynomial to subtract is found usinga 256-entry lookup table indexed by the high 8 bits.
(The table entries are simply the CRC-32 of the given one-byte messages.)
When space is more constrained, smaller tables can be used, e.g. two4-bit shifts followed by a lookup in a 16-entry table.
It is not practical to process much more than 8 bits at a time using thistechnique, because tables larger than 256 entries use too much memory and,more importantly, too much of the L1 cache.
To get higher software performance, a “slicing” technique can be used.See “High Octane CRC Generation with the Intel Slicing-by-8 Algorithm”,ftp://download.intel.com/technology/comms/perfnet/download/slicing-by-8.pdf
This does not change the number of table lookups, but does increasethe parallelism. With the classic Sarwate algorithm, each table lookupmust be completed before the index of the next can be computed.
A “slicing by 2” technique would shift the remainder 16 bits at a time,producing a 48-bit intermediate remainder. Rather than doing a singlelookup in a 65536-entry table, the two high bytes are looked up intwo different 256-entry tables. Each contains the remainder requiredto cancel out the corresponding byte. The tables are different because thepolynomials to cancel are different. One has non-zero coefficients fromx^32 to x^39, while the other goes from x^40 to x^47.
Since modern processors can handle many parallel memory operations, thistakes barely longer than a single table look-up and thus performs almosttwice as fast as the basic Sarwate algorithm.
This can be extended to “slicing by 4” using 4 256-entry tables.Each step, 32 bits of data is fetched, XORed with the CRC, and the resultbroken into bytes and looked up in the tables. Because the 32-bit shiftleaves the low-order bits of the intermediate remainder zero, thefinal CRC is simply the XOR of the 4 table look-ups.
But this still enforces sequential execution: a second group of tablelook-ups cannot begin until the previous groups 4 table look-ups have allbeen completed. Thus, the processor’s load/store unit is sometimes idle.
To make maximum use of the processor, “slicing by 8” performs 8 look-upsin parallel. Each step, the 32-bit CRC is shifted 64 bits and XORedwith 64 bits of input data. What is important to note is that 4 ofthose 8 bytes are simply copies of the input data; they do not dependon the previous CRC at all. Thus, those 4 table look-ups may commenceimmediately, without waiting for the previous loop iteration.
By always having 4 loads in flight, a modern superscalar processor canbe kept busy and make full use of its L1 cache.
Two more details about CRC implementation in the real world:
Normally, appending zero bits to a message which is already a multipleof a polynomial produces a larger multiple of that polynomial. Thus,a basic CRC will not detect appended zero bits (or bytes). To enablea CRC to detect this condition, it’s common to invert the CRC beforeappending it. This makes the remainder of the message+crc come out notas zero, but some fixed non-zero value. (The CRC of the inversionpattern, 0xffffffff.)
The same problem applies to zero bits prepended to the message, and asimilar solution is used. Instead of starting the CRC computation witha remainder of 0, an initial remainder of all ones is used. As long asyou start the same way on decoding, it doesn’t make a difference.