TECHNICAL FIELDThe present invention relates to the field of networking, and in particular to data compression techniques in a possibly congested network.
BACKGROUND ARTCompressing packet data can result in less congestion in a network, especially in oversubscribed situations. Congestion is reduced because a compressed packet takes less time on the wire and takes less buffer space to store. On a typical switch configuration of 24 slower speed links trunked onto an oversubscribed pair of higher speed ports, even a 20% compression can make the switch an entirely non-blocking switch. Compression can result in lower end-to-end latency in congested networks, but may increase latency in other situations, such as uncongested networks.
SUMMARY OF INVENTIONThe described techniques allow a network device to compress an output data stream adaptively based on indications of network congestion. These indications can include simple pause notifications or more complex congestion notifications that indicate a quantized level of congestion, allowing finer control over the compression of data by the network device.
In embodiments where the indications are simple pause flow control messages, the compression can be enabled upon receipt of the pause message and disabled when the flow control mechanism allows data to begin to flow across the network.
In embodiments where quantized levels of compression are indicated in congestion messages, when the congestion level increases, the congestion technique can be adapted to increase compression. When the congestion level decreases, the congestion technique can be adapted to decrease compression, or even disable compression altogether. The adaptive compression technique can thus achieve a lower end-to-end latency across the network.
The compression technique according to one embodiment matches strings in a data stream against previously seen strings of data, and replaces matched strings with a code word that is shorter than the matched string. A hash table memory is used for the matching process, and in some embodiments, a history table buffer is also provided for determining the length of the matched string.
The compression engine can respond to an indication of a quantized level of congestion by adapting the compression engine to alter one or more of the size of the string being matched, the hash table memory, and the history buffer, as well as altering the number of repetitions of compression. In addition, the compression engine can selectively perform an entropy encoding of the compressed data, resulting in further compression. All of these adaptations can be performed responsive to a quantized level of congestion, in order to vary the compression ratio, increasing the compression when congestion increases and decreasing the compression when congestion decreases.
At the receiving network device, the decompression technique can generally decompress the compressed data stream at line speed, independent of the adaptations of the compression technique performed by the compression engine.
BRIEF DESCRIPTION OF DRAWINGSThe accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate an implementation of apparatus and methods consistent with the present invention and, together with the detailed description, serve to explain advantages and principles consistent with the invention. In the drawings,
FIG. 1 is a high-level block diagram of a network switch according to one embodiment;
FIG. 2 is a diagram illustrating different effects of compression on end-to-end latency in a network;
FIG. 3 is a block diagram illustrating a simple network according to one embodiment;
(FIG. 4 is omitted.)
FIG. 5 is an illustration of a start of segment code word and an end of segment code word according to one embodiment;
FIG. 6 is an illustration of two formats of a literal control word according to one embodiment;
FIG. 7 is an illustration of three formats of a match code word according to one embodiment;
FIG. 8 is a block diagram illustrating logic blocks of a compression engine according to one embodiment;
FIG. 9 is a block diagram illustrating logic blocks of a compression engine according to another embodiment;
FIG. 10 is a timing chart illustrating relative timing of compression events according to the embodiments ofFIGS. 8 and 9;
FIG. 11 is a block diagram illustrating a technique for hashing uncompressed data according to one embodiment;
FIG. 12 is a block diagram illustrating matching hash values according to one embodiment;
FIG. 13 is a block diagram illustrating a technique for creating hash table entries according to one embodiment;
FIG. 14 is a block diagram illustrating a hashing function according to one embodiment;
FIG. 15 is a block diagram illustrating logic blocks of a decompression engine logic according to one embodiment;
FIG. 16 is a block diagram illustrating logic blocks of a decompression engine logic according to another embodiment;
FIG. 17 is a block diagram illustrating a decompression logic according to the embodiments ofFIGS. 15 and 16; and
FIG. 18 is a state diagram for a state machine for decompression according to the embodiments ofFIGS. 15 and 16.
DESCRIPTION OF EMBODIMENTSThe following is described in terms of an network switch in an Ethernet network, but the present invention is not limited to Ethernet networks and can be embodied in any type of networking apparatus in which data is transmitted between a source and a destination across a network infrastructure, whether through wired or wireless communications techniques, and using any desired communications protocol that allows for the delivery of self-contained bundles of information, whether identified as packets, frames, or any other nomenclature.
FIG. 1 is a high-level block diagram illustrating major blocks of one embodiment of anetwork switch100. Thenetwork switch100 can be a Fibre Channel over Ethernet switch or any other type of network switch. Thenetwork switch100 is implemented as a PCIe card and contains aPCIe core110 for handling the PCIe protocol interaction between thenetwork switch100 and ahost105 in a system in which thenetwork switch100 is installed. A CoreLogic120, typically implemented as an application specific integrated circuit (ASIC), handles the processing logic for thenetwork switch100. In the embodiment illustrated inFIG. 1, thenetwork switch100 is capable of connecting to an Ethernetlink160 that can range in speed from 10 Mbps to 10 Gbps. A 10Gbps link160 is handled by a 10 G Media Access Control (MAC)logic130, while lower speeds are handled by a 1 G/100/10MAC logic140. Regardless of the speed of thelink160, serialization and deserialization of data across thelink160 is handled by a 10 G/1 G serializer/deserializer (SERDES)logic150. In some embodiments, thenetwork switch100 can handle multiple Ethernet links, and multiple MAC and SERDES logics can be used for that purpose. The high-level structure illustrated inFIG. 1 is illustrative and by way of example only, and other high-level structures and collections of logic blocks can be used as desired.
Thecore logic120 of thenetwork switch100 is capable of compressing and decompressing data streams that are transmitted or received through thenetwork switch100 across thelink160, using a compression/decompression technique that is suitable for on the fly compression/decompression of data packets being transmitted or received across thelink160.
The compression technique used by thecore logic120 buffers a packet until the compressed data is at least half as much as the entire packet length in order to insure no under runs when transmitting the packet. By delaying the packet for this amount, the compression algorithm guarantees that it has less than half of the uncompressed data left to process. If thecore logic120 can process data at wire speed, no under run can occur from this point onwards, even if the uncompressed data results in a single byte of compressed data. If the data is not compressible, then the compression delay is half the time taken to transmit the packet. For example, if it takes 1 μs to transmit a packet, then the delay for an uncompressible packet would be 500 ns. If the data is compressible, then the delay can be longer, with a worst-case delay of 1 μs, assuming the algorithm can operate at wire speed.
FIG. 2 illustrates some scenarios where compression can either decrease or increase latency. If the network, for purposes of this example comprised of asource1, aningress switch2, across bar switch3, anegress switch4, and adestination5, is uncongested and the switches operate in a cut thru mode, as inscenario200, then compression can result in an increased end-to-end latency, as illustrated inscenario210. However, if the network is congested and the switches operate in store and forward mode, as inscenario220, then the end-to-end latency can actually decrease because the compressed packet takes less time on the wire and hence gets to the destination faster, as illustrated inscenario230.
In conventional network switches, compression is not related to network congestion. In one embodiment described below, thenetwork switch100 can determine that a network is uncongested and disable compression or that a network is congested and enable compression. In another embodiment described below, thenetwork switch100 can adjust compression techniques based on quantized network congestion information. The compression technique used is such that the receiving switch can decompress at wire speed, regardless of the amount of compression achieved by the transmitting switch.
FIG. 3 illustrates asimple example network300 with thenetwork switch100 ofFIG. 1 connected to anothernetwork switch310. For purposes of the discussion below,switch100 is connected to the source of a stream of packets or frames of information, and switch310 is connected to the destination of the stream. Although an actual network typically contains numerous other nodes, for clarity of this example, only the twonetwork switches100 and310 are shown inFIG. 3. In addition, although illustrated as directly connected bylink160 for simplicity of the drawing, the network switches100 and310 can be network connected with intermediary nodes.
In one embodiment, network switches100 and310 communicate using a flow control technique, such as found in IEEE 802.1Qbb, “Priority-based Flow Control,” which is incorporated by reference herein in its entirety for all purposes, which allowsnetwork switch310 to signalnetwork switch100 to pause the transmission of data across thelink160 tonetwork switch310. The specific pause signaling technique employed is illustrative and by way of example only, and any convenient technique for one network switch to signal another network switch to pause transmission can be used. In this embodiment, thenetwork switch100 can enable or disable compression based on the receipt of pause signals. When a pause signal is received, thenetwork switch100 can enable compression so that when it is allowed to restart transmitting, it can transmit the compressed the data stream, at which time it can disable compression if desired.
In another embodiment, network switches100 and310 communicate using a technique for sending congestion notifications, such as found in IEEE 802.1Qau, “Quantized Congestion Notifications” (QCN), which is incorporated by reference herein in its entirety for all purposes. QCN allows a receivingnetwork switch310 to provide feedback in the form of a congestion message to a transmittingnetwork switch100 regarding congestion in thenetwork300. As with the transmission pausing flow control technique described above, the QCN technique allows thenetwork switch100 to attempt to reduce congestion by reducing the flow rate.
The QCN technique allows the receivingnetwork switch310 to indicate quantized network congestion levels in a message of a pre-defined format sent through thenetwork300 to the transmittingnetwork switch100. The transmittingnetwork switch100 typically responds by reducing (or increasing) the data flow rate based on the contents of the message. The format of the message used for the congestion notification is not significant for the purposes of this disclosure and is therefore not set out in detail herein.
The congestion level information is described herein as quantized, because a large range of congestion values is approximated by a relative small set of values. The concept of quantization is well known in the art and is not discussed further herein. The specific quantized values for the levels of congestion are not significant for purposes of this application are therefore not discussed herein.
The QCN congestion notification technique outlined above is illustrative and by way of example only, and any convenient technique for one network switch to signal another network switch regarding quantized levels of network congestion can be used.
When thenetwork switch100 receives a congestion notification from thenetwork switch310, indicating congestion onlink160 has occurred or has increased, thenetwork switch100 responds by decreasing the data flow rate across the network. Thenetwork switch100 can also adapt the compression technique responsive to the congestion notification, to attempt to decrease end-to-end latency. By requesting higher compression of the data stream, so that the same uncompressed input can traverse the network in less time.
Similarly, when thenetwork switch100 receives a congestion notification from thenetwork switch310, indicating that the congestion onlink160 has cleared or decreased, then thenetwork switch100 can respond by increasing the data flow rate across the network. In that situation, thenetwork switch100 can also adapt the compression technique used on thelink160 responsive to the congestion notification, to attempt to decrease end-to-end latency by requesting less compression or even suspending compression of the data stream.
How thenetwork switch100 adapts the compression technique can depend on the compression technique used. Some compression techniques, e.g., the JPEG technique commonly used on still images, are considered lossy compression techniques, i.e., some of the original data is lost when compressed and cannot be recreated when the compressed data stream is decompressed. Because the compression technique is used in a data communication application, the compression technique chosen is preferably a lossless data compression technique; i.e., a compression technique that allows for decompression of the original uncompressed content without the loss of any of the source data.
In one embodiment, described in more detail below, thenetwork switch100 employs a form of Liv-Zempel compression, such as a Liv-Zempel-Oberhumer (LZO) compression technique, that is designed to allow fast compression and decompression that can be performed entirely in hardware in a compression engine of thecore logic120, trading compression ratio (the difference in the size of the compressed data stream and the size of the uncompressed data stream) for speed of compression and decompression. There are numerous LZO compression techniques. In one embodiment, the network switches100 and310 implement a McLZO compression and decompression technique. The use of McLZO compression and decompression techniques is illustrative and by way of example only, and any convenient compression technique that provides for lossless compression and decompression at wire speeds can be used. Preferably, the compression technique is implemented in hardware, to maximize the compression and decompression speed.
Before discussing how the disclosed embodiments adapt the compression based on congestion, a brief overview of how McLZO compression and decompression works may be useful.
McLZO is a block compression algorithm, i.e., it compresses and decompresses a block of data at a time, rather than one byte at a time. The block size must be the same for compression and decompression. McLZO compresses a block of data into matches, using a sliding dictionary, and runs of non-matching literals. The technique produces a good compression ratio (the ratio between the uncompressed and compressed data) on highly redundant data and deals acceptably with non-compressible data, only expanding uncompressible data by a maximum of 1/64th of the original size when measured over a block size of at least 1 kB.
The McLZO concept is to find repeating strings in a data stream, replacing later occurrences of the string with a distance and length vector that points to the previous occurrence of the string in the original input data. Because the distance and length vector are smaller than the string that it replaces, the string is compressed by the difference between the size of the original string and the size of the distance and length vector. If the distance field is at least one byte long and the length field is at least one byte long, then the smallest string that can be compressed is 3 bytes long, as the result is a two-byte {distance, length} code. For example, consider the following data streams (where the first byte transmitted is at the left):
Example 1a b c d e f g x y z a b c l m n
could be replaced with:
a b c d e f g x y z {10,3} l m n
where {10,3} indicates that the same 3 byte pattern occurred 10 bytes earlier in the data stream.
Example 2a b c d a b c d a b c d
could be replaced with:
a b c d {4,8}
In this example, the length field is longer than the distance field, indicating that the pattern repeats from the start after every four bytes.
Example 3a a a a a a a a a a a a
could be replaced with: a {1,11}
Example 4a b a b a b a b a b c d
could be replaced by
a b {2,8} c d
In certain situations, a pattern may have occurred multiple times, and may have repeated in variable lengths. For example,
a b c d e f g h i d e f g h a b c d e f g h i
could be replaced by
a b c d e f g h i {6,5} {14,3} {8,5}i
or
a b c d e f g h i {6,5} {14,9}
Both solutions are correct but the second one results in better compression.
The distance and length vector typically comprises a match code word. The decompression technique finds the match codes, reads the location that is pointed to by the distance field, and then reads the number of bytes indicated by the length field. The decompression logic needs a way of identifying whether a particular byte is part of a literal (uncompressed) data, or is part of a match code word. In addition, code words can be inserted into the byte pattern to help the decompression logic determine the string boundaries. Preferably, the decompression technique checks for data corruption to verify that the data after compression and decompression actually matches the original data by including an error checking code, such as a cyclic redundancy check (CRC), in the data stream. For example,
a b c d e f g h i d e f g h a b c d e f g h i
could be replaced by
{l9} a b c d e f g h i {m6,5} {m14,3} {m8,5} {l1} i
where {l9} indicates that there are 9 literal bytes following the literal code word, and {m6,5} indicates a match code word withdistance 6 andlength 5.
The format for the various code words described below is illustrative and by way of example only, and any desired format for the code words can be used.
In one embodiment, illustrated inFIGS. 5-7, a prefix value in the code word indicates the type of code word that follows. InFIG. 5, a start ofsegment code word500 is a two-byte string that indicates the start of a data segment. The first field of all packets or segments of a packet in a data stream according to this embodiment start with this code. A data stream can be a frame or any set of bytes which are to be compressed as a unit. The V bits indicate a version number of the compressor, to allow the decompressor to handle differences in the compression based on the compressor version. The C bit indicates whether the data transmitted in the current segment contains a CRC on the uncompressed data at the end. The E field indicates whether this segment is the last segment for the stream sent to the compressor. It is useful when the size of the data stream sent to the compressor is larger than what can be enumerated in the UncompressedData Size field505. TheUncompressed Data Size505 is the size of the segment before compression and allows determining how much buffering the uncompressed data will take before the decompression is done.
An End ofSegment code word510 is used to mark the end of a data segment can also signal the end of the packet or EOP.
Turning toFIG. 6, two forms of a Literal code word are illustrated.Literal code word600 allows indicating that a string of length 1-32 bytes follows, andLiteral code word610 allows indicating that a string of length 1-2KB follows. In either code word, thelength field605 or615 is expressed as the actual length minus one, because a zero length literal field is not allowed. Aprefix602 or612 allows distinguishing between the two forms of a literal code word.
FIG. 7 illustrates three forms of a Match code word. Inform700, a single Match code byte includes thedistance field702 andlength field704, where thedistance field702 can point backwards in the data stream by up to 64 bytes, and thelength field702 can indicate a length of 3 or 4 bytes. In two-byte form710, alonger distance field712 allows for specifying a matching string up to 4096 bytes away in the data stream, but only allows for a 3-6 byte matching string. Three-byte form720 also allows for adistance field722 indicating the matching string is up to 4096 bytes away and alength field724 allowing a matching length of between 3 and 258 bytes. All three forms of the match code word can be distinguished byunique prefix bits701,711, and721.
In addition to the McLZO compression described above, an entropy encoding, such as Huffman encoding, can be used to increase the compression ratio of the data stream. An entropy encoding technique tries to exploit the fact that certain encodings of a byte occur more often than others. For example, in an ASCII file the encodings that correspond to certain characters occur more often. Commonly occurring bytes can be represented by fewer than 8 bits, and less frequently occurring bytes can be represented by more bits than the commonly occurring bytes. However, since the commonly occurring bytes occur more often, the overall result is a compressed data stream.
Huffman encoding can use a preloaded library or a calculated library of most commonly occurring encodings. The library indicates which encodings happen more often than others and assigns the smallest number of bit representation to the most commonly occurring encoding. In one embodiment, a predetermined library is used. In another embodiment, a calculated library technique analyzes the current data to find which encodings happen most often, and then transmits the calculated library along with the data so that the destination can use it. Both the source and destination must have the same library in order to correctly encode and decode the data, and in a calculated library embodiment, the calculated library is typically transmitted at the beginning of the encoded data.
In one embodiment, thenetwork switch100 can transmit the data stream across thelink160 uncompressed, compressed by the McLZO technique, and compressed by both the McLZO technique and Huffman encoding, depending on the congestion of the network. If the network is uncongested, then the network switch can transmit the data stream uncompressed. If thenetwork switch100 receives a flow control indication of network congestion, then the network switch can transmit the data compressed. If multiple quantized levels of network congestion are distinguishable, thenetwork switch100 can use McLZO compression at lower congestion levels, and at higher congestion levels, the network switch can also include Huffman encoding, in addition to the McLZO compression, all to vary the resulting compression ratio responsive to the quantized level of network congestion. The Huffman encoding can potentially compress the bytes that contain the code words described above, in addition to the literal strings that remain in the data stream after the McLZO compression. In another embodiment, thenetwork switch100 can further change vary the compression ratio responsive to the quantized congestion level information by changing aspects of the compression technique, such as the way in which the length of the matching string is determined, as described below. Thus, compression of the data stream by thenetwork switch100 varies generally proportionally with the level of congestion of the network.
Turning now toFIG. 8, a block diagram illustrates one embodiment of acompression engine logic800 contained in thecore logic120 for compressing a transmitteddata stream805. Thecompression engine800 is part of thecore logic120 ofFIG. 1. Thedata stream805 is received by ahash logic810 and a cyclic redundancy check (CRC)logic820, which adds data to the end of the data segment for error detection, before passing the additional data to thehash logic810. Thehash logic810 then performs a hash function on the data stream in three bytes chunks, as described below, on portions of thedata stream805. Thehash logic810 then passes the hash values to both a writehash table logic830, which adds those values to ahash table memory850, and to a readhash table logic840, which attempts to match three-byte strings to previously identified three-byte strings. The data stream then passes to theencoder logic860, which insertsmatch code words700,710, or720 andliteral code words600 or610 as appropriate, before putting the compressed data stream in aFIFO buffer870. Compressed data is then pulled from theFIFO buffer870 intoRAM880 for transmission over thelink160. As indicated by the dashed line inFIG. 8, when no compression is to be performed, for example when the network is uncongested, then the ReadHash Table logic840 is bypassed, but the hashing and writing to the hash table functions are still performed. Thehash logic810, the readhash table logic840, the writehash table logic830, and the hash table memory comprise a matching logic for thecompression engine800.
In another embodiment, thecompression engine800 illustrated inFIG. 9 additionally uses ahistory buffer FIFO900 as part of the matching logic in the compression of the data. Thehistory buffer900 allows potentially better compression of thedata stream805, as described below.
Consider the following data stream:
{ABCDEFGHIJKLEFGPQABCDEFGHIJXYZ}
The compression achieved in the embodiment ofFIG. 8, without thehistory buffer900, is
{A B C D E F G H I J K L [8,3] P Q [17,4] [9,3] [17,3] XYZ}
The compression achieved in the embodiment ofFIG. 9, withhistory buffer900, is
{A B C D E F G H I J K L [8,3] P Q [17,10] XYZ}
The three match code words of theFIG. 8 embodiment ([17,4], [9,3], [17,3]) are combined in theFIG. 9 embodiment into a single match code word [17,10]. In one embodiment, where all of the match code words in theFIG. 8 embodiment are 2-byte match code words as discussed above regardingFIG. 7, even if the match code word of theFIG. 9 embodiment is a 3-byte match code word, the resulting compressed string is shorter by three bytes.
Thus, in one embodiment, thenetwork switch100 can adapt the compression technique by enabling or disabling the use of thehistory buffer900 to vary the compression ratio responsive to a quantized level of network congestion. If thehistory buffer900 is enabled, in a further embodiment, the compression technique can adaptively select all or a portion of the history buffer as available for use. By using less than the entire history buffer, the compression technique lessens the potential for a match, because less of the input data is available for matching. Therefore, at a first congestion level, a first portion of theavailable history buffer900 can be used, and at a second congestion level, a second portion of theavailable history buffer900 can be used. If the first portion is smaller than the second portion, then less compression is potentially achievable. In such an embodiment, the higher the congestion level, the greater the portion of the available history buffer that is used, to increase the potential for compression of the data stream.
A further embodiment of theencoding logic860 can increase the compression at even higher congestion levels by enabling an entropy encoding, such as Huffman encoding, to compress the data stream further, including the literal and match code words inserted by theencoder logic860.
If the compression encoding described above is repeated, the compression ratio is also increased in a typical data stream. In one embodiment thecompression engine800 can perform multiple iterations of the compression encoding described above responsive to the quantized level of network congestion, thus further varying the compression ratio responsive to that quantized level of network congestion.
FIG. 10 is a timing chart illustrating how one embodiment of the compression technique overlaps processing to improve latency of the compression. InCycle1, the data stream is hashed by thehash logic810 and a CRC is calculated by theCRC logic820. Then incycle2, theRead Hash logic840 reads the hashes computed from the hash table and determines whether any matches are found inCycle3. Also inCycle3, theWrite Hash logic830 writes the hash values calculated inCycle1 to the hash table850, and in the embodiment with thehistory FIFO buffer900, writes the corresponding uncompressed data to thehistory buffer900. Then inCycle4, the Encoder logic generates the necessary code words and inserts them into thetransmission FIFO870. Finally, in Cycle n, when theFIFO870 has accumulated 8 bytes of output data, it drives the output data intoRAM880 for transmission onlink160.
We now turn to the hashing technique in more detail. In one embodiment, at the beginning of a data stream, a location pointer into the history buffer is initialized to 0, and then incremented per byte of successive data in thedata stream805. For the current byte being processed, a hash over that byte and the next 2 bytes is calculated. Alternately, the hash value can be calculated over the current byte and the two previous bytes in thedata stream805. The hash is used as an address to thehash table memory850, and the location pointer along with a valid bit is written as a hash table entry. For every subsequent byte, the hash location in thehash table memory850 is read to determine whether that hash was seen before in the same data stream. If it was, then the location pointer into thehistory buffer900 is used as the address into thehistory buffer900 to read the contents at that location. If the data stored at that location matches the current pattern, then the difference between the current location pointer and the stored location pointer is calculated as the distance vector. Since the hash is calculated over 3 bytes of data, the initial length will be 3. The hash is calculated over 3 bytes because the distance and length vectors typically take 2 bytes to represent, so any repeating pattern oflength 2 bytes typically does not result in any compression.
Once a match is found in thehistory buffer900, the subsequent locations of thehistory buffer900 are read to see if the match extends beyond 3 bytes. If it does, then the length vector is incremented. If it does not, then the hash is calculated and checked again as before.
EXAMPLESExample 1a b c d e f g x y z a b c l m n
The contents of the history buffer and hash table after 10 bytes is:
history buffer: a b c d e f g x y z hash table:
H{abc}=v,0
H{bcd}=v,1
H{cde}=v,2
H{def}=v,3
H{efg}=v,4
H{fgx}=v,5
H{gxy}=v,6
H{xyz}=v,7
H{yza}=v,8
H{zab}=v,9
where v is a valid bit and is set to indicate a valid hash location. In the next cycle, the hash H{abc} is calculated and then that location is read.
The data at location (v,0) indicates that the same hash occurred before atlocation pointer 0. Therefore, thehistory buffer location 0 is read. Since the first byte matches (a), the next two locations are read, finding (b,c) which also match. Meanwhile, the hash table is updated with the new entries. Now, whenbyte1 occurs in the data stream, the hash H{bcl} location indicates not valid, therefore the match is only a 3 byte match, and the distance is Current byte position−0=Current byte position. In another embodiment, such as the embodiment ofFIG. 8, where no history buffer is available, instead of a pointer to the history buffer, the hash table entry can contain a copy of at least a portion of the data used to calculate the hash value, which is used for matching the incoming data string.
Considering the logic blocks ofFIGS. 8 and 9 in more detail, thehash logic810 in one embodiment takes a window of 8-byte chunks of input data and computes a 10-bit hash value for each 3 bytes of input data, resulting in 8 hash codes in a single clock cycle. As illustrated inFIG. 11, thehash logic810 in one embodiment computes a hash value based on the current byte and the previous two bytes. Thus for example, if the data received inClock 0 is {x, y, z, -, -, -, -, -} (where the “-” indicates bytes of no interest, and the data received inClock 1 is {h, g, f, h, d, c, b, a}, then, the 8 hashes computed inclock 1 are as follows:
hash0=Fn(a, x, y}
hash1=Fn(b, a, x}
hash2=Fn(c, b, a}
hash3=Fn(d, c, b}
hash4=Fn(e, d, c}
hash5=Fn(f, e, d}
hash6=Fn(g, f, e}
hash7=Fn(h, g, f}
The hashes computed are passed to the ReadHash Table logic830 and the WriteHash Table logic820 along with a valid indication. At the end of the packet (EOP), a CRC is appended by theCRC logic820 and a hash is computed on the CRC as if it were part of the originalinput data stream805.
If the data EOP does not align with a 64-bit boundary, the CRC appending is done slightly differently. For example, if the data received in the last clock before the EOP is {-, -, -, -, w, x, y, z}, then the CRC is computed in the same clock from the previous 8 bytes {a, b, c, d, e, f, g, h}. Then the data for hash computation in the last 2 clocks from theHash logic810 would be:
{a, b, c, d, w, x, y, z}
{a, b, c, d, e, f, g, h}
The hash valid bits will be sent accordingly to ReadHash Table logic840 and WriteHash Table logic830. Once the EOP is received and after the CRC hash is sent, theHash logic810 will generate a reset pulse to reset the valid bits in the hash tables850. The EOP given out from thehash logic810 will thus include the CRC/Parity bytes.
Thehash logic810 informs the ReadHash Table logic840 about the last data and bytes valid.
In one embodiment, thehistory buffer900 is a 64-bit wide 512-entry buffer. Each hash table comprises eight 27-bit wide entries, and thehash table memory850 can store1024 hash tables. Thehash logic810 maintains a 12-bit counter to count 4 KB blocks and to compute the History Buffer Address and reset theHash Table memory850 at 4 KB boundaries.
In an embodiment in which nohistory buffer900 is available or in which the history buffer has been disabled responsive to quantized level of network congestion, theHash logic810 behaves the same as in embodiments in which thehistory buffer900 is available and enabled. In that embodiment, theHash Write logic830 writes an index into the hash table entries that simulates an offset into the history buffer. When determining the length of the matched string, theread hash logic840 uses the index values themselves to determine the matched length. As each new byte is processed, if the value in the hash table entry is either invalid (indicating a match has never been seen) or is not incremented by 1 from the previous hash table entry, then the match terminates. Thus, in an embodiment such as is illustrated inFIG. 8, with no history buffer, the previous example uncompressed data stream:
{ABCDEFGHIJKLEFGPQABCDEFGHIJXYZ}
matches the second occurrence of the string “EFG,” replacing it with {8,3}, but because the hash table entries for “EFG” are overwritten to point to the most recent occurrence of that string, the hash table value for “EFG” is not incremented sequentially from the hash table entry for “DEF,” and will limit the match to “ABCD,” resulting in the compression
{A B C D E F G H I J K L [8,3] P Q [17,4] [9,3] [17,3] XYZ}
as described above. Thus, the embodiment ofFIG. 9 can achieve a higher compression ratio than the embodiment ofFIG. 8.
In one embodiment, the hash function is illustrated inFIG. 14 and XORs the 24 bits of the input data to produce a 10-bit hash value as follows:
H0=D0^D8^D16
H1=D1^D9
H2=D2^D17
H3=D3^D10^D18
H4=D4^D11
H5=D12^D19
H6=D5^D13^D20
H7=D6^D21
H8=D14^D22
H9=D7^D15^D23
The 10-bit result is then used as an index into the hash tables850.
TheCRC logic820 in one embodiment calculates parity of all the incoming bits and gives the result to HASHlogic810 in the next clock cycle, as follows:
crc_new=crc_old^data_in
In another embodiment, theCRC block820 generates the 4-byte CRC used for Ethernet MAC packets.
Thewrite hash logic830 according to one embodiment registers the data, hash, and valid bits to adjust the latency as shown inFIG. 13.
Each of the 8 hash values are used to create a hash table entry. In one embodiment, only 14 bits of the 24 data bits used for hashing are extracted to be written into the respective hash table of thehash table memory850 indicated by the 10-bit value calculated by thehash logic810. The write hash logic then concatenates the 14 data bits, the 12-bit address into thehistory buffer900 of the current three input data bytes, and a 1-bit valid field, writing the 27-bit entry intoHash Table memory850, for each of the 8 hash values calculated by thehash logic810. Thewrite hash logic820 then passes the 8 input data bytes to Encoder logic for further processing.
At EOP thewrite hash logic820 writes less than 8 hash table entries, only writing into some hash tables based on the data bytes valid signal.
Theread hash logic840 receives the hashes computed by thehash logic810. Each of the 8 hash values received is checked for a match in the hash tables ofhash table memory850. When a match is found, theread hash logic840 indicates that toEncoder logic860, indicating the number of bytes matched. The minimum match length needs to be 3 bytes. As illustrated inFIG. 12, at most three non-overlapping matches can be found in a single clock cycle in the 8 bytes that were hashed by thehash logic810, including a match with the last two bytes of the previous clock cycle. Theread hash logic840 can indicate all three matches to theencoder logic860.
The above description has been written with a window size of 8 bytes of data for hashing and encoding. The window size is illustrative and by way of example only, and other window sizes can be used. In one embodiment, thecompression engine800 can vary the window size to vary the compression ratio responsive to the congestion level of thenetwork300. In such an embodiment, thehash logic810, the writehash table logic830, the readhash table logic840, and theencoder860 can be configured to select a window size of less than 8 bytes for hashing the data stream and reading and writing the hash tables850. Where the window size is less than the full 8 bytes, the writehash table logic830 inserts fewer than 8 hash table entries and the readhash table logic840 reads fewer than 8 hash table entries from the hash tables850.
In an embodiment such as the embodiment ofFIG. 8, with nohistory buffer900, the match is done with hashes and the corresponding input data. In an embodiment such as the embodiment ofFIG. 9, where ahistory buffer900 is available, the match is identified by theread hash logic840 and data matching but the length of the match is determined by reading data from thehistory buffer900.
In one embodiment, thenetwork switch100 can select whether to use thehistory buffer900 depending on an indication received by thecore logic120 of congestion of thenetwork300, enabling use of thehistory buffer900 at higher congestion levels, and disabling thehistory buffer900 at lower congestion levels. In a further embodiment, thecompression engine800 can adaptively select how much of theavailable history buffer900 to use, based on the congestion level. By varying the effective size of the history buffer, thecompression engine800 can vary the compression ratio of the data stream. A larger effective size of thehistory buffer900 allows an increased potential for compressing of the data stream. In such an embodiment, if the effective size of thehistory buffer900 is decreased, then hash table entries in the hash tables850 that point to locations no longer in the effective size of the history buffer are marked as invalid.
TheEncoder logic860 is responsible for the insertion of literal and match code words into the output data stream, replacing the compressed input data when inserting the match code words. Theencoder logic860 writes the uncompressed data received from the writehash table logic830 8 bytes at a time into a 4 KBprivate RAM865 with a gap for a literal code word at the beginning. Once the first match is found, the length of the uncompressed string is written into a Literal code word in that gap. Once a match indication is received from the ReadHash Table logic840, theencoder logic860 stops writing the data received from the Write Hash Table logic until the {dist, length} vector is specified from the ReadHash Table logic840. Then theencoder logic860 updates the match code into theprivate RAM865. The encoder logic also rotates and adjusts the input data to fit correctly after any of the match code words if the match length is not a multiple of 8 bytes.
In one embodiment, theencoder logic860 can selectively enable or disable an entropy encoding technique, such as Huffman encoding, selectively increasing or decreasing, respectively, compression of the output data stream, including the code words described above, responsive to the network congestion level.
TheFIFO logic870 orients the data back into a 64-byte buffer and in one embodiment is implemented as a multi-byte input and 8-byte output FIFO. The FIFO buffer used by theFIFO logic870 has a depth of 16 bytes. TheFIFO logic870 waits for 8 bytes to accumulate, then drives the data into theRAM logic880.
TheRAM logic880 stores the output data until the first match is received. Once the first match is received and the match code is updated, data is ready to be sent out. If no match is found, then the RAM logic sends out the data when 4 KB of data (the maximum length of a segment) or EOP is received.
For example, if thenetwork switch100 is notified by a QCN congestion message that network congestion has decreased, thecompression engine800 can restrict the window size to 2 bytes, resulting in a lower level of compression. Similarly, if thenetwork switch100 is notified by a QCN congestion message that network congestion has increased, thecompression engine800 can increase the window size from 2 to a larger window size, depending on the amount of increase of congestion, to attempt to achieve a higher level of compression.
Turning back toFIG. 3, we now consider the decompression by thenetwork switch310 of the data compressed by thenetwork switch100. As with the compression, the decompression should be done at line rate in hardware. Preferably, the decompression should be implemented in a pipelined fashion and the latency of the logic should be kept at a minimum. Preferably, the decompression logic is capable of a 40 Gb rate, but 10 Gb rate can be acceptable in some embodiments. In one embodiment, the decompression logic supports a bypass mode in which the input data can be passed directly to output with no or minimal latency when the input data is not compressed. If the compression technique used by thenetwork switch100 included a CRC data, then the CRC added by compression should be stripped off. If a CRC error is found, then the decompression logic should indicate the same.
In one embodiment, adecompression logic1500 is illustrated inFIG. 15. A multi-byte output block (MBO)1510 receives 8 bytes of input data across a 64-bit data interface into a FIFO buffer. The output from theMBO1510 is an 8-byte wide output interface, with a signal indicating how many of the 8 bytes are valid. TheMBO1510 aligns the input data to an 8-byte boundary based on the different byte reads requested.
Ahistory buffer1520 allows for processing the match codes in the data stream. In one embodiment, thehistory buffer1520 has a capacity of 4 KB, which matches the maximum block size in compression encoder. The data width of thebuffer1520 is 64 bits with a buffer depth of 512 entries.
A Multi Byte Input FIFO (MBI)1540 buffers output from thedecompression logic1500. TheMBI1540 provides 1 to 8 bytes of data with a signal indicating the number of valid bytes. Whenever there is 8 bytes of data available, the MBI sends it out on its output interface. If less than 8 bytes are available and the data is EOP, then the data is sent out with an EOP signal and the number of bytes specified. The MBI can accept data input with various byte widths and aligns them into 8-byte boundary.
Thedecoder1530 is the main decompression engine of thedecompression logic1500. Thedecoder1530 is responsible for stripping of literal and match code words. After stripping a literal code word, the following data is written to theMBI1540 and into thehistory buffer1520. The decoder in one embodiment is implemented as a state machine, as described in more detail below. Based on the literal code, the state machine of thedecoder1530 first reads only data bytes from theMBO1510 and writes them into thehistory buffer1520 and theMBI1540. After that, it pre-fetches the match code and reads the whole match code. Based on the match code, data is read from thehistory buffer1520 and written back in the new location of thehistory buffer1520 and to theMBI1540. After all the data for a code word is process, the decoder reads the next match or literal code word and repeats the above steps until an EOP indication is received. On EOP, the last set of data bytes are written intoMBI1540 and indicated as an EOP. The Start of Segment and End ofSegment code words500 and510 are only for decoder's internal consumption. The End ofSegment code word510 should be used to indicate EOP if the current section is the last section of the packet.
Thedecompression logic1500 does not need to adapt to the level of McLZO compression performed by thecompression logic800. In some embodiments, where an entropy encoding such as Huffman encoding is used for additional compression at some congestion levels, thedecompression logic1500 can detect the presence of entropy encoding and decode the entropy encoded data stream prior to decompressing the McLZO-compressed data. In embodiments were a calculated entropy encoding library is used, thedecompression logic1500 can recognize the calculated library that is sent from thecompression engine800 of thenetwork switch100 and use that calculated library for decoding. A bypass mode can be provided that bypasses decompression when the input data stream is not compressed.
TheCRC logic1550 calculates the CRC over the 64-bit data received from theMBI1540 and indicates with an error signal if the CRC/Parity does not match.
In another embodiment, illustrated inFIG. 16, instead of thedecoder1530 writing data to theMBI1540, anoutput block1600 pulls the output data from thehistory buffer1520.
In bothFIGS. 15 and 16, the dashed line indicates the data path when no match code words are found in the input data stream.
FIG. 17 is a higher-level view of thedecompression logic1500 as asingle logic block1700. The data_in line provides 8-byte wide input data to thedecompression logic block1700, which decompresses the data and outputs the decompressed data on the data_out line. A bytes_valid_in line indicates how many of the 8 input bytes are valid, and a bytes_valid line indicates how many of the 8 output bytes are valid. Other lines provide other useful signals to and from thedecompression logic block1700, including EOP.
FIG. 18 illustrates astate machine1800 for thedecoder1520 according to one embodiment.
Thestate machine1800 begins in the IDLE state1840. In this state, the state machine waits for theMBO1510 to de-assert the empty signal. Once the empty signal is de-asserted, the state machine moves to ReadSOS state1810.
In the ReadSOS state1810, the decoder reads and analyzes a Start of Segment code. Based on the Start of Segment code, the length of the uncompressed data to be read is determined. Thestate machine1800 then transits to theReadUData state1820.
In theReadUData state1820, the uncompressed data is read from theMBO1510 and written to theMBI1540 and theHistory buffer1530 simultaneously, if in the embodiment ofFIG. 15; or just to theHistory buffer1530, if in the embodiment ofFIG. 16. Based on the number of bytes read, the read data counter is decremented. Once, the data count reaches zero, the state changes to theReadCode state1830.
TheReadCode state1830 reads the next code from theMBO1510. The code can be a match, literal, or End of Segment code word. If the code is a literal code, the state changes to theReadUData1820. If the code is a match code, then the state changes to theReadHData stage1850; if an end of segment code, thedecoder1530 moves to the IDLE state1840. If the code is a match code, the distance and length counters are updated.
In theReadHData state1850, the address of theHistory buffer1520 is calculated and data is read fromHistory buffer1520 and pushed to theMBI1540. The counter for length is decremented while the data transfer is happening. The data is also written back intoHistory buffer1520 at the current location. Once the length counter reaches zero, the state changes to theReadCode state1830.
In conclusion, the described techniques allow anetwork switch100 to compress an output data stream adaptively based on indications of network congestion. These indications can include simple pause notifications or more complex congestion notifications that indicate a quantized level of congestion, allowing finer control over the response of thenetwork switch100.
In embodiments where the indications are simple pause flow control messages, the compression can be enabled upon receipt of the pause message and disabled when the flow control mechanism allows data to begin to flow across the network.
In embodiments where multiple levels of compression can be indicated in congestion messages, when the congestion level increases, the congestion technique can be adapted to increase compression. When the congestion level decreases, the congestion technique can be adapted to decrease compression, or even disable compression altogether. The adaptive compression technique can thus achieve a lower end-to-end latency across the network.
At the receiving network switch, the decompression technique can generally decompress the compressed data stream at line speed.
While certain exemplary embodiments have been described in details and shown in the accompanying drawings, it is to be understood that such embodiments are merely illustrative of and not devised without departing from the basic scope thereof, which is determined by the claims that follow.