Movatterモバイル変換

[RFC Home] [TEXT|PDF|HTML] [Tracker] [IPR] [Info page]

Obsoleted by:7805 HISTORIC

Network Working Group                                  John NagleRequest For Comments:  896                         6 January 1984                    Ford Aerospace and Communications Corporation           Congestion Control in IP/TCP InternetworksThis memo discusses some aspects of congestion control in  IP/TCPInternetworks.   It  is intended to stimulate thought and furtherdiscussion of this topic.   While some specific  suggestions  aremade for improved congestion  control  implementation,  this memodoes not specify any standards.                          IntroductionCongestion control is a recognized problem in  complex  networks.We have discovered that the Department of Defense's Internet Pro-tocol (IP) , a pure datagram protocol, and  Transmission  ControlProtocol  (TCP),  a transport layer protocol, when used together,are subject to unusual congestion problems caused by interactionsbetween  the  transport  and  datagram layers.  In particular, IPgateways are vulnerable to a phenomenon we call  "congestion col-lapse",  especially when such gateways connect networks of widelydifferent bandwidth.  We have developed  solutions  that  preventcongestion collapse.These problems are not generally recognized because these  proto-cols  are used most often on networks built on top of ARPANET IMPtechnology.  ARPANET IMP based networks traditionally  have  uni-form  bandwidth and identical switching nodes, and are sized withsubstantial excess capacity.  This excess capacity, and the abil-ity  of the IMP system to throttle the transmissions of hosts hasfor most IP / TCP hosts and  networks  been  adequate  to  handlecongestion.  With the recent split of the ARPANET into two inter-connected networks and the growth of other networks with  differ-ing properties connected to the ARPANET, however, reliance on thebenign properties of the IMP system is no longer enough to  allowhosts  to  communicate rapidly and reliably. Improved handling ofcongestion is now  mandatory  for  successful  network  operationunder load.Ford Aerospace and Communications  Corporation,  and  its  parentcompany,  Ford  Motor  Company,  operate  the only private IP/TCPlong-haul network in existence today.  This network connects fourfacilities  (one  in Michigan, two in California, and one in Eng-land) some with extensive local networks.  This net is cross-tiedto  the  ARPANET  but  uses  its  own long-haul circuits; trafficbetween Ford  facilities  flows  over  private  leased  circuits,including  a  leased  transatlantic  satellite  connection.   Allswitching nodes are pure IP datagram switches  with  no  node-to-node  flow  control, and all hosts run software either written orheavily modified by Ford or Ford Aerospace.  Bandwidth  of  linksin  this  network varies widely, from 1200 to 10,000,000 bits persecond.  In general, we have not been able to afford  the  luxuryof excess long-haul bandwidth that the ARPANET possesses, and ourlong-haul links are heavily loaded during peak periods.   Transittimes of several seconds are thus common in our network.

RFC 896 Congestion Control in IP/TCP Internetworks 1/6/84Because of our pure datagram orientation, heavy loading, and widevariation in bandwidth, we have had to solve problems that theARPANET / MILNET community is just beginning to recognize. Ournetwork is sensitive to suboptimal behavior by host TCP implemen-tations, both on and off our own net. We have devoted consider-able effort to examining TCP behavior under various conditions,and have solved some widely prevalent problems with TCP. Wepresent here two problems and their solutions. Many TCP imple-mentations have these problems; if throughput is worse through anARPANET / MILNET gateway for a given TCP implementation thanthroughput across a single net, there is a high probability thatthe TCP implementation has one or both of these problems. Congestion collapseBefore we proceed with a discussion of the two specific problemsand their solutions, a description of what happens when theseproblems are not addressed is in order. In heavily loaded puredatagram networks with end to end retransmission, as switchingnodes become congested, the round trip time through the netincreases and the count of datagrams in transit within the netalso increases. This is normal behavior under load. As long asthere is only one copy of each datagram in transit, congestion isunder control. Once retransmission of datagrams not yetdelivered begins, there is potential for serious trouble.Host TCP implementations are expected to retransmit packetsseveral times at increasing time intervals until some upper limiton the retransmit interval is reached. Normally, this mechanismis enough to prevent serious congestion problems. Even with thebetter adaptive host retransmission algorithms, though, a suddenload on the net can cause the round-trip time to rise faster thanthe sending hosts measurements of round-trip time can be updated.Such a load occurs when a new bulk transfer, such a filetransfer, begins and starts filling a large window. Should theround-trip time exceed the maximum retransmission interval forany host, that host will begin to introduce more and more copiesof the same datagrams into the net. The network is now in seri-ous trouble. Eventually all available buffers in the switchingnodes will be full and packets must be dropped. The round-triptime for packets that are delivered is now at its maximum. Hostsare sending each packet several times, and eventually some copyof each packet arrives at its destination. This is congestioncollapse.This condition is stable. Once the saturation point has beenreached, if the algorithm for selecting packets to be dropped isfair, the network will continue to operate in a degraded condi-tion. In this condition every packet is being transmittedseveral times and throughput is reduced to a small fraction ofnormal. We have pushed our network into this condition experi-mentally and observed its stability. It is possible for round-trip time to become so large that connections are broken because

RFC 896 Congestion Control in IP/TCP Internetworks 1/6/84the hosts involved time out.Congestion collapse and pathological congestion are not normallyseen in the ARPANET / MILNET system because these networks havesubstantial excess capacity. Where connections do not passthrough IP gateways, the IMP-to host flow control mechanisms usu-ally prevent congestion collapse, especially since TCP implemen-tations tend to be well adjusted for the time constants associ-ated with the pure ARPANET case. However, other than ICMP SourceQuench messages, nothing fundamentally prevents congestion col-lapse when TCP is run over the ARPANET / MILNET and packets arebeing dropped at gateways. Worth noting is that a few badly-behaved hosts can by themselves congest the gateways and preventother hosts from passing traffic. We have observed this problemrepeatedly with certain hosts (with whose administrators we havecommunicated privately) on the ARPANET.Adding additional memory to the gateways will not solve the prob-lem. The more memory added, the longer round-trip times mustbecome before packets are dropped. Thus, the onset of congestioncollapse will be delayed but when collapse occurs an even largerfraction of the packets in the net will be duplicates andthroughput will be even worse. The two problemsTwo key problems with the engineering of TCP implementations havebeen observed; we call these the small-packet problem and thesource-quench problem. The second is being addressed by severalimplementors; the first is generally believed (incorrectly) to besolved. We have discovered that once the small-packet problemhas been solved, the source-quench problem becomes much moretractable. We thus present the small-packet problem and oursolution to it first. The small-packet problemThere is a special problem associated with small packets. WhenTCP is used for the transmission of single-character messagesoriginating at a keyboard, the typical result is that 41 bytepackets (one byte of data, 40 bytes of header) are transmittedfor each byte of useful data. This 4000% overhead is annoyingbut tolerable on lightly loaded networks. On heavily loaded net-works, however, the congestion resulting from this overhead canresult in lost datagrams and retransmissions, as well as exces-sive propagation time caused by congestion in switching nodes andgateways. In practice, throughput may drop so low that TCP con-nections are aborted.This classic problem is well-known and was first addressed in theTymnet network in the late 1960s. The solution used there was toimpose a limit on the count of datagrams generated per unit time.This limit was enforced by delaying transmission of small packets

RFC 896 Congestion Control in IP/TCP Internetworks 1/6/84until a short (200-500ms) time had elapsed, in hope that anothercharacter or two would become available for addition to the samepacket before the timer ran out. An additional feature toenhance user acceptability was to inhibit the time delay when acontrol character, such as a carriage return, was received.This technique has been used in NCP Telnet, X.25 PADs, and TCPTelnet. It has the advantage of being well-understood, and is nottoo difficult to implement. Its flaw is that it is hard to comeup with a time limit that will satisfy everyone. A time limitshort enough to provide highly responsive service over a 10M bitsper second Ethernet will be too short to prevent congestion col-lapse over a heavily loaded net with a five second round-triptime; and conversely, a time limit long enough to handle theheavily loaded net will produce frustrated users on the Ethernet. The solution to the small-packet problemClearly an adaptive approach is desirable. One would expect aproposal for an adaptive inter-packet time limit based on theround-trip delay observed by TCP. While such a mechanism couldcertainly be implemented, it is unnecessary. A simple andelegant solution has been discovered.The solution is to inhibit the sending of new TCP segments whennew outgoing data arrives from the user if any previouslytransmitted data on the connection remains unacknowledged. Thisinhibition is to be unconditional; no timers, tests for size ofdata received, or other conditions are required. Implementationtypically requires one or two lines inside a TCP program.At first glance, this solution seems to imply drastic changes inthe behavior of TCP. This is not so. It all works out right inthe end. Let us see why this is so.When a user process writes to a TCP connection, TCP receives somedata. It may hold that data for future sending or may send apacket immediately. If it refrains from sending now, it willtypically send the data later when an incoming packet arrives andchanges the state of the system. The state changes in one of twoways; the incoming packet acknowledges old data the distant hosthas received, or announces the availability of buffer space inthe distant host for new data. (This last is referred to as"updating the window"). Each time data arrives on a connec-tion, TCP must reexamine its current state and perhaps send somepackets out. Thus, when we omit sending data on arrival from theuser, we are simply deferring its transmission until the nextmessage arrives from the distant host. A message must alwaysarrive soon unless the connection was previously idle or communi-cations with the other end have been lost. In the first case,the idle connection, our scheme will result in a packet beingsent whenever the user writes to the TCP connection. Thus we donot deadlock in the idle condition. In the second case, where

RFC 896 Congestion Control in IP/TCP Internetworks 1/6/84the distant host has failed, sending more data is futile anyway.Note that we have done nothing to inhibit normal TCP retransmis-sion logic, so lost messages are not a problem.Examination of the behavior of this scheme under various condi-tions demonstrates that the scheme does work in all cases. Thefirst case to examine is the one we wanted to solve, that of thecharacter-oriented Telnet connection. Let us suppose that theuser is sending TCP a new character every 200ms, and that theconnection is via an Ethernet with a round-trip time includingsoftware processing of 50ms. Without any mechanism to preventsmall-packet congestion, one packet will be sent for each charac-ter, and response will be optimal. Overhead will be 4000%, butthis is acceptable on an Ethernet. The classic timer scheme,with a limit of 2 packets per second, will cause two or threecharacters to be sent per packet. Response will thus be degradedeven though on a high-bandwidth Ethernet this is unnecessary.Overhead will drop to 1500%, but on an Ethernet this is a badtradeoff. With our scheme, every character the user types willfind TCP with an idle connection, and the character will be sentat once, just as in the no-control case. The user will see novisible delay. Thus, our scheme performs as well as the no-control scheme and provides better responsiveness than the timerscheme.The second case to examine is the same Telnet test but over along-haul link with a 5-second round trip time. Without anymechanism to prevent small-packet congestion, 25 new packetswould be sent in 5 seconds.* Overhead here is 4000%. With theclassic timer scheme, and the same limit of 2 packets per second,there would still be 10 packets outstanding and contributing tocongestion. Round-trip time will not be improved by sending manypackets, of course; in general it will be worse since the packetswill contend for line time. Overhead now drops to 1500%. Withour scheme, however, the first character from the user would findan idle TCP connection and would be sent immediately. The next24 characters, arriving from the user at 200msintervals, wouldbe held pending a message from the distant host. When an ACKarrived for the first packet at the end of 5 seconds, a singlepacket with the 24 queued characters would be sent. Our schemethus results in an overhead reduction to 320% with no penalty inresponse time. Response time will usually be improved with ourscheme because packet overhead is reduced, here by a factor of4.7 over the classic timer scheme.Congestion will be reduced bythis factor and round-trip delay will decrease sharply. For this________ * This problem is not seen in the pure ARPANET case because the IMPs will block the host when the count of packets outstanding becomes excessive, but in the case where a pure datagram local net (such as an Ethernet) or a pure datagram gateway (such as an ARPANET / MILNET gateway) is involved, it is possible to have large numbers of tiny packets outstanding.

RFC 896 Congestion Control in IP/TCP Internetworks 1/6/84case, our scheme has a striking advantage over either of theother approaches.We use our scheme for all TCP connections, not just Telnet con-nections. Let us see what happens for a file transfer data con-nection using our technique. The two extreme cases will again beconsidered.As before, we first consider the Ethernet case. The user is nowwriting data to TCP in 512 byte blocks as fast as TCP will acceptthem. The user's first write to TCP will start things going; ourfirst datagram will be 512+40 bytes or 552 bytes long. Theuser's second write to TCP will not cause a send but will causethe block to be buffered. Assume that the user fills up TCP'soutgoing buffer area before the first ACK comes back. Then whenthe ACK comes in, all queued data up to the window size will besent. From then on, the window will be kept full, as each ACKinitiates a sending cycle and queued data is sent out. Thus,after a one round-trip time initial period when only one block issent, our scheme settles down into a maximum-throughput condi-tion. The delay in startup is only 50ms on the Ethernet, so thestartup transient is insignificant. All three schemes provideequivalent performance for this case.Finally, let us look at a file transfer over the 5-second roundtrip time connection. Again, only one packet will be sent untilthe first ACK comes back; the window will then be filled and keptfull. Since the round-trip time is 5 seconds, only 512 bytes ofdata are transmitted in the first 5 seconds. Assuming a 2K win-dow, once the first ACK comes in, 2K of data will be sent and asteady rate of 2K per 5 seconds will be maintained thereafter.Only for this case is our scheme inferior to the timer scheme,and the difference is only in the startup transient; steady-statethroughput is identical. The naive scheme and the timer schemewould both take 250 seconds to transmit a 100K byte file underthe above conditions and our scheme would take 254 seconds, adifference of 1.6%.Thus, for all cases examined, our scheme provides at least 98% ofthe performance of both other schemes, and provides a dramaticimprovement in Telnet performance over paths with long round triptimes. We use our scheme in the Ford Aerospace SoftwareEngineering Network, and are able to run screen editors over Eth-ernet and talk to distant TOPS-20 hosts with improved performancein both cases. Congestion control with ICMPHaving solved the small-packet congestion problem and with it theproblem of excessive small-packet congestion within our own net-work, we turned our attention to the problem of general conges-tion control. Since our own network is pure datagram with nonode-to-node flow control, the only mechanism available to us

RFC 896    Congestion Control in IP/TCP Internetworks      1/6/84under  the  IP standard was the ICMP Source Quench message.  Withcareful handling,  we  find  this  adequate  to  prevent  seriouscongestion problems.  We do find it necessary to be careful aboutthe behavior of our hosts and switching  nodes  regarding  SourceQuench messages.               When to send an ICMP Source QuenchThe present ICMP standard* specifies that an ICMP  Source  Quenchmessage  should  be  sent whenever a packet is dropped, and addi-tionally may be sent when a gateway finds itself  becoming  shortof  resources.   There is some ambiguity here but clearly it is aviolation of the standard to drop a  packet  without  sending  anICMP message.Our basic assumption is that packets ought not to be dropped dur-ing  normal  network  operation.   We  therefore want to throttlesenders back before they overload switching nodes  and  gateways.All  our  switching  nodes  send ICMP Source Quench messages wellbefore buffer space is exhausted; they do not wait  until  it  isnecessary to drop a message before sending an ICMP Source Quench.As demonstrated in our  analysis  of  the  small-packet  problem,merely  providing  large  amounts of buffering is not a solution.In general, our experience is that Source Quench should  be  sentwhen  about  half  the  buffering space is exhausted; this is notbased on extensive experimentation but appears to be a reasonableengineering  decision.   One  could  argue for an adaptive schemethat adjusted the quench generation  threshold  based  on  recentexperience; we have not found this necessary as yet.There exist other gateway implementations  that  generate  SourceQuenches  only after more than one packet has been discarded.  Weconsider this approach undesirable since any system for  control-ling congestion based on the discarding of packets is wasteful ofbandwidth and may be susceptible  to  congestion  collapse  underheavy  load.   Our understanding is that the decision to generateSource Quenches with great reluctance stems from a fear that ack-nowledge  traffic  will  be quenched and that this will result inconnection failure.  As will be shown below, appropriate handlingof  Source  Quench in host implementations eliminates this possi-bility.        What to do when an ICMP Source Quench is receivedWe inform TCP or any other  protocol  at  that  layer  when  ICMPreceives  a Source Quench.  The basic action of our TCP implemen-tations is to reduce the amount of data  outstanding  on  connec-tions to the host mentioned in the Source Quench. This control is________  * ARPANETRFC 792 is the present standard.  We are advised by    the Defense Communications Agency that the description of    ICMP in MIL-STD-1777 is incomplete and will be deleted from    future revision of that standard.

RFC 896 Congestion Control in IP/TCP Internetworks 1/6/84applied by causing the sending TCP to behave as if the distanthost's window size has been reduced. Our first implementationwas simplistic but effective; once a Source Quench has beenreceived our TCP behaves as if the window size is zero wheneverthe window isn't empty. This behavior continues until somenumber (at present 10) of ACKs have been received, at that timeTCP returns to normal operation.* David Mills of Linkabit Cor-poration has since implemented a similar but more elaboratethrottle on the count of outstanding packets in his DCN systems.The additional sophistication seems to produce a modest gain inthroughput, but we have not made formal tests. Both implementa-tions effectively prevent congestion collapse in switching nodes.Source Quench thus has the effect of limiting the connection to alimited number (perhaps one) of outstanding messages. Thus, com-munication can continue but at a reduced rate, that is exactlythe effect desired.This scheme has the important property that Source Quench doesn'tinhibit the sending of acknowledges or retransmissions. Imple-mentations of Source Quench entirely within the IP layer are usu-ally unsuccessful because IP lacks enough information to throttlea connection properly. Holding back acknowledges tends to pro-duce retransmissions and thus unnecessary traffic. Holding backretransmissions may cause loss of a connection by a retransmis-sion timeout. Our scheme will keep connections alive undersevere overload but at reduced bandwidth per connection.Other protocols at the same layer as TCP should also be respon-sive to Source Quench. In each case we would suggest that newtraffic should be throttled but acknowledges should be treatednormally. The only serious problem comes from the User DatagramProtocol, not normally a major traffic generator. We have notimplemented any throttling in these protocols as yet; all arepassed Source Quench messages by ICMP but ignore them. Self-defense for gatewaysAs we have shown, gateways are vulnerable to host mismanagementof congestion. Host misbehavior by excessive traffic generationcan prevent not only the host's own traffic from getting through,but can interfere with other unrelated traffic. The problem canbe dealt with at the host level but since one malfunctioning hostcan interfere with others, future gateways should be capable ofdefending themselves against such behavior by obnoxious or mali-cious hosts. We offer some basic self-defense techniques.On one occasion in late 1983, a TCP bug in an ARPANET host causedthe host to frantically generate retransmissions of the samedatagram as fast as the ARPANET would accept them. The gateway________ * This follows the control engineering dictum "Never bother with proportional control unless bang-bang doesn't work".

RFC 896 Congestion Control in IP/TCP Internetworks 1/6/84that connected our net with the ARPANET was saturated and littleuseful traffic could get through, since the gateway had morebandwidth to the ARPANET than to our net. The gateway busilysent ICMP Source Quench messages but the malfunctioning hostignored them. This continued for several hours, until the mal-functioning host crashed. During this period, our network waseffectively disconnected from the ARPANET.When a gateway is forced to discard a packet, the packet isselected at the discretion of the gateway. Classic techniquesfor making this decision are to discard the most recentlyreceived packet, or the packet at the end of the longest outgoingqueue. We suggest that a worthwhile practical measure is to dis-card the latest packet from the host that originated the mostpackets currently queued within the gateway. This strategy willtend to balance throughput amongst the hosts using the gateway.We have not yet tried this strategy, but it seems a reasonablestarting point for gateway self-protection.Another strategy is to discard a newly arrived packet if thepacket duplicates a packet already in the queue. The computa-tional load for this check is not a problem if hashing techniquesare used. This check will not protect against malicious hostsbut will provide some protection against TCP implementations withpoor retransmission control. Gateways between fast local net-works and slower long-haul networks may find this check valuableif the local hosts are tuned to work well with the local network.Ideally the gateway should detect malfunctioning hosts andsquelch them; such detection is difficult in a pure datagram sys-tem. Failure to respond to an ICMP Source Quench message,though, should be regarded as grounds for action by a gateway todisconnect a host. Detecting such failure is non-trivial but isa worthwhile area for further research. ConclusionThe congestion control problems associated with pure datagramnetworks are difficult, but effective solutions exist. If IP /TCP networks are to be operated under heavy load, TCP implementa-tions must address several key issues in ways at least as effec-tive as the ones described here.

[8]ページ先頭