Movatterモバイル変換
[0]ホーム
[RFC Home] [TEXT|PDF|HTML] [Tracker] [IPR] [Info page]
INFORMATIONAL
RFC: 817 MODULARITY AND EFFICIENCY IN PROTOCOL IMPLEMENTATION David D. Clark MIT Laboratory for Computer Science Computer Systems and Communications Group July, 1982 1. Introduction Many protocol implementers have made the unpleasant discovery thattheir packages do not run quite as fast as they had hoped. The blamefor this widely observed problem has been attributed to a variety ofcauses, ranging from details in the design of the protocol to theunderlying structure of the host operating system. This RFC willdiscuss some of the commonly encountered reasons why protocolimplementations seem to run slowly. Experience suggests that one of the most important factors indetermining the performance of an implementation is the manner in whichthat implementation is modularized and integrated into the hostoperating system. For this reason, it is useful to discuss the questionof how an implementation is structured at the same time that we considerhow it will perform. In fact, this RFC will argue that modularity isone of the chief villains in attempting to obtain good performance, sothat the designer is faced with a delicate and inevitable tradeoffbetween good structure and good performance. Further, the single factorwhich most strongly determines how well this conflict can be resolved isnot the protocol but the operating system.
2 2. Efficiency Considerations There are many aspects to efficiency. One aspect is sending dataat minimum transmission cost, which is a critical aspect of commoncarrier communications, if not in local area network communications.Another aspect is sending data at a high rate, which may not be possibleat all if the net is very slow, but which may be the one central designconstraint when taking advantage of a local net with high raw bandwidth.The final consideration is doing the above with minimum expenditure ofcomputer resources. This last may be necessary to achieve high speed,but in the case of the slow net may be important only in that theresources used up, for example cpu cycles, are costly or otherwiseneeded. It is worth pointing out that these different goals oftenconflict; for example it is often possible to trade off efficient use ofthe computer against efficient use of the network. Thus, there may beno such thing as a successful general purpose protocol implementation. The simplest measure of performance is throughput, measured in bitsper second. It is worth doing a few simple computations in order to geta feeling for the magnitude of the problems involved. Assume that datais being sent from one machine to another in packets of 576 bytes, themaximum generally acceptable internet packet size. Allowing for headeroverhead, this packet size permits 4288 bits in each packet. If auseful throughput of 10,000 bits per second is desired, then a databearing packet must leave the sending host about every 430 milliseconds,a little over two per second. This is clearly not difficult to achieve.However, if one wishes to achieve 100 kilobits per second throughput,
3the packet must leave the host every 43 milliseconds, and to achieve onemegabit per second, which is not at all unreasonable on a high-speedlocal net, the packets must be spaced no more than 4.3 milliseconds. These latter numbers are a slightly more alarming goal for which toset one's sights. Many operating systems take a substantial fraction ofa millisecond just to service an interrupt. If the protocol has beenstructured as a process, it is necessary to go through a processscheduling before the protocol code can even begin to run. If any pieceof a protocol package or its data must be fetched from disk, real timedelays of between 30 to 100 milliseconds can be expected. If theprotocol must compete for cpu resources with other processes of thesystem, it may be necessary to wait a scheduling quantum before theprotocol can run. Many systems have a scheduling quantum of 100milliseconds or more. Considering these sorts of numbers, it becomesimmediately clear that the protocol must be fitted into the operatingsystem in a thorough and effective manner if any like reasonablethroughput is to be achieved. There is one obvious conclusion immediately suggested by even thissimple analysis. Except in very special circumstances, when manypackets are being processed at once, the cost of processing a packet isdominated by factors, such as cpu scheduling, which are independent ofthe packet size. This suggests two general rules which anyimplementation ought to obey. First, send data in large packets.Obviously, if processing time per packet is a constant, then throughputwill be directly proportional to the packet size. Second, never send an
4unneeded packet. Unneeded packets use up just as many resources as apacket full of data, but perform no useful function. RFC 813, "Windowand Acknowledgement Strategy in TCP", discusses one aspect of reducingthe number of packets sent per useful data byte. This document willmention other attacks on the same problem. The above analysis suggests that there are two main parts to theproblem of achieving good protocol performance. The first has to dowith how the protocol implementation is integrated into the hostoperating system. The second has to do with how the protocol packageitself is organized internally. This document will consider each ofthese topics in turn. 3. The Protocol vs. the Operating System There are normally three reasonable ways in which to add a protocolto an operating system. The protocol can be in a process that isprovided by the operating system, or it can be part of the kernel of theoperating system itself, or it can be put in a separate communicationsprocessor or front end machine. This decision is strongly influenced bydetails of hardware architecture and operating system design; each ofthese three approaches has its own advantages and disadvantages. The "process" is the abstraction which most operating systems useto provide the execution environment for user programs. A very simplepath for implementing a protocol is to obtain a process from theoperating system and implement the protocol to run in it.Superficially, this approach has a number of advantages. Since
5modifications to the kernel are not required, the job can be done bysomeone who is not an expert in the kernel structure. Since it is oftenimpossible to find somebody who is experienced both in the structure ofthe operating system and the structure of the protocol, this path, froma management point of view, is often extremely appealing. Unfortunately,putting a protocol in a process has a number of disadvantages, relatedto both structure and performance. First, as was discussed above,process scheduling can be a significant source of real-time delay.There is not only the actual cost of going through the scheduler, butthe problem that the operating system may not have the right sort ofpriority tools to bring the process into execution quickly wheneverthere is work to be done. Structurally, the difficulty with putting a protocol in a processis that the protocol may be providing services, for example support ofdata streams, which are normally obtained by going to special kernelentry points. Depending on the generality of the operating system, itmay be impossible to take a program which is accustomed to readingthrough a kernel entry point, and redirect it so it is reading the datafrom a process. The most extreme example of this problem occurs whenimplementing server telnet. In almost all systems, the device handlerfor the locally attached teletypes is located inside the kernel, andprograms read and write from their teletype by making kernel calls. Ifserver telnet is implemented in a process, it is then necessary to takethe data streams provided by server telnet and somehow get them backdown inside the kernel so that they mimic the interface provided bylocal teletypes. It is usually the case that special kernel
6modification is necessary to achieve this structure, which somewhatdefeats the benefit of having removed the protocol from the kernel inthe first place. Clearly, then, there are advantages to putting the protocol packagein the kernel. Structurally, it is reasonable to view the network as adevice, and device drivers are traditionally contained in the kernel.Presumably, the problems associated with process scheduling can besidesteped, at least to a certain extent, by placing the code inside thekernel. And it is obviously easier to make the server telnet channelsmimic the local teletype channels if they are both realized in the samelevel in the kernel. However, implementation of protocols in the kernel has its own setof pitfalls. First, network protocols have a characteristic which isshared by almost no other device: they require rather complex actionsto be performed as a result of a timeout. The problem with thisrequirement is that the kernel often has no facility by which a programcan be brought into execution as a result of the timer event. What isreally needed, of course, is a special sort of process inside thekernel. Most systems lack this mechanism. Failing that, the onlyexecution mechanism available is to run at interrupt time. There are substantial drawbacks to implementing a protocol to runat interrupt time. First, the actions performed may be somewhat complexand time consuming, compared to the maximum amount of time that theoperating system is prepared to spend servicing an interrupt. Problemscan arise if interrupts are masked for too long. This is particularly
7bad when running as a result of a clock interrupt, which can imply thatthe clock interrupt is masked. Second, the environment provided by aninterrupt handler is usually extremely primitive compared to theenvironment of a process. There are usually a variety of systemfacilities which are unavailable while running in an interrupt handler.The most important of these is the ability to suspend execution pendingthe arrival of some event or message. It is a cardinal rule of almostevery known operating system that one must not invoke the schedulerwhile running in an interrupt handler. Thus, the programmer who isforced to implement all or part of his protocol package as an interrupthandler must be the best sort of expert in the operating systeminvolved, and must be prepared for development sessions filled withobscure bugs which crash not just the protocol package but the entireoperating system. A final problem with processing at interrupt time is that thesystem scheduler has no control over the percentage of system time usedby the protocol handler. If a large number of packets arrive, from aforeign host that is either malfunctioning or fast, all of the time maybe spent in the interrupt handler, effectively killing the system. There are other problems associated with putting protocols into anoperating system kernel. The simplest problem often encountered is thatthe kernel address space is simply too small to hold the piece of codein question. This is a rather artificial sort of problem, but it is asevere problem none the less in many machines. It is an appallinglyunpleasant experience to do an implementation with the knowledge that
8for every byte of new feature put in one must find some other byte ofold feature to throw out. It is hopeless to expect an effective andgeneral implementation under this kind of constraint. Another problemis that the protocol package, once it is thoroughly entwined in theoperating system, may need to be redone every time the operating systemchanges. If the protocol and the operating system are not maintained bythe same group, this makes maintenance of the protocol package aperpetual headache. The third option for protocol implementation is to take theprotocol package and move it outside the machine entirely, on to aseparate processor dedicated to this kind of task. Such a machine isoften described as a communications processor or a front-end processor.There are several advantages to this approach. First, the operatingsystem on the communications processor can be tailored for preciselythis kind of task. This makes the job of implementation much easier.Second, one does not need to redo the task for every machine to whichthe protocol is to be added. It may be possible to reuse the samefront-end machine on different host computers. Since the task need notbe done as many times, one might hope that more attention could be paidto doing it right. Given a careful implementation in an environmentwhich is optimized for this kind of task, the resulting package shouldturn out to be very efficient. Unfortunately, there are also problemswith this approach. There is, of course, a financial problem associatedwith buying an additional computer. In many cases, this is not aproblem at all since the cost is negligible compared to what theprogrammer would cost to do the job in the mainframe itself. More
9fundamentally, the communications processor approach does not completelysidestep any of the problems raised above. The reason is that thecommunications processor, since it is a separate machine, must beattached to the mainframe by some mechanism. Whatever that mechanism,code is required in the mainframe to deal with it. It can be arguedthat the program to deal with the communications processor is simplerthan the program to implement the entire protocol package. Even if thatis so, the communications processor interface package is still aprotocol in nature, with all of the same structural problems. Thus, allof the issues raised above must still be faced. In addition to thoseproblems, there are some other, more subtle problems associated with anoutboard implementation of a protocol. We will return to these problemslater. There is a way of attaching a communications processor to amainframe host which sidesteps all of the mainframe implementationproblems, which is to use some preexisting interface on the host machineas the port by which a communications processor is attached. Thisstrategy is often used as a last stage of desperation when the softwareon the host computer is so intractable that it cannot be changed in anyway. Unfortunately, it is almost inevitably the case that all of theavailable interfaces are totally unsuitable for this purpose, so theresult is unsatisfactory at best. The most common way in which thisform of attachment occurs is when a network connection is being used tomimic local teletypes. In this case, the front-end processor can beattached to the mainframe by simply providing a number of wires out ofthe front-end processor, each corresponding to a connection, which are
10plugged into teletype ports on the mainframe computer. (Because of theappearance of the physical configuration which results from thisarrangement, Michael Padlipsky has described this as the "milkingmachine" approach to computer networking.) This strategy solves theimmediate problem of providing remote access to a host, but it isextremely inflexible. The channels being provided to the host arerestricted by the host software to one purpose only, remote login. Itis impossible to use them for any other purpose, such as file transferor sending mail, so the host is integrated into the network environmentin an extremely limited and inflexible manner. If this is the best thatcan be done, then it should be tolerated. Otherwise, implementorsshould be strongly encouraged to take a more flexible approach. 4. Protocol Layering The previous discussion suggested that there was a decision to bemade as to where a protocol ought to be implemented. In fact, thedecision is much more complicated than that, for the goal is not toimplement a single protocol, but to implement a whole family of protocollayers, starting with a device driver or local network driver at thebottom, then IP and TCP, and eventually reaching the applicationspecific protocol, such as Telnet, FTP and SMTP on the top. Clearly,the bottommost of these layers is somewhere within the kernel, since thephysical device driver for the net is almost inevitably located there.Equally clearly, the top layers of this package, which provide the userhis ability to perform the remote login function or to send mail, arenot entirely contained within the kernel. Thus, the question is not
11whether the protocol family shall be inside or outside the kernel, buthow it shall be sliced in two between that part inside and that partoutside. Since protocols come nicely layered, an obvious proposal is thatone of the layer interfaces should be the point at which the inside andoutside components are sliced apart. Most systems have been implementedin this way, and many have been made to work quite effectively. Oneobvious place to slice is at the upper interface of TCP. Since TCPprovides a bidirectional byte stream, which is somewhat similar to theI/O facility provided by most operating systems, it is possible to makethe interface to TCP almost mimic the interface to other existingdevices. Except in the matter of opening a connection, and dealing withpeculiar failures, the software using TCP need not know that it is anetwork connection, rather than a local I/O stream that is providing thecommunications function. This approach does put TCP inside the kernel,which raises all the problems addressed above. It also raises theproblem that the interface to the IP layer can, if the programmer is notcareful, become excessively buried inside the kernel. It must beremembered that things other than TCP are expected to run on top of IP.The IP interface must be made accessible, even if TCP sits on top of itinside the kernel. Another obvious place to slice is above Telnet. The advantage ofslicing above Telnet is that it solves the problem of having remotelogin channels emulate local teletype channels. The disadvantage ofputting Telnet into the kernel is that the amount of code which has now
12been included there is getting remarkably large. In some earlyimplementations, the size of the network package, when one includesprotocols at the level of Telnet, rivals the size of the rest of thesupervisor. This leads to vague feelings that all is not right. Any attempt to slice through a lower layer boundary, for examplebetween internet and TCP, reveals one fundamental problem. The TCPlayer, as well as the IP layer, performs a demultiplexing function onincoming datagrams. Until the TCP header has been examined, it is notpossible to know for which user the packet is ultimately destined.Therefore, if TCP, as a whole, is moved outside the kernel, it isnecessary to create one separate process called the TCP process, whichperforms the TCP multiplexing function, and probably all of the rest ofTCP processing as well. This means that incoming data destined for auser process involves not just a scheduling of the user process, butscheduling the TCP process first. This suggests an alternative structuring strategy which slicesthrough the protocols, not along an established layer boundary, butalong a functional boundary having to do with demultiplexing. In thisapproach, certain parts of IP and certain parts of TCP are placed in thekernel. The amount of code placed there is sufficient so that when anincoming datagram arrives, it is possible to know for which process thatdatagram is ultimately destined. The datagram is then routed directlyto the final process, where additional IP and TCP processing isperformed on it. This removes from the kernel any requirement for timerbased actions, since they can be done by the process provided by the
13user. This structure has the additional advantage of reducing theamount of code required in the kernel, so that it is suitable forsystems where kernel space is at a premium. TheRFC 814, titled "Names,Addresses, Ports, and Routes," discusses this rather orthogonal slicingstrategy in more detail. A related discussion of protocol layering and multiplexing can befound in Cohen and Postel [1]. 5. Breaking Down the Barriers In fact, the implementor should be sensitive to the possibility ofeven more peculiar slicing strategies in dividing up the variousprotocol layers between the kernel and the one or more user processes.The result of the strategy proposed above was that part of TCP shouldexecute in the process of the user. In other words, instead of havingone TCP process for the system, there is one TCP process per connection.Given this architecture, it is not longer necessary to imagine that allof the TCPs are identical. One TCP could be optimized for highthroughput applications, such as file transfer. Another TCP could beoptimized for small low delay applications such as Telnet. In fact, itwould be possible to produce a TCP which was somewhat integrated withthe Telnet or FTP on top of it. Such an integration is extremelyimportant, for it can lead to a kind of efficiency which moretraditional structures are incapable of producing. Earlier, this paperpointed out that one of the important rules to achieving efficiency wasto send the minimum number of packets for a given amount of data. Theidea of protocol layering interacts very strongly (and poorly) with this
14goal, because independent layers have independent ideas about whenpackets should be sent, and unless these layers can somehow be broughtinto cooperation, additional packets will flow. The best example ofthis is the operation of server telnet in a character at a time remoteecho mode on top of TCP. When a packet containing a character arrivesat a server host, each layer has a different response to that packet.TCP has an obligation to acknowledge the packet. Either server telnetor the application layer above has an obligation to echo the characterreceived in the packet. If the character is a Telnet control sequence,then Telnet has additional actions which it must perform in response tothe packet. The result of this, in most implementations, is thatseveral packets are sent back in response to the one arriving packet.Combining all of these return messages into one packet is important forseveral reasons. First, of course, it reduces the number of packetsbeing sent over the net, which directly reduces the charges incurred formany common carrier tariff structures. Second, it reduces the number ofscheduling actions which will occur inside both hosts, which, as wasdiscussed above, is extremely important in improving throughput. The way to achieve this goal of packet sharing is to break down thebarrier between the layers of the protocols, in a very restrained andcareful manner, so that a limited amount of information can leak acrossthe barrier to enable one layer to optimize its behavior with respect tothe desires of the layers above and below it. For example, it wouldrepresent an improvement if TCP, when it received a packet, could askthe layer above whether or not it would be worth pausing for a fewmilliseconds before sending an acknowledgement in order to see if the
15upper layer would have any outgoing data to send. Dallying beforesending the acknowledgement produces precisely the right sort ofoptimization if the client of TCP is server Telnet. However, dallyingbefore sending an acknowledgement is absolutely unacceptable if TCP isbeing used for file transfer, for in file transfer there is almost neverdata flowing in the reverse direction, and the delay in sending theacknowledgement probably translates directly into a delay in obtainingthe next packets. Thus, TCP must know a little about the layers aboveit to adjust its performance as needed. It would be possible to imagine a general purpose TCP which wasequipped with all sorts of special mechanisms by which it would querythe layer above and modify its behavior accordingly. In the structuressuggested above, in which there is not one but several TCPs, the TCP cansimply be modified so that it produces the correct behavior as a matterof course. This structure has the disadvantage that there will beseveral implementations of TCP existing on a single machine, which canmean more maintenance headaches if a problem is found where TCP needs tobe changed. However, it is probably the case that each of the TCPs willbe substantially simpler than the general purpose TCP which wouldotherwise have been built. There are some experimental projectscurrently under way which suggest that this approach may make designingof a TCP, or almost any other layer, substantially easier, so that thetotal effort involved in bringing up a complete package is actually lessif this approach is followed. This approach is by no means generallyaccepted, but deserves some consideration.
16 The general conclusion to be drawn from this sort of considerationis that a layer boundary has both a benefit and a penalty. A visiblelayer boundary, with a well specified interface, provides a form ofisolation between two layers which allows one to be changed with theconfidence that the other one will not stop working as a result.However, a firm layer boundary almost inevitably leads to inefficientoperation. This can easily be seen by analogy with other aspects ofoperating systems. Consider, for example, file systems. A typicaloperating system provides a file system, which is a highly abstractedrepresentation of a disk. The interface is highly formalized, andpresumed to be highly stable. This makes it very easy for naive usersto have access to disks without having to write a great deal ofsoftware. The existence of a file system is clearly beneficial. On theother hand, it is clear that the restricted interface to a file systemalmost inevitably leads to inefficiency. If the interface is organizedas a sequential read and write of bytes, then there will be people whowish to do high throughput transfers who cannot achieve their goal. Ifthe interface is a virtual memory interface, then other users willregret the necessity of building a byte stream interface on top of thememory mapped file. The most objectionable inefficiency results when ahighly sophisticated package, such as a data base management package,must be built on top of an existing operating system. Almostinevitably, the implementors of the database system attempt to rejectthe file system and obtain direct access to the disks. They havesacrificed modularity for efficiency. The same conflict appears in networking, in a rather extreme form.
17The concept of a protocol is still unknown and frightening to most naiveprogrammers. The idea that they might have to implement a protocol, oreven part of a protocol, as part of some application package, is adreadful thought. And thus there is great pressure to hide the functionof the net behind a very hard barrier. On the other hand, the kind ofinefficiency which results from this is a particularly undesirable sortof inefficiency, for it shows up, among other things, in increasing thecost of the communications resource used up to achieve the applicationgoal. In cases where one must pay for one's communications costs, theyusually turn out to be the dominant cost within the system. Thus, doingan excessively good job of packaging up the protocols in an inflexiblemanner has a direct impact on increasing the cost of the criticalresource within the system. This is a dilemma which will probably onlybe solved when programmers become somewhat less alarmed about protocols,so that they are willing to weave a certain amount of protocol structureinto their application program, much as application programs today weaveparts of database management systems into the structure of theirapplication program. An extreme example of putting the protocol package behind a firmlayer boundary occurs when the protocol package is relegated to a front-end processor. In this case the interface to the protocol is some otherprotocol. It is difficult to imagine how to build close cooperationbetween layers when they are that far separated. Realistically, one ofthe prices which must be associated with an implementation so physicallymodularized is that the performance will suffer as a result. Of course,a separate processor for protocols could be very closely integrated into
18the mainframe architecture, with interprocessor co-ordination signals,shared memory, and similar features. Such a physical modularity mightwork very well, but there is little documented experience with thisclosely coupled architecture for protocol support. 6. Efficiency of Protocol Processing To this point, this document has considered how a protocol packageshould be broken into modules, and how those modules should bedistributed between free standing machines, the operating system kernel,and one or more user processes. It is now time to consider the otherhalf of the efficiency question, which is what can be done to speed theexecution of those programs that actually implement the protocols. Wewill make some specific observations about TCP and IP, and then concludewith a few generalities. IP is a simple protocol, especially with respect to the processingof normal packets, so it should be easy to get it to performefficiently. The only area of any complexity related to actual packetprocessing has to do with fragmentation and reassembly. The reader isreferred to RFC 815, titled "IP Datagram Reassembly Algorithms", forspecific consideration of this point. Most costs in the IP layer come from table look up functions, asopposed to packet processing functions. An outgoing packet requires twotranslation functions to be performed. The internet address must betranslated to a target gateway, and a gateway address must be translatedto a local network number (if the host is attached to more than one
19network). It is easy to build a simple implementation of these tablelook up functions that in fact performs very poorly. The programmershould keep in mind that there may be as many as a thousand networknumbers in a typical configuration. Linear searching of a thousandentry table on every packet is extremely unsuitable. In fact, it may beworth asking TCP to cache a hint for each connection, which can behanded down to IP each time a packet is sent, to try to avoid theoverhead of a table look up. TCP is a more complex protocol, and presents many moreopportunities for getting things wrong. There is one area which isgenerally accepted as causing noticeable and substantial overhead aspart of TCP processing. This is computation of the checksum. It wouldbe nice if this cost could be avoided somehow, but the idea of an end-to-end checksum is absolutely central to the functioning of TCP. Nohost implementor should think of omitting the validation of a checksumon incoming data. Various clever tricks have been used to try to minimize the cost ofcomputing the checksum. If it is possible to add additional microcodedinstructions to the machine, a checksum instruction is the most obviouscandidate. Since computing the checksum involves picking up every byteof the segment and examining it, it is possible to combine the operationof computing the checksum with the operation of copying the segment fromone location to another. Since a number of data copies are probablyalready required as part of the processing structure, this kind ofsharing might conceivably pay off if it didn't cause too much trouble to
20the modularity of the program. Finally, computation of the checksumseems to be one place where careful attention to the details of thealgorithm used can make a drastic difference in the throughput of theprogram. The Multics system provides one of the best case studies ofthis, since Multics is about as poorly organized to perform thisfunction as any machine implementing TCP. Multics is a 36-bit wordmachine, with four 9-bit bytes per word. The eight-bit bytes of a TCPsegment are laid down packed in memory, ignoring word boundaries. Thismeans that when it is necessary to pick up the data as a set of 16-bitunits for the purpose of adding them to compute checksums, horriblemasking and shifting is required for each 16-bit value. An earlyversion of a program using this strategy required 6 milliseconds tochecksum a 576-byte segment. Obviously, at this point, checksumcomputation was becoming the central bottleneck to throughput. A morecareful recoding of this algorithm reduced the checksum processing timeto less than one millisecond. The strategy used was extremely dirty.It involved adding up carefully selected words of the area in which thedata lay, knowing that for those particular words, the 16-bit valueswere properly aligned inside the words. Only after the addition hadbeen done were the various sums shifted, and finally added to producethe eventual checksum. This kind of highly specialized programming isprobably not acceptable if used everywhere within an operating system.It is clearly appropriate for one highly localized function which can beclearly identified as an extreme performance bottleneck. Another area of TCP processing which may cause performance problemsis the overhead of examining all of the possible flags and options which
21occur in each incoming packet. One paper, by Bunch and Day [2], assertsthat the overhead of packet header processing is actually an importantlimiting factor in throughput computation. Not all measurementexperiments have tended to support this result. To whatever extent itis true, however, there is an obvious strategy which the implementorought to use in designing his program. He should build his program tooptimize the expected case. It is easy, especially when first designinga program, to pay equal attention to all of the possible outcomes ofevery test. In practice, however, few of these will ever happen. A TCPshould be built on the assumption that the next packet to arrive willhave absolutely nothing special about it, and will be the next oneexpected in the sequence space. One or two tests are sufficient todetermine that the expected set of control flags are on. (The ACK flagshould be on; the Push flag may or may not be on. No other flags shouldbe on.) One test is sufficient to determine that the sequence number ofthe incoming packet is one greater than the last sequence numberreceived. In almost every case, that will be the actual result. Again,using the Multics system as an example, failure to optimize the case ofreceiving the expected sequence number had a detectable effect on theperformance of the system. The particular problem arose when a numberof packets arrived at once. TCP attempted to process all of thesepackets before awaking the user. As a result, by the time the lastpacket arrived, there was a threaded list of packets which had severalitems on it. When a new packet arrived, the list was searched to findthe location into which the packet should be inserted. Obviously, thelist should be searched from highest sequence number to lowest sequence
22number, because one is expecting to receive a packet which comes afterthose already received. By mistake, the list was searched from front toback, starting with the packets with the lowest sequence number. Theamount of time spent searching this list backwards was easily detectablein the metering measurements. Other data structures can be organized to optimize the action whichis normally taken on them. For example, the retransmission queue isvery seldom actually used for retransmission, so it should not beorganized to optimize that action. In fact, it should be organized tooptimized the discarding of things from it when the acknowledgementarrives. In many cases, the easiest way to do this is not to save thepacket at all, but to reconstruct it only if it needs to beretransmitted, starting from the data as it was originally buffered bythe user. There is another generality, at least as important as optimizingthe common case, which is to avoid copying data any more times thannecessary. One more result from the Multics TCP may prove enlighteninghere. Multics takes between two and three milliseconds within the TCPlayer to process an incoming packet, depending on its size. For a 576-byte packet, the three milliseconds is used up approximately as follows.One millisecond is used computing the checksum. Six hundredmicroseconds is spent copying the data. (The data is copied twice, at.3 milliseconds a copy.) One of those copy operations could correctlybe included as part of the checksum cost, since it is done to get thedata on a known word boundary to optimize the checksum algorithm.
23However, the copy also performs another necessary transfer at the sametime. Header processing and packet resequencing takes .7 milliseconds.The rest of the time is used in miscellaneous processing, such asremoving packets from the retransmission queue which are acknowledged bythis packet. Data copying is the second most expensive single operationafter data checksuming. Some implementations, often because of anexcessively layered modularity, end up copying the data around a greatdeal. Other implementations end up copying the data because there is noshared memory between processes, and the data must be moved from processto process via a kernel operation. Unless the amount of this activityis kept strictly under control, it will quickly become the majorperformance bottleneck. 7. Conclusions This document has addressed two aspects of obtaining performancefrom a protocol implementation, the way in which the protocol is layeredand integrated into the operating system, and the way in which thedetailed handling of the packet is optimized. It would be nice if oneor the other of these costs would completely dominate, so that all ofone's attention could be concentrated there. Regrettably, this is notso. Depending on the particular sort of traffic one is getting, forexample, whether Telnet one-byte packets or file transfer maximum sizepackets at maximum speed, one can expect to see one or the other costbeing the major bottleneck to throughput. Most implementors who havestudied their programs in an attempt to find out where the time wasgoing have reached the unsatisfactory conclusion that it is going
24equally to all parts of their program. With the possible exception ofchecksum processing, very few people have ever found that theirperformance problems were due to a single, horrible bottleneck whichthey could fix by a single stroke of inventive programming. Rather, theperformance was something which was improved by painstaking tuning ofthe entire program. Most discussions of protocols begin by introducing the concept oflayering, which tends to suggest that layering is a fundamentallywonderful idea which should be a part of every consideration ofprotocols. In fact, layering is a mixed blessing. Clearly, a layerinterface is necessary whenever more than one client of a particularlayer is to be allowed to use that same layer. But an interface,precisely because it is fixed, inevitably leads to a lack of completeunderstanding as to what one layer wishes to obtain from another. Thishas to lead to inefficiency. Furthermore, layering is a potential snarein that one is tempted to think that a layer boundary, which was anartifact of the specification procedure, is in fact the proper boundaryto use in modularizing the implementation. Again, in certain cases, anarchitected layer must correspond to an implemented layer, precisely sothat several clients can have access to that layer in a reasonablystraightforward manner. In other cases, cunning rearrangement of theimplemented module boundaries to match with various functions, such asthe demultiplexing of incoming packets, or the sending of asynchronousoutgoing packets, can lead to unexpected performance improvementscompared to more traditional implementation strategies. Finally, goodperformance is something which is difficult to retrofit onto an existing
25program. Since performance is influenced, not just by the fine detail,but by the gross structure, it is sometimes the case that in order toobtain a substantial performance improvement, it is necessary tocompletely redo the program from the bottom up. This is a greatdisappointment to programmers, especially those doing a protocolimplementation for the first time. Programmers who are somewhatinexperienced and unfamiliar with protocols are sufficiently concernedwith getting their program logically correct that they do not have thecapacity to think at the same time about the performance of thestructure they are building. Only after they have achieved a logicallycorrect program do they discover that they have done so in a way whichhas precluded real performance. Clearly, it is more difficult to designa program thinking from the start about both logical correctness andperformance. With time, as implementors as a group learn more about theappropriate structures to use for building protocols, it will bepossible to proceed with an implementation project having moreconfidence that the structure is rational, that the program will work,and that the program will work well. Those of us now implementingprotocols have the privilege of being on the forefront of this learningprocess. It should be no surprise that our programs sometimes sufferfrom the uncertainty we bring to bear on them.
26Citations [1] Cohen and Postel, "On Protocol Multiplexing", Sixth DataCommunications Symposium, ACM/IEEE, November 1979. [2] Bunch and Day, "Control Structure Overhead in TCP", Trends andApplications: Computer Networking, NBS Symposium, May 1980.
[8]ページ先頭