Movatterモバイル変換

[RFC Home] [TEXT|PDF|HTML] [Tracker] [IPR] [Info page]

INFORMATIONAL

RFC: 817 MODULARITY AND EFFICIENCY IN PROTOCOL IMPLEMENTATION David D. Clark MIT Laboratory for Computer Science Computer Systems and Communications Group July, 1982 1. Introduction Many protocol implementers have made the unpleasant discovery thattheir packages do not run quite as fast as they had hoped. The blamefor this widely observed problem has been attributed to a variety ofcauses, ranging from details in the design of the protocol to theunderlying structure of the host operating system. This RFC willdiscuss some of the commonly encountered reasons why protocolimplementations seem to run slowly. Experience suggests that one of the most important factors indetermining the performance of an implementation is the manner in whichthat implementation is modularized and integrated into the hostoperating system. For this reason, it is useful to discuss the questionof how an implementation is structured at the same time that we considerhow it will perform. In fact, this RFC will argue that modularity isone of the chief villains in attempting to obtain good performance, sothat the designer is faced with a delicate and inevitable tradeoffbetween good structure and good performance. Further, the single factorwhich most strongly determines how well this conflict can be resolved isnot the protocol but the operating system.

2 2. Efficiency Considerations There are many aspects to efficiency. One aspect is sending dataat minimum transmission cost, which is a critical aspect of commoncarrier communications, if not in local area network communications.Another aspect is sending data at a high rate, which may not be possibleat all if the net is very slow, but which may be the one central designconstraint when taking advantage of a local net with high raw bandwidth.The final consideration is doing the above with minimum expenditure ofcomputer resources. This last may be necessary to achieve high speed,but in the case of the slow net may be important only in that theresources used up, for example cpu cycles, are costly or otherwiseneeded. It is worth pointing out that these different goals oftenconflict; for example it is often possible to trade off efficient use ofthe computer against efficient use of the network. Thus, there may beno such thing as a successful general purpose protocol implementation. The simplest measure of performance is throughput, measured in bitsper second. It is worth doing a few simple computations in order to geta feeling for the magnitude of the problems involved. Assume that datais being sent from one machine to another in packets of 576 bytes, themaximum generally acceptable internet packet size. Allowing for headeroverhead, this packet size permits 4288 bits in each packet. If auseful throughput of 10,000 bits per second is desired, then a databearing packet must leave the sending host about every 430 milliseconds,a little over two per second. This is clearly not difficult to achieve.However, if one wishes to achieve 100 kilobits per second throughput,

3the packet must leave the host every 43 milliseconds, and to achieve onemegabit per second, which is not at all unreasonable on a high-speedlocal net, the packets must be spaced no more than 4.3 milliseconds. These latter numbers are a slightly more alarming goal for which toset one's sights. Many operating systems take a substantial fraction ofa millisecond just to service an interrupt. If the protocol has beenstructured as a process, it is necessary to go through a processscheduling before the protocol code can even begin to run. If any pieceof a protocol package or its data must be fetched from disk, real timedelays of between 30 to 100 milliseconds can be expected. If theprotocol must compete for cpu resources with other processes of thesystem, it may be necessary to wait a scheduling quantum before theprotocol can run. Many systems have a scheduling quantum of 100milliseconds or more. Considering these sorts of numbers, it becomesimmediately clear that the protocol must be fitted into the operatingsystem in a thorough and effective manner if any like reasonablethroughput is to be achieved. There is one obvious conclusion immediately suggested by even thissimple analysis. Except in very special circumstances, when manypackets are being processed at once, the cost of processing a packet isdominated by factors, such as cpu scheduling, which are independent ofthe packet size. This suggests two general rules which anyimplementation ought to obey. First, send data in large packets.Obviously, if processing time per packet is a constant, then throughputwill be directly proportional to the packet size. Second, never send an

4unneeded packet. Unneeded packets use up just as many resources as apacket full of data, but perform no useful function. RFC 813, "Windowand Acknowledgement Strategy in TCP", discusses one aspect of reducingthe number of packets sent per useful data byte. This document willmention other attacks on the same problem. The above analysis suggests that there are two main parts to theproblem of achieving good protocol performance. The first has to dowith how the protocol implementation is integrated into the hostoperating system. The second has to do with how the protocol packageitself is organized internally. This document will consider each ofthese topics in turn. 3. The Protocol vs. the Operating System There are normally three reasonable ways in which to add a protocolto an operating system. The protocol can be in a process that isprovided by the operating system, or it can be part of the kernel of theoperating system itself, or it can be put in a separate communicationsprocessor or front end machine. This decision is strongly influenced bydetails of hardware architecture and operating system design; each ofthese three approaches has its own advantages and disadvantages. The "process" is the abstraction which most operating systems useto provide the execution environment for user programs. A very simplepath for implementing a protocol is to obtain a process from theoperating system and implement the protocol to run in it.Superficially, this approach has a number of advantages. Since

                                   5modifications  to  the  kernel  are not required, the job can be done bysomeone who is not an expert in the kernel structure.  Since it is oftenimpossible to find somebody who is experienced both in the structure  ofthe  operating system and the structure of the protocol, this path, froma management point of view, is often extremely appealing. Unfortunately,putting a protocol in a process has a number of  disadvantages,  relatedto  both  structure  and  performance.    First, as was discussed above,process scheduling can be  a  significant  source  of  real-time  delay.There  is  not  only the actual cost of going through the scheduler, butthe problem that the operating system may not have  the  right  sort  ofpriority  tools  to  bring  the  process into execution quickly wheneverthere is work to be done.     Structurally, the difficulty with putting a protocol in  a  processis  that  the protocol may be providing services, for example support ofdata streams, which are normally obtained by  going  to  special  kernelentry  points.   Depending on the generality of the operating system, itmay be impossible to take a  program  which  is  accustomed  to  readingthrough  a kernel entry point, and redirect it so it is reading the datafrom a process.  The most extreme example of this  problem  occurs  whenimplementing  server  telnet.  In almost all systems, the device handlerfor the locally attached teletypes is located  inside  the  kernel,  andprograms  read and write from their teletype by making kernel calls.  Ifserver telnet is implemented in a process, it is then necessary to  takethe  data  streams  provided  by server telnet and somehow get them backdown inside the kernel so that they  mimic  the  interface  provided  bylocal   teletypes.     It  is  usually  the  case  that  special  kernel

6modification is necessary to achieve this structure, which somewhatdefeats the benefit of having removed the protocol from the kernel inthe first place. Clearly, then, there are advantages to putting the protocol packagein the kernel. Structurally, it is reasonable to view the network as adevice, and device drivers are traditionally contained in the kernel.Presumably, the problems associated with process scheduling can besidesteped, at least to a certain extent, by placing the code inside thekernel. And it is obviously easier to make the server telnet channelsmimic the local teletype channels if they are both realized in the samelevel in the kernel. However, implementation of protocols in the kernel has its own setof pitfalls. First, network protocols have a characteristic which isshared by almost no other device: they require rather complex actionsto be performed as a result of a timeout. The problem with thisrequirement is that the kernel often has no facility by which a programcan be brought into execution as a result of the timer event. What isreally needed, of course, is a special sort of process inside thekernel. Most systems lack this mechanism. Failing that, the onlyexecution mechanism available is to run at interrupt time. There are substantial drawbacks to implementing a protocol to runat interrupt time. First, the actions performed may be somewhat complexand time consuming, compared to the maximum amount of time that theoperating system is prepared to spend servicing an interrupt. Problemscan arise if interrupts are masked for too long. This is particularly

                                   7bad  when running as a result of a clock interrupt, which can imply thatthe clock interrupt is masked.  Second, the environment provided  by  aninterrupt  handler  is  usually  extremely  primitive  compared  to  theenvironment of a process.    There  are  usually  a  variety  of  systemfacilities  which are unavailable while running in an interrupt handler.The most important of these is the ability to suspend execution  pendingthe  arrival  of some event or message.  It is a cardinal rule of almostevery known operating system that one  must  not  invoke  the  schedulerwhile  running  in  an  interrupt  handler.  Thus, the programmer who isforced to implement all or part of his protocol package as an  interrupthandler  must  be  the  best  sort  of  expert  in  the operating systeminvolved, and must be prepared  for  development  sessions  filled  withobscure  bugs  which  crash not just the protocol package but the entireoperating system.     A final problem with processing  at  interrupt  time  is  that  thesystem  scheduler has no control over the percentage of system time usedby the protocol handler.  If a large number of packets  arrive,  from  aforeign  host that is either malfunctioning or fast, all of the time maybe spent in the interrupt handler, effectively killing the system.     There are other problems associated with putting protocols into  anoperating system kernel.  The simplest problem often encountered is thatthe  kernel  address space is simply too small to hold the piece of codein question.  This is a rather artificial sort of problem, but it  is  asevere  problem  none  the  less in many machines.  It is an appallinglyunpleasant experience to do an implementation with  the  knowledge  that

                                   8for  every  byte  of new feature put in one must find some other byte ofold feature to throw out.  It is hopeless to  expect  an  effective  andgeneral  implementation  under this kind of constraint.  Another problemis that the protocol package, once it  is  thoroughly  entwined  in  theoperating  system, may need to be redone every time the operating systemchanges.  If the protocol and the operating system are not maintained bythe same group,  this  makes  maintenance  of  the  protocol  package  aperpetual headache.     The  third  option  for  protocol  implementation  is  to  take theprotocol package and move it outside  the  machine  entirely,  on  to  aseparate  processor  dedicated  to this kind of task.  Such a machine isoften described as a communications processor or a front-end  processor.There  are  several  advantages  to this approach.  First, the operatingsystem on the communications processor can  be  tailored  for  preciselythis  kind  of  task.  This makes the job of implementation much easier.Second, one does not need to redo the task for every  machine  to  whichthe  protocol  is  to  be  added.   It may be possible to reuse the samefront-end machine on different host computers.  Since the task need  notbe  done as many times, one might hope that more attention could be paidto doing it right.  Given a careful  implementation  in  an  environmentwhich  is  optimized for this kind of task, the resulting package shouldturn out to be very efficient.  Unfortunately, there are  also  problemswith this approach.  There is, of course, a financial problem associatedwith  buying  an  additional  computer.    In  many cases, this is not aproblem at all since  the  cost  is  negligible  compared  to  what  theprogrammer  would  cost  to  do  the  job in the mainframe itself.  More

                                   9fundamentally, the communications processor approach does not completelysidestep  any  of  the  problems  raised  above.  The reason is that thecommunications processor, since  it  is  a  separate  machine,  must  beattached  to  the mainframe by some mechanism.  Whatever that mechanism,code is required in the mainframe to deal with it.   It  can  be  arguedthat  the  program  to deal with the communications processor is simplerthan the program to implement the entire protocol package.  Even if thatis so,  the  communications  processor  interface  package  is  still  aprotocol in nature, with all of the same structural problems.  Thus, allof  the  issues  raised above must still be faced.  In addition to thoseproblems, there are some other, more subtle problems associated with  anoutboard implementation of a protocol.  We will return to these problemslater.     There  is  a  way  of  attaching  a  communications  processor to amainframe host which  sidesteps  all  of  the  mainframe  implementationproblems, which is to use some preexisting interface on the host machineas  the  port  by  which  a  communications processor is attached.  Thisstrategy is often used as a last stage of desperation when the  softwareon  the host computer is so intractable that it cannot be changed in anyway.  Unfortunately, it is almost inevitably the case that  all  of  theavailable  interfaces  are  totally  unsuitable for this purpose, so theresult is unsatisfactory at best.  The most common  way  in  which  thisform  of attachment occurs is when a network connection is being used tomimic local teletypes.  In this case, the  front-end  processor  can  beattached  to  the mainframe by simply providing a number of wires out ofthe front-end processor, each corresponding to a connection,  which  are

10plugged into teletype ports on the mainframe computer. (Because of theappearance of the physical configuration which results from thisarrangement, Michael Padlipsky has described this as the "milkingmachine" approach to computer networking.) This strategy solves theimmediate problem of providing remote access to a host, but it isextremely inflexible. The channels being provided to the host arerestricted by the host software to one purpose only, remote login. Itis impossible to use them for any other purpose, such as file transferor sending mail, so the host is integrated into the network environmentin an extremely limited and inflexible manner. If this is the best thatcan be done, then it should be tolerated. Otherwise, implementorsshould be strongly encouraged to take a more flexible approach. 4. Protocol Layering The previous discussion suggested that there was a decision to bemade as to where a protocol ought to be implemented. In fact, thedecision is much more complicated than that, for the goal is not toimplement a single protocol, but to implement a whole family of protocollayers, starting with a device driver or local network driver at thebottom, then IP and TCP, and eventually reaching the applicationspecific protocol, such as Telnet, FTP and SMTP on the top. Clearly,the bottommost of these layers is somewhere within the kernel, since thephysical device driver for the net is almost inevitably located there.Equally clearly, the top layers of this package, which provide the userhis ability to perform the remote login function or to send mail, arenot entirely contained within the kernel. Thus, the question is not

11whether the protocol family shall be inside or outside the kernel, buthow it shall be sliced in two between that part inside and that partoutside. Since protocols come nicely layered, an obvious proposal is thatone of the layer interfaces should be the point at which the inside andoutside components are sliced apart. Most systems have been implementedin this way, and many have been made to work quite effectively. Oneobvious place to slice is at the upper interface of TCP. Since TCPprovides a bidirectional byte stream, which is somewhat similar to theI/O facility provided by most operating systems, it is possible to makethe interface to TCP almost mimic the interface to other existingdevices. Except in the matter of opening a connection, and dealing withpeculiar failures, the software using TCP need not know that it is anetwork connection, rather than a local I/O stream that is providing thecommunications function. This approach does put TCP inside the kernel,which raises all the problems addressed above. It also raises theproblem that the interface to the IP layer can, if the programmer is notcareful, become excessively buried inside the kernel. It must beremembered that things other than TCP are expected to run on top of IP.The IP interface must be made accessible, even if TCP sits on top of itinside the kernel. Another obvious place to slice is above Telnet. The advantage ofslicing above Telnet is that it solves the problem of having remotelogin channels emulate local teletype channels. The disadvantage ofputting Telnet into the kernel is that the amount of code which has now

12been included there is getting remarkably large. In some earlyimplementations, the size of the network package, when one includesprotocols at the level of Telnet, rivals the size of the rest of thesupervisor. This leads to vague feelings that all is not right. Any attempt to slice through a lower layer boundary, for examplebetween internet and TCP, reveals one fundamental problem. The TCPlayer, as well as the IP layer, performs a demultiplexing function onincoming datagrams. Until the TCP header has been examined, it is notpossible to know for which user the packet is ultimately destined.Therefore, if TCP, as a whole, is moved outside the kernel, it isnecessary to create one separate process called the TCP process, whichperforms the TCP multiplexing function, and probably all of the rest ofTCP processing as well. This means that incoming data destined for auser process involves not just a scheduling of the user process, butscheduling the TCP process first. This suggests an alternative structuring strategy which slicesthrough the protocols, not along an established layer boundary, butalong a functional boundary having to do with demultiplexing. In thisapproach, certain parts of IP and certain parts of TCP are placed in thekernel. The amount of code placed there is sufficient so that when anincoming datagram arrives, it is possible to know for which process thatdatagram is ultimately destined. The datagram is then routed directlyto the final process, where additional IP and TCP processing isperformed on it. This removes from the kernel any requirement for timerbased actions, since they can be done by the process provided by the

13user. This structure has the additional advantage of reducing theamount of code required in the kernel, so that it is suitable forsystems where kernel space is at a premium. TheRFC 814, titled "Names,Addresses, Ports, and Routes," discusses this rather orthogonal slicingstrategy in more detail. A related discussion of protocol layering and multiplexing can befound in Cohen and Postel [1]. 5. Breaking Down the Barriers In fact, the implementor should be sensitive to the possibility ofeven more peculiar slicing strategies in dividing up the variousprotocol layers between the kernel and the one or more user processes.The result of the strategy proposed above was that part of TCP shouldexecute in the process of the user. In other words, instead of havingone TCP process for the system, there is one TCP process per connection.Given this architecture, it is not longer necessary to imagine that allof the TCPs are identical. One TCP could be optimized for highthroughput applications, such as file transfer. Another TCP could beoptimized for small low delay applications such as Telnet. In fact, itwould be possible to produce a TCP which was somewhat integrated withthe Telnet or FTP on top of it. Such an integration is extremelyimportant, for it can lead to a kind of efficiency which moretraditional structures are incapable of producing. Earlier, this paperpointed out that one of the important rules to achieving efficiency wasto send the minimum number of packets for a given amount of data. Theidea of protocol layering interacts very strongly (and poorly) with this

                                   14goal,  because  independent  layers  have  independent  ideas about whenpackets should be sent, and unless these layers can somehow  be  broughtinto  cooperation,  additional  packets  will flow.  The best example ofthis is the operation of server telnet in a character at a  time  remoteecho  mode  on top of TCP.  When a packet containing a character arrivesat a server host, each layer has a different response  to  that  packet.TCP  has  an obligation to acknowledge the packet.  Either server telnetor the application layer above has an obligation to echo  the  characterreceived  in the packet.  If the character is a Telnet control sequence,then Telnet has additional actions which it must perform in response  tothe  packet.    The  result  of  this,  in most implementations, is thatseveral packets are sent back in response to the  one  arriving  packet.Combining  all of these return messages into one packet is important forseveral reasons.  First, of course, it reduces  the  number  of  packetsbeing sent over the net, which directly reduces the charges incurred formany common carrier tariff structures.  Second, it reduces the number ofscheduling  actions  which  will  occur inside both hosts, which, as wasdiscussed above, is extremely important in improving throughput.     The way to achieve this goal of packet sharing is to break down thebarrier between the layers of the protocols, in a  very  restrained  andcareful  manner, so that a limited amount of information can leak acrossthe barrier to enable one layer to optimize its behavior with respect tothe desires of the layers above and below it.   For  example,  it  wouldrepresent  an  improvement  if TCP, when it received a packet, could askthe layer above whether or not it would  be  worth  pausing  for  a  fewmilliseconds  before  sending  an acknowledgement in order to see if the

15upper layer would have any outgoing data to send. Dallying beforesending the acknowledgement produces precisely the right sort ofoptimization if the client of TCP is server Telnet. However, dallyingbefore sending an acknowledgement is absolutely unacceptable if TCP isbeing used for file transfer, for in file transfer there is almost neverdata flowing in the reverse direction, and the delay in sending theacknowledgement probably translates directly into a delay in obtainingthe next packets. Thus, TCP must know a little about the layers aboveit to adjust its performance as needed. It would be possible to imagine a general purpose TCP which wasequipped with all sorts of special mechanisms by which it would querythe layer above and modify its behavior accordingly. In the structuressuggested above, in which there is not one but several TCPs, the TCP cansimply be modified so that it produces the correct behavior as a matterof course. This structure has the disadvantage that there will beseveral implementations of TCP existing on a single machine, which canmean more maintenance headaches if a problem is found where TCP needs tobe changed. However, it is probably the case that each of the TCPs willbe substantially simpler than the general purpose TCP which wouldotherwise have been built. There are some experimental projectscurrently under way which suggest that this approach may make designingof a TCP, or almost any other layer, substantially easier, so that thetotal effort involved in bringing up a complete package is actually lessif this approach is followed. This approach is by no means generallyaccepted, but deserves some consideration.

16 The general conclusion to be drawn from this sort of considerationis that a layer boundary has both a benefit and a penalty. A visiblelayer boundary, with a well specified interface, provides a form ofisolation between two layers which allows one to be changed with theconfidence that the other one will not stop working as a result.However, a firm layer boundary almost inevitably leads to inefficientoperation. This can easily be seen by analogy with other aspects ofoperating systems. Consider, for example, file systems. A typicaloperating system provides a file system, which is a highly abstractedrepresentation of a disk. The interface is highly formalized, andpresumed to be highly stable. This makes it very easy for naive usersto have access to disks without having to write a great deal ofsoftware. The existence of a file system is clearly beneficial. On theother hand, it is clear that the restricted interface to a file systemalmost inevitably leads to inefficiency. If the interface is organizedas a sequential read and write of bytes, then there will be people whowish to do high throughput transfers who cannot achieve their goal. Ifthe interface is a virtual memory interface, then other users willregret the necessity of building a byte stream interface on top of thememory mapped file. The most objectionable inefficiency results when ahighly sophisticated package, such as a data base management package,must be built on top of an existing operating system. Almostinevitably, the implementors of the database system attempt to rejectthe file system and obtain direct access to the disks. They havesacrificed modularity for efficiency. The same conflict appears in networking, in a rather extreme form.

                                   17The concept of a protocol is still unknown and frightening to most naiveprogrammers.   The idea that they might have to implement a protocol, oreven part of a protocol, as part  of  some  application  package,  is  adreadful thought.  And thus there is great pressure to hide the functionof  the  net behind a very hard barrier.  On the other hand, the kind ofinefficiency which results from this is a particularly undesirable  sortof  inefficiency, for it shows up, among other things, in increasing thecost of the communications resource used up to achieve  the  applicationgoal.   In cases where one must pay for one's communications costs, theyusually turn out to be the dominant cost within the system.  Thus, doingan excessively good job of packaging up the protocols in  an  inflexiblemanner  has  a  direct  impact  on  increasing  the cost of the criticalresource within the system.  This is a dilemma which will probably  onlybe solved when programmers become somewhat less alarmed about protocols,so that they are willing to weave a certain amount of protocol structureinto their application program, much as application programs today weaveparts  of  database  management  systems  into  the  structure  of theirapplication program.     An extreme example of putting the protocol package  behind  a  firmlayer boundary occurs when the protocol package is relegated to a front-end processor.  In this case the interface to the protocol is some otherprotocol.    It  is  difficult to imagine how to build close cooperationbetween layers when they are that far separated.  Realistically, one  ofthe prices which must be associated with an implementation so physicallymodularized is that the performance will suffer as a result.  Of course,a separate processor for protocols could be very closely integrated into

18the mainframe architecture, with interprocessor co-ordination signals,shared memory, and similar features. Such a physical modularity mightwork very well, but there is little documented experience with thisclosely coupled architecture for protocol support. 6. Efficiency of Protocol Processing To this point, this document has considered how a protocol packageshould be broken into modules, and how those modules should bedistributed between free standing machines, the operating system kernel,and one or more user processes. It is now time to consider the otherhalf of the efficiency question, which is what can be done to speed theexecution of those programs that actually implement the protocols. Wewill make some specific observations about TCP and IP, and then concludewith a few generalities. IP is a simple protocol, especially with respect to the processingof normal packets, so it should be easy to get it to performefficiently. The only area of any complexity related to actual packetprocessing has to do with fragmentation and reassembly. The reader isreferred to RFC 815, titled "IP Datagram Reassembly Algorithms", forspecific consideration of this point. Most costs in the IP layer come from table look up functions, asopposed to packet processing functions. An outgoing packet requires twotranslation functions to be performed. The internet address must betranslated to a target gateway, and a gateway address must be translatedto a local network number (if the host is attached to more than one

                                   19network).    It  is easy to build a simple implementation of these tablelook up functions that in fact performs very  poorly.    The  programmershould  keep  in  mind  that  there may be as many as a thousand networknumbers in a typical configuration.   Linear  searching  of  a  thousandentry table on every packet is extremely unsuitable.  In fact, it may beworth  asking  TCP  to  cache  a  hint for each connection, which can behanded down to IP each time a packet  is  sent,  to  try  to  avoid  theoverhead of a table look up.     TCP   is   a   more   complex  protocol,  and  presents  many  moreopportunities for getting things wrong.  There  is  one  area  which  isgenerally  accepted  as  causing  noticeable and substantial overhead aspart of TCP processing.  This is computation of the checksum.  It  wouldbe  nice  if this cost could be avoided somehow, but the idea of an end-to-end checksum is absolutely central to the functioning  of  TCP.    Nohost  implementor  should think of omitting the validation of a checksumon incoming data.     Various clever tricks have been used to try to minimize the cost ofcomputing the checksum.  If it is possible to add additional  microcodedinstructions  to the machine, a checksum instruction is the most obviouscandidate.  Since computing the checksum involves picking up every  byteof the segment and examining it, it is possible to combine the operationof computing the checksum with the operation of copying the segment fromone  location  to  another.   Since a number of data copies are probablyalready required as part of  the  processing  structure,  this  kind  ofsharing might conceivably pay off if it didn't cause too much trouble to

20the modularity of the program. Finally, computation of the checksumseems to be one place where careful attention to the details of thealgorithm used can make a drastic difference in the throughput of theprogram. The Multics system provides one of the best case studies ofthis, since Multics is about as poorly organized to perform thisfunction as any machine implementing TCP. Multics is a 36-bit wordmachine, with four 9-bit bytes per word. The eight-bit bytes of a TCPsegment are laid down packed in memory, ignoring word boundaries. Thismeans that when it is necessary to pick up the data as a set of 16-bitunits for the purpose of adding them to compute checksums, horriblemasking and shifting is required for each 16-bit value. An earlyversion of a program using this strategy required 6 milliseconds tochecksum a 576-byte segment. Obviously, at this point, checksumcomputation was becoming the central bottleneck to throughput. A morecareful recoding of this algorithm reduced the checksum processing timeto less than one millisecond. The strategy used was extremely dirty.It involved adding up carefully selected words of the area in which thedata lay, knowing that for those particular words, the 16-bit valueswere properly aligned inside the words. Only after the addition hadbeen done were the various sums shifted, and finally added to producethe eventual checksum. This kind of highly specialized programming isprobably not acceptable if used everywhere within an operating system.It is clearly appropriate for one highly localized function which can beclearly identified as an extreme performance bottleneck. Another area of TCP processing which may cause performance problemsis the overhead of examining all of the possible flags and options which

21occur in each incoming packet. One paper, by Bunch and Day [2], assertsthat the overhead of packet header processing is actually an importantlimiting factor in throughput computation. Not all measurementexperiments have tended to support this result. To whatever extent itis true, however, there is an obvious strategy which the implementorought to use in designing his program. He should build his program tooptimize the expected case. It is easy, especially when first designinga program, to pay equal attention to all of the possible outcomes ofevery test. In practice, however, few of these will ever happen. A TCPshould be built on the assumption that the next packet to arrive willhave absolutely nothing special about it, and will be the next oneexpected in the sequence space. One or two tests are sufficient todetermine that the expected set of control flags are on. (The ACK flagshould be on; the Push flag may or may not be on. No other flags shouldbe on.) One test is sufficient to determine that the sequence number ofthe incoming packet is one greater than the last sequence numberreceived. In almost every case, that will be the actual result. Again,using the Multics system as an example, failure to optimize the case ofreceiving the expected sequence number had a detectable effect on theperformance of the system. The particular problem arose when a numberof packets arrived at once. TCP attempted to process all of thesepackets before awaking the user. As a result, by the time the lastpacket arrived, there was a threaded list of packets which had severalitems on it. When a new packet arrived, the list was searched to findthe location into which the packet should be inserted. Obviously, thelist should be searched from highest sequence number to lowest sequence

22number, because one is expecting to receive a packet which comes afterthose already received. By mistake, the list was searched from front toback, starting with the packets with the lowest sequence number. Theamount of time spent searching this list backwards was easily detectablein the metering measurements. Other data structures can be organized to optimize the action whichis normally taken on them. For example, the retransmission queue isvery seldom actually used for retransmission, so it should not beorganized to optimize that action. In fact, it should be organized tooptimized the discarding of things from it when the acknowledgementarrives. In many cases, the easiest way to do this is not to save thepacket at all, but to reconstruct it only if it needs to beretransmitted, starting from the data as it was originally buffered bythe user. There is another generality, at least as important as optimizingthe common case, which is to avoid copying data any more times thannecessary. One more result from the Multics TCP may prove enlighteninghere. Multics takes between two and three milliseconds within the TCPlayer to process an incoming packet, depending on its size. For a 576-byte packet, the three milliseconds is used up approximately as follows.One millisecond is used computing the checksum. Six hundredmicroseconds is spent copying the data. (The data is copied twice, at.3 milliseconds a copy.) One of those copy operations could correctlybe included as part of the checksum cost, since it is done to get thedata on a known word boundary to optimize the checksum algorithm.

                                   23However,  the  copy also performs another necessary transfer at the sametime.  Header processing and packet resequencing takes .7  milliseconds.The  rest  of  the  time  is  used  in miscellaneous processing, such asremoving packets from the retransmission queue which are acknowledged bythis packet.  Data copying is the second most expensive single operationafter data checksuming.   Some  implementations,  often  because  of  anexcessively  layered  modularity, end up copying the data around a greatdeal.  Other implementations end up copying the data because there is noshared memory between processes, and the data must be moved from processto process via a kernel operation.  Unless the amount of  this  activityis  kept  strictly  under  control,  it  will  quickly  become the majorperformance bottleneck.     7.  Conclusions     This document has addressed two aspects  of  obtaining  performancefrom a protocol implementation, the way in which the protocol is layeredand  integrated  into  the  operating  system,  and the way in which thedetailed handling of the packet is optimized.  It would be nice  if  oneor  the  other  of these costs would completely dominate, so that all ofone's attention could be concentrated there.  Regrettably, this  is  notso.    Depending  on  the particular sort of traffic one is getting, forexample, whether Telnet one-byte packets or file transfer  maximum  sizepackets  at  maximum  speed, one can expect to see one or the other costbeing the major bottleneck to throughput.  Most  implementors  who  havestudied  their  programs  in  an  attempt to find out where the time wasgoing have reached  the  unsatisfactory  conclusion  that  it  is  going

24equally to all parts of their program. With the possible exception ofchecksum processing, very few people have ever found that theirperformance problems were due to a single, horrible bottleneck whichthey could fix by a single stroke of inventive programming. Rather, theperformance was something which was improved by painstaking tuning ofthe entire program. Most discussions of protocols begin by introducing the concept oflayering, which tends to suggest that layering is a fundamentallywonderful idea which should be a part of every consideration ofprotocols. In fact, layering is a mixed blessing. Clearly, a layerinterface is necessary whenever more than one client of a particularlayer is to be allowed to use that same layer. But an interface,precisely because it is fixed, inevitably leads to a lack of completeunderstanding as to what one layer wishes to obtain from another. Thishas to lead to inefficiency. Furthermore, layering is a potential snarein that one is tempted to think that a layer boundary, which was anartifact of the specification procedure, is in fact the proper boundaryto use in modularizing the implementation. Again, in certain cases, anarchitected layer must correspond to an implemented layer, precisely sothat several clients can have access to that layer in a reasonablystraightforward manner. In other cases, cunning rearrangement of theimplemented module boundaries to match with various functions, such asthe demultiplexing of incoming packets, or the sending of asynchronousoutgoing packets, can lead to unexpected performance improvementscompared to more traditional implementation strategies. Finally, goodperformance is something which is difficult to retrofit onto an existing

25program. Since performance is influenced, not just by the fine detail,but by the gross structure, it is sometimes the case that in order toobtain a substantial performance improvement, it is necessary tocompletely redo the program from the bottom up. This is a greatdisappointment to programmers, especially those doing a protocolimplementation for the first time. Programmers who are somewhatinexperienced and unfamiliar with protocols are sufficiently concernedwith getting their program logically correct that they do not have thecapacity to think at the same time about the performance of thestructure they are building. Only after they have achieved a logicallycorrect program do they discover that they have done so in a way whichhas precluded real performance. Clearly, it is more difficult to designa program thinking from the start about both logical correctness andperformance. With time, as implementors as a group learn more about theappropriate structures to use for building protocols, it will bepossible to proceed with an implementation project having moreconfidence that the structure is rational, that the program will work,and that the program will work well. Those of us now implementingprotocols have the privilege of being on the forefront of this learningprocess. It should be no surprise that our programs sometimes sufferfrom the uncertainty we bring to bear on them.

                                   26Citations     [1]  Cohen  and  Postel,  "On  Protocol  Multiplexing",  Sixth DataCommunications Symposium, ACM/IEEE, November 1979.     [2] Bunch and Day, "Control Structure Overhead in TCP", Trends  andApplications:  Computer Networking, NBS Symposium, May 1980.

[8]ページ先頭