Movatterモバイル変換


[0]ホーム

URL:


[RFC Home] [TEXT|PDF|HTML] [Tracker] [IPR] [Info page]

INFORMATIONAL
RFC:  817          MODULARITY AND EFFICIENCY IN PROTOCOL IMPLEMENTATION                             David D. Clark                  MIT Laboratory for Computer Science               Computer Systems and Communications Group                               July, 1982     1.  Introduction     Many  protocol implementers have made the unpleasant discovery thattheir packages do not run quite as fast as they had hoped.    The  blamefor  this  widely  observed  problem has been attributed to a variety ofcauses, ranging from details in  the  design  of  the  protocol  to  theunderlying  structure  of  the  host  operating  system.   This RFC willdiscuss  some  of  the  commonly  encountered   reasons   why   protocolimplementations seem to run slowly.     Experience  suggests  that  one  of  the  most important factors indetermining the performance of an implementation is the manner in  whichthat   implementation  is  modularized  and  integrated  into  the  hostoperating system.  For this reason, it is useful to discuss the questionof how an implementation is structured at the same time that we considerhow it will perform.  In fact, this RFC will argue  that  modularity  isone  of  the chief villains in attempting to obtain good performance, sothat the designer is faced  with  a  delicate  and  inevitable  tradeoffbetween good structure and good performance.  Further, the single factorwhich most strongly determines how well this conflict can be resolved isnot the protocol but the operating system.

                                   2     2.  Efficiency Considerations     There  are  many aspects to efficiency.  One aspect is sending dataat minimum transmission cost, which  is  a  critical  aspect  of  commoncarrier  communications,  if  not  in local area network communications.Another aspect is sending data at a high rate, which may not be possibleat all if the net is very slow, but which may be the one central  designconstraint when taking advantage of a local net with high raw bandwidth.The  final  consideration is doing the above with minimum expenditure ofcomputer resources.  This last may be necessary to achieve  high  speed,but  in  the  case  of  the  slow  net may be important only in that theresources used up, for example  cpu  cycles,  are  costly  or  otherwiseneeded.    It  is  worth  pointing  out that these different goals oftenconflict; for example it is often possible to trade off efficient use ofthe computer against efficient use of the network.  Thus, there  may  beno such thing as a successful general purpose protocol implementation.     The simplest measure of performance is throughput, measured in bitsper second.  It is worth doing a few simple computations in order to geta  feeling for the magnitude of the problems involved.  Assume that datais being sent from one machine to another in packets of 576  bytes,  themaximum  generally acceptable internet packet size.  Allowing for headeroverhead, this packet size permits 4288 bits  in  each  packet.    If  auseful  throughput  of  10,000  bits  per second is desired, then a databearing packet must leave the sending host about every 430 milliseconds,a little over two per second.  This is clearly not difficult to achieve.However, if one wishes to achieve 100 kilobits  per  second  throughput,

                                   3the packet must leave the host every 43 milliseconds, and to achieve onemegabit  per  second,  which  is not at all unreasonable on a high-speedlocal net, the packets must be spaced no more than 4.3 milliseconds.     These latter numbers are a slightly more alarming goal for which toset one's sights.  Many operating systems take a substantial fraction ofa millisecond just to service an interrupt.  If the  protocol  has  beenstructured  as  a  process,  it  is  necessary  to  go through a processscheduling before the protocol code can even begin to run.  If any pieceof a protocol package or its data must be fetched from disk,  real  timedelays  of  between  30  to  100  milliseconds  can be expected.  If theprotocol must compete for cpu resources  with  other  processes  of  thesystem,  it  may  be  necessary  to wait a scheduling quantum before theprotocol can run.   Many  systems  have  a  scheduling  quantum  of  100milliseconds  or  more.   Considering these sorts of numbers, it becomesimmediately clear that the protocol must be fitted  into  the  operatingsystem  in  a  thorough  and  effective  manner  if  any like reasonablethroughput is to be achieved.     There is one obvious conclusion immediately suggested by even  thissimple  analysis.    Except  in  very  special  circumstances, when manypackets are being processed at once, the cost of processing a packet  isdominated  by  factors, such as cpu scheduling, which are independent ofthe  packet  size.    This  suggests  two  general   rules   which   anyimplementation  ought  to  obey.    First,  send  data in large packets.Obviously, if processing time per packet is a constant, then  throughputwill be directly proportional to the packet size.  Second, never send an

                                   4unneeded  packet.    Unneeded packets use up just as many resources as apacket full of data, but perform no useful function.  RFC  813,  "Windowand  Acknowledgement  Strategy in TCP", discusses one aspect of reducingthe number of packets sent per useful data byte.    This  document  willmention other attacks on the same problem.     The  above  analysis  suggests that there are two main parts to theproblem of achieving good protocol performance.  The  first  has  to  dowith  how  the  protocol  implementation  is  integrated  into  the hostoperating system.  The second has to do with how  the  protocol  packageitself  is  organized  internally.   This document will consider each ofthese topics in turn.     3.  The Protocol vs. the Operating System     There are normally three reasonable ways in which to add a protocolto an operating system.  The protocol  can  be  in  a  process  that  isprovided by the operating system, or it can be part of the kernel of theoperating  system  itself, or it can be put in a separate communicationsprocessor or front end machine.  This decision is strongly influenced bydetails of hardware architecture and operating system  design;  each  ofthese three approaches has its own advantages and disadvantages.     The  "process"  is the abstraction which most operating systems useto provide the execution environment for user programs.  A  very  simplepath  for  implementing  a  protocol  is  to  obtain  a process from theoperating  system  and  implement   the   protocol   to   run   in   it.Superficially,  this  approach  has  a  number  of  advantages.    Since

                                   5modifications  to  the  kernel  are not required, the job can be done bysomeone who is not an expert in the kernel structure.  Since it is oftenimpossible to find somebody who is experienced both in the structure  ofthe  operating system and the structure of the protocol, this path, froma management point of view, is often extremely appealing. Unfortunately,putting a protocol in a process has a number of  disadvantages,  relatedto  both  structure  and  performance.    First, as was discussed above,process scheduling can be  a  significant  source  of  real-time  delay.There  is  not  only the actual cost of going through the scheduler, butthe problem that the operating system may not have  the  right  sort  ofpriority  tools  to  bring  the  process into execution quickly wheneverthere is work to be done.     Structurally, the difficulty with putting a protocol in  a  processis  that  the protocol may be providing services, for example support ofdata streams, which are normally obtained by  going  to  special  kernelentry  points.   Depending on the generality of the operating system, itmay be impossible to take a  program  which  is  accustomed  to  readingthrough  a kernel entry point, and redirect it so it is reading the datafrom a process.  The most extreme example of this  problem  occurs  whenimplementing  server  telnet.  In almost all systems, the device handlerfor the locally attached teletypes is located  inside  the  kernel,  andprograms  read and write from their teletype by making kernel calls.  Ifserver telnet is implemented in a process, it is then necessary to  takethe  data  streams  provided  by server telnet and somehow get them backdown inside the kernel so that they  mimic  the  interface  provided  bylocal   teletypes.     It  is  usually  the  case  that  special  kernel

                                   6modification  is  necessary  to  achieve  this structure, which somewhatdefeats the benefit of having removed the protocol from  the  kernel  inthe first place.     Clearly, then, there are advantages to putting the protocol packagein  the kernel.  Structurally, it is reasonable to view the network as adevice, and device drivers are traditionally contained  in  the  kernel.Presumably,  the  problems  associated  with  process  scheduling can besidesteped, at least to a certain extent, by placing the code inside thekernel.  And it is obviously easier to make the server  telnet  channelsmimic  the local teletype channels if they are both realized in the samelevel in the kernel.     However, implementation of protocols in the kernel has its own  setof  pitfalls.    First, network protocols have a characteristic which isshared by almost no other device:  they require rather  complex  actionsto  be  performed  as  a  result  of  a  timeout.  The problem with thisrequirement is that the kernel often has no facility by which a  programcan  be  brought into execution as a result of the timer event.  What isreally needed, of course, is  a  special  sort  of  process  inside  thekernel.    Most  systems  lack  this  mechanism.  Failing that, the onlyexecution mechanism available is to run at interrupt time.     There are substantial drawbacks to implementing a protocol  to  runat interrupt time.  First, the actions performed may be somewhat complexand  time  consuming,  compared  to  the maximum amount of time that theoperating system is prepared to spend servicing an interrupt.   Problemscan  arise  if interrupts are masked for too long.  This is particularly

                                   7bad  when running as a result of a clock interrupt, which can imply thatthe clock interrupt is masked.  Second, the environment provided  by  aninterrupt  handler  is  usually  extremely  primitive  compared  to  theenvironment of a process.    There  are  usually  a  variety  of  systemfacilities  which are unavailable while running in an interrupt handler.The most important of these is the ability to suspend execution  pendingthe  arrival  of some event or message.  It is a cardinal rule of almostevery known operating system that one  must  not  invoke  the  schedulerwhile  running  in  an  interrupt  handler.  Thus, the programmer who isforced to implement all or part of his protocol package as an  interrupthandler  must  be  the  best  sort  of  expert  in  the operating systeminvolved, and must be prepared  for  development  sessions  filled  withobscure  bugs  which  crash not just the protocol package but the entireoperating system.     A final problem with processing  at  interrupt  time  is  that  thesystem  scheduler has no control over the percentage of system time usedby the protocol handler.  If a large number of packets  arrive,  from  aforeign  host that is either malfunctioning or fast, all of the time maybe spent in the interrupt handler, effectively killing the system.     There are other problems associated with putting protocols into  anoperating system kernel.  The simplest problem often encountered is thatthe  kernel  address space is simply too small to hold the piece of codein question.  This is a rather artificial sort of problem, but it  is  asevere  problem  none  the  less in many machines.  It is an appallinglyunpleasant experience to do an implementation with  the  knowledge  that

                                   8for  every  byte  of new feature put in one must find some other byte ofold feature to throw out.  It is hopeless to  expect  an  effective  andgeneral  implementation  under this kind of constraint.  Another problemis that the protocol package, once it  is  thoroughly  entwined  in  theoperating  system, may need to be redone every time the operating systemchanges.  If the protocol and the operating system are not maintained bythe same group,  this  makes  maintenance  of  the  protocol  package  aperpetual headache.     The  third  option  for  protocol  implementation  is  to  take theprotocol package and move it outside  the  machine  entirely,  on  to  aseparate  processor  dedicated  to this kind of task.  Such a machine isoften described as a communications processor or a front-end  processor.There  are  several  advantages  to this approach.  First, the operatingsystem on the communications processor can  be  tailored  for  preciselythis  kind  of  task.  This makes the job of implementation much easier.Second, one does not need to redo the task for every  machine  to  whichthe  protocol  is  to  be  added.   It may be possible to reuse the samefront-end machine on different host computers.  Since the task need  notbe  done as many times, one might hope that more attention could be paidto doing it right.  Given a careful  implementation  in  an  environmentwhich  is  optimized for this kind of task, the resulting package shouldturn out to be very efficient.  Unfortunately, there are  also  problemswith this approach.  There is, of course, a financial problem associatedwith  buying  an  additional  computer.    In  many cases, this is not aproblem at all since  the  cost  is  negligible  compared  to  what  theprogrammer  would  cost  to  do  the  job in the mainframe itself.  More

                                   9fundamentally, the communications processor approach does not completelysidestep  any  of  the  problems  raised  above.  The reason is that thecommunications processor, since  it  is  a  separate  machine,  must  beattached  to  the mainframe by some mechanism.  Whatever that mechanism,code is required in the mainframe to deal with it.   It  can  be  arguedthat  the  program  to deal with the communications processor is simplerthan the program to implement the entire protocol package.  Even if thatis so,  the  communications  processor  interface  package  is  still  aprotocol in nature, with all of the same structural problems.  Thus, allof  the  issues  raised above must still be faced.  In addition to thoseproblems, there are some other, more subtle problems associated with  anoutboard implementation of a protocol.  We will return to these problemslater.     There  is  a  way  of  attaching  a  communications  processor to amainframe host which  sidesteps  all  of  the  mainframe  implementationproblems, which is to use some preexisting interface on the host machineas  the  port  by  which  a  communications processor is attached.  Thisstrategy is often used as a last stage of desperation when the  softwareon  the host computer is so intractable that it cannot be changed in anyway.  Unfortunately, it is almost inevitably the case that  all  of  theavailable  interfaces  are  totally  unsuitable for this purpose, so theresult is unsatisfactory at best.  The most common  way  in  which  thisform  of attachment occurs is when a network connection is being used tomimic local teletypes.  In this case, the  front-end  processor  can  beattached  to  the mainframe by simply providing a number of wires out ofthe front-end processor, each corresponding to a connection,  which  are

                                   10plugged  into teletype ports on the mainframe computer.  (Because of theappearance  of  the  physical  configuration  which  results  from  thisarrangement,  Michael  Padlipsky  has  described  this  as  the "milkingmachine" approach to computer networking.)   This  strategy  solves  theimmediate  problem  of  providing  remote  access  to  a host, but it isextremely inflexible.  The channels  being  provided  to  the  host  arerestricted  by  the host software to one purpose only, remote login.  Itis impossible to use them for any other purpose, such as  file  transferor  sending mail, so the host is integrated into the network environmentin an extremely limited and inflexible manner.  If this is the best thatcan be done, then it  should  be  tolerated.    Otherwise,  implementorsshould be strongly encouraged to take a more flexible approach.     4.  Protocol Layering     The  previous  discussion suggested that there was a decision to bemade as to where a protocol ought to  be  implemented.    In  fact,  thedecision  is  much  more  complicated  than that, for the goal is not toimplement a single protocol, but to implement a whole family of protocollayers, starting with a device driver or local  network  driver  at  thebottom,  then  IP  and  TCP,  and  eventually  reaching  the applicationspecific protocol, such as Telnet, FTP and SMTP on the  top.    Clearly,the bottommost of these layers is somewhere within the kernel, since thephysical  device  driver for the net is almost inevitably located there.Equally clearly, the top layers of this package, which provide the  userhis  ability  to  perform the remote login function or to send mail, arenot entirely contained within the kernel.  Thus,  the  question  is  not

                                   11whether  the  protocol family shall be inside or outside the kernel, buthow it shall be sliced in two between that part  inside  and  that  partoutside.     Since  protocols  come  nicely layered, an obvious proposal is thatone of the layer interfaces should be the point at which the inside  andoutside components are sliced apart.  Most systems have been implementedin  this  way,  and  many have been made to work quite effectively.  Oneobvious place to slice is at the upper interface  of  TCP.    Since  TCPprovides  a  bidirectional byte stream, which is somewhat similar to theI/O facility provided by most operating systems, it is possible to  makethe  interface  to  TCP  almost  mimic  the  interface to other existingdevices.  Except in the matter of opening a connection, and dealing withpeculiar failures, the software using TCP need not know  that  it  is  anetwork connection, rather than a local I/O stream that is providing thecommunications  function.  This approach does put TCP inside the kernel,which raises all the problems addressed  above.    It  also  raises  theproblem that the interface to the IP layer can, if the programmer is notcareful,  become  excessively  buried  inside  the  kernel.   It must beremembered that things other than TCP are expected to run on top of  IP.The  IP interface must be made accessible, even if TCP sits on top of itinside the kernel.     Another obvious place to slice is above Telnet.  The  advantage  ofslicing  above  Telnet  is  that  it solves the problem of having remotelogin channels emulate local teletype channels.    The  disadvantage  ofputting  Telnet into the kernel is that the amount of code which has now

                                   12been  included  there  is  getting  remarkably  large.    In  some earlyimplementations, the size of the  network  package,  when  one  includesprotocols  at  the  level  of Telnet, rivals the size of the rest of thesupervisor.  This leads to vague feelings that all is not right.     Any attempt to slice through a lower layer  boundary,  for  examplebetween  internet  and  TCP,  reveals  one fundamental problem.  The TCPlayer, as well as the IP layer, performs a  demultiplexing  function  onincoming  datagrams.   Until the TCP header has been examined, it is notpossible to know for which  user  the  packet  is  ultimately  destined.Therefore,  if  TCP,  as  a  whole,  is  moved outside the kernel, it isnecessary to create one separate process called the TCP  process,  whichperforms  the TCP multiplexing function, and probably all of the rest ofTCP processing as well.  This means that incoming data  destined  for  auser  process  involves  not  just a scheduling of the user process, butscheduling the TCP process first.     This suggests an  alternative  structuring  strategy  which  slicesthrough  the  protocols,  not  along  an established layer boundary, butalong a functional boundary having to do with demultiplexing.   In  thisapproach, certain parts of IP and certain parts of TCP are placed in thekernel.    The amount of code placed there is sufficient so that when anincoming datagram arrives, it is possible to know for which process thatdatagram is ultimately destined.  The datagram is then  routed  directlyto  the  final  process,  where  additional  IP  and  TCP  processing isperformed on it.  This removes from the kernel any requirement for timerbased actions, since they can be done by the  process  provided  by  the

                                   13user.    This  structure  has  the  additional advantage of reducing theamount of code required in the  kernel,  so  that  it  is  suitable  forsystems where kernel space is at a premium.  TheRFC 814, titled "Names,Addresses,  Ports, and Routes," discusses this rather orthogonal slicingstrategy in more detail.     A related discussion of protocol layering and multiplexing  can  befound in Cohen and Postel [1].     5.  Breaking Down the Barriers     In  fact, the implementor should be sensitive to the possibility ofeven more  peculiar  slicing  strategies  in  dividing  up  the  variousprotocol  layers  between the kernel and the one or more user processes.The result of the strategy proposed above was that part  of  TCP  shouldexecute  in  the process of the user.  In other words, instead of havingone TCP process for the system, there is one TCP process per connection.Given this architecture, it is not longer necessary to imagine that  allof  the  TCPs  are  identical.    One  TCP  could  be optimized for highthroughput applications, such as file transfer.  Another  TCP  could  beoptimized  for small low delay applications such as Telnet.  In fact, itwould be possible to produce a TCP which was  somewhat  integrated  withthe  Telnet  or  FTP  on  top  of  it.  Such an integration is extremelyimportant,  for  it  can  lead  to  a  kind  of  efficiency  which  moretraditional  structures are incapable of producing.  Earlier, this paperpointed out that one of the important rules to achieving efficiency  wasto  send  the minimum number of packets for a given amount of data.  Theidea of protocol layering interacts very strongly (and poorly) with this

                                   14goal,  because  independent  layers  have  independent  ideas about whenpackets should be sent, and unless these layers can somehow  be  broughtinto  cooperation,  additional  packets  will flow.  The best example ofthis is the operation of server telnet in a character at a  time  remoteecho  mode  on top of TCP.  When a packet containing a character arrivesat a server host, each layer has a different response  to  that  packet.TCP  has  an obligation to acknowledge the packet.  Either server telnetor the application layer above has an obligation to echo  the  characterreceived  in the packet.  If the character is a Telnet control sequence,then Telnet has additional actions which it must perform in response  tothe  packet.    The  result  of  this,  in most implementations, is thatseveral packets are sent back in response to the  one  arriving  packet.Combining  all of these return messages into one packet is important forseveral reasons.  First, of course, it reduces  the  number  of  packetsbeing sent over the net, which directly reduces the charges incurred formany common carrier tariff structures.  Second, it reduces the number ofscheduling  actions  which  will  occur inside both hosts, which, as wasdiscussed above, is extremely important in improving throughput.     The way to achieve this goal of packet sharing is to break down thebarrier between the layers of the protocols, in a  very  restrained  andcareful  manner, so that a limited amount of information can leak acrossthe barrier to enable one layer to optimize its behavior with respect tothe desires of the layers above and below it.   For  example,  it  wouldrepresent  an  improvement  if TCP, when it received a packet, could askthe layer above whether or not it would  be  worth  pausing  for  a  fewmilliseconds  before  sending  an acknowledgement in order to see if the

                                   15upper  layer  would  have  any  outgoing  data to send.  Dallying beforesending  the  acknowledgement  produces  precisely  the  right  sort  ofoptimization  if  the client of TCP is server Telnet.  However, dallyingbefore sending an acknowledgement is absolutely unacceptable if  TCP  isbeing used for file transfer, for in file transfer there is almost neverdata  flowing  in  the  reverse  direction, and the delay in sending theacknowledgement probably translates directly into a delay  in  obtainingthe  next  packets.  Thus, TCP must know a little about the layers aboveit to adjust its performance as needed.     It would be possible to imagine a general  purpose  TCP  which  wasequipped  with  all  sorts of special mechanisms by which it would querythe layer above and modify its behavior accordingly.  In the  structuressuggested above, in which there is not one but several TCPs, the TCP cansimply  be modified so that it produces the correct behavior as a matterof course.  This structure has  the  disadvantage  that  there  will  beseveral  implementations  of TCP existing on a single machine, which canmean more maintenance headaches if a problem is found where TCP needs tobe changed.  However, it is probably the case that each of the TCPs willbe substantially simpler  than  the  general  purpose  TCP  which  wouldotherwise  have  been  built.    There  are  some  experimental projectscurrently under way which suggest that this approach may make  designingof  a  TCP, or almost any other layer, substantially easier, so that thetotal effort involved in bringing up a complete package is actually lessif this approach is followed.  This approach is by  no  means  generallyaccepted, but deserves some consideration.

                                   16     The  general conclusion to be drawn from this sort of considerationis that a layer boundary has both a benefit and a penalty.    A  visiblelayer  boundary,  with  a  well  specified interface, provides a form ofisolation between two layers which allows one to  be  changed  with  theconfidence  that  the  other  one  will  not  stop  working as a result.However, a firm layer boundary almost inevitably  leads  to  inefficientoperation.    This  can  easily be seen by analogy with other aspects ofoperating systems.  Consider, for example,  file  systems.    A  typicaloperating  system  provides  a file system, which is a highly abstractedrepresentation of a disk.   The  interface  is  highly  formalized,  andpresumed  to  be highly stable.  This makes it very easy for naive usersto have access to  disks  without  having  to  write  a  great  deal  ofsoftware.  The existence of a file system is clearly beneficial.  On theother  hand,  it is clear that the restricted interface to a file systemalmost inevitably leads to inefficiency.  If the interface is  organizedas  a  sequential read and write of bytes, then there will be people whowish to do high throughput transfers who cannot achieve their goal.   Ifthe  interface  is  a  virtual  memory  interface, then other users willregret the necessity of building a byte stream interface on top  of  thememory  mapped file.  The most objectionable inefficiency results when ahighly sophisticated package, such as a data  base  management  package,must  be  built  on  top  of  an  existing  operating  system.    Almostinevitably, the implementors of the database system  attempt  to  rejectthe  file  system  and  obtain  direct  access  to the disks.  They havesacrificed modularity for efficiency.     The same conflict appears in networking, in a rather extreme  form.

                                   17The concept of a protocol is still unknown and frightening to most naiveprogrammers.   The idea that they might have to implement a protocol, oreven part of a protocol, as part  of  some  application  package,  is  adreadful thought.  And thus there is great pressure to hide the functionof  the  net behind a very hard barrier.  On the other hand, the kind ofinefficiency which results from this is a particularly undesirable  sortof  inefficiency, for it shows up, among other things, in increasing thecost of the communications resource used up to achieve  the  applicationgoal.   In cases where one must pay for one's communications costs, theyusually turn out to be the dominant cost within the system.  Thus, doingan excessively good job of packaging up the protocols in  an  inflexiblemanner  has  a  direct  impact  on  increasing  the cost of the criticalresource within the system.  This is a dilemma which will probably  onlybe solved when programmers become somewhat less alarmed about protocols,so that they are willing to weave a certain amount of protocol structureinto their application program, much as application programs today weaveparts  of  database  management  systems  into  the  structure  of theirapplication program.     An extreme example of putting the protocol package  behind  a  firmlayer boundary occurs when the protocol package is relegated to a front-end processor.  In this case the interface to the protocol is some otherprotocol.    It  is  difficult to imagine how to build close cooperationbetween layers when they are that far separated.  Realistically, one  ofthe prices which must be associated with an implementation so physicallymodularized is that the performance will suffer as a result.  Of course,a separate processor for protocols could be very closely integrated into

                                   18the  mainframe  architecture, with interprocessor co-ordination signals,shared memory, and similar features.  Such a physical  modularity  mightwork  very  well,  but  there  is little documented experience with thisclosely coupled architecture for protocol support.     6.  Efficiency of Protocol Processing     To this point, this document has considered how a protocol  packageshould  be  broken  into  modules,  and  how  those  modules  should  bedistributed between free standing machines, the operating system kernel,and one or more user processes.  It is now time to  consider  the  otherhalf  of the efficiency question, which is what can be done to speed theexecution of those programs that actually implement the protocols.    Wewill make some specific observations about TCP and IP, and then concludewith a few generalities.     IP  is a simple protocol, especially with respect to the processingof  normal  packets,  so  it  should  be  easy  to  get  it  to  performefficiently.    The only area of any complexity related to actual packetprocessing has to do with fragmentation and reassembly.  The  reader  isreferred  to  RFC  815,  titled "IP Datagram Reassembly Algorithms", forspecific consideration of this point.     Most costs in the IP layer come from table look  up  functions,  asopposed to packet processing functions.  An outgoing packet requires twotranslation  functions  to  be  performed.  The internet address must betranslated to a target gateway, and a gateway address must be translatedto a local network number (if the host is  attached  to  more  than  one

                                   19network).    It  is easy to build a simple implementation of these tablelook up functions that in fact performs very  poorly.    The  programmershould  keep  in  mind  that  there may be as many as a thousand networknumbers in a typical configuration.   Linear  searching  of  a  thousandentry table on every packet is extremely unsuitable.  In fact, it may beworth  asking  TCP  to  cache  a  hint for each connection, which can behanded down to IP each time a packet  is  sent,  to  try  to  avoid  theoverhead of a table look up.     TCP   is   a   more   complex  protocol,  and  presents  many  moreopportunities for getting things wrong.  There  is  one  area  which  isgenerally  accepted  as  causing  noticeable and substantial overhead aspart of TCP processing.  This is computation of the checksum.  It  wouldbe  nice  if this cost could be avoided somehow, but the idea of an end-to-end checksum is absolutely central to the functioning  of  TCP.    Nohost  implementor  should think of omitting the validation of a checksumon incoming data.     Various clever tricks have been used to try to minimize the cost ofcomputing the checksum.  If it is possible to add additional  microcodedinstructions  to the machine, a checksum instruction is the most obviouscandidate.  Since computing the checksum involves picking up every  byteof the segment and examining it, it is possible to combine the operationof computing the checksum with the operation of copying the segment fromone  location  to  another.   Since a number of data copies are probablyalready required as part of  the  processing  structure,  this  kind  ofsharing might conceivably pay off if it didn't cause too much trouble to

                                   20the  modularity  of  the  program.  Finally, computation of the checksumseems to be one place where careful attention  to  the  details  of  thealgorithm  used  can  make a drastic difference in the throughput of theprogram.  The Multics system provides one of the best  case  studies  ofthis,  since  Multics  is  about  as  poorly  organized  to perform thisfunction as any machine implementing TCP.   Multics  is  a  36-bit  wordmachine,  with  four 9-bit bytes per word.  The eight-bit bytes of a TCPsegment are laid down packed in memory, ignoring word boundaries.   Thismeans  that  when it is necessary to pick up the data as a set of 16-bitunits for the purpose of adding  them  to  compute  checksums,  horriblemasking  and  shifting  is  required  for  each  16-bit value.  An earlyversion of a program using this  strategy  required  6  milliseconds  tochecksum  a  576-byte  segment.    Obviously,  at  this  point, checksumcomputation was becoming the central bottleneck to throughput.   A  morecareful  recoding of this algorithm reduced the checksum processing timeto less than one millisecond.  The strategy used  was  extremely  dirty.It  involved adding up carefully selected words of the area in which thedata lay, knowing that for those particular  words,  the  16-bit  valueswere  properly  aligned  inside  the words.  Only after the addition hadbeen done were the various sums shifted, and finally  added  to  producethe  eventual  checksum.  This kind of highly specialized programming isprobably not acceptable if used everywhere within an  operating  system.It is clearly appropriate for one highly localized function which can beclearly identified as an extreme performance bottleneck.     Another area of TCP processing which may cause performance problemsis the overhead of examining all of the possible flags and options which

                                   21occur in each incoming packet.  One paper, by Bunch and Day [2], assertsthat  the  overhead of packet header processing is actually an importantlimiting  factor  in  throughput  computation.    Not  all   measurementexperiments  have  tended to support this result.  To whatever extent itis true, however, there is an obvious  strategy  which  the  implementorought  to  use in designing his program.  He should build his program tooptimize the expected case.  It is easy, especially when first designinga program, to pay equal attention to all of  the  possible  outcomes  ofevery test.  In practice, however, few of these will ever happen.  A TCPshould  be  built  on the assumption that the next packet to arrive willhave absolutely nothing special about it,  and  will  be  the  next  oneexpected  in  the  sequence  space.   One or two tests are sufficient todetermine that the expected set of control flags are on.  (The ACK  flagshould be on; the Push flag may or may not be on.  No other flags shouldbe on.)  One test is sufficient to determine that the sequence number ofthe  incoming  packet  is  one  greater  than  the  last sequence numberreceived.  In almost every case, that will be the actual result.  Again,using the Multics system as an example, failure to optimize the case  ofreceiving  the  expected  sequence number had a detectable effect on theperformance of the system.  The particular problem arose when  a  numberof  packets  arrived  at  once.    TCP attempted to process all of thesepackets before awaking the user.  As a result,  by  the  time  the  lastpacket  arrived,  there was a threaded list of packets which had severalitems on it.  When a new packet arrived, the list was searched  to  findthe  location  into which the packet should be inserted.  Obviously, thelist should be searched from highest sequence number to lowest  sequence

                                   22number,  because  one is expecting to receive a packet which comes afterthose already received.  By mistake, the list was searched from front toback, starting with the packets with the lowest sequence  number.    Theamount of time spent searching this list backwards was easily detectablein the metering measurements.     Other data structures can be organized to optimize the action whichis  normally  taken  on  them.  For example, the retransmission queue isvery seldom actually used  for  retransmission,  so  it  should  not  beorganized  to  optimize that action.  In fact, it should be organized tooptimized the discarding of things  from  it  when  the  acknowledgementarrives.    In many cases, the easiest way to do this is not to save thepacket  at  all,  but  to  reconstruct  it  only  if  it  needs  to   beretransmitted,  starting  from the data as it was originally buffered bythe user.     There is another generality, at least as  important  as  optimizingthe  common  case,  which  is  to avoid copying data any more times thannecessary.  One more result from the Multics TCP may prove  enlighteninghere.    Multics takes between two and three milliseconds within the TCPlayer to process an incoming packet, depending on its size.  For a  576-byte packet, the three milliseconds is used up approximately as follows.One   millisecond   is   used  computing  the  checksum.    Six  hundredmicroseconds is spent copying the data.  (The data is copied  twice,  at.3  milliseconds  a copy.)  One of those copy operations could correctlybe included as part of the checksum cost, since it is done  to  get  thedata  on  a  known  word  boundary  to  optimize the checksum algorithm.

                                   23However,  the  copy also performs another necessary transfer at the sametime.  Header processing and packet resequencing takes .7  milliseconds.The  rest  of  the  time  is  used  in miscellaneous processing, such asremoving packets from the retransmission queue which are acknowledged bythis packet.  Data copying is the second most expensive single operationafter data checksuming.   Some  implementations,  often  because  of  anexcessively  layered  modularity, end up copying the data around a greatdeal.  Other implementations end up copying the data because there is noshared memory between processes, and the data must be moved from processto process via a kernel operation.  Unless the amount of  this  activityis  kept  strictly  under  control,  it  will  quickly  become the majorperformance bottleneck.     7.  Conclusions     This document has addressed two aspects  of  obtaining  performancefrom a protocol implementation, the way in which the protocol is layeredand  integrated  into  the  operating  system,  and the way in which thedetailed handling of the packet is optimized.  It would be nice  if  oneor  the  other  of these costs would completely dominate, so that all ofone's attention could be concentrated there.  Regrettably, this  is  notso.    Depending  on  the particular sort of traffic one is getting, forexample, whether Telnet one-byte packets or file transfer  maximum  sizepackets  at  maximum  speed, one can expect to see one or the other costbeing the major bottleneck to throughput.  Most  implementors  who  havestudied  their  programs  in  an  attempt to find out where the time wasgoing have reached  the  unsatisfactory  conclusion  that  it  is  going

                                   24equally  to  all parts of their program.  With the possible exception ofchecksum  processing,  very  few  people  have  ever  found  that  theirperformance  problems  were  due  to a single, horrible bottleneck whichthey could fix by a single stroke of inventive programming.  Rather, theperformance was something which was improved by  painstaking  tuning  ofthe entire program.     Most  discussions  of protocols begin by introducing the concept oflayering, which tends  to  suggest  that  layering  is  a  fundamentallywonderful  idea  which  should  be  a  part  of  every  consideration ofprotocols.  In fact, layering is a mixed blessing.    Clearly,  a  layerinterface  is  necessary  whenever  more than one client of a particularlayer is to be allowed to use  that  same  layer.    But  an  interface,precisely  because  it  is fixed, inevitably leads to a lack of completeunderstanding as to what one layer wishes to obtain from another.   Thishas to lead to inefficiency.  Furthermore, layering is a potential snarein  that  one  is  tempted  to think that a layer boundary, which was anartifact of the specification procedure, is in fact the proper  boundaryto  use in modularizing the implementation.  Again, in certain cases, anarchitected layer must correspond to an implemented layer, precisely  sothat  several  clients  can  have  access  to that layer in a reasonablystraightforward manner.  In other cases, cunning  rearrangement  of  theimplemented  module  boundaries to match with various functions, such asthe demultiplexing of incoming packets, or the sending  of  asynchronousoutgoing  packets,  can  lead  to  unexpected  performance  improvementscompared to more traditional implementation strategies.   Finally,  goodperformance is something which is difficult to retrofit onto an existing

                                   25program.   Since performance is influenced, not just by the fine detail,but by the gross structure, it is sometimes the case that  in  order  toobtain  a  substantial  performance  improvement,  it  is  necessary  tocompletely redo the program from  the  bottom  up.    This  is  a  greatdisappointment   to  programmers,  especially  those  doing  a  protocolimplementation for  the  first  time.    Programmers  who  are  somewhatinexperienced  and  unfamiliar with protocols are sufficiently concernedwith getting their program logically correct that they do not  have  thecapacity  to  think  at  the  same  time  about  the  performance of thestructure they are building.  Only after they have achieved a  logicallycorrect  program  do they discover that they have done so in a way whichhas precluded real performance.  Clearly, it is more difficult to designa program thinking from the start about  both  logical  correctness  andperformance.  With time, as implementors as a group learn more about theappropriate  structures  to  use  for  building  protocols,  it  will bepossible  to  proceed  with  an  implementation  project   having   moreconfidence  that  the structure is rational, that the program will work,and that the program will work well.    Those  of  us  now  implementingprotocols  have the privilege of being on the forefront of this learningprocess.  It should be no surprise that our  programs  sometimes  sufferfrom the uncertainty we bring to bear on them.

                                   26Citations     [1]  Cohen  and  Postel,  "On  Protocol  Multiplexing",  Sixth DataCommunications Symposium, ACM/IEEE, November 1979.     [2] Bunch and Day, "Control Structure Overhead in TCP", Trends  andApplications:  Computer Networking, NBS Symposium, May 1980.

[8]ページ先頭

©2009-2025 Movatter.jp