There have been a fair number of questions recently in BeDevTalk andBeUserTalk regarding the networking rewrite. I thought I'd use my shiftin the newsletter sweatshop to describe the new architecture and givesome details on its status.
This article is an attempt to roughly describe the stack internals andgive developers who might wish to create new protocols or NIC driverswith some tasty tidbits of info. For sane, normal developers who have nodesire to work on networking internals, be assured that the sockets APIwill be just as it is on your favorite BSD clone and skip down to the$ales Pitch.
BeOS networking is being completely replaced by a new architecture,called the BeOS Networking Environment, orBONE. None of the existingR4.x networking will survive the change; it's either being ported over tothe new architecture (in the case of drivers), or discarded completely(in the case of the net server, net kit, netdev kit/net server add-onarchitecture, PPP, Netscript, etc.) The new architecture focuses onperformance, scalability, maintainability, and extensibility, in nospecific order. It is simpler than the current net_server, yet far moreflexible.
TheBONE architecture is a modular design that allows for easy removal orreplacement of any of its individual parts, by users or by Be. In thisregard,BONE is an API spec for a networking architecture, and adescription of how those modules interoperate. The implementation Be willship can have parts replaced by users at will if they so desire, providedthat they adhere to the specification.
_______________ | | | libsocket.so | |_______________|user land |- - - - - - - - - - - - - - - - -+- - - - - - - - - - - - -kernel land | ________|________ | | ______________ | net api driver | | | |_________________| | bone_util | | |______________| ________|________ | |transport layer | protocol module | (e.g.,. udp) |_________________| | ________|________ | |network layer | protocol module | (e.g.,. ipv4) |_________________| | ________|________ | | (containsdata link layer | datalink module | routing, ARP, |_________________| etc.) / \ _____________/___ _\_______________ | loopback | | 802.3 | | framing module | | framing module | |_________________| |_________________| | |physical layer | | ________|________ ________|________ | | | | | loopback driver | | ethernet driver | |_________________| |_________________|
Let's look at each driver and module type in the architecture.
libsocket/Net API driver/kernel sockets module
All networking functionality visible to user programs is provided bylibsocket, which is a very thin library that opens a driver (whichprovides the socket "file descriptor") and communicates with it viaioctls to provide the networking API. The net API driver instantiates theinternal data structures associated with a socket (thebone_endpoint_t),sets up the protocol stack for each socket, and handles all communicationbetween the socket and the stack.
Other networking APIs besides BSD sockets interface could be implementedto talk to the net API driver using the sameioctlsthatlibsocket does.
And finally, for the truly ambitious amongst you who are developingnetworked file systems and such, you'll be happy to hear that there is akernel module interface to the sockets API so you'll be able to usenetworking from kernel land.
Thebone_util module contains functionality that the other modules needand/or that doesn't fit elsewhere.bone_data (see below) manipulation,benaphores, fifos, masked data copying, and other "generic" utilities areprovided here. All parts of theBONE system use this important module. Itdefines operations for several data types.
Abone_data_t is a data type that is used inBONE as a container fortransient networking data. While it fulfills the same requirements asmbufs do under a BSD networking architecture,bone_data_t are quitedifferent thanmbufs and suffer from none ofmbufs' limitations orproblems.
Central to the efficiency of a networking stack is reducing the amount ofdata copies. Unlikembufs,bone_data_t are containers of lists of iovecs.Abone_data_t contains two such lists: a "freelist," which containspointers to actual memory addresses that need to be freed, and a"datalist," which contains a virtual "view" of networking memory that cannot only be very efficiently accessed, but also easily modified.
Consider the following scenario. A user calls "sendto" with a buffercontaining a udp datagram that is 2000 bytes long. This results in abone_data_t with the following layout:
bone_data_t {datalist: {iov_base= &buffer,iov_len= 2000}freelist: {&buffer, 2000}}
![]() | actually this wouldn't be here in this case since on datagram sendsBONE is zero-copy and would pass the user's buffer directly to the NICdriver rather than allocating a new buffer that would later need freeing.But we'll leave it here for demonstration purposes.) |
The udp layer would then add a header to the data. This is easily done bysimply adding aniovec to the chain:
bone_data_t {datalist: {&udp_header, 8}, {&buffer, 2000}freelist: {&udp_header, 8}, {&buffer, 2000}}
(Again, theudp_header would not *really* be added to the free list,since the udp layer would be using a local buffer for it that would notneed freeing, but we'll use it as an example, as with the IP headerbelow.)
Now, suppose the interface it's being sent on has an MTU of 1500 bytes.IP would need to fragment the data and add an IP header to each frag.
On other systems (especially BSD-based systems that use mbufs), multiplecopies would need to be done here.BONE simply manipulates iovecs intheir lists:
bone_data_t {datalist: {&ip_header, 20},{&udp_header, 8}, {&buffer, 1472}, {&ip_header_2, 20},{&buffer+ 1472, 528}freelist: {&ip_header_2, 20},{&ip_header,20},{&udp_header, 8}, {&buffer, 2000}}
By manipulating the logical view of the data rather than copying, bonewill see a big scalability and performance win when using large datagrams(such as during bulk data transfer of things like large image files).
bone_proto_info_t
All the protocols are implemented as instances ofbone_proto_info_t.These are chained together as appropriate in structures calledbone_proto_node_t for each networking endpoint instance when it iscreated. Adriver_settings configuration file specifies which protocolsto put in a socket's stack when the socket is created.
(If you're afraid of looking at a config file for efficiency reasons,don't be; the bone_util module contains optimized functions for readingtheBONE settings. On average, opening a socket underBONE takes on theorder of 300 usec (microseconds)).
When networking operations occur, the net api driver calls theappropriate function in thebone_proto_info_t module on top of itsprotocol stack. The protocol then performs all necessaryprotocol-specific operations and calls the next protocol in the chain, ondown to the network layer protocol, which passes the final data on to thedatalink layer.
To add a new protocol to bone, one essentially creates abone_proto_info_t"subclass" for the protocol, and adds entries for it to theBONEconfiguration file. It will be loaded at runtime by either the API driver(for new sockets) or the datalink layer (for inbound data).
bone_datalink
The datalink module is the center of theBONE architecture.
Each time an interface is brought up (viaifconfig, etc.), the datalinkmodule spawns off a thread which blocks on the interface module's receivemethod. When new data arrives on the interface, it's read by that thread,demuxed, and pushed up the appropriate protocol stack to the receivequeue of the appropriatebone_endpoint_t.
The fact that each interface has its own reader thread associated withit, in addition to the fact that multiple user-level threads will bepushing data simultaneously through the system, should provideBONE withgreater scalability than other systems, particularly in the area of stacklatency. Multiple-interface BeOS systems perform quite well underBONE.
Networking Interfaces are represented using the traditional BSD structifnet data structure, modified for BeOS. This structure contains muchinfo about an interface, including the various addresses associated withit, volatile statistics, thebone_interface_info_t module to use for theinterface, and thebone_frame_info_t module to use for framing the data.
bone_frame_info_t
Since many different interfaces use the same link-level framing types,these were isolated out into modules to facilitate reuse. For example,any number of ethernet card driver modules can load the single bone_802.3module for their framing needs.
Similarly, by decoupling framing from the rest of the link layer, asingle NIC driver module can use different types of framing. For example,a HiPPI interface that is configured to use the HiPPI physical layer vs.its logical layer framing. Another example would be an ethernet interfacethat wants to send jumbograms rather than 1500-byte ethernet frames.
bone_interface_info_t
A networking-oriented interface to device drivers is added inBONE, to beused in writing NIC drivers. If desired, a traditional device driver canalso export abone_interface_info_t module interface, which makes portingexisting drivers easy.
In the way of sample code, I have included the current snapshot ofbone_proto.h andbone_interface.h, the two headers most useful to themajority of you who will be writingBONE modules. I have also included asnapshot of thebone_util.hBONE utilities header file, since the otherfiles use it so much. Finally, I've included the source code to theBONEloopback interface module to illustrate how to write a network interfacemodule. To get the code: <ftp://ftp.be.com/pub/samples/bone/bone.zip>.
Note that these files should be considered alpha-level software. They arelikely to change in the future. The loopback module is (purposely)nonoptimized and provided as an illustration; real loopback operationsare heavily optimized inBONE and bypass this module entirely.
While these files aren't everything you need to start developing forBONE, they should give you an idea of the directions you should beheading in.
In addition to the traditional BeOS GUI-based tools, all of your favoriteUNIX networking utilities are either already ported or will port readily.Examples include:
BIND 8.2 tools: addr, dnsquery, irpd, named-bootconf, nslookup, dig,host, mkservdb, named-xfer, nsupdate, dnskeygen, named, ndc
Configuration Tools: route, ifconfig, etc.
Utilities: telnet, ping, ftp, traceroute, tcpdump, libpcap, etc.
and many more.
Almost every feature that BeOS net developers have been asking for isthere; sockets are file descriptors, the sockets API is much morecompliant, raw sockets are there, it's relatively easy to add newprotocols, there is a kernel networking interface, and so on.
Net performance has improved massively; there are no hard numbers (and wehaven't finished optimizing) but our benchmarks are puttingBONE aroundtwenty times (2000%) the speed of the current net_server;BONE is in thesame league as Linux and FreeBSD, though not fully competitive with theirspeed yet. Yet. :-)
As one of the new guys at Be, I've been able to avoid the task of writinga newsletter article for a while. Today though, my fellow DTS engineersfinally located my secret hideout, and promptly assigned me the task ofwriting this week's Engineering Insights article. So, as we say in theNetherlands: "here it is."
In this article I'll present a small application, written a long time agowhen I wanted to measure cpu usage of theSoundPlay mp3 player (you mayhave heard of it). Specifically, I wanted to compare its cpu usage withthat of the other players available at the time: I wanted to see whichthreads used the most cpu and when they used it. The resulting app is(not surprisingly) calledcputime. You can download the source code at<ftp://ftp.be.com/pub/samples/kernel_kit/cputime.zip>.
You can start from the command line this way:
cputime application arguments
where "application" is the application you want to measure and"arguments" are the arguments (if any) you want to pass to theapplication.cputime will launch the specified application with the givenarguments. While the application is running,cputime will continuouslymonitor its cpu usage, either until the application exits orcputime'ssampling buffer is full. When either of these happens,cputime will opena window and graphically display the cpu-usage over time of each threadof the application.
Because of the waycputime works, threads that are created and destroyedwithin one sampling interval are "lost" and don't show up in the display.If the application you want to monitor rapidly spawns new threads thatrun for only a very short time and then die, you may want to modifycputime to use a higher sampling rate. This is left as an exercise forthe reader.
Not surprisingly, last week's column brought more questions and one ortwo arguments to my BeMail mailbox. I appreciate these responses—theyhelp me understand where we need to make ourselves "even" clearer. Morethan one reader noted that I've become more cautious, shall we say, alittle less "colorful" in my observations. It's not clear whether this isseen as regrettable or, au contraire, a welcome change.
And, yes, my column is vetted by our legal counsel for statements thatwould in fact or appearance cause trouble with the powers that overseethe stock market. It's one thing to believe in emerging opportunities,but one's optimism can to easily be interpreted as promoting not theproduct, but the company's securities (notwithstanding the potentialentendre-doubling the latter noun might suggest). The fact that ourcompany's stock is publicly traded impacts what I can and cannot say.
Second, several readers proclaimed they don't care for InternetAppliances, don't need them, and feel we're misguided for getting intothem. This suggests that we haven't made a good enough case yet for thisemerging genre of connected devices—for their benefits in general andfor the role our technology can play in this emerging area. Readersobject that "my PC does everything I need, so I have no use for any ofthese Internet devices." But this is not a PC versus IA (InternetAppliance) debate. It's PC and IA, not PC versus IA.
In theory, a PC is capable of infinite mutability through software andhardware add-ons that can simulate any experience. I'm exaggerating, ofcourse, but the idea that the PC is a simulation engine is fundamentallycorrect. So is the idea that dedicated, specialized devices always appearin a prosperous ecological niche and coexist with the general purposeones.
Let's put Swiss Army knives and screwdrivers aside and look atautomobiles. The last 20 years have brought us specialized vehicles suchas minivans and SUVs. The same kind of specialization is appearing in thedigital realm, with telephones, Palm VIIs, TVs, stereos, alarm systems,WebPads, Tivo video recorders, pagers, Web Minitels, and similarinformation appliances and, yes, PCs. And for everything that we can seeor imagine now, there are many other kinds of devices our inevitablyderivative thinking prevents us from intuiting in the present moment.
As for myself, whether or not I can make good guesses as to whichInternet Appliances will survive beyond the concept stage, I know we'reonto something. Go buy a Tivo hard disk video recorder. Today, itconnects to the Net through a slow telephone line. Watch your childrenuse it and quickly forget that there was a time when Tivo didn't exist.Now let's muster our available derivative thinking and "picture" whatwill happen when the Net connection is always open, say via cable modemor DSL. We'll be able to program the recorder from anywhere in the house,or in the world, through a browser, with no need to download the TVschedule at night as the Tivo box does today.
No, it's not video on demand—that will come later with a beefier Netinfrastructure. And no, we don't really need these appliances. Nor do wereally need PCs, or cars, or chocolate. Personally, I'd rather give upPCs than books. But just as we don't seem to want to live withoutpersonal transportation, there's a good chance that we'll find theseappliances exciting or liberating enough to want them, regardless of whatwe call need.
Next week, I'll describe how one of us uses PCs, appliances, and wirelesstechnology for home entertainment.