While most computer use focuses on creating, storing andmoving single files, many application domains exist whereefficiently handling operations on entire disks is important.Classic system administration tasks, such as operatingsystem installation, catastrophe recovery, and forensics, aswell as new research such as mobile work environments [18]and computing utility farms [9, 14], benefit greatly from theability to quickly read, transfer, and write entire diskpartitions.
There are two basic disk-level distribution strategies.Differential update, represented by tools such asrsync [17]operates above the filesystem, and compares what is already on the disk with the desired contents, replacing only what is necessary. Disk imaging, used by tools such as Ghost [7], operates below the filesystem, unconditionally replacing the contents of a disk.
Differential techniques are very effective at synchronizing file hierarchies within a filesystem and are extremely bandwidth efficient. However, for distributing entire disks, disk imaging offers a number of important advantages:
Generality:Creating a disk image requires no knowledge of the filesystem being imaged. We show how limited knowledge can be beneficial in Section 3.1, but is not required. Synchronization above the filesystem, however, requires detailed understanding of the filesystem such as directory structure, file ownership, access controls, and times.
Robustness:Disk images have no dependence on the existing contents of the target disk; in contrast, file-based synchronization tools cannot, for example, be used on a corrupted filesystem.
Versatility:One filesystem type can easily be replaced with another using a disk image. This cannot be done with a file-based synchronizer.
Speed:Writing an entire disk image can be faster than determining which files need to be updated. Section 5.4 demonstrates that Frisbee runs much faster thanrsyncin our target environment.
Full-disk imaging does have drawbacks. It is less bandwidth-efficient than differential techniques—no matter how small the difference between the source and destination, the entire disk is copied. The client will also most likely have to be entirely dedicated to the task, since instead of updating files above the filesystem layer, raw disk contents are changed. The advantages outweigh the drawbacks, however, and we demonstrate that by taking advantage of characteristics of the target domain, disk imaging can used to implement an efficient and scalable disk loading system.
Adopting the strategy of complete disk reloading due to our application’s requirements, we have designed, implemented and evolved Frisbee, a disk image generation and distribution system that is fast and highly scalable in a LAN environment. While designed for our networkemulation testbed, Frisbee offers functionality andtechniques useful in a variety of production and researchenvironments.
Five aspects of Frisbee’s design are key to its success:
This paper makes the following contributions: (1) It showsthat bulk disk imaging can be extremely fast and scalable,making it a practical approach to disk loading, and frequentlya superior one. Furthermore, our performance resultsindicate that disk imaging is so fast that it can be applied inqualitatively new ways. (2) It presents the detailed design,implementation, and experimental evaluation of Frisbeeand identifies the design aspects most important to itsperformance. (3) The disk imaging system it describesis, to our knowledge, the fastest such system extant.In addition, versions of the system have been provenin production use for over 18 months by hundreds ofexternal users, and is available in open source form.(4) It discusses the design tradeoffs in disk imagingsystems.
In the rest of this paper we first outline Netbed, thenetwork testbed system that drove our need for disk imaging.We then outline the design tradeoffs in a disk imagingsystem, following with sections on Frisbee’s design andimplementation, performance evaluation and analysis, related and future work, and conclusion.
As Frisbee was developed primarily for use in the Emulab portion of Netbed, some background will be useful in understanding its motivation and target environment. Netbed [20] is a time- and space-shared facility for networking and distributed systems research and education. It has been in use by the community since April 2000. Emulab is one of Netbed’s primary hardware resources—it consists of a cluster of commodity PC “nodes” with configurable network interconnectivity. Emulab is space-shared in the sense that it can be arbitrarily partitioned for use by multiple experimenters at the same time. Some resources in the system, such as nodes, can only be used in one experiment at a time; in this sense, Emulab is also time-shared. Experimenters are generally given full root access to nodes, meaning that they are free to reconfigure the host operating system in any way they wish, or even install their own.
In light of its time-shared nature, and the degree of control experimenters are given, it is critical that Emulab nodes be returned to a known state between experiments. Without this, experimenters have no guarantee that their results are not affected by configuration changes made by the previous user. Even worse, a malicious user could modify disk contents to facilitate compromise of the node once it has been allocated to another experimenter. To ensure a node is in a known state, its disk must be entirely reloaded.
Disk contents on Emulab are considered to be soft state—hard state, such as user accounts and information about network configuration, are kept off-node in a central database. This separation simplifies things by allowing the same disk image to be used on many nodes, with each node self-configuring from the central database at boot time. It also relieves users of the need to preserve configuration data. Thus, if an experimenter corrupts disk contents, say by introducing a kernel change that corrupts a filesystem, the disk can simply be reloaded, preserving any hard state and losing only soft state. This forgiving environment encourages aggressive experimentation.
Disk images can also be loaded automatically at experiment creation time. An experimenter who wishes to install their own custom operating system or make substantial changes to the default FreeBSD or Linux images provided by Emulab can create an image containing their customizations. They can then load this image on an arbitrary number of other nodes without manual intervention.
Since Emulab has a large number of nodes (currently, 170), and users have run experiments that use more than 120 nodes, speed and scaling are critical to enable these new uses of disk imaging. Waiting for scores ofnodes to load serially would leave them unavailablefor a long period of time, dramatically reducing thethroughput of Emulab, so image distribution must bedone in parallel. There are, however, distinct classes ofnodes of differing speeds, so it is important that ourdisk imaging solution work well with heterogeneousclients.
There are three phases of a disk imaging system: imagecreation, image distribution, and image installation. Eachphase has aspects which must be balanced to fulfill a desiredgoal. We consider each phase in turn.
Source availability: While it is possible for the source ofthe snapshot to be active during the image creation process, itis more common that it be quiescent to ensure consistency.Quiescence may be achieved either by using a separatepartition or disk for the image source or by running theimage creation tool in a standalone environment whichdoesn’t use the source partition. Whatever the technique,the time that the image source is “offline” may be aconsideration. For example, an image creation tool whichcompresses the data as it reads it from the disk may takemuch longer than one that just reads the raw data andcompresses later. However, the former will require much lessspace to store the initial image.
Degree of compression and data segmentation: Another factor is how much (if any) and what kind ofcompression is used when creating the image. Whilecompression would seem to be an obvious optimization,there are trade-offs. As mentioned, the time and CPUresources required to create an image are greater whencompressing. Compression also impacts the distribution anddecompression process. If a disk image is compressed as asingle unit and even a single byte is lost during distribution,the decompression process will stall until the byte isacquired successfully. Thus, depending on the distributionmedium, images may need to be broken into smallerpieces, each of which is compressed independently. Thiscan make image distribution more robust and imageinstallation more efficient at the expense of sub-optimalcompression.
Filesystem-aware compression: A stated advantage ofdisk imaging over techniques that operate at the file level isthat imaging requires no knowledge of the contents orsemantics of the data being imaged. This matches well withtypical file compression tools and algorithms which arelikewise ignorant of the data being compressed. However,most disk images contain filesystems and most filesystemshave a large amount of available (free) space in them, space that will dutifully be compressed even though the contents are irrelevant. Thus, the trade-off for being able to handle any content is wasted time and space creating the image and wasted time decompressing the image. One common workaround is to zero all the free space in filesystems on the disk prior to imaging, for example, by creating and then deleting a large file full of zeros. This at least ensures maximum compressibility of the free space. A better solution is to performfilesystem-awarecompression. A filesystem-aware compression tool understands the layout of a disk, identifying filesystems and differentiating the important, allocated blocks from the unimportant, free blocks. The allocated blocks are compressed while the free blocks are skipped. Of course, a disk imaging tool using filesystem-aware compression requires even more intimate knowledge of a filesystem than a file-level tool, but the imaging tool need not understand all filesystems it may encounter- it can always fall back on naive compression.
Network bandwidth and latency: Perhaps the most important aspect of network distribution is bandwidth utilization. The availability of bandwidth affects how images are created (the degree of compression) as well as how many clients can be supported by a server (scaling). Bandwidth requirements are reduced significantly by using compression. Increased compression not only reduces the amount of data that needs to be transferred, it also slows the consumption rate of the client due to the need to decompress the data before writing it to disk. If image distribution is serialized, only one client at a time, then compression alone may be sufficient to achieve a target bandwidth. However, if the goal is to distribute an image to multiple clients simultaneously, then typical unicast protocols will need to be replaced with broadcast or multicast. Broadcast works well in environments where all clients in the broadcast domain are involved in the image distribution. If the network is shared, then multicast is more appropriate, ensuring that unaffiliated machines are not affected. Just as in all data transfer protocols, the delay-bandwidth product affects how much data needs to be en route in order to keep clients busy, and the bandwidth and latency influence the granularity of the error recovery protocol.
Network reliability: As alluded to earlier, the error rate of the network may affect how compression is performed. Smaller compression units may limit the effectiveness of the compression, but increase the ability of clients to remain busy in the face of lost packets. More generally, inlossy networks it is desirable to sub-divide an imageinto “chunks” and include with each chunk additionalinformation to make that chunk self-describing. In ahighly reliable network, or if using a reliable transportprotocol that provides in-order delivery beneath the imagedistribution protocol, this additional overhead would beunnecessary.
Network security: If the distribution network is not“secure,” additional measures will need to be taken to ensurethe integrity and privacy of image data. If the image containssensitive data, then encryption can be used to protect it. Thisencryption can be done either in the network transport using,for example, SSL, or the image itself could be encrypted aspart of the creation process. The latter requires moreCPU resources when creating the image but providesprivacy of the stored image and is compatible with existingmulticast protocols. Ensuring that the image is not corruptedduring distribution due to injection of forged data into thecommunication channel is also an issue. This requiresthat clients authenticate the source of the image. Again,many solutions exist in the unicast space, such as usingan SSH tunnel to distribute images. For multicast, theproblem is harder and the focus of much research [2]. Notethat security is not just a wide-area network concern.Even in a LAN, untrusted parties may be able to snoopor spoof on traffic unless countermeasures are taken.However, in the LAN case, switch technologies such asvirtual LANs can provide some or all of the necessaryprotection.
Receiver vs. sender-driven protocol: A final issue inimage distribution is whether the protocol is server orreceiver-driven [19]. A simple server-driven protocol mightrequire that all clients synchronize their startup andoperate in lock-step as the server doles out pieces of animage as it sees fit. Such a strategy would scale wellin a highly reliable network with homogeneous clientmachines as little extraneous communication is required.However, if a client does miss a piece of the image forany reason, it might be forced to abort or wait untilthe entire image has been sent out and then request aresend. A client-driven protocol allows each client tojoin the distribution process at any time, requesting thechunks it needs to complete its copy, and then leaving.The process completes when all clients have left. Thedownside is more control traffic and the potential forsignificant redundant data transfers, either of which canaffect scaling.
The final element of disk imaging is the installation of thetransferred image on a client. As with image creation, thedisk or partition involved must be quiescent with the imageinstaller either operating on a second disk or partition orrunning standalone. Since image installation is typically concerned with installing or restoring the “primary” disk on a machine, we restrict the remaining discussion to the standalone case.
Resource utilization: In a standalone environment, the disk installation tool is in the enviable position of being able to consume every available local resource on the target machine. For example, it can use hundreds of megabytes of memory for caching image data incoming from the network or for decompressed data waiting to be written to disk. Likewise, it can spawn multiple threads to handle separate tasks and maximize overlap of CPU and I/O operations.
Overlapping computation and I/O: CPU is of particular interest since, on reasonably fast current processors, substantial computing can be performed while waiting for incoming network packets and disk write completions. The most obvious use of the cycles is to decompress data. However, on unreliable transports the time could also be used for computing checksums, CRCs, or forward-error-correction codes. On insecure transports, CPU resources may be needed for decrypting and authenticating incoming data.
Optimizing disk I/O: If disk I/O is the bottleneck when installing an image, the installation tool may be able to exploit client resources or characteristics of the disk image to minimize disk write operations. On machines with large physical memories, memory can be used to buffer disk writes allowing for fewer and larger sequential IO operations. If the image format uses a filesystem-aware compression strategy which distinguishes allocated and free blocks, the installation tool can seek over ranges of free blocks, thereby reducing the number of disk writes. Note that this method has security implications, since it has the potential to “leak” information from the previous disk user to the new user.
Optimizing network I/O: If network bandwidth is the bottleneck then it may be possible to take advantage of similarities between the old disk contents and the desired contents, as is done in the LBFS filesystem [15], designed for the wide-area. In this technique, acquiring a disk image would be a two-phase process. Whenever a client needs a block of data, it would first ask for a unique identifier, such as a collision-resistant hash [5], for that block which it could then compare to blocks on the local disk. If the block already exists on the local disk, it need only be copied to the correct location. Only if the block is not found, would the client request the actual block data. Such hashing techniques can place a heavy burden on the CPU as well as the disk, if local hashes must be computed at run time.
The previous section outlined a variety of issues and trade-offs in the design of a disk imaging system. In thissection we describe our design choices, their rationale, andtheir mapping to implementation.
Creation and compression: To create an image, wefirst boot the source machine into a special memory filesystem-based version of Unix. This satisfies the need fordisk quiescence, and allows us to create images withoutporting the image creator to run on all operating systemsfor which it can create images. Since image creation ismuch less frequent than image installation, we do notaggressively optimize the time spent creating images. Tosave on server disk space and bandwidth, the image iscompressed on the client before being written to the server.Filesystem/OS-specific compression is used, includingskipping swap partitions; genericzlib-based [4, 21]compression is used on allocated blocks. Partitions thatcontain unknown filesystem types are either compressedgenerically or, optionally, entirely skipped.
Multicast: Our distribution mechanism uses a customapplication-level receiver-initiated multicast protocol withNAK-avoidance [10]. In turn, this protocol relies on IPmulticast support in the network switches to provideone-to-many delivery at the link level. The number ofcontrol messages is kept under control by multicastingclient repair requests (NAKs), so that other clients cansuppress duplicate requests. In terms of the multicastdesign space put forth in RFC 2887 [8], our applicationdesign requires only scalability and total reliability. Wedo not require other constraints, in particular ordereddata or server knowledge of which clients have receiveddata.
Two-level segmentation: We now address the issue of thegranularity of data segments. Since we will need to resendlost multicast packets, yet do not need to preserve ordering, itis clear that we want the client-side decompression routinesto process data segments out of order. Therefore, eachdata segment must be self-describing and stand aloneas a decompressable unit. Since compression routinesoptimize their dictionaries based on the distribution of theirinput data, they achieve better compression ratios whengiven longer input to sample. That argues for longersegments.
Since Frisbee’s basic job is I/O—copying disks throughmemory over networks—the classic hardware and OSarchitecture reasons that favor sequential I/O for itsspeed and efficiency also favor long segments. However,to preserve network and machine resources, we wantour multicast loss recovery algorithm to use selectiveretransmission, which requires relatively short segments.Finally, we want a small segment size that fits into theEthernet MTU.
The fundamental problem is that we need to follow theprinciple of Application Level Framing (ALF) [3], yet have conflicting application requirements. We address these conflicting demands by using a two-level segmentation scheme. The unit of compression is the self-describing 1MBchunk, composed of 1024 1KB blocks. For the initial network transmission, the server multicasts an entire chunk, capping its rate by pausing everyN(currently 16) blocks for a tunable period. Receivers selectively request missing blocks via partial request messages. In this way we achieve long segments that can efficiently be randomly-accessed, composed of small blocks that are typically, but not always, accessed sequentially. Subsets of blocks, as specified in receiver partial request messages, give us a flexible mechanism to request intermediate lengths, without undue complexity.
Specialization for resource availability: Since most modern Ethernet networks are switched, and the dedicated clients do not need bandwidth for other purposes, the main place that bandwidth must be saved is on the server and the server’s network link, a situation for which multicast is ideally suited. The protocol is client-driven; this way, a server can be running at all times, but naturally falls idle when no clients are present. Additionally, this client-side control provides a high degree of robustness in the face of client failure and reduces server-side bookkeeping.
Receiver concurrency control: To install an image, we boot into a small, memory filesystem-based Unix system similar to the one used when creating the image. Using multiple threads, our disk loader client program takes care to overlap the computationally expensive decompression with the slow disk I/O. Using filesystem-specific compression turns out to give the biggest performance improvements at this stage—once compression is used to reduce the data transferred on the network, and maximal processor/IO overlap is achieved, the bottleneck in performance becomes disk writes. We thus obtain a huge savings by not having to write unnecessary disk blocks. The ability to write free areas of the disk is still available because, as discussed in Section 4.4, this may be needed for confidentiality reasons.
Security: Since users must register to use Netbed, our threat model does not encompass determined malicious local adversaries. We have not yet needed to ensure security in the face of malicious users. We do expect to provide more security eventually, perhaps by signing image data or with VLAN technology. In Emulab, images are distributed via the “control” network, a single switched virtual LAN connecting all nodes. Thus there is the potential for an experiment to observe data on, or inject data into, a Frisbee multicast stream for another experiment. Our focus is on preventing accidental interference between experiments, in particular ensuring the integrity of image data. We use a simple check of the source IP address of incoming block data, which, although maliciously spoofable, suggests that it comes from the approved host. We donot currently provide an option for ensuring privacy ofdistributed images. User-created images are protected viafilesystem permissions while stored on the Frisbee server’sdisk.
In the Frisbee system, theimagezipapplication is responsiblefor creating images of either entire disks or single partitions.The images are compressed using both conventional andfilesystem-aware techniques in a two-stage process. Inthe optional first phase, the partitions of interest areanalyzed to determine if filesystem-aware compressioncan be done. Partitions are identified either explicitlyby command line options or implicitly by reading thepartition type field in the DOS partition table (on x86-basedsystems). Frisbee currently handles BSD FFS, Linux ext2fsand Windows NTFS filesystems as well as BSD andLinux swap partitions. If a partition is recognized, afilesystem-specific module is invoked to scan the filesystemfree list and build up a list of free blocks in the partition. If apartition is not recognized,imageziptreats all blocks asallocated.
imagezipalso has a limited ability to associate “relocation”information with data in a created image. This informationallows it to create single partition images that can beloaded onto a disk that has a different partition layout.This facility is needed for filesystem types that containabsolute rather than partition-relative sector numbers.Notable examples of this are FreeBSD disklabels and LILObootblocks.
In the second phase, the allocated blocks are readsequentially and compressed, producing 1MB chunks. Eachchunk has a fixed-sized header with index informationidentifying the ranges of allocated blocks contained within it.Since the degree of compression is unpredictable, it isimpossible to know exactly how much input data is requiredto fill the remaining space in a chunk. We counter with asimple algorithm that compresses smaller and smaller piecesas the chunk gets close to full; we then pad the chunk outto exactly 1MB. The padding typically runs around20KB.
In summary, as shown in Figure 1,imagezipusesknowledge of filesystem types as well as conventional zlibcompression to compress disk images. Images are segmentedinto self-describing 1MB chunks, each with independentlycompressed data.
A compressed disk image in the Frisbee system is just aregular, albeit potentially very large, file and thus can bedistributed in any number of ways, such as viascpor NFS,and then installed using theimageunzipcommand lineprogram described in Section 4.4.
In a local area network environment, a more efficient andscalable way of image distribution is to use the Frisbeeprotocol as implemented in thefrisbeedserver andfrisbeeclient.Frisbeedaccepts request messages from multipleFrisbee clients and uses UDP on IP-multicast to transfer animage. Eachfrisbeeclient uses the multicast channel torequest pieces of the desired image as needed until all pieceshave been received, decompressed and written to the targetdisk. In the current implementation, each instance offrisbeedserves up a specific disk image using a unique multicastaddress. The information about what disk image andmulticast address to use is communicated out-of-band to theserver and clients. In Emulab’s case, the client learns thisfrom the central Netbed database.
Thefrisbeedserver has two threads, one which receivesincoming requests and one which processes those requestsand multicasts image data to the network. The server receivethread fields three types of messages from clients.JOINandLEAVEmessages bracket a client’s participation in amulticast session. The server’s response to aJOINmessageincludes the number of blocks in the image.
Clients issue dataREQUESTmessages containing a chunknumber and a block range; typically they request the entirechunk. The server receive thread places block requestson a FIFO work queue, after first merging with anyalready queued request that overlaps the requested datarange.
Thefrisbeedtransmit thread loops, pulling requests fromthe work queue, reading the indicated data from thecompressed image file, and multicasting it to the network inFrisbeeBLOCKmessages.BLOCKmessages contain a single1KB block of data along with identifying chunk and blocknumbers. Since a request allows for multiple blocks to bespecified, a single request from the work queue may generatemultipleBLOCKmessages.
In our current production system, the server’s networkbandwidth consumption is controlled by placing a simple capon the maximum bandwidth used. Two parameters are used to implement the cap: a burst size and a burst gap. The burst size is the number ofBLOCKmessages that can be transmitted consecutively without pausing, while the burst gap is the duration of that pause. Ideally, just an inter-packet delay could be used to pace data to the network, but the resolution of UNIX sleep mechanisms is dictated by the resolution of the scheduling clock, which is typically too coarse (1-10ms). Our current values of burst size (16) and gap (2ms) were empirically tuned for our environment. Clearly, this capping mechanism is adequate only on a dedicated server machine in a switched LAN environment, as the server does not adjust its transmission rate in response to network load. The effect of this is shown, and an alternative mechanism discussed, in Section 5.3.
Thefrisbeeclient is structured as three threads in order to overlap network I/O, disk I/O, and decompression. The network thread, whose basic operation is shown in Figure 2, is responsible for retrievingBLOCKmessages multicast by the server, accumulating the contained data blocks into complete chunks, and queuing those chunks for processing by the decompression thread. The network thread also ensures that data arrives in a timely fashion by issuingREQUESTmessages for needed chunks and blocks. The decompression thread dequeues completed chunks, decompresses the data and, using the index information from the chunk header, queues variable-sized disk write requests. The disk thread dequeues those requests and performs the actual disk write operations. Once all chunks have been written to disk, the client exits. The remainder of this section focuses on the acquisition of data via the Frisbee protocol.
Afrisbeeclient will of course receive not only blocksit has explicitly requested, but those that other clientshave requested as well. Ideally,frisbeewould be able tosave all such blocks. However, since blocks for a givenchunk must be kept until the entire 1MB chunk has beenreceived, and a compressed image may be hundreds tothousands of megabytes, this is not practical. Thus,frisbeemaintains a cache of chunks for which it has receivedone or more blocks, discarding incoming data for otherblocks when the cache is full. Currently, the size of thiscache (typically 64MB) is configured via a commandline parameter and is fixed for the duration of the clientrun.
The client keeps a timestamp for each outstanding chunkin the image, recording when it last issued a request for thechunk or observed another client’s request for it. Thetimestamp prevents the client from re-requesting data toosoon. Before a partial or full chunk request is made, theclient verifies that no client has requested the same chunkrecently.Frisbeecan track other clients’ requests because allclient-initiated messages (JOIN,LEAVEandREQUEST) aremulticast.
After a client joins a session, it sends one or moreREQUESTmessages to start the transfer process. Instead ofhaving each client request chunks in sequential order, clientsrandomize their initial request list. This prevents the clientsfrom synchronizing, requesting the same chunks at the sametime, which would cause Frisbee’s NAK-avoidance toperform less well. Each client is allowed to “request ahead” afixed number of 1MB chunks.
Once a client has started and made its initial chunkrequests, there are two situations in which it may makeadditional requests: when it has just completed a chunk andhanded it off to the decompression thread or when it hasn’tseen any packets (messages) for some period of time. Theformer represents the normal operation cycle: a clientreceives chunks, decompresses and writes them out, andmakes further requests. When requesting new data followingchunk completion, the first priority is completing any chunksfor which some blocks have already been received. For eachincomplete chunk currently in the client’s cache, that chunk’stimestamp is checked and, if it has been long enoughsince the chunk was last requested, the client issues apartial-chunk request to retrieve missing data for thatchunk. Prioritizing partial-chunk requests over those fornew chunks helps keep the decompression and diskthreads busy and flushes data from the cache bufferssooner, making space available for new chunks. Afterhandling partial-chunk requests, the client may also issueone or more full chunk requests to fill its request-aheadwindow.
If the client is initiating a request due to a receiver timeout,the request process is similar to the chunk-completed case: partial-chunk requests followed by request-ahead chunk requests. The difference is that, in the timeout case, the chunk timestamp is not consulted; the requests are made regardless of when the chunks were last requested. The reasoning is that a timeout indicates a significant packet loss event between this client and the server (and other clients), so that even recent requests are likely to have been lost. To prevent flooding the network with requests in the event of a prolonged disconnection from the server, for example a server crash, clients exponentially back-off on requests.
Images are installed on a disk by one of two client programs. One is thefrisbeeclient discussed above; the other is a simple program calledimageunzip. They differ only in how they obtain the image:imageunzipreads an image out of a file whilefrisbeeuses the Frisbee protocol to obtain it from the network. Both clients share the code used to decompress the data and write it out to disk. This section describes the operation of that common code.
Since the disk image is broken into independent 1MB chunks, the decompression code is invoked repeatedly, once for each chunk. For each chunk, the header is read to obtain ranges of allocated blocks contained in the chunk. For each allocated range, the indicated amount of data is decompressed from the chunk and queued for the disk writer thread to write to the appropriate location. The separation of decompression and disk I/O allows a great deal of overlap since raw disk I/O in FreeBSD is blocking. For free areas between ranges, the client can either skip them or fill them with zeros. The former is the default behavior and speeds the installation process dramatically in images with a large proportion of free space. However, this method may be inappropriate in some environments since it can “leak” information from the previous disk image to the current. For example, in Emulab where machines are time shared between experiments, some users may wish to have all their data “wiped” from their machines when they are done. For these environments, the installation client can be directed to zero-fill free space.
In this section, we evaluate the performance of our disk imaging and loading system, testing the speed of individual parts, as well as the entire system, with a variety of disk image properties, client counts and network conditions. Furthermore, we compare the performance of our system with a similar popular commercial offering and with a differential update program.
For our tests, we use one or more of a standard set ofthree test images. Our “small” image is a typical cleaninstallation of FreeBSD on the FFS filesystem, which uses642MB (21%) of a 3067MB filesystem. Our “large”image is a similar installation of FreeBSD that containsadditional files typically found on a desktop workstation,such as several large source trees, compressed sourcearchives, build trees, and additional binary packages; thisimage uses 1776MB (58%) of the available filesystemspace. For comparison with Symantec Ghost, whichperforms best with NTFS filesystems, we used our“XP” image, which is a typical clean installation ofMicrosoft Windows XP Professional Edition. It uses990MB of data with a 384MB swap file and 520MB“hibernation” file for a total of 1894MB (46%) of disk spacein a 4094MB filesystem. All tests were performed onEmulab.1
|
|
By comparing the first two columns, we see significantsavings from using filesystem-aware compression: the smallimage, with 80% free space, sees a time savings of 78% overthe naively compressed image. From the last two rows,where there is only a single thread and thus no overlap ofdecompression and disk writing, we see that disk write speedis the limiting factor. Writing to disk accounts for 55-60% ofthe total time. Since disk writes are synchronous and themajority of the time is spent waiting, decompressingin parallel effectively hides much of its cost. This isdemonstrated in the difference between the single- andmulti-threaded results in which the multithreaded case is upto 40% faster.
![]()
|
Scaling: To show Frisbee’s speed and scalability, we rana number of tests, reloading sets of clients ranging in numberfrom 1 to 80. During these tests, all clients began loading atthe same time. Figure 3 shows the average client runtimefor the small and large images using both naive andfilesystem-aware compression. The minimum and maximumtimes are indicated with error bars, but the varianceis so low that they are not identifiable at the defaultmagnification. Frisbee is fast and scalable: it loads thesmall image onto one node in 33.8 seconds, and onto 80nodes in 33.6 seconds. It loads the naively-compressedimages in nearly constant time. For the filesystem-awarelarge image, the runtime does increase slowly: Frisbeeloads 1 node in 94 seconds and 80 nodes in 102 seconds.The reason for this difference remains to be explored;we suspect that the fraction of partial requests may beincreasing, or the clients’ chunk buffers may be fillingup.
Across all runs, Frisbee’s network efficiency is very good;the number of duplicate blocks transmitted due to packet lossor duplicate requests did not exceed 8% of the total blockssent. Note the nearly identical runtimes for the two naivelycompressed images, despite the nearly 50% difference intheir compressed sizes. Since both must write a full 3GB ofdata to disk, this demonstrates that the disk is indeed thebottleneck on these machines.
Figure 4 shows the average number of control messages(JOIN,REQUEST, andLEAVEmessages) received by theserver per second. Since the control messages are at most152 bytes, the peak number of messages per second shown inthis graph, 127, represents at most 154Kbps of upstreamtraffic to the server. If the linear scaling shown in this graphholds for larger client counts, control message traffic shouldnot run into packet rate or bandwidth limitations ona 100Mbps LAN until we reach tens of thousands ofclients.
One thing to note in these graphs is that the maximumnode runtime remains flat even as the control messagetraffic rises. This is because thefrisbeedserver mergesredundant requests in its work queue. For example, in theworst case at 127 messages per second, over 93% of theREQUESTmessages were at least partially redundant.This indicates that there is considerable opportunity forimprovement in the NAK avoidance strategy, a topicdiscussed later.
Another important result is that load times with Frisbeeare very similar to the load times reported in Table 2 forimageunzip: Frisbee is able to keep the disk-writing threadsupplied with data at a high enough rate that network transferrate is not the bottleneck. With respect to supplying the diskwriter, Frisbee’s multicast distribution provides nearly thesame level of performance as reading from local RAM on theclient, and maintains this performance for a large number of clients.
|
As one might expect, skewing requests results in redundantblock transfers. Late joining clients miss the blocksrequested by earlier clients and thus request them again.This, in turn, stalls the early joining clients when theserver sends redundant data. As a result, late joiningclients tend to finish significantly faster. The Frisbeeclient could be made more fair by making its requestbehavior less aggressive. Currently, the client issues its ownrequests even if it is constantly receiving sufficient data tokeep it busy. Making the client more passive, requestingdata only when not making forward progress, wouldrestore fairness. That change risks causing a client’sruntime to further increase, if it doesn’t quickly enoughtransition to making its own requests, and falls idle.However, even in the current state, we consider the ability tointroduce clients into a Frisbee run at any time to bewell worth the increases in client runtime and resourceconsumption.
![]()
|
Packet Loss: In general, we do not expect Frisbee tohave to contend with packet loss, since its target environmentis a switched LAN in which the receivers are dedicatedclients. However, packet loss can still occur if the server orswitch is overloaded. To investigate Frisbee’s behavior in thepresence of packet loss, we loaded the large image on 1 to 80clients with packet loss rates of 0%, 1% and 10%. Packetdrops were done at the server; since Frisbee clients arerunning only Frisbee, there will be no contention for theirlinks. Figure 5 summarizes the results. Packet loss makesFrisbee more sensitive to the number of clients, and there isdefinitely room for improvement, since with a large numberof nodes, the 1% packet loss case performs similarly to the10% case. Still, performance is clearly acceptable forwhat we expect to be a rare occurrence. It is interestingto note that a single client performs worse with lossthan do multiple clients. When there is a single client,and aREQUESTmessage it sends to the server is lost, atimeout must pass before it will ask again. During thattime, the client is idle. When there are multiple clients,blocks sent as the result of their requests will enable thefirst client to make progress until its timeout periodexpires.
|
The top half of the table shows runs using the defaultserver network bandwidth of 70Mbps, a value tuned toefficiently support the 600-850Mhz class of machines inEmulab. Here we see that runtimes are very similar for allcombinations. However, the lack of improvement by thepc2000s is because they are throttled by the networkbandwidth, not by the presence of slower machines. This isillustrated in the lower half of the table, where the serverbandwidth was increased to 90Mbps. At this rate, a setof four pc2000s is able to load the image 23% faster,while a set of four pc600s takes 12% longer. In thisconfiguration, we do see an effect when combining the twotypes. Combining a single pc600 with three pc2000s slowsthe faster machines, increasing their runtime to 81.6 seconds,while the slower machine runtime remains unchanged.With three pc600s and a single pc2000, the latter isfurther slowed to 93.7 seconds, with little change for thepc600s. This slowdown is directly attributable to theincrease in duplicate data caused by the slower machines’re-request messages. While not shown in the table, theduplicate data rate tops out at 35% with eight pc600s. Atthis rate, the pc2000 continues to run faster than thepc600s, taking 102 seconds versus 112 for the the slowermachines.
NAK Avoidance: We ran Frisbee with its NAKavoidance features, snooping on control messages andtime-limiting of re-requests, disabled. With 80 clients, themessage received rate at the server increased dramatically,from 85 per second to 264 for the small image, and from 146per second to 639 for the large image.
As noted earlier, the NAK avoidance features still seem to allow a large number of spurious control messages, which are then ignored by the server. These messages are the result of using a static time limit (one second) for re-requests. When the limit is changed to two seconds, the request rate is reduced to 47 per second for the small image and 84 for the large image. However, blindly increasing the static value can result in increased client runtime when small numbers of nodes are involved and messages are truly lost. Ideally, we need to take into account the transfer rate of the server and the length of the server’s work queue (which varies with the number of active clients), both of which affect the latency of an individual request. A dynamic time limit could be implemented by having the server piggyback current bandwidth and queue length information onBLOCKmessages. Clients would use that information to calculate a more appropriate re-request rate.
|
|
Asfrisbeedessentially just moves data from the diskto the network, we would expect the use of all threeresources to increase with the number ofBLOCKmessagesprocessed. As seen in the client performance measurements,increased requests most commonly occur when client startupis skewed or there is significant packet loss, causingthe server to resend data. Table 5 details the run time,CPU use and amount of data transferred from disk tonetwork for the skewed client experiment reported inTable 3. The rate of CPU and disk use is bounded by thenetwork send rate which, as mentioned in Section 4.3.1, iscontrolled by a simple static bandwidth cap. The value of70Mbits/sec used in our evaluation, which includes allnetwork overhead, translates to 7.7MB per second of imagedata. In the table we can see that as the data transferrate approaches this value, CPU us does not exceed15%.
More problematic is the multiple server scenario.With no provision for dynamically altering bandwidthconsumption, resource use is additive in the number ofrunning servers. Table 6 demonstrates this effect as we runfrom one to eightFrisbeedinstances to load 80 clientnodes. Even with a 1000Mbps link from the server, attwoFrisbeeds we are near the 100Mbps limit of theclient links and the switch begins to drop packets. ByeightFrisbeeds, the CPU is saturated. Moreover, notshown in this table is the lack of fairness between servers.For example, in the four-server case one finished in35 seconds while the other three took longer than 60seconds.
While we can tolerate this behavior in the currentEmulab, where server and switch resources are plentiful, abetter solution is needed. Recently we have prototypeda rate-based pacing mechanism so thatfrisbeedwilladapt to network load. We use a simple additive-increasemultiplicative-decrease algorithm which dynamically adjuststhe burst size based on the number of lost blocks. The key tocalculating the latter is that the server can treat any partialchunk request as indicating a lost packet. Results for thisversion ofFrisbeedare mixed, with two servers quicklyadapting to each take half the 100Mbps bandwidth, but withfour and eight servers wildly oscillating. We believe the latteris merely a consequence of our simplistic rate-equation andnot a reflection on the Frisbee protocol and its ability todetect loss events.
To get some idea of the speed of our disk imaging approach compared to differential file-based approaches, we ranrsyncon the three filesystems in our small image, with essentially no changes between source and target machines. We configuredrsyncto identify changed files but not to update any. This is a best-case test forrsync, since its runtime strongly depends on the amount of data it must copy.
We found that, when identifying changed files based solely on timestamps,rsyncis approximately three times faster than Frisbee—it took 12 seconds to compare two machines vs. Frisbee’s 34 seconds to blindly write the same image. Security and robustness concerns, however, prevent us from using timestamps as an accurate way of comparing files, since they are not reliable when experimenters have full root access. Whenrsyncperforms MD4 hashes on all files to find differences, its runtime increases to 170 seconds, five times longer than Frisbee. Given our static disk distribution needs, some domain-specific optimizations torsyncshould be possible. For example, while it must always hash the target disk’s files, the server can cache the hashes of its unchanging source disk. In the above test, server-side hashing accounted for approximately 60 seconds ofrsync’s runtime and is serialized with client-side processing. Therefore, this optimization should reducersync’s runtime to three times Frisbee’s.
However, these small tests still demonstrate that, on a fast distribution network where bandwidth is not the major bottleneck, and with disk contents such as ours, it is unnecessary to spend time identifying changed files. It is faster simply to copy the entire disk.
We compared Frisbee to one of the most popular commercial disk imaging packages, Symantec Ghost2. Ghost has a similar feature set, including filesystem-specific compression and multicast distribution. Ghost’s “high” compression setting (level four out of nine) appears substantially similar toimagezip’s compression (using zlib level four). We used the Windows XP disk image for this comparison, whichimagezipcompressed to 575MB and Ghost compressed to 594MB. Both Ghost and Frisbee have the ability to skip the swap and hibernation files, whose contents do not need to be preserved.
Figure 6 shows Frisbee and Ghost load times on 1 to 25clients, with no packet loss and with a 1% loss rate. SinceGhost is a commercial product with per-client licensing, themaximum number of clients we tested was limited bylicensing costs. Still, clear trends are visible: Ghost’s base(one-client) time of 156 seconds is nearly twice as high asFrisbee’s 81 seconds, and it increases with the number ofclients to 369 seconds, while Frisbee’s grows only 5% to85 seconds. Frisbee exhibits excellent tolerance to 1%packet loss. The extremely poor behavior of Ghost in thepresence of packet loss is remarkable, and bears furtherinvestigation.
An important difference between Frisbee and Ghost is thatFrisbee allows new clients to connect while other clients arein the process of receiving an image. Ghost, on the otherhand, requires all clients to start simultaneously. Thissubstantially impacts the latency of the system, as all clientsmust wait for the slowest to begin, and clients that wish tojoin after a session has been started must first wait for theongoing session to finish. One can work around thisrestriction by starting a new Ghost session for the sameimage, with the downside of unnecessarily increasingnetwork traffic.
Partition Image [16] is an open-source program forcreating and restoring disk partition images. Like Frisbee,it uses filesystem-aware compression in conjunctionwith conventional compression to reduce the size of theimage and accelerate image distribution and installation.Partition Image currently supports a larger set of recognizedfilesystem types. Unlike Frisbee, images are compressed as asingle unit and thus the image must be decompressedsequentially. Partition Image also does not support creatingcomplete disk images with multiple partitions. PartitionImage uses a stream-oriented unicast protocol with optionalencryption. Thus it will not scale as well as Frisbee’smulticast protocol, but will work unchanged in a wide-areanetwork environment. The Partition Image client can bothsave and restore images over the network where Frisbeecurrently has no built-in mechanism for saving images acrossthe network.
HCP [18] is a hybrid technique for synchronizingdisks, using a form of differential updating, but belowthe file level. HCP is one method used in Stanford’sCollective project to copy virtual disks (“capsules”) betweenmachines. In HCP, a cryptographic hash is used to identifyblocks in the client and server disks. To synchronize ablock between the two, the client first requests the hashfor the desired block and compares that to the hash forall blocks in all local virtual disks. If any local block matches the hash, that block is used to provide the data, otherwise the actual block data is obtained from the server. HCP takes advantage of the high degree of similarity between the multiple virtual disks that could reside on any client and the fact that the same virtual disk will tend to migrate back and forth between a small set of machines. Still, as noted by the authors, HCP is only appropriate in environments where the network is the bottleneck due to increased disk seek activity on the client.
Numerous other multicast protocols for bulk data transfer have been proposed, such as SRM [6] and RMTP [11]. Frisbee’s target environment, high-speed, low packet loss, low-latency LANs, allows a much simpler protocol, which can be optimized for very high throughput. In the taxonomy of known multicast protocols presented in [10], the Frisbee protocol is considered a RINA (Receiver Initiated with NAK-Avoidance) protocol.
Extending the Frisbee system from a LAN environment into the wide area presents an interesting challenge. In addition to its Emulab cluster, Netbed manages a number of nodes at sites around the world. Currently, images compressed byimagezipare distributed via unicast, and installed withimageunzip, but this will clearly not scale for a large number of nodes or frequent image distribution. Extending diskloading to the wide area will assuredly raise issues that are not present in our LAN environment. Some of these issues, such as differing client bandwidths and TCP-friendliness, have been the subjects of extensive research and we will undoubtedly be able to leverage this work. For example, techniques such as those employed by Digital Fountain [1] or WEBRC [12] may be useful. Digital Fountain uses a multicast protocol based on erasure codes [13] to create a large-scale software distribution system. WEBRC obtains an estimate of multicast RTT for flow control and TCP friendliness, and uses multiple multicast streams and a fluid model to serve clients of differing bandwidths.
When sending data in the wide area, security is also a concern—while it is acceptable to send images unencrypted and unauthenticated on a tightly-controlled LAN, care will have to be taken in the wide area to ensure that eavesdroppers cannot obtain a copy of sensitive data on the image, or alter disk contents.
We have presented Frisbee, a fast and scalable system for disk image generation, distribution in local area networks, and installation. We summarized our target application domain and have shown how aspects of that domain governed our choices in designing the system. As well asdiscussing our use of established techniques, we haveexplained our methods of filesystem-aware compression andtwo-level segmentation, and how they are particularlywell-suited to our multicast file transfer protocol. Finally, wehave shown that this system exceeds our performancerequirements and scales remarkably well to a large numberof clients.
Many thanks to Kirk Webb for gathering importantperformance results, to Russ Christensen for implementingNTFS compression, to Dave Andersen for implementing anearly unicast disk imager in the OSKit, and to the anonymousreviewers for their useful feedback.
[1] J. W. Byers, M. Luby, M. Mitzenmacher, andA. Rege. A Digital Fountain Approach to ReliableDistribution of Bulk Data. InProc. of ACM SIGCOMM’98, pages 56-67, Vancouver, BC, 1998.
[2] R. Canetti, J. Garay, G. Itkis,D. Micciancio, M. Naor, and B. Pinkas. MulticastSecurity: A Taxonomy and Some Efficient Constructions.InProc. of INFOCOM ’99, pages 708-716, Mar. 1999.
[3] D. D. Clark and D. L. Tennenhouse. ArchitecturalConsiderations for a New Generation of Protocols. InProc. of SIGCOMM ’90, pages 200-208, 1990.
[4] P. L. Deutsch and J.-L. Gailly. ZLIB CompressedData Format Specification version 3. Internet Request forComments 1950, IETF, May 1996.
[5] FIPS 180-1.Secure Hash Standard. U.S. Departmentof Commerce/N.I.S.T., National Technical InformationService, Springfield, VA, Apr. 1995.
[6] S. Floyd, V. Jacobson, C.-G. Liu, S. McCanne,and L. Zhang. A Reliable Multicast Framework forLight-weight Sessions and Application Level Framing.IEEE/ACM Transactions on Networking, 5(6):783-803,Dec. 1997.
[7] Symantec Ghost.http://www.symantec.com/sabu/ghost/.
[8] M. Handley et al. The Reliable Multicast DesignSpace for Bulk Data Transfer. Internet Request ForComments 2887, IETF, Aug. 2000.
[9] IBM Corp. The Océano Project.http://www.research.ibm.com/oceanoproject/.
[10] B. N. Levine and J. Garia-Luna-Aceves. AComparison of Known Classes of Reliable MulticastProtocols. InProc. of IEEE ICNP ’96, pages 112-123,Oct. 1996.
[11] J. C. Lin and S. Paul. RMTP: A Reliable MulticastTransport Protocol. InProc. of INFOCOM ’96, pages1414-1424, San Francisco, CA, Mar. 1996.
[12] M. Luby, V. K. Goyal, S. Skaria, and G. B. Horn.Wave and Equation Based Rate Control Using MulticastRound Trip Time. InProc. of ACM SIGCOMM ’02,pages 191-204, Aug. 2002.
[13] A. J. McAuley. Reliable Broadband CommunicationUsing a Burst Erasure Correcting Code. InProc. of ACMSIGCOMM ’90, pages 297-306, Philadelphia, PA, Sept.1990.
[14] J. Moore and J. Chase. Cluster On Demand.Technical Report CS-2002-07, Duke University, Dept. ofComputer Science, May 2002.
[15] A. Muthitacharoen, B. Chen, and D. Mazières. ALow-bandwidth Network File System. InProc. of the18th ACM Symposium on Operating Systems Principles,pages 174-187, Banff, Alberta, Canada, Oct. 2001.
[16] Partition Image for Linux. http://www.partimage.org/.
[17] rsync. http://rsync.samba.org/.
[18] C. P. Sapuntzakis, R. Chandra, B. Pfaff, J. Chow,M. S. Lam, and M. Rosenblum. Optimizing theMigration of Virtual Computers. InProc. of OSDI ’02,pages 377-390, Boston, MA, Dec. 2002.
[19] D. Towsley, J. Kurose, and S. Pingali.A Comparison of Sender-Initated and Receiver-InitiatedReliable Multicast Protocols.IEEE Journal on SelectedAreas in Communications, 13(3), April 1997.
[20] B. White, J. Lepreau, L. Stoller, R. Ricci,S. Guruprasad, M. Newbold, M. Hibler, C. Barb, andA. Joglekar. An Integrated Experimental Environmentfor Distributed Systems and Networks. InProc. ofOSDI ’02, pages 255-270, Boston, MA, Dec. 2002.
[21] zlib: A Massively Spiffy Yet Delicately UnobtrusiveCompression Library. http://www.gzip.org/zlib/.