Movatterモバイル変換

[0]ホーム

Enhancements to the Fast Filesystem To Support Multi-Terabyte Storage Systems
Marshall Kirk McKusick
Author and Consultant

Abstract

This paper describes a new version of the fast filesystem, UFS2,designed to run on multi-terabyte storage systems. It gives themotivation behind coming up with a new on-disk format rather thantrying to continue enhancing the existing fast-filesystem format. Itdescribes the new features and capabilities in UFS2 including extendedattributes, new and higher resolution time stamps, dynamicallyallocated inodes, and an expanded boot block area. It also describesthe features and capabilities that were considered but rejected givingthe reasons for their rejection. Next it covers changes that weremade to the soft update code to support the new capabilities and toenable it to work more smoothly with existing filesystems. The papercovers enhancements made to support live dumps and changes made tofilesystem snapshots needed to avoid deadlocks and to enable them towork efficiently with multi-terabyte filesystems. Similarly, itdescribes changes that needed to be made to the filesystem checkprogram to work with large filesystems. The paper gives some commentsabout performance, and describes areas for future work including anextent-based allocation mechanism and indexed directory structures.The paper concludes with current status and availability of UFS2.

1. Background and Introduction

Traditionally, the BSD fast filesystem (which we shall refer to inthis paper as UFS1) [McKusick et al, 1996; McKusick, Joy et al, 1984]and its derivatives have used 32-bit pointers to reference the blocksused by a file on the disk. The UFS1 filesystem was designed in theearly 1980's when the largest disks were 330 megabytes. There wasdebate at the time whether it was worth squandering 32-bits per blockpointer rather than using the 24-bit block pointers of the filesystemthat it replaced. Luckily the futurist view prevailed and the designused 32-bit block pointers. Over the twenty years that it has beendeployed, storage systems have grown to hold over a terabyte of data.Depending on the block size configuration, the 32-bit block pointersof UFS1 run out of space in the 1 to 4 terabyte range. While somestop-gap measures can be used to extend the maximum size storagesystems supported by UFS1, by 2002 it became clear that the onlylong-term solution was to use 64-bit block pointers. Thus, we decidedto build a new filesystem, UFS2, that would use 64-bit block pointers.

We considered the alternatives between trying to make incrementalchanges to the existing UFS1 filesystem versus importing anotherexisting filesystem such as XFS [Sweeney et al, 1996], or ReiserFS[Reiser, 2001]. We also considered writing a new filesystem fromscratch so that we could take advantage of recent filesystem researchand experience. We chose to extend the UFS1 filesystem as thisapproach allowed us to reuse most of the existing UFS1 code base. Thebenefits of this decision were that UFS2 was developed and deployedquickly, it became stable and reliable rapidly, and the same code basecould be used to support both UFS1 and UFS2 filesystem formats. Over90 percent of the code base is shared, thus bug fixes and feature orperformance enhancements usually apply to both filesystem formats.

Sections 2, 3, and 4 discuss the UFS2 filesystem itself. Sections 5and 6 discuss enhancements that were made during the development ofUFS2 but which transfer over to UFS1 as well. Sections 7 and 8describe how we overcame problems of scale brought on by the enormoussize of filesystems supported by UFS2. The last three sectionsconclude with discussions of performance, future work, and currentstatus.

2. The UFS2 Filesystem

The on-disk inodes used by UFS1 are 128-bytes in size and have onlytwo unused 32-bit fields. It would not be possible to convert to64-bit block pointers without reducing the number of direct blockpointers from twelve to five. Doing so would dramatically increasethe amount of wasted space as only direct block pointers can referencefragments. So, the only viable alternative is to increase the size ofthe on-disk inode to 256 bytes.

Once one is committed to changing to a new on-disk format for theinodes, it is possible to include other inode-related changes thatwere not possible within the constraints of the old inodes. While itis tempting to throw in everything that has ever been suggested overthe last twenty years, we feel that it is best to limit the additionof new capabilities to those that are likely to have a clear benefit.Every new addition adds complexity which has a cost both inmaintainability and performance. Obscure or little used features mayadd conditional checks in frequently executed code paths such as readand write slowing down the overall performance of the filesystem evenif they are not used.

Although we decided to come up with a new on-disk inode format, wechose not to change the format of the superblock, the cylinder groupmaps, or the directories. Additional information needed for the UFS2superblock and cylinder groups is stored in spare fields of the UFS1superblock and cylinder groups. Maintaining the same format for thesedata structures allows a single code base to be used for both UFS1and UFS2. Because the only difference between the two filesystems isin the format of their inodes, code can dereference pointers tosuperblocks, cylinder groups, and directory entries without need ofchecking what type of filesystem is being accessed. To minimizeconditional checking of code that references inodes, the on-disk inodeis converted to a common in-core format when the inode is first readin from the disk, and converted back to its on-disk format when it iswritten back. The effect of this decision is that there are only nineout of several hundred routines that are specific to UFS1 versus UFS2.The benefit of having a single code base for both filesystems is thatit dramatically reduces the maintenance cost. Outside of the ninefilesystem format specific functions, fixing a bug in the code fixesit for both filesystem types. A common code base also means that asthe symmetric multiprocessing support gets added, it only needs to bedone once for the UFS family of filesystems.

Although we still use the same data structure to describe cylindergroups, the practical definition of them has changed. In the era ofUFS1, the filesystem could get an accurate view of the disk geometryincluding the cylinder and track boundaries and could accuratelycompute the rotational location of every sector. Modern disks hidethis information providing fictitious numbers of blocks per track,tracks per cylinder, and cylinders per disk. Indeed, in modern RAIDarrays, the "disk" that is presented to the filesystem may reallybe composed from a collection of disks in the RAID array. While someresearch has been done to figure out the true geometry of a disk[Griffin et al, 2002; Lumb et al, 2002; Schindler et al, 2002], thecomplexity of using such information effectively is quite high. Moderndisks have greater numbers of sectors per track on the outer part ofthe disk than the inner part which makes calculation of the rotationalposition of any given sector quite complex to calculate. So, for UFS2,we decided to get rid of all the rotational layout code found in UFS1and simply assume that laying out files with numerically close blocknumbers (sequential being viewed as optimal) would give the bestperformance. Thus, the cylinder group structure is retained in UFS2,but it is used only as a convenient way to manage logically closegroups of blocks. The rotational layout code had been disabled in UFS1since the late 1980s, so as part of the code base cleanup it wasremoved entirely.

The UFS1 filesystem uses 32-bit inode numbers. While it is verytempting to increase these inode numbers to 64 bits in UFS2, doing sowould require that the directory format be changed. There is a lotof code that works directly on directory entries.

Changing directory formats would entail creating many more filesystemspecific functions which would increase the complexity andmaintainability issues with the code. Furthermore, the current APIsfor referencing directory entries use 32-bit inode numbers. So, evenif the underlying filesystem supported 64-bit inode numbers, theycould not currently be made visible to user applications. In the shortterm, applications are not running into the four billionfiles-per-filesystem limit that 32-bit inode numbers impose. If weassume that the growth rate in the number of files per filesystem overthe last twenty years will continue at the same rate, we estimate thatthe 32-bit inode number should be sufficient for another ten to twentyyears. However, the limit will be reached before the 64-bit blocklimit of UFS2 is reached. So, the UFS2 filesystem has reserved a flagin the superblock to indicate that it is a filesystem with 64-bitinode numbers. When the time comes to begin using 64-bit inodenumbers, the flag can be turned on and the new directory format canbe used. Kernels that predate the introduction of 64-bit inode numberscheck this flag and will know that they cannot mount such filesystems.

Another change that was contemplated was changing to a more complexdirectory structure such as one that uses B-trees to speed up accessfor large directories. This technique is used in many modernfilesystems such as XFS [Sweeney et al, 1996], JFS [Best & Kleikamp,2003], ReiserFS [Reiser, 2001], and in later versions of Ext2[Phillips, 2001]. We decided not to make the change at this time forseveral reasons. First, we had limited time and resources and wewanted to get something working and stable that could be used in thetime frame of FreeBSD 5.0. By keeping the same directory format, wewere able to reuse all the directory code from UFS1, did not have tochange numerous filesystem utilities to understand and maintain a newdirectory format, and were able to produce a stable and reliablefilesystem in the time frame available to us. The other reason thatwe felt that we could retain the existing directory structure isbecause of the dynamic directory hashing that was added to FreeBSD[Dowse & Malone, 2002]. The dynamic directory hashing retrofits adirectory indexing system to UFS. To avoid repeated linear searchesof large directories, the dynamic directory hashing builds a hashtable of directory entries on the fly when the directory is firstaccessed. This table avoids directory scans on subsequent lookups,creates, and deletes. Unlike filesystems originally designed withlarge directories in mind, these indices are not saved on disk and sothe system is backwards compatible. The effect of the dynamicdirectory hashing is that large directories in UFS cause minimalperformance problems.

Borrowing the technique used by the Ext2 filesystem a flag was alsoadded to indicate that an on-disk indexing structure is supported fordirectories [Phillips, 2001]. This flag is unconditionally turned offby the existing implementation of UFS. In the future, if animplementation of an on-disk directory-indexing structure is added,the implementations that support it will not turn the flag off.Index-supporting kernels will maintain the indices and leave the flagon. If an old non-index-supporting kernel is run, it will turn offthe flag so that when the filesystem is once again run under a newkernel, the new kernel will discover that the indexing flag has beenturned off and will know that the indices may be out date and have tobe rebuilt before being used. The only constraint on an implementationof the indices is that they have to be an auxiliary data structurethat references the old linear directory format.

3. Extended Attributes

A major addition in UFS2 is support for extended attributes. Extendedattributes are a piece of auxiliary data storage associated with aninode that can be used to store auxiliary data that is separate fromthe contents of the file. The idea is similar to the concept of dataforks used in the Apple filesystem [Apple, 2003]. By integrating theextended attributes into the inode itself, it is possible to providethe same integrity guarantees as are made for the contents of the fileitself. Specifically, the successful completion of an "fsync"system call ensures that the file data, the extended attributes, andall names and paths leading to the names of the file are in stablestore.

The current implementation has space in the inode to store up to twoblocks of extended attributes. The new UFS2 inode format had room forup to five additional 64-bit pointers. Thus, the number of extendedattribute blocks could have been in the range of one to five blocks.We chose to allocate two blocks to the extended attributes and toleave the other three as spares for future use. By having two, allthe code had to be prepared to deal with an array of pointers, thusif the number got expanded into the remaining spares in the futurethe existing implementation will work without change. By saving threespares, we provided a reasonable amount of space for future needs.And, if the decision to allow only two blocks proves to be too littlespace, one or more of the spares can be used to expand the size ofthe extended attributes in the future. If vastly more extendedattribute space is needed, one of the spares could be used as anindirect pointer to extended attribute data blocks.

Figure 1:Format of Extended Attributes

Figure 1 shows the format used for the extended attributes. Theheader of each attribute has a 4-byte length, 1-byte name space class,1-byte content pad length, 1-byte name length, and name. The name ispadded so that the contents start on an 8-byte boundary. The contentsare padded to the size shown by the "content pad length" field.Applications that do not understand the name space or name can simplyskip over the unknown attribute by adding the length to their currentposition to get to the next attribute. Thus, many differentapplications can share the usage of the extended attribute space, evenif they do not understand each other's data types.

The first of two initial uses for extended attributes is to supportaccess control lists, generally referred to as ACLs. An ACL replacesthe group permissions for a file with a more specific list of theusers that are permitted to access the files along with a list of thepermissions that they are granted. These permissions include thetraditional read, write, and execute permissions along with otherproperties such as the right to rename or delete the file [Rhodes,2003].

Earlier implementations of ACLs were done with a single auxiliary fileper filesystem that was indexed by the inode number and had a smalland fixed sized area to store the ACL permissions. The size was smallto keep the size of the auxiliary file reasonable since it had to havespace for every possible inode in the filesystem. There were twoproblems with this implementation. The fixed size of the space perinode to store the ACL information meant that it was not possible togive access to long lists of users. The second problem was that itwas difficult to atomically commit changes to the ACL list for a filesince an update requires that both the file inode and the ACL file bewritten to have the update take effect [Watson, 2000].

Both problems with the auxiliary file implementation of ACLs are fixedby storing the ACL information directly in the extended-attribute dataarea of the inode. Because of the large size of the extended attributedata area (a minimum of 8 kilobytes and typically 32 kilobytes), longlists of ACL information can be easily stored. Space used to storeextended attribute information is proportional to the number of inodeswith extended attributes and the size of the ACL lists that they use.Atomic update of the information is much easier since writing theinode will update the inode attributes and the set of data that itreferences including the extended attributes in one disk operation.While it would be possible to update the old auxiliary file on every"fsync" system call done on the filesystem, the cost of doing sowould be prohibitive. Here, the kernel knows if the extended attributedata block for an inode is dirty and can write just that data blockduring an "fsync" call on the inode.

The second use for extended attributes is for data labeling. Datalabels are used to provide permissions for mandatory access controls(MACs). The kernel provides a MAC framework that permits dynamicallyintroduced system-security modules to modify system securityfunctionality. This framework can be used to support a variety of newsecurity services, including traditional labeled mandatory accesscontrol models. The framework provides a series of entry points whichis called by code supporting various kernel services, especially withrespects to access control points and object creation. The frameworkthen calls out to security modules to offer them the opportunity tomodify security behavior at those MAC entry points. Thus, thefilesystem does not codify how the labels are used or enforced. Itsimply stores the labels associated with the inode and produces themwhen a security modules needs to query them to make a permission check[Watson, 2001; Watson et al, 2003].

We considered storing symbolic links in the extended attribute area.We chose not to do this for three reasons. First, the time to accessan extended storage block is the same as the time to access a regulardata block. Second, since symbolic links rarely have any extendedattributes, there would be no savings in storage since a filesystemfragment would be needed whether it was stored in a regular data blockor in an extended storage block. Third, if it were stored in theextended storage area, it would take more time to traverse down theattribute list to find it.

4. New Filesystem Capabilities

Several other improvements were made when the enlarged inode formatwas created. We decided to get an early jump on the year 2038 problem(specifically, Tue Jan 19 03:14:08 2038 GMT which could be a reallyugly way to usher in my 84th birthday). We expanded the time fields(which hold seconds-since-1970) for access, modification, andinode-modification times from 32-bits to 64-bits. At plus or minus136 billion years that should carry us from well before the universewas created until long after our Sun has burned itself out. We leftthe nanoseconds fields for these times at 32-bits as we did not feelthat added resolution was going to be useful in the foreseeablefuture. We considered expanding the time to only 48-bits. We chose togo to 64-bits as 64-bits is a native size that can be easilymanipulated with existing and likely future architectures. Using48-bits would have required an extra unpacking or packing step eachtime the field was read or written. Also, going to 64-bits ensuresenough bits for all likely measured time so will not have to beenlarged.

At the same time we also added a new time field (also 64-bit) to holdthe birth time (also commonly called the creation time) of the file.The birth time is set when the inode is first allocated and is notchanged thereafter. It has been added to the structure returned bythe "stat" system call so that applications can determine its valueand so that archiving programs such asdump,tar, andpax can savethis value along with the other file times. The birth time was addedto a previously spare field in the "stat" system call structure sothat the size of the structure did not change. Thus, old versions ofprograms that use the "stat" call continue to work.

To date, only thedump program has been changed to save the birth timevalue. This new version ofdump which can dump both UFS1 and UFS2filesystems, creates a new dump format which is not readable by olderversions ofrestore. The updated version ofrestore can identify andrestore from both old and new dump formats. The birth times are onlyavailable and setable from the new dump format.

The "utimes" system call sets the access and modification times ofa file to a specified set of values. It is used primarily by archiveretrieval programs to set newly extracted files times back to thoseassociated with the file in the archive. With the addition of birthtime, we added a new system call that allows the setting of access,modification, and birth times. However, we realized that many existingapplications will not be changed to use the new "utimes" systemcall. The result will be that the files that they retrieved fromarchives will have a newer birth time than access or modificationtimes.

To provide a sensible birth time for applications that are unaware ofthe birth time attribute, we changed the semantics of the "utimes"system call so that if the birth time was newer than the value of themodification time that it was setting, it sets the birth time to thesame time as the modification time. An application that is aware ofthe birth time attribute can set both the birth time and themodification time by doing two calls to "utimes". First it calls"utimes" with a modification time equal to the saved birth time,then it calls "utimes" a second time with a modification time equalto the (presumably newer) saved modification time. For filesystemsthat do not store birth times, the second call will overwrite thefirst call resulting in the same values for access and modificationtimes as they would have previously gotten. For filesystems thatsupport birth time, it will be properly set. And most happily for theapplication writers, they will not have to conditionally compile thename of "utimes" for BSD and non-BSD systems. They just write theirapplications to call the standard interface twice knowing that theright thing will happen on all systems and filesystems. For thoseapplications that value speed of execution over portability can usethe new version of the "utimes" system call that allows all timevalues to be set with one call.

Another incremental change to the inode format was to split the flagsfield into two separate 32-bit fields, one for flags that can be setby applications (as in UFS1) and a new field for flags maintainedstrictly by the kernel. An example of a kernel flag is the SNAPSHOTflag used to label a file as being a snapshot. Another kernel-onlyflag is OPAQUE which is used by the union filesystem to mark adirectory which should not make the layers below it visible. By movingthese kernel flags into a separate field, they will not beaccidentally set or cleared by a naive or malicious application.

4.1. Dynamic Inodes

One of the common complaints about the UFS1 filesystem is that itpreallocates all its inodes at the time that the filesystem iscreated. For filesystems with millions of files, the initializationof the filesystem can take several hours. Additionally, thefilesystem creation program,newfs, had to assume that everyfilesystem would be filled with many small files and allocate a lotmore inodes than were likely to ever be used. If a UFS1 filesystemuses up all its inodes, the only way to get more is to dump, rebuild,and restore the filesystem. The UFS2 filesystem resolves theseproblems by dynamically allocating its inodes. The usualimplementation of dynamically allocated inodes requires a separatefilesystem data structure (typically referred to as the inode file)that tracks the current set of inodes. The management and maintenanceof this extra data structure adds overhead and complexity and oftendegrades performance.

To avoid these costs, UFS2 preallocates a range of inodenumbers and a set of blocks for each cylinder group. Initiallyeach cylinder group has a single block of inodes allocated (atypical block holds 32 or 64 inodes). When the block fills up,the next block of inodes in the set is allocated and initialized.The set of blocks that may be allocated to inodes is held as partof the free-space reserve until all other space in the filesystemis allocated. Only then can it be used for file data.

In theory a filesystem could fill using up all the blocks set asidefor inodes. Later after large files had been removed and many smallfiles created to replace them, the filesystem might find itself unableto allocated the needed inodes because all the space set aside forinodes was still in use. Here, it would be necessary to reallocateexisting files to move them to new locations outside of the inodearea. Such code has not been written as we do not anticipate that thiscondition will arise in practice as the free space reserve used onmost filesystems (8%) exceeds the amount of space needed for inodes(typically 2-6%). On these systems only a process running with rootprivileges would ever be able to allocate the inode blocks. Shouldthe code prove necessary in actual use, it can be written at thattime. Until it is written, filesystems hitting this condition willreturn an "out of inodes" error on attempts to create new files.

One of the side benefits of dynamically allocating inodes isthat the time to create a new filesystem in UFS2 is about 1percent of the time that it takes in UFS1. A filesystem thatwould take one hour to build in a UFS1 format can be built inunder a minute in the UFS2 format. While filesystem creationsare not a common operation, having them build quickly does matterto the system administrators that have to do such tasks with someregularity.

The cost of dynamically allocating inodes is one extra diskwrite for every 64 new inodes that are created. Although thiscost is quite low compared to the other costs of creating 64 newfiles, some systems administrators might want to preallocate morethan the minimal number of inodes. If such a demand arises, itwould be trivial to add a flag to the newfs program topreallocate additional inodes at the time that the filesystem iscreated.

4.2. Boot Blocks

The UFS1 filesystem reserved an 8 kilobyte space at the beginningof the filesystem in which to put a boot block. While this spaceseemed huge compared to the 1 kilobyte book block that itreplaced, over time it has gotten increasingly difficult to cramthe needed boot code into this space. Consequently we decided torevisit the boot block size in UFS2.

The boot code has a list of locations to check for boot blocks. A bootblock can be defined to start at any 8 kilobyte boundary. We set upan initial list with four possible boot block sizes: none, 8kilobytes, 64 kilobytes, and 256 kilobytes. Each of these locationswas selected for a particular purpose. Filesystems other than theroot filesystem do not need to be bootable, so can use a boot blocksize of zero. Also, filesystems on tiny media that need every blockthat they can get such as floppy disks can use a zero size boot block.For architectures with simple boot blocks, the traditional UFS1 8kilobyte boot block can be used. More typically the 64 kilobyte bootblock is used (for example on the PC architecture with its need tosupport booting from a myriad of busses and disk drivers).

We added the 256 kilobyte boot block in case some architecture orapplication needs to set aside a particularly large boot area. Whilethis was not strictly necessary as new sizes can be added to the listat any time, it can take a long time before the updated list getspropagated to all the boot programs and loaders out on the existingsystems. By adding the option for a huge boot area now, we can ensureit will be readily available should it be needed on short notice inthe future.

One of the unexpected side effects of using a 64 kilobyte boot blockfor UFS2 is that if the partition had previously had a UFS1 filesystemon it, the superblock for the former UFS1 filesystem may not beoverwritten. If an old version offsck that does not first look fora UFS2 filesystem is run and finds the UFS1 superblock, it canincorrectly try to rebuild the UFS1 filesystem destroying the UFS2filesystem in the process. So, when building UFS2 filesystems, thenewfs utility looks for old UFS1 superblocks and zeros them out.

5. Changes and Enhancements to Soft Updates

Traditionally, filesystem consistency has been maintained acrosssystem failures either by using synchronous writes to sequencedependent metadata updates or by using write-ahead logging toatomically group them [Seltzer et al, 2000]. Soft updates, analternative to these approaches, is an implementation mechanism thattracks and enforces metadata update dependencies to ensure that thedisk image is always kept consistent. The use of soft updates obviatesthe need for a separate log or for most synchronous writes [McKusick& Ganger, 1999].

The addition of extended attribute data to the inode required thatthe soft updates code be extended so that it could ensure theintegrity of these new data blocks. As with the file data blocks, itensures that the extended data blocks and the bitmaps that show thatthey are in use are written to disk before they are claimed by theinode. Soft updates also ensure that any updated extended attributedata is committed to disk as part of an "fsync" of the file.

Two important enhancements were made to the existing soft updatesimplementation. These enhancements were initially made for UFS2 butbecause of the shared code base with UFS1 were trivially integratedto work with UFS1 filesystems as well.

When a file is removed on a filesystem running with soft updates, theremoval appears to happen very quickly, but the process of removingthe file and returning its blocks to the free list may take up toseveral minutes. Prior to UFS2, the space held by the file did notshow up in the filesystem statistics until the removal of the filehad been completed. Thus, applications that clean up disk space suchas the news expiration program would often vastly overshoot theirgoal. They work by removing files and then checking to see if enoughfree space has showed up. Because of the time lag in having the freespace recorded, they would remove far too many files. To resolveproblems of this sort, the soft updates code now maintains a counterthat keeps track of the amount of space that is held by the files thatit is in the process of removing. This counter of pending space isadded to the actual amount of free space as reported by the kernel(and thus by utilities like df). The result of this change is thatfree space appears immediately after the "unlink" system callreturns or the rm utility finishes.

The second and related change to soft updates has to do with avoidingfalse out-of-space errors. When running with soft updates on a nearlyfull filesystem with high turnover rate (for example when installinga whole new set of binaries on a root partition), the filesystem canreturn a filesystem full error even though it reports that it hasplenty of free space. The filesystem full message happens because softupdates has not managed to free the space from the old binaries intime for it to be available for the new binaries.

The initial attempt to correct this problem was to simply have theprocess that wished to allocate space wait for the free space to showup. The problem with this approach is that it often had to wait forup to a minute. In addition to making the application seem intolerablyslow, it usually held a locked vnode which could cause otherapplications to get blocked waiting for it to become available (oftenreferred to as a lock race to the root of the filesystem). Althoughthe condition would clear in a minute or two, users often assumed thattheir system had hung and would reboot.

To remedy this problem, the solution devised for UFS2 is to co-optthe process that would otherwise be blocked and put it to work helpingsoft updates process the files to be freed. The more processes tryingto allocate space, the more help that is available to soft updatesand the faster free blocks begin to appear. Usually in under onesecond enough space shows up that the processes can return to theiroriginal task and proceed to completion. The effect of this change isthat soft updates can now be used on small nearly full filesystemswith high turnover.

6. Enhancements for Live Dumps

A filesystem snapshot is a frozen image of a filesystem at a giveninstant in time. Snapshots support several important features: theability to provide back-ups of the filesystem at several times duringthe day, the ability to do reliable dumps of live filesystems, andthe ability to run a filesystem check program on a active system toreclaim lost blocks and inodes [McKusick, 2002].

With the advent of filesystem snapshots, thedump program has beenenhanced to safely dump live filesystems. When given the -L flag,dumpverifies that it is being asked to dump a mounted filesystem, thentakes a snapshot of the filesystem and dumps the snapshot instead ofon the live filesystem. Whendump completes, it releases the snapshot.

The initial implementation of live dumps had thedump program do the"mount" system call itself to take the snapshot. However, mostsystems require root privilege to use the "mount" system call. Sincedumps are often done by theoperator user rather thanroot, an attemptto take a snapshot will fail.

To get around this problem, a new set-user-identifierroot programwas written calledmksnap_ffs. Themksnap_ffs command creates asnapshot with a given name on a specified filesystem. The snapshotfile must be contained within the filesystem being snapshotted. Thegroup ownership of the file is set tooperator; the owner of the fileremainsroot. The mode of the snapshot is set to be readable by theowner or members of theoperator group.

Thedump program now invokesmksnap_ffs to create the snapshot ratherthan trying to create it directly. The result is that anyone withoperator privileges can now reliably take live dumps. Allowingoperator group access to the snapshot does not open any new securityholes since the raw disk is also readable by members of the operatorgroup (for the benefit of traditionaldump). Thus, the informationthat is available in the snapshot can also be accessed directlythrough the disk device.

7. Large Filesystem Snapshots

Creating and using a snapshot requires random access to the snapshotfile. The creation of a snapshot requires the inspection and copyingof all the cylinder group maps. Once in operation, every writeoperation to the filesystem must check whether the block being writtenneeds to be copied. The information on whether a blocks needs to becopied is contained in the snapshot file metadata (its indirectblocks). Ideally, this metadata would be resident in the kernel memorythroughout the lifetime of the snapshot. In FreeBSD, the entirephysical memory on the machine can be used to cache file data pagesif the memory is not needed for other purposes. Unfortunately, datapages associated with disks can only be cached in pages mapped intothe kernel physical memory. Only about 10 megabytes of kernel memoryis dedicated to such purposes. Assuming that we allow up to half ofthis space to be used for any single snapshot, the largest snapshotwhose metadata that we can hold in memory is 11 megabytes. Withouthelp, such a tiny cache would be hopeless in trying to support amulti-terabyte snapshot.

In an effort to support multi-terabyte snapshots with the tinymetadata cache available, it is necessary to observe the accesspatterns on typical filesystems. The snapshot is only consulted forfiles that are being written. The filesystem is organized aroundcylinder groups which maps small contiguous areas of the disk. Withina directory, the filesystem tries to allocate all the inodes and filesin the same cylinder group. When moving between directories differentcylinder groups are usually inspected. Thus, the widely randombehavior occurs from movement between cylinder groups. Once filewriting activity settles down into a cylinder group, only a smallamount of snapshot metadata needs to be consulted. That metadata willeasily fit in even the tiny kernel metadata cache. So, the need is tofind a way to avoid thrashing the cache when moving between cylindergroups.

The technique used to avoid thrashing when moving between cylindergroups is to build a look aside table of all the blocks that werecopied during the time that the snapshot was made. This table liststhe blocks associated with all the snapshot metadata blocks, thecylinder groups maps, the super block, and blocks that contain activeinodes. When a copy-on-write fault occurs for a block, the first stepis to consult this table. If the block is found in the table, then nofurther searching needs to be done in any of the snapshots. If theblock is not found, then the metadata of each active snapshot on thefilesystem must be consulted to see if a copy is needed. This tablelookup saves time as it not only avoids faulting in metadata forwidely scattered blocks, but it also avoids the need to consultpotentially many snapshots.

Figure 2:Snapshot deadlock scenario

Another problem with snapshots on large filesystems is that theyaggravated existing deadlock problems. When there are multiplesnapshots associated with a filesystem, they are kept in a listordered from oldest to youngest. When a copy-on-write fault occurs,the list is traversed letting each snapshot decide if it needs to makea copy of the block that is about to be written. Originally, eachsnapshot inode had its own lock. A deadlock could occur between twoprocesses each trying to do a write. Consider the example in Fig. 2.It shows a filesystem with two snapshots, snap1 and snap2. Process Aholds snapshot 1 locked and process B holds snapshot 2 locked. Bothsnap1 and snap2 have decided that they need to allocate a new blockin which to hold a copy of the block being written by the process thatholds them locked. The writing of the new block in snapshot 1 willcause the kernel running in the context of process A to scan the listof snapshots which will get blocked at snapshot 2 because it is heldlocked by process B. Meanwhile, the writing of the new block insnapshot 2 will cause the kernel running in the context of process Bto scan the list of snapshots which will get blocked at snapshot 1because it is held locked by process A.

The resolution to the deadlock problem is to allocate a single lockthat is used for all the snapshots on a filesystem. When a newsnapshot is created, the kernel checks whether there are any othersnapshots on the filesystem. If there are, the per-file lockassociated with the new snapshot inode is released and replaced withthe lock used for the other snapshots. With only a single lock, theaccess to the snapshots as a whole are serialized. Thus, in Fig. 2,process B will hold the lock for all the snapshots and will be ableto make the necessary checks and updates while process A will be heldwaiting. Once process B completes its scan, process A will be able toget access to all the snapshots and will be able to run successfullyto completion. Because of the added serialization of the snapshotlookups, the look-aside table described earlier is important to ensurereasonable performance of snapshots. In gathering statistics on ourrunning systems, we found that the look-aside table resolves nearlyhalf of the snapshot copy-on-write lookups. Thus, we found that thelook-aside table keeps the contention for the snapshot lock to areasonable level.

8. Running Fsck on Large Filesystems

Traditionally, after an unclean system shutdown, the filesystem checkprogram,fsck, has had to be run over all inodes in a filesystem toascertain which inodes and blocks are in use and to correct thebitmaps. The current implementation of soft updates guarantees theconsistency of all filesystem resources, including the inode and blockbitmaps. With soft updates, the only inconsistency that can arise inthe filesystem (barring software bugs and media failures) is that someunreferenced blocks may not appear in the bitmaps and some inodes mayhave to have overly high link counts reduced. Thus, it is completelysafe to begin using the filesystem after a crash without first runningfsck. However, some filesystem space may be lost after each crash.Thus, there is a version offsck that can run in the background on anactive filesystem to find and recover any lost blocks and adjustinodes with overly high link counts. A special case of the overly highlink count is one that should be zero. Such an inode will be freed aspart of reducing its link count to zero. This garbage collection taskis less difficult than it might at first appear, since this versionoffsck only needs to identify resources that are not in use andcannot be allocated or accessed by the running system [McKusick &Ganger, 1999].

With the addition of snapshots, the task becomes simple, requiringonly minor modifications to the standardfsck. When run in backgroundcleanup mode,fsck starts by taking a snapshot of the filesystem tobe checked.Fsck then runs over the snapshot filesystem image doingits usual calculations just as in its normal operation. The only otherchange comes at the end of its run, when it wants to write out theupdated versions of the bitmaps. Here, the modifiedfsck takes theset of blocks that it finds were in use at the time of the snapshotand removes this set from the set marked as in use at the time of thesnapshot-- the difference is the set of lost blocks. It alsoconstructs the list of inodes whose counts need to be adjusted.Fsckthen uses a new system call to notify the filesystem of the identifiedlost blocks so that it can replace them in its bitmaps. It also givesthe set of inodes whose link counts need to be adjusted; those inodeswhose link count is reduced to zero are truncated to zero length andfreed. Whenfsck completes, it releases its snapshot [McKusick, 2002].

As filesystems have gotten bigger the time to run either a foregroundor a backgroundfsck has increased to multiple hours. Being able torunfsck in background has largely mitigated the running time issuebecause it allows normal system operation to proceed in parallel.

Another problem with runningfsck on large filesystems is that thememory that it consumes grows in proportional to the size of thefilesystem being checked. The main consumption of memory is four bytesper regular inode, 40 to 50 bytes per directory inode, and one bitper filesystem data block. On a typical UFS2 filesystem with 16kilobyte blocks and 2 kilobyte fragments, the data-block map requires64 megabytes of memory per terabyte of filesystem. Because UFS2 doesnot preallocate inodes, but rather allocates the inodes as they areneeded, the memory required is dependent on the number of files thatare created in the filesystem.

Filesystem	Files per TB	Dirs per TB	fsck memory per TB	maximum checkable filesystem
/usr	93M	15M	1200K	3Tb
/jukebox	243K	18K	66K	60Tb

Table 1:Maximum filesystem sizes checkable byfsck on a 32-bit architecture

The number of files and directories in a filesystem make a hugedifference in the amount of memory required byfsck. Table 1 showsthe two ends of the spectrum. At one end is a typical FreeBSD/usrfilesystem assuming that it grew at its current file and directorymix to fill a 1 terabyte disk. The memory footprint offsck isdominated by the memory to manage the inodes and would require theentire address space of a 32-bit processor for a filesystem of about3 terabytes. At the other extreme is the author's/jukebox filesystemassuming that it grew at its current file and directory mix to filla 1 terabyte disk. The memory footprint offsck is dominated by thememory to manage the data blocks and would require the entire addressspace of a 32-bit processor for a filesystem of about 60 terabytes.My expectation is that as disks get larger, they will tend to befilled with larger files of audio and video. Thus, in practicefsckwill run out of space on 32-bit architectures at about 30 terabytefilesystems. Hopefully by the time that such filesystems are common,they will be running on 64-bit architectures.

In the event thatfsck hits the memory limit of 32-bit architectures,Julian Elischer has suggested that one solution is to implement an"offline, non-in-place" version offsck using all those techniqueswe learned in CS101 relating to mag-tape merge sorts.Fsck would haveto have a small (20 gigabyte) disk partition set aside to hold workingfiles, to which it would write files of records detailing blocknumbers, etc. Then it would do merge or block sorts on those files toget them in various orders (depending on fields in the records).Fsckwould then recombine them to find such things as multiple referencedblocks and other file inconsistencies. It would be slow, but at leastit could be used to check a 100 terabyte array, where the in-memoryversion would need a process VM space of 13 Gigabytes which is clearlyimpossible on the 32-bit PC.

Journalling filesystems provide a much faster state recovery thanfsck. For this reason, there is ongoing work to provide a journallingoption for UFS2. However, even journalling filesystems need to havea filesystem recovery program such asfsck. In the event of media orsoftware failure, the filesystem can be damaged in ways that thejournal cannot fix. Thus, the size of the recovery program is an issuefor all filesystems. Indeed, the fact that UFS needs to usefsck inits general operation ensures thatfsck is kept in good working orderand is known to work even on very large filesystems.

9. Performance

The performance of UFS2 is nearly identical to that of UFS1. Thissimilarity in performance is hardly surprising since the twofilesystem share most of the same code base and use the sameallocation algorithms. The purpose of UFS2 was not to try and improveon the performance of UFS1 which is already within 80-95% of thebandwidth of the disk. Rather it was to support multi-terabytefilesystems and to provide new capabilities such as extendedattributes without losing performance. It has been successful in thatgoal.

10. Future Work

Figure 3:Alternative file metadata representations

With the addition of dynamic block reallocation in the early 1990s[Seltzer & Smith, 1996], the UFS1 filesystem has had the ability toallocate most files contiguously on the disk. The metadata describinga large file consists of indirect blocks with long runs of sequentialblock numbers, see Fig. 3-(a). For quick access while a file isactive, the kernel tries to keep all of a file's metadata in memory.With UFS2 the space required to hold the metadata for a file isdoubled as every block pointer grows from 32-bits to 64-bits. Toprovide a more compact representation, many filesystems use anextent-based representation. A typical extent-based representationuses pairs of block numbers and lengths. Figure 3-(b) represents thesame set of block number as Fig. 3-(a) in an extent-based format.Provided that the file can be laid out nearly contiguously, thisrepresentation provides a very compact description. However, randomlyor slowly written files can end up with many non-contiguous blockallocations which will produce a representation that requires morespace than the one used by UFS1. This representation also has thedrawback that it can require a lot of computation to do random accessto the file since the block number needs to be computed by adding upthe sizes starting from the beginning of the file until the desiredseek offset is reached.

To gain most of the efficiencies of extends without the random accessinefficiencies, UFS2 has added a field to the inode that will allowthat inode to use a larger block size. Small, slowly growing, orsparse files set this value to the regular filesystem block size andrepresent their data in the traditional way show in Fig. 3-(a).However, when the filesystem detects a large dense file, it can setthis inode-block-size field to a value two to sixteen times thefilesystem block size. Figure 3-(c) represents the same set of blocknumber as Fig. 3-(a) with the inode-block-size field set to four timesthe filesystem block size. Each block pointer references a piece ofdisk storage that is four times larger which reduces the metadatastorage requirement by 75 percent. Since every block pointer otherthan possibly the last one references an equal sized block,computation of random access offsets is just as fast as in thetraditional metadata representation. It also cannot degrade to alarger representation than the traditional metadata representation.

The drawback to this approach is that once a file has committed tousing a larger block size, it can only utilize blocks of that size.If the filesystem runs out of big blocks then the file can no longergrow and either the application will get an "out-of-space" error,or the filesystem has to recreate the metadata with the standardfilesystem block size. My current plan is to write the code torecreate the metadata. While recreating the metadata usually willcause a long pause, We expect that condition to be quite rare and nota noticeable problem in actual use.

11. Current Status

The UFS2 filesystem was developed for the FreeBSD Project by theauthor under contract to Network Associates Laboratories, the SecurityResearch Division of Network Associates, Inc. under DARPA/SPAWARcontract N66001-01-C-8035 ("CBOSS"), as part of the DARPA CHATSresearch program. Under the terms of that contract, the software mustbe released under a Berkeley-style copyright. The UFS2 filesystemwas written in 2002 and first released in FreeBSD 5.0. Extensive userfeedback in that release has been helpful in shaking out latentshort-comings particularly in the ability of UFS2 to smoothly handlethe really big filesystems for which it was designed. The biggestcurrent limitation is that the disk labels used in FreeBSD 5.0 canonly describe 2 terabyte disks. We are hoping that the new larger disklabels will be available by the time FreeBSD 5.1 is released.

12. References

Apple, 2003.

Apple,Mac OS X Essentials, Chapter 9 Filesystem, Section12 Resource Forks,https://developer.apple.com/documentation/MacOSX/Conceptual/SystemOverview/FileSystem/index.html(2003).

Best & Kleikamp, 2003.

S. Best & D. Kleikamp, "How the Journaled File System handles the on-disk layout,"< a href=https://www-106.ibm.com/developerworks/linux/library/l-jfslayout/>https://www-106.ibm.com/developerworks/linux/library/l-jfslayout/ (2003).

Dowse & Malone, 2002.

I. Dowse & D. Malone, "Recent Filesystem Optimizations on FreeBSD,"Proceedings of the Freenix Track at the 2002 Usenix Annual Technical Conference, p. 245-258 (June 2002).

Griffin et al, 2002.

J. L. Griffin, J. Schindler, S. W. Schlosser, J. S. Bucy, & G. R. Ganger, "Timing-accurate Storage Emulation,"Proceedings of the Usenix Conference on File and Storage Technologies, p. 75-88 (January 2002).

Lumb et al, 2002.

C. R. Lumb, J. Schindler, & G. R. Ganger, "Freeblock Scheduling Outside of Disk Firmware,"Proceedings of the Usenix Conference on File and Storage Technologies, p. 275-288 (January 2002).

McKusick, 2002.

M. McKusick, "Running Fsck in the Background,"Proceedings of the BSDCon 2002 Conference, p. 55-64 (February 2002).

McKusick et al, 1996.

M. McKusick, K. Bostic, M. Karels, & J. Quarterman,,The Design and Implementation of the 4.4BSD Operating System, p. 269-271, Addison Wesley Publishing Company, Reading, MA (1996).

McKusick & Ganger, 1999.

M. McKusick & G. Ganger, "Soft Updates: A Technique for Eliminating Most Synchronous Writes in the Fast Filesystem,"Proceedings of the Freenix Track at the 1999 Usenix Annual Technical Conference, p. 1-17 (June 1999).

McKusick et al, 1984.

M. McKusick, W. Joy, S. Leffler, & R. Fabry, "A Fast File System for UNIX,"ACM Transactions on Computer Systems, 2, 3, p. 181-197 (August 1984).

Phillips, 2001.

D. Phillips, "A Directory Index for Ext2,"Proceedings of the Usenix Fifth Annual Linux Showcase and Conference (November 2001).

Reiser, 2001.

H. Reiser, "The Reiser File System,"https://www.namesys.com/res_whol.shtml(January 2001).

Rhodes, 2003.

T. Rhodes, "FreeBSD Handbook, Chapter 3, Section 3.3 File System Access Control Lists,"https://www.FreeBSD.org/doc/en_US.ISO8859-1/books/handbook/fs-acl.html(2003).

Schindler et al, 2002.

J. Schindler, J. L. Griffin, C. R. Lumb, & G. R. Ganger, "Track-aligned Extents: Matching Access Patterns to Disk Drive Characteristics,"Proceedings of the Usenix Conference on File and Storage Technologies, p. 259-274 (January 2002).

Seltzer et al, 2000.

M. Seltzer, G. Ganger, M. McKusick, K. Smith, C. Soules, & C. Stein, "Journaling versus Soft Updates: Asynchronous Meta-data Protection in File Systems,"Proceedings of the San Diego Usenix Conference, p. 71-84 (June 2000).

Seltzer & Smith, 1996.

M. Seltzer & K. Smith, "A Comparison of FFS Disk Allocation Algorithms,"Winter USENIX Conference, p. 15-25 (January 1996).

Sweeney et al, 1996.

A. Sweeney, D. Doucette, C. Anderson, W. Hu, M. Nishimoto, & G. Peck, "Scalability in the XFS File System,"Proceedings of the 1996 Usenix Annual Technical Conference, p. 1-14 (January 1996).

Watson, 2000.

R. Watson, "Introducing Supporting Infrastructure for Trusted Operating System Support in FreeBSD,"Proceedings of the BSDCon 2000 Conference (September 2000).

Watson, 2001.

R. Watson, "TrustedBSD: Adding Trusted Operating System Features to FreeBSD,"Proceedings of the Freenix Track at the 2001 Usenix Annual Technical Conference (June 2001).

Watson et al, 2003.

R. Watson, W. Morrison, C. Vance, & B. Feldman, "The TrustedBSD MAC Framework: Extensible Kernel Access Control for FreeBSD 5.0,"Proceedings of the Freenix Track at the 2003 Usenix Annual Technical Conference (June 2003).

13. Biography

Dr. Marshall Kirk McKusick writes books and articles, consults, andteaches classes on UNIX- and BSD-related subjects. While at theUniversity of California at Berkeley, he implemented the 4.2BSD fastfilesystem, and was the Research Computer Scientist at the BerkeleyComputer Systems Research Group (CSRG) overseeing the development andrelease of 4.3BSD and 4.4BSD. His particular areas of interest arethe virtual-memory system and the filesystem. One day, he hopes tosee them merged seamlessly. He earned his undergraduate degree inElectrical Engineering from Cornell University, and did his graduatework at the University of California at Berkeley, where he receivedMasters degrees in Computer Science and Business Administration, anda doctoral degree in Computer Science. He is president of the UsenixAssociation, and is a member of ACM and IEEE.

In his spare time, he enjoys swimming, scuba diving, and winecollecting. The wine is stored in a specially constructed wine cellar(accessible from the web athttps://www.mckusick.com/~mckusick/) inthe basement of the house that he shares with Eric Allman, hisdomestic partner of 24-and-some-odd years. You can contact him viaemail at[email protected].

[8]ページ先頭