ONLINE CHECKING OF DATA STRUCTURES OF A FILE SYSTEM
BACKGROUND
File systems organize and track where data is stored in memory and where free or available space exists in memory. Distributed or clustered file systems can store thousands or even millions of files. These files and the corresponding metadata are distributed across a large number of storage devices, such as disk drives.
In order to assist in managing the integrity of file systems, software tools are used to check all file system data structures for consistency and, if necessary, make changes to bring them into a consistent state. These tools, however, typically require the file system to be taken off-line for the duration of the checking process (or, in some cases, be made read-only).
Given the large size of some file systems, this approach has become increasingly untenable. First, many toois use algorithms whose runtimes grow at least linearly in the size of the file system. This can produce very long running times when applied to large file systems, with runtimes ranging from hours to even days for large file systems. Second, many file systems include important data that cannot be taken off-line without significantly impacting users.
Methods and systems to improve the efficiency of checking the integrity of large file systems are needed. BRIEF DESCRIPTION OF THE DRAWINGS
Figure 1 shows a duster file system in accordance with an exemplary embodiment of the present invention.
Figure 2 shows a hierarchical directory for a file system in accordance with an exemplary embodiment of the present invention.
Figure 3 shows a flow diagram for online checking the integrity of data structures of a file system in accordance with an exemplary embodiment of the present invention.
Figure 4 shows an exemplary computer system for implementing one or more of the methods in accordance with an exemplary embodiment of the present invention.
DETAILED DESCRIPTION
Exemplary embodiments relate to systems and methods for checking the integrity of a file system while the file system remains online. In one embodiment, the data structures of the file system are incrementally checked for consistency, and errors or inconsistencies discovered during the check are fixed while the file system is running and in use.
In one exemplary embodiment, the file system uses a hierarchical tree to track and store data. Directories are represented as tree structures on storage devices (such as disk drives). Entries in the tree structure refer to separate blocks of memory that include the corresponding data for the file in the directory,
In some instances, the file system includes errors or mistakes, such as computer bugs, lost or misplaced data, corrupt data, etc. For example, an entry in the tree will point to a specific block of memory, but this block will not actually contain the correct data. In this instance, data or inodes can become lost in the file system. As another example, fife systems use free space bitmaps to track where available or unused space exists in memory. Errors can occur in these maps so the file system cannot accurately track which blocks of memory are free and which blocks of memory are available for storage.
In order to discover and correct errors, the file system uses a file checker (fsck) software too! or utility. The file checker is a program that navigates through the file system, discovers errors, and then corrects such errors. For example, the file checker repairs errors in the tree structure, such as errors in specifying the location of files, errors in the free space bitmaps, etc.
Embodiments in accordance with the present invention provide new data structures or ways to store data in the fiie system to enable the file checker to run without taking the entire file system offline. In other words, the file checker navigates through the file system, discovers errors, and then corrects such errors while the file system remains mounted and available for use. User operations continue to execute through the file system without hanging or failing. In one exemplary embodiment, the file system is divided or separated into fragments, segments, or smai! chucks. These segments or chunks are searched with the file checker whiie the remaining portions of the file system remain online. For example, each segment defines a number of blocks that are examined to find and correct corruption.
With exemplary embodiments, part of the file system is checked for consistency while access to the file system data remains possible. One exemplary method reads a contiguous segment of the file system, checks infernal references within that segment, and then checks externa! references to verify the accuracy or consistency of such references. Since many references are internal, the check or verification process proceeds without generating large numbers of expensive random searches through the storage devices.
One exemplary method provides a benefit unrelated to on-line file system verification. By making it easy to find any physical references that exist to a given file system block, this method also makes it easy to relocate a block to a new location. In turn, this provides the ability to improve file system performance by removing fragmentation and migrating data among different classes of storage devices {for example, between expensive high-performance disk and cheap slower disks, or between power-hungry disks and lower-power disks).
One exemplary embodiment is a clustered file system in which many computers simultaneously access disk data. Figure 1 shows a distributed or cluster file storage system 100 in accordance with an exemplary embodiment of the present invention. By way of example, the system is a cluster storage network and/or a storage area network (SAN) that includes a plurality of client computers, nodes, or host computers 102A to 102N and one or more storage devices or arrays 103A to 103N that include one or more storage controllers 104 (shown by way of example as an array coπtrolier), a plurality of storage devices 106 (shown by way of example as disk array 1 to disk array N), a file system checker (fsck) 107, and a file system manager (FSM) 108 in communication with the storage controllers and devices. The fitesystem manager 108 (such as a server or storage device) stores and organizes computer files so the files and corresponding data can be managed and discovered for the hosts 102A to 102N. in one exempiary embodiment, the fiiesystem manager 108 is replicated on all cluster nodes. The fsck 107 (such as run in a server or computer) is a too! for checking for consistency in the clustered file system.
The host computers are grouped to form one or more clusters (shown as cluster 114A to 114N). For example, hosts 102A are grouped to form a one cluster 114A which includes a plurality of host computers (shown as host 1 to host N), Hosts 102N are grouped to form another cluster 114N.
The clusters 114A to 114N and file system manager 108 are coupled to the array controllers 104 through one or more fabrics or networks 110, and the storage devices or arrays 103 are coupled to the storage devices 106 through one or more fabrics or networks 111 , For instance, the hosts communicate with an array controller using a SmaSS Computer System interface (SCSi) or other interface/commands over a fiber channel (FC). By way of example, networks 110 and 111 include one or more of the Ethernet, fibre channel (FC), serial attached SCSI (SAS), ϊSCSS, internet, local area network (LAN), wide area network (WAN), public and/or private networks, etc. Communications links 112 are shown in the figure to represent communication paths or couplings between the hosts, controllers, and storage devices.
In one exemplary embodiment, the storage devices (such as array controller 104 and disk arrays 106) are network attached devices providing random access memory (RAM) and/or disk space (for storage and as virtual RAM) and/or some other form of storage or storage device, such as magnetic memory (example, tapes), micromechanicai systems (MEMS), or optica! disks, to name a few examples. Typically, storage devices include larger amounts of RASVI and/or disk space and one or more specialized devices, such as network disk drives or disk drive arrays, {example, redundant array of independent disks (RASD)), high speed tape, magnetic random access memory (MRAM) systems or other devices, and combinations thereof. Sn one exemplary embodiment, the storage devices include one or more servers.
The storage controller 104 manages various data storage and retrieval operations. Storage controller 104 receives I/O requests or commands from the host computers 102A to 102N, such as data read requests, data write requests, maintenance requests, etc. Storage controller 104 handles the storage and retrieval of data on the muitipie disk arrays 106 and disk groups, In one exemplary embodiment, storage controller 104 is a separate device or may be part of a computer system, such as a server. Additionally, the storage controller 104 may be located with, proximate, or a great geographical distance from the disk arrays 106 or from each other.
The array controller 104 includes numerous electronic devices, circuit boards, electronic components, etc. By way of example, the array controller 104 includes firmware 120, an input/output (I/O) scheduler 122, a queue 124, one or more interfaces 126, one or more processors 128 (shown by way of example as a CPU, centra! processing unit), and memory 130 (including read and write cache). CPU 128 performs operations and tasks necessary to manage the various data storage and data retrieval requests received from host computers 102A to 102N. For instance, processor 128 is coupled to a host interface 126A that provides bidirectional data communications to one or more host computers 102A to 102N. Processor 128 is also coupled to an array interface 126B that provides bidirectional data communications to the disk arrays 106. Memory 130 is also coupled to processor 128 and stores various information used by processor when carrying out its tasks. By way of example, memory 130 includes one or more of volatile memory, non-volatile memory, or a combination of volatile and non-volatile memory The memory 130, for example, stores applications, data, control programs, algorithms (including software to implement or assist in implementing embodiments in accordance with the present invention), and other data associated with the storage device (example, state data such as mapping metadata, configuration metadata, and cached user data). The processor 128 communicates with memory 130, interfaces 126, and the other components via one or more buses 132.
Figure 2 shows a partial directory or hierarchical tree structure 200 for a file system in accordance with an exemplary embodiment of the present invention. The directory includes a root node 210 with branches leading to a plurality of directories 220A, 220B, to 220N (shown as directory A to directory N). Directory 220A includes pSural subdirectories 230A to 230N (shown as subdirectory 1 to subdirectory N). Each subdirectory further includes one or more files. For example, directory 220B has file 240A, and subdirectory 230A has fifes 240B to 240N (shown as file 2 to file N).
In one exemplary embodiment, each file contains a reference to a separate data structure that stores data or metadata about the file For example, each file contains a storage location (for example, a block location for data or an inode number that refers to a storage device storing the i nodes).
Figure 3 shows a flow diagram for online checking the integrity of data structures of a file system in accordance with an exemplary embodiment of the present invention.
According to block 300, the file system is divided into discrete segments. By way of example, the file system is divided into a discrete number of storage blocks. According to block 310, the integrity of the file system is checked with a file system checker whiie the fϋe system remains online. The fϋe system checker navigates through the selected segment and related portions of the file system to discover, errors, bugs, inconsistencies, lost/misplaced data, inaccuracies, etc.
According to block 320, the file system checker corrects inconsistencies or errors discovered in biock 310.
According to block 330, a question is asked whether another segment in the file system will be checked with the file system checker, if the answer to this question is "yes" then flow proceeds back to block 310. If the answer to this question is "no" then flow proceeds to block 340. Here, the process ends since the file system checker is finished checking all of the segments.
Figure 4 shows an exemplary computer system 410 for impiementing one or more of the methods in accordance with an exemplary embodiment of the present invention.
The computer system 410 includes a computer 420 coupSed to storage devices 430. The computer 420 comprises a processing unit 450 (such as one or more processors of centra! processing units, CPUs) for controlling the overall operation of memory 460 {such as random access memory (RAM) for temporary data storage and read only memory (ROM) for permanent data storage) and one or more algorithms or programs 480 (such as a file system manager and/or a file system checker, fsck). The memory 460 stores data, control programs, and other data associate with the computer 420. Although shown separateiy, the memory 460 can store the programs 480. The processing unit 450 communicates with memory 460, storage devices 430; programs 480, and many other components via buses 490. Embodiments in accordance with the present invention are not limited to any particular type or number of storage devices and/or computer. The computer system, for example, includes various portable and non-portable computers and/or electronic devices. Exemplary computer include, but are not limited to, servers, main frame computers, distributed computing devices, laptops, and other electronic devices and systems whether such devices and systems are portable or non-portable.
In order to facilitate an explanation of the data structures of the file system in accordance with exemplary embodiments, the description is divided into various headings.
Modules
Modules are coded to perform low-cost consistency checking and to correct corruption, inconsistencies, or errors in the file system. One aspect of online fsck is to access all parts of the filesystem via normal interfaces of the applicable modules that implicitly will perform these checks.
The modules provide routines for performing module-specific consistency checks that are beyond the scope of run-time low-cost consistency checking (e.g., checking consistency a set of global and local quota items for a particular quota owner). These routines are executed by an on-line fsck. Furthermore, moduSe- spectftc routines are designed to allow code sharing with user level utilities (offline fsck}.
In one embodiment, the filesystem data structures do not include one-way links between structures in separate filesystem blocks. Instead, two-way links are used to allow detection of inconsistencies and orphaned link targets without scanning the entire fslesystem. Data structure fields that represent a count of objects or summation of a set of values which reside across a large or unbounded number of filesystem blocks are represented by a Recoverable Distributed Counters (RDC) or some other structure that supports breaking down consistency checking and correction into bounded operations.
Consistency Checking f sck
Fsck checks the consistency and integrity of data in the file system. This consistency checking includes one or more of the following;
Fsck performs checks of performance of operations in this file system. These checks indirectly cause module specific checks of ati fiiesystem structures (e.g., a read of quota limits for each fiiesystem owner will result in consistency checking which is part of a quota data access).
Fsck performs checks of intra-module consistencies that are beyond the scope of Sow-cost checks done by the module (e.g., checking consistency of an item in an inode tree with the sfat data in the tree's root). Fsck also checks inter-module consistency (e.g., checking that inodes are included in appropriate lists, such as quota owner inode lists and unique security descriptor table).
Additionally, Fsck performs block allocation checks. The division of this check into a set of discrete bounded operation depends on the near and far pointer designs for user data.
Fsck also checks RDCs and any other structures designed to facilitate dividing checks of unbounded objects into a set of discrete bounded operations. The specific requirements of this check depend on the design of the RDCs and other structures. When inconsistencies are found, an administrative notification will be generated with details of the error. Locking
Online fsck operations foliow the locking rules followed by other runtime modules to allow concurrent operations and maintain consistency. Online fsck uses low- level routines, such as those which perform tree traversal, and depends on these routines to perform consistency checking. The locking necessary to maintain the appropriate stability within the fiiesysfem whiie each fsck operation depends on the locking implemented for non-fsck operations.
Since online fsck does depend on the fiiesystem remaining in a static state, it tolerates concurrent changes in the fiiesystem.
Corrective Action
Some types of corruption, such as a missing root directory, will require an offline fsck (and will result in appropriate error indication when detected),
Corrective actions leave the fϋesystern in a state that will satisfy fsck consistency checks. Consistency checks that are defined by non-fsck modules have associated routines for correcting any corruption (they actually wiii usually be part of the same routine).
Corruption found by module-specific runtime consistency checking routines indicates a corruption that interferes with correct operation of the fiiesystem and is corrected immediately when possible. If corrective action is required which is outside of the scope of the module which detected the corruption, the module will convert the corruption to a state that allows further operations to identify and work around the corruption (for example, converting a corrupt pointer to a LOST pointer).
Corruption found by fsck operations which are determined to be non-critical to ftlesystem operation are corrected as determined by the mode of fsck operation being executed. Examples of such non-critical corruption include incorrect RDC values and unreachabie inodes. By contrast, corruption found by fsck operations which are determined to be critical to fiiesystem operation are immediately corrected (with the caveat of the next requirement).
When corruption is found, if not immediately correctable, the corrupted structure is tagged as corrupt (such as a LOST pointer) or the structure is isolated from future operations, isolation could be performed on an inode tree by including some type of tag in directory entries that point to it. Overall, the methods used to tag or isolate are specific to the module and data affected. Corrective actions are completed through journaled transactions.
In one embodiment, online fsck provides routines to perform targeted corrective actions based on specific corruption found. For example, a lost+found directory is created to link inodes which have lost the linkage to their parent directories. The iost+found directory is readable only by the root/administrator to avoid potential security compromises. When corrective actions are taken, an administrative notification is generated with details of the correction.
initiation of fsck Operations Initiation of fsck operations are initiated by a user-level psfsck utility. Each call into the kernel requests one or more atomic operations in the sequence of fsck tasks. Supported operations are designed to support one of more of the following;
a. Analyze-only psfsck. Descriptions of any corruptions found are communicated to the user-level utility, b. Analyze and fix psfsck. Corruption is fixed as it is found when possible.
Descriptions of any corruptions found and related corrective action are communicated to the user-level utiiity. The user-level utility maintains a record of unresolved corruption which depends on further analysis operations to determine final resolution. For example, any LOST pointers wii! be recorded, and if a block is found that Sinks back to the LOST pointer location, the pointer will be restored. Any LOST pointers that have not been resoived following a complete psfsck analysis of the fiSesystem are communicated back to the kerne! to be removed/replaced as appropriate.
Rebuild tree type operations are a normal corrective action for psfsck, when necessary, not a separate invocation. The operations are performed only on metadata blocks that remain orphaned after other corrective actions have been performed.
Progress of aSS Song-running operations is reported directly or through an administrative interface.
Rυn-time Changes Some code changes related to remaining online during the detection and repair of corruption do not affect the on-disk format. These changes, in one exemplary embodiment, do require a universal discipline throughout the code as follows:
First, blocks read from disk are treated as untrusted. Any vaiue read from disk that affects an in-memory calculation is bounds-checked or otherwise validated before its use. For example, there are many places in the current code base in which items are moved with memory copies. The number of items in a block is used as a multiplier to determine the offset within an in-memory block to use as the source or destination of the memory copy. Usually, the number of items is not bounds-checked to ensure that the resulting offset still falls within the block. Therefore, in the current code, an incorrect item number on disk could cause random kernel memory corruption. That corruption could detract the filβsystem from remaining online while corrupt.
Second, operations that encounter corruption do not continue (which could lead to further corruption) or self-evict (which would prevent the fiiesystem from being available while it was corrupt). Instead, they roll back their current transactions and return errors to user space.
Third, operations execute consistency checks on the data, with such checks being performed cheaply (which will usually mean "without incurring extra input/output or I/O"). For example, if a iookup follows a directory entry from a directory to another directory, and then the root block of the child directory is read to continue the lookup, then it will probably be common (although it does depend on the on disk format) for the child directory's backiink to the parent directory to be in the root block, and therefore to have been read already. Therefore it would be cheap to verify that the backiink is present and points to the expected parent directory. This requirement is vague because "cheap" is not precisely defined, and there will be trade-offs between the cost of the checks and their completeness, but the on-disk format will provide more opportunities for run-time checks cheap enough to be worth executing.
Design: On-disk Structures
Exemplary embodiments include various on-disk structures that enable the fsck to check the filesystem while the filesystem remains online. Two examples are provided below.
As a first example, the filesystems are divided or segmented into zones with some bounded size in such a way that each zone is consistency-checked mostly independently. For example, references across zones carry more information than a single pointer. Further, a consistency check on a zone renders the zone inaccessible during the consistency check, but the time of the check has a constant bound because the size of the zone has a constant bound. The cross- zone references carry enough information to prevent the consistency check of a single zone to exceed any bound less than the size of the filesystem. As a second example, every global consistency condition is converted to a universa! quantifier over local consistency conditions such that each loca! consistency condition is checked atαmtcally. Here, no operation that involved greater than some constant number of blocks would need to be atomic. The atomic operations, which have some constant upper bound in size, are performed in the kernel under kernel Socks, and could therefore be done while the ftlesystem was operating normally. For example, it dearly requires a scan of the whole fiiesystem to say "the entire fiiesystem is consistent". However, the idea of this scheme is that an administrator who wanted to verify that the entire fiiesystem was consistent (i.e.. run a full fsck) would run a user-ievei program that wouid drive the kernel to do a series of bounded atomic consistency checks over the entire fiiesystem, while the fiiesystem was mounted. The state of the fiiesystem would be changing underneath the user-level program, so it might miss objects that were added white it was running.
The structures for online fsck will also facilitate efficient offline fsck operations. For example, a separate (and additionai) "rebuild tree" operation is unnecessary, as backlinks wiil ailow detecting unreferenced blocks; and only trees associated with those unreferenced blocks will require any rebuilding.
Converting Global Consistency Conditions to Local Conditions In one exemplary embodiment, the on-disk structures convert global consistency conditions to local conditions. To perform this conversion, one embodiment uses two-way iinks instead of one-way links. Two examples in the fiiesystem are as follows:
As one example, each allocated block conceptually has a pointer to its parent block,
As another example, each inode has pointers to its parent directory entry or antnes. Even for directories which have pointers, this requires more structure than a traditional filesystem. A pointer to the parent directory oniy tells you which directory is the parent, whereas a pointer to the parent directory entry tells you where. For example, such a pointer instructs as to which hash value or which filename in the parent directory the child directory is iocated,
On-disk Structures
The following on-disk structures are contained within various filesystem trees and specifically support online fsck.
First, each inode's B~tree contains (either embedded in the root like stat data in our current on-disk format, or looked up by key in a leaf like an indirect or directory entry item) exactly one backlink item for each hard link to the inode. Directories therefore contain exactly one backlink item each. A backlink item specifies the inode number of the directory containing the hard Sink to the inode, and the key of the directory entry representing the hard Sink to the inode. This effectively makes the logical Sinks comprising a filesystem's path structure two- way.
Second, metadata Sogical backpointers, consisting of the containing object ID and a contained key stored in every metadata B-tree block, aSlow online fsck to search for a metadata biock to confirm that it is S inked into the B-tree where it believes it is linked.
Third, a 64-bit value caSSed LOST is reserved to represent that a pointer was discovered to be inconsistent with the rest of the filesystem and shouSd no longer be used. If an operation encounters a LOST pointer it returns an error as if it had discovered the corruption itself. LOST may appear anywhere that a sector pointer might appear in the filesystem, and is different from a NULL pointer or the absence of a pointer -- those lead to "object not found" return values. By contrast, LOST leads to a "corruption found" return value and indicates that data are not merely not present or formerly present but deleted but were present and were destroyed by corruption.
Fourth, a cycle detection field in the directory structure is used to detect directory cycles,
In one exemplary embodiment, an fsck data tree is referenced from the super inode. The tree will generally be empty, but is capable of having items inserted to represent module-specific corruption items and items inserted to store the status of corruption.
The moduie-specific corruption items indicate to fsck (offline or online) the location and characteristics of detected corruption.
Items inserted by online fsck store a status of the corruption found in one operation which may be relevant to subsequent independent operations. This reduces the necessity for the orchestrating user-level fsck utility to maintain state and provides more capability to resume a partially completed online fsck operation.
In one exemplary embodiment, the fsck data tree supports partial fsck and online fsck operations. The tree's reference *s inserted into the super mode tree of existing fϊlesystems. Correct operation of a full filesystem fsck will not depend on items in the fsck data tree, but certain targeted fsck operations may.
Global Consistency Conditions
This section provides global consistency conditions that can be checked and enforced (i.e., repair when they are not met) online. Many of these conditions depend on details of the on-disk format; this section is intended to address common conditions that will apply regardless of the format. The idea of the design is that with on-disk format, each of the conditions can be checked or repaired using atomic operations whose maximum size is some constant number of blocks (and hence locks), independent of the size of the filesystem.
1. ff a biock is marked allocated, then it is reachable through some inode B- 5 tree or per-filesystem metadata structure.
2. A given block is reachable through at most one inode B-tree or per- fiiesystem metadata structure (it is not doubly-referenced).
10 3. A regular file's nlink field is equal to the number of hard Sinks to the file in ali directories reachable from the root of the filesystem. The nlink field is useful for more than just reporting a number to users in response to a stat(2) system call. The filesystem internally uses this field to determine when a regular file becomes freeable. An incorrect nlink value could lead
15 to effectively arbitrary filesystem corruption if the blocks of an inode were incorrectly freed, reused, and then later interpreted as the original inode by a lookup through a forgotten hard link.
4. A directory's nlink field is equal to the number of subdirectories of the 20 directory.
5. The amount of free space reported by the filesystem is equal to the number of blocks in the filesystem that are not marked as allocated.
?ς 6. There are no cycles in the directory structure.
7. No inodes have been leaked (unreachable from the root). The amount of space charged to the quota of a user or group, either in the filesystem or in a per directory per-user quota if those are supported, is equal to the 30 amount of space taken up by files owned by that user or group in the fϋesystem or directory. Locai consistency conditions
This section examines how the new on-disk structures allow the global consistency conditions to be checked using only local atomic operations. !n some cases, a description is provided to describe how to handle detecting inconsistencies.
1. In some on-disk structures, the only way to verify this condition for a given block is to lock the entire fiiesystem and search it starting from the root for a reference to the block. If the fiiesystem were not locked throughout this operation, it would be incorrect to declare that the block was incorrectly marked allocated even if it were not found. This occurs because there would be no way to know that the block had not been freed and reused during the scan so that the scan simply missed the allocation of the block. Using the supporting on-disk structures, the various types of allocated blocks can be verified using bounded atomic operations as shown below:
Allocated as metadata
Tree internal nodes and leaves
Using the metadata logical backpointer, a search for each block can be performed in its containing tree.
Tree roots
Verify against appropriate reference via a backlink in the root block. Allocated as NEAR user data
Perform a bounded scan of the region to find its reference. Allocated as FAR user data
Follow FAR data pointer to verify the forward link.
2. This is similar to the check as #1 , above. For metadata and FAR user data, only one reference can match the stored backpointer or FAR data pointer for the block. Any additional references would be recognized as inconsistent. For NEAR user data, the bounded region scan will detect multiple references within the region, and any reference from outside the region is automatically inconsistent.
3. Since ali inodes contain backiinks to the directory entries referencing them, nlink can be checked by comparing it to the number of backiinks stored in the inode. The backSinks themselves can then each be checked in turn to confirm that there are directory entries corresponding to each. Finally, a semantic scan of the filesystem (walking the directory tree), calling on the kerne! to perform bounded atomic operations, would eventually discover any directory entries pointing to the inode that did not have corresponding backSinks in the inode. The count of the backlink items in an inode will be stored in an RDC on which an online fsck comprised of bounded atomic operations can be performed,
4. This is a similar issue to #3, above. The backiinks of the linked directories can be checked against the corresponding directory entries. The nlink count will be stored in an RDC.
5. Ftiesystem free space is stored in an RDC on which an online fsck comprised of bounded atomic operations can be performed.
6. There are two cases:
First, the cycle is reachable from the root. In this case there will be at least one directory that is a child of two directories, so a semantic scan will discover the inconsistency when it finds either a directory with two backSinks or a directory entry without a corresponding backlink.
Second, the cycle is not reachable from the root. There will be a cycle detection field in the directory structure which will facilitate defection of these cycles. There will be some bound on how far we check for a cycle atomicaiSy. If that limit is hit, then one embodiment uses monkey bar Socking to traverse up to the root. !n each directory traversed, the cycle detection field is updated as being part of a cycle detection. Relinking a directory inotfe clears this field. If the process hits a directory aiready seen, the process traverses back. If all directories are stili tagged (they have not moved), a cycle exists. If they are not a!! tagged, something in the middle has moved and the process starts over. The downside is it potentially produces a lot of writes. Only one of these cycle checks could execute on the filesystem at once, so it would need to be protected by some global lock.
7. There are two cases: First, the inode is part of a cycle which is unreachable from the root.
Second, there is some inode which has no hard link to it, In this case the inode will be discovered during a block scan to be the root of an inode that either has no backlink items or that has a backlink item with no corresponding directory entry.
When an inode is found to have been leaked, it is traditionally placed in a directory called /lost+found. In other filesystems, files in /lost+found are accessible only by root. In one exemplary embodiment, because the operation may occur online, other users may have open file handles on the files when they are placed in /lost+found. This situation is not problema ic, because if they have open file handles, they have already done access checks on the path by which they reached the inode.
8. When quota accounting is enabled, space used by each owner will be stored a RDC. Operations for checking will be further bounded by the associated quota ownage item, which will identify the specific inodes objects owned by the owner.
Design: Work flow In one exemplary embodiment, similar work flow designs apply to both offline and online fsck operations. The work flow is designed to provide some guarantee that when each component is checked, the trees that are referenced for verification have already been checked enough to support the current operation. This guarantee cannot be depended on, but when it is detected to be broken, the fsck routine aborts its current operation and reverts to an earlier step in the process which will reprocess the component where the corruption was detected.
1. Full fϋesystem fsck When checking the full filesystem, the work will be broken down into the following phases"
Prepa ratory phase
This phase performs the minimal checks necessary to locate the block allocator metadata and other metadata necessary to perform the allocation phase, which traverses and checks the block allocator metadata trees. The checks done in this phase will not check for correct representation in the block allocator metadata.
Superinode / Per filesvstem extensible metadata Here, check the tree structure of the superinode and locate items with pointers to other metadata which will be accessed during the preparatory phase.
Per node extensible metadata
Check the tree structure of the tree and locate items with pointers to other metadata which will be accessed during the preparatory phase.
Super-journal / per-node journals
For online fsck, these checks are limited to those that are not adequately performed during normal runtime journal operations. Check tree structure. Check consistency of journal contents, {offline fsck only) Create a record of journal contents which need to be replayed, and map them to the blocks which they represent. The user-level buffer cache code wii! use this information to substitute journal contents for filesystem blocks as necessary.
Reiocation Table Check tree staicture.
Quota Index (and any other trees that map object IDs) Check tree structure.
RDC trees
Check the tree structure of the global and local RDC trees, and check the consistency of RDC items associated with the block allocator.
2. Allocation phase This phase checks the structure of block allocator trees, !t scans the block allocator bitmaps by region. Verify pool entries against their targets.
Check bitmaps
Scan bitmaps by region. As region is scanned, accumulate counts tracked by ROCs,
(Online fsck only) A region Sock wil! exist for each filesystem region of some specific size. Online fsck wil! obtain the lock exclusively. Operations that modify user data block allocation within the region, specifically NEAR pointers within the region and FAR pointers to blocks within the region, obtain the lock shared for the appropriate region. Other checks done during the region scan do not need to be covered by the region lock, as they do not have region wide dependencies, and will obtain locks for the specific blocks involved in the operation. Region scan will traverse block allocation tree for the region, scanning the bitmap for allocated blocks, maintaining two bitmaps of the region; 1- NEAR pointers from leaf nodes in the region 2- Blocks allocated as NEAR user data blocks. When scan of region is complete, the consistency of the two maps is verified. For each allocated block, action will be taken based on the block allocation state:
Unallocated: No action.
Allocated as metadata:
For tree nodes (roots, internal nodes, leaves), follow the backiink to the block's parent and verify the forward link, (interna! nodes and leaves) Verify all forward Sinks against backliπks in referenced node. Verify tree consistency with all directly linked nodes in tree. Verify key order of items and consistency with sibling leaves, (internal nodes and leaves). Verify path exists from root of tree to block (this may require referring to the OiD index).
For roots of non-inode trees, verify against appropriate reference. This requires the ability to uniquely identify each tree.
For inode tree roots, verify backlink(s) against directory entry(s).
For inode tree leaf nodes, scan indirect items for NEAR pointers, tracking those that have been found, verifying that they haven't been found already.
For allocated as NEAR user data, update bitmap for region scan.
For allocated/has extra state: Subblock data: For each allocated subblock, follow the backiink to the block's parent and verify the forward Sink.
FAR data: Follow FAR data pointer to verify the forward link.
For blocks marked as allocated object IDs, which are not inode root blocks, find and verify entry in relocation table. Verify accumulated counts against RDC data after the completion of each region.
After the completion of this phase, ali block allocator metadata is consistent. There may still be inconsistencies (e,g, orphaned blocks) that were discovered during this phase. The related portions of the block allocator metadata are consistent such that subsequent corrective action operations can update the block allocation as applicable.
3. Tree phase All metadata residing in trees is checked to verify the tree structure and contents are consistent. Some trees are related to others, such that repairs in one tree require updates in others, or consistency checks in one tree require accessing others. The order that the trees are checked is important to accommodate these relationships and avoid multiple checks of metadata.
The trees checked during the preparatory phase need to be rechecked, performing any checks that were not done previously, and verifying that all blocks are correctly accounted by the block allocator.
Superinode / Per filesystem extensible metadata
Perform more thorough check, verifying ali items and verifying that all blocks are correctly accounted by the block allocator.
Per node extensible metadata Perform more thorough check, verifying all items and verifying that all blocks are correctly accounted by the block allocator.
Super-journal /' per-node journals
Perform more thorough check, verifying ali items and verifying that ali blocks are correctly accounted by the block allocator. Quota Index (and any other trees that map object SDs)
Perform more thorough check, verifying ali items and verifying that ali blocks are correctly accounted by the block ailocator. For each item, verify a corresponding item in the relocation table.
Reiocatioπ Table
This table preferable contains an entry for each allocated non-inode object ID, and sufficient data to allow finding the mapped object. Here, perform a more thorough check, verifying all items and verifying that al! blocks are correctly accounted by the block allocator. For each item in the tabie, verify the object SD is allocated by the block allocator and that the correct object exists,
RDC trees
Perform a more thorough check, verifying all items and verifying that all blocks are correctly accounted by the block allocator. The RDCs are not checked against the trees they represent, but they are checked for internal consistency (e.g. multiple items of a divided RDC are checked values at different levels add up correctly).
Unique security descriptor table
Check tree structure and contained items. Verify that all blocks are correctly accounted by the block allocator. Verify validity of security descriptors. For each associated inode, check that the inode has this security descriptor's unique ID assigned. This requires security descriptor references to be in the inode root, as the process should not be traversing inode trees yet. Otherwise, the unique security descriptor table can be traversed again after inode trees have been checked, verifying inode references at that time. The preliminary check still needs to occur here to allow searching for individual security descriptors while checking inode trees.
Quota trees Quota accounting wii! be checked after inotfe tree checks are completed. However, each inode needs to have its ownership checked against the appropriate quota ownage item. This check allows the quota ownage items to be searched for verification against the ownership stored in the stat data. This is provides motive to the design option of maintaining quota ownage items in separate trees from other quota items.
Deferred work tree / U n.i inked open file, tree
Check tree structure and contained items. Verify that all blocks are correctly accounted by the block allocator. Check tree items, including consistency with inodes in tree. Check unlinked mode trees (see inode checking section for more details).
inode trees Check tree structure and contained items. Verify that all blocks are correctly accounted by the block allocator. Check stat data and other root biock items. For assigned security descriptor ID, verify inode is listed in the unique security descriptor table. Find mapping for each owner in quota index tree and verify reference in quota ownage item. Verify any data streams with its mapped auxiliary inode. For auxiliary inodes, verify with its associated main inode mapping. For directory inodes, check for cycles. For online fsck, this needs to be a bounded operation, in the absence of a better solution:
Put some bound on how far the process has checked for a cycle atomicaily. If this iimit is hit, use monkey bar Socking to traverse up to the root. Each node traversed would be tagged as being part of a cycie detection. Relinking a directory inode dears this tag. If node already seen is hit, traverse back. If all nodes are stili tagged (they have not moved), a cycle exists. If they are not all tagged, something in the middle has moved and the process starts over, The downside is it potentially produces a iot of writes. Only one of thase cycle checks could execute on the filesystem at once, so it would need to be protected by some global lock. How/where to store the tag in the inode root is an open issue, but only one bit is needed.
Traverse tree to leaf nodes. Small inode trees will be locked during this process. Large trees wil! be locked and checked in sections. Locking which supports intrafile concurrency, if implemented, may be used. Otherwise a separate locking design may be needed.
Directory inode tree leaf node: Verify item types and contents are consistent with inode. For each directory entry, verify that the linked inode has corresponding backlϊnk. If the inode's first backlink corresponds to directory entry, run the fsck on the inode tree.
Regular file inode tree leaf node: Verify item types and contents are consistent with inode. For each indirect item, verify FAR pointers with backlinks in block allocator tree.
As the tree is traversed, accumulate RDC counts and values for tree (tree section for large trees). For divided RDCs, sections should coincide with RDC divisions; verify counts and values with corresponding subvalues in RDC tree. For undivided ROCs, update the recovery progress key and recovery deltas following completion of each section.
Snapshot / branch description fables (dead and alive)
Check tree structure and contained items. Verify that all blocks are correctly accounted by the block allocator. Find mapping for owner in quota index tree and verify reference in quota ownage item. Check consistency of table data with snapshot / branch contents.
Modified inode table Check tree structure and contained items. Verify that all blocks are correctiy accounted by the block allocator. Check consistency of table data. Snapshots and branches of inodes will be treated similarly Io unmodified inode trees. Each tree is traversed (which will involve multiple traversals of nodes shared between branches), and associated RDCs are verified, as the RDC values wil! be independent for each branch.
Quota trees
Global Quota index: Check tree structure and contained items. Verify that all blocks are correctly accounted by the block allocator. For each leaf node item, verify owner SD to object IO mapping in each entry matches an item in the global quota tree and verify object ID reference in relocation table.
Global Quota Tree; Check tree structure and contained items. Verify that all blocks are correctly accounted by the block allocator.
For each leaf node quota ownage item:
1. Verify there is a matching entry in the global quota index.
2. (online fsck only) Consolidate with local inode lists for same owner.
3. Verify each object in an ownage list item is owned by the item owner. Accumulate quota accounting information for each object as it is accessed. Sf the list contains a large number of owned objects, perform the check in sections corresponding with divisions in the associated RDC (if divided, otherwise update the recovery progress key and recovery deltas).
4. Verify a corresponding global quota item. Find any corresponding local quota items. Verify consistency of accumulated/RDC values with global allocation and locai reservation (online fsck may hoard local reservations). Verify consistency of other quota item data.
Local Quota Tree: Check tree structure and contained items. Verify that all blocks are correctly accounted by the block allocator. For each local quota ownage item and quota item, verify there is a matching entry in the globai quota index. The items have already been checked in conjunction with the corresponding globa! item.
Corruption cases / Corrective actions
Any corrective actions by online fsck should take into account filesystem data which may be cached by running processes. These will be handled on an individual basis, but will present challenges to making the unconventional modifications needed by fsck to an online filesystem.
1. Individual filesystem items
Corruption of fiSesystem items that can be corrected without reference to items of other moduies are corrected via module specific routines. For example, a stat data item without the directory bit set in i_mode in an inode which contains only directory items can correct i_mode without referencing items outside of the inode.
2. Tree corruption
Isolated inconsistencies in the tree structure may be correctable by local correction, but pervasive corruption within a tree may require the rebuilding of an individual tree, or a section, based on the leaf items. Any potential orphaned leaf items should be identified during fsck's allocation phase. If such corruption is detected in a tree prior to the ailocation phase, it is rebuilt with LOST pointers, as appropriate. Those trees are rechecked, and potentially rebuilt with orphaned leaves, after the allocation phase.
3. Superinode Any corruption to the superinode is difficult to repair, therefore the superinode is mirrored. When a mismatch is detected between superinode copies, the specific course of recovery will depend on finding one or more copies which pass consistency checks and copying the consistent information to the corrupt superinode copy. Depending on the levef of mismatch, this check and repair may affect one or more items or the entire superinode tree.
Corruption that is not correctable through use of the superinode mirrors may render the filesystem unusable, or at least require an offline fsck. One possibility is to cache a copy of the superinode in memory at mount time, so that online fsck would have that copy to use when corruption is detected. However, the in memory copies would need to be kept in sync with the ondisk versions, and cases where this would be useful would be very rare.
4. Per node extensible metadata
This tree is needed for some operations, but may require the aliocation phase to complete before it can be compietely corrected. One option is to allocate a new tree for the node, and treat the corrupted tree as that of a node which no longer has the fiiesystem mounted.
5. Super-journal / per-node journals
The journal contents are relatively independent of other metadata. Since they are not statically allocated, corrections may require updates to the block allocator,
The potential exists here for corruption which requires an offline fsck. Even repairs that occur prior to the journal is checked will depend on a functional journal.
6. Relocation table This table can be rebuilt entirely or in part based on the allocation phase and scanning trees which map object !Ds to non-inode objects, such as the quota index,
7. Quota index
This table can be rebuilt entirely or in part based on the globa! quota tree.
8. RDC trees
RDC items that are inconsistent with the fiiesystem items that they represent are corrected to be consistent with the represented tree. Other corruption would require rebuilding affected ROCs. This can be done in a bounded fashion by using the recovery progress key / recovery delta mechanism,
9. Biock allocator During the allocation phase, aii corruption should be correctable such that the bitmaps are in a consistent state. Fsck will maintain a record of orphaned blocks, which will remain marked as allocated and will be freed or linked to lost+found if a parent is not found during the tree phase. Corrective actions during the tree phase may require updates to the block allocator, and will depend on its consistency to make these updates.
10. Unique security descriptor table
On some file systems, when an invalid security descriptor is found, it is replaced with a default security descriptor. In one exemplary embodiment, since the default security descriptor may already be in the table, all inodes containing the security descriptor (a list of these is maintained) are updated to map to the default security descriptor.
11. lnode trees Corruption and repairs to inode trees will require updates to RDCs1 block allocator bitmaps, and potentially other metadata. Quota trees are checked after ϊnodβ trees, so they can be ignored here and corrected during their pass.
12. Quota trees
Accounted objects are made consistent before the quota trees are checked, so no operations should follow which require updates to the quota items, inconsistencies may be found in the quota data due to repairs made to inodes, and will be corrected in the quota items.
Module-specific consistency checks
Due to the diversity of the various filesystem trees, a unique method of traversing each tree and checking the contained items is preferred. Runtime checks performed by module-specific routines to detect corruptions and inconsistencies will be incorporated into the online fsck procedures or the procedures will be designed to implicitly invoke the module-specific routines.
Partial fsck.s
The operations that make up a complete filesystem fsck will be sufficiently independent to allow fscking of isolated pieces of the filesystem.
Examples;
1. fsck inode !; The tree for inode ! is traversed and checked as described above, including RDCs. AS! backlinks are checked for consistency with forward links. No quotas are checked for the inode's owners. Only blocks allocated to the inode tree are checked for proper status in the block allocator metadata.
2. fsck quota__owner 0: The quota index entry for quota owner O is checked. The global and local quota ownage items and quota items are checked for self consistency. The owned objects indicated by the quota ownage items are referenced to check consistency with the quota items and associated RDC, Repairs will be limited in some instances in partial fscks, due to dependencies on checks of other portions of the fiSesystem. Because of this, some partial fscks may be provided only as a non-modifying fsck, and any corruption found would be evidence that a full filesystem fsck is needed.
Definitions
As used herein and in the claims, the following words are defined as follows:
A "B-tree" is a tree data structure in a database that sorts data and enables searches, insertions, and deletions to occur in logarithmic amortized time. Internal nodes have a variable number of child nodes with all leaf nodes being maintained at a same depth for balance.
A "block" is a sequence of bytes or bits having a specified length or block size. Data is blocked (i.e., placed into blocks) to facilitate handling streams of data since data is typically read as a single block at a time. Most file systems are based on block devices that store and retrieve specified blocks of data. For example, a single file can be stored in multiple blocks.
A "block device" refers to a device through which the file system moves data in the form of blocks. Block devices are addressable device nodes, such as hard disks, CD-ROM, or memory regions.
A "checker" or "file system checker" of "fsck" refers to a software program that checks the consistency of a file system. The fsck traverses through the file system to discover inconsistent states and automatically corrects or fixes inconsistencies, problems, errors, etc. A system administrator can also manually run fsck. A "duster" is a group of two or more computers that work closely together so that in many respects form a single computer. For examples, a cluster is formed by linking multiple computers through a fast local area network (LAN) to improve performance and/or availability over a single computer.
A "data storage block" or "storage block" refers to specific areas in memory, such as a hard disk. For example, one data storage block is a collection of eight sectors or 4,096 bytes, referred to as 4 K bytes.
A "directory" is an entity in a file system that contains a group of files or other directories. Related files are typically stored in a same directory. A directory contained inside a directory is a subdirectory. Together, multiple directories form a hierarchy or tree structure.
A "filesystem" or "file system" is a collection of file data, maintained by a filesystem implementation which is a set of data types, methods, and algorithms (typically implemented within an operating system instance) that store, organize, and maintain file data, frequently in some kind of file and/or directory hierarchy (albeit various alternatives and choices exist in the exact organizational structure made manifest by the filesystem implementation to the consumers of the file data). The actual file data and associated filesystem meta-data which describe the location, layout, directory organization, etc. of all file data within the filesystem is in turned stored on a data storage device (e.g., single hard disk, CD-ROM, disk storage array, network attached storage (NAS), etc.).
An "inode" is a data structure that contains information about files (such as basic information about a regular file, directory, or other file system object), lnodes include information on files, such as, but not limited to, user ownership, access mode (read, write, execute permissions) and type. In one exemplary embodiment, each file has an inode and is identified by an inode number (i-number) in the file system where it resides, Inodes contain metadata (i.e., data about data) about the file.
The term 'inode number" refers to an i-number that identifies where a file resides in the file system. The inode number indexes a table of inodes in a known location on the storage device. The kernel uses the inode number to access the contents of the inode. For example, the inode number for a file is found using a 1s -i command (compared with a 1s -1 command which retrieves information of time of last modification). Some file systems do not implement a table of inodes but store such data to locate the inodes. The data can be caiied stat data (in reference to a stat system call that provides the inode information to a requesting program).
The term "metadata"' refers to data about data. Metadata can be stored in a separate file apart from the data itself. For example, fiie systems can store metadata in directory entries or in specialized structures, such as inodes. By way of example, metadata can include length of data stored as the number of blocks allocated for the file, a time and date when the file was modified, created, or last accessed, ownership identification, access permissions, etc.
The term "Recoverable Distributed Counters" or "RDC" refers to a structure that keeps track of a total count of some type of object within a large range of the fiiesystem. The RDC breaks down the total counts into partial counts of smali ranges of the fiiesystem so that it can be checked and corrected piece-by-piece.
A "stat call" or "stat system call" is a cail or request made to enumerate files in a directory or tree of directories. For example, in Linux or Unix systems, a program is executed to traverse a file system with a stat system call. This call extracts information such as time of last modification, time of last status change, and time of last access. Such information is used to detect whether or when a file changed. A "storage device" refers to any data storage device capable of storing data including, but not limited to, one or more of a disk array, a disk drive, a tape drive, optical drive, a SCSI device, or a fiber channel device Further, a "disk array" or '"array" is a storage system that includes plural disk doves, a cache, and controller. Arrays include, but are not limited to, networked attached storage (NAS) arrays, modular SAN arrays, monolithic SAN arrays, utility SAN arrays, and storage virtual ization.
In one exemplary embodiment, one or more blocks or steps discussed herein are automated. In other words, apparatus, systems, and methods occur automatically. The terms "automated"' or "automatically" (and like variations thereof) mean controlled operation of an apparatus, system, and/or process using computers and/or mechanical/electrical devices without the necessity of human intervention, observation, effort and/or decision,
The methods in accordance with exemplary embodiments of the present invention are provided as examples and should not be construed to limit other embodiments within the scope of the invention. Further, methods or steps discussed within different figures can be added to or exchanged with methods of steps in other figures. Further yet, specific numerical data values (such as specific quantities, numbers, categories, etc.) or other specific information should be interpreted as illustrative for discussing exemplary embodiments. Such specific information is not provided to limit the invention,
In the various embodiments in accordance with the present invention, embodiments are implemented as a method, system, and/or apparatus. As one example, exemplary embodiments and steps associated therewith are implemented as one or more computer software programs to implement the methods described herein. For example, the software is implemented as one or more modules. The location of the software will differ for the various alternative embodiments. The software programming code, for example, is accessed by a processor or processors of the computer or server from long-term storage media of some type, such as a CD-ROM drive or hard drive. The software programming code is embodied or stored on any of a variety of known media for use with a data processing system or in any memory device such as semiconductor, magnetic and optica! devices, including a disk, hard drive, CD-ROM, ROM, etc. The code is distributed on such media, or is distributed to users from the memory or storage of one computer system over a network of some type to other computer systems for use by users of such other systems. Alternatively, the programming code is embodied in the memory and accessed by the processor using the bus. The techniques and methods for embodying software programming code in memory, on physical media, and/or distributing software code via networks are well known and will not be further discussed herein.
The above discussion is meant to be illustrative of the principles and various embodiments of the present invention. Numerous variations and modifications will become apparent to those skilled in the art once the above disclosure is fuiiy appreciated. It is intended that the following claims be interpreted to embrace all such variations and modifications.