Making Filesystems Exportable¶
Overview¶
All filesystem operations require a dentry (or two) as a startingpoint. Local applications have a reference-counted hold on suitabledentries via open file descriptors or cwd/root. However remoteapplications that access a filesystem via a remote filesystem protocolsuch as NFS may not be able to hold such a reference, and so need adifferent way to refer to a particular dentry. As the alternativeform of reference needs to be stable across renames, truncates, andserver-reboot (among other things, though these tend to be the mostproblematic), there is no simple answer like ‘filename’.
The mechanism discussed here allows each filesystem implementation tospecify how to generate an opaque (outside of the filesystem) bytestring for any dentry, and how to find an appropriate dentry for anygiven opaque byte string.This byte string will be called a “filehandle fragment” as itcorresponds to part of an NFS filehandle.
A filesystem which supports the mapping between filehandle fragmentsand dentries will be termed “exportable”.
Dcache Issues¶
The dcache normally contains a proper prefix of any given filesystemtree. This means that if any filesystem object is in the dcache, thenall of the ancestors of that filesystem object are also in the dcache.As normal access is by filename this prefix is created naturally andmaintained easily (by each object maintaining a reference count onits parent).
However when objects are included into the dcache by interpreting afilehandle fragment, there is no automatic creation of a path prefixfor the object. This leads to two related but distinct features ofthe dcache that are not needed for normal filesystem access.
The dcache must sometimes contain objects that are not part of theproper prefix. i.e that are not connected to the root.
The dcache must be prepared for a newly found (via ->lookup) directoryto already have a (non-connected) dentry, and must be able to movethat dentry into place (based on the parent and name in the->lookup). This is particularly needed for directories asit is a dcache invariant that directories only have one dentry.
To implement these features, the dcache has:
A dentry flag DCACHE_DISCONNECTED which is set onany dentry that might not be part of the proper prefix.This is set when anonymous dentries are created, and cleared when adentry is noticed to be a child of a dentry which is in the properprefix. If the refcount on a dentry with this flag setbecomes zero, the dentry is immediately discarded, rather than beingkept in the dcache. If a dentry that is not already in the dcacheis repeatedly accessed by filehandle (as NFSD might do), an new dentrywill be a allocated for each access, and discarded at the end ofthe access.
Note that such a dentry can acquire children, name, ancestors, etc.without losing DCACHE_DISCONNECTED - that flag is only cleared whensubtree is successfully reconnected to root. Until then dentriesin such subtree are retained only as long as there are references;refcount reaching zero means immediate eviction, same as for unhasheddentries. That guarantees that we won’t need to hunt them down uponumount.
A primitive for creation of secondary roots - d_obtain_root(inode).Those do _not_ bear DCACHE_DISCONNECTED. They are placed on theper-superblock list (->s_roots), so they can be located at umounttime for eviction purposes.
Helper routines to allocate anonymous dentries, and to help attachloose directory dentries at lookup time. They are:
- d_obtain_alias(inode) will return a dentry for the given inode.
If the inode already has a dentry, one of those is returned.
If it doesn’t, a new anonymous (IS_ROOT andDCACHE_DISCONNECTED) dentry is allocated and attached.
In the case of a directory, care is taken that only one dentrycan ever be attached.
- d_splice_alias(inode, dentry) will introduce a new dentry into the tree;
either the passed-in dentry or a preexisting alias for the given inode(such as an anonymous one created by d_obtain_alias), if appropriate.It returns NULL when the passed-in dentry is used, following the callingconvention of ->lookup.
Filesystem Issues¶
For a filesystem to be exportable it must:
provide the filehandle fragment routines described below.
make sure that d_splice_alias is used rather than d_addwhen ->lookup finds an inode for a given parent and name.
If inode is NULL, d_splice_alias(inode, dentry) is equivalent to:
d_add(dentry, inode), NULLSimilarly, d_splice_alias(ERR_PTR(err), dentry) = ERR_PTR(err)
Typically the ->lookup routine will simply end with a:
return d_splice_alias(inode, dentry);}
A file system implementation declares that instances of the filesystemare exportable by setting the s_export_op field in thestructsuper_block. This field must point to a “structexport_operations”structwhich has the following members:
- encode_fh (mandatory)
Takes a dentry and creates a filehandle fragment which may later be usedto find or create a dentry for the same object.
- fh_to_dentry (mandatory)
Given a filehandle fragment, this should find the implied object andcreate a dentry for it (possibly with d_obtain_alias).
- fh_to_parent (optional but strongly recommended)
Given a filehandle fragment, this should find the parent of theimplied object and create a dentry for it (possibly withd_obtain_alias). May fail if the filehandle fragment is too small.
- get_parent (optional but strongly recommended)
When given a dentry for a directory, this should return a dentry forthe parent. Quite possibly the parent dentry will have been allocatedby d_alloc_anon. The default get_parent function just returns an errorso any filehandle lookup that requires finding a parent will fail.->lookup(“..”) isnot used as a default as it can leave “..” entriesin the dcache which are too messy to work with.
- get_name (optional)
When given a parent dentry and a child dentry, this should find a namein the directory identified by the parent dentry, which leads to theobject identified by the child dentry. If no get_name function issupplied, a default implementation is provided which uses vfs_readdirto find potential names, and matches inode numbers to find the correctmatch.
- flags
Some filesystems may need to be handled differently than others. Theexport_operations
structalsoincludes a flags field that allows thefilesystem to communicate such information to nfsd. See the ExportOperations Flags section below for more explanation.
A filehandle fragment consists of an array of 1 or more 4byte words,together with a one byte “type”.The decode_fh routine should not depend on the stated size that ispassed to it. This size may be larger than the original filehandlegenerated by encode_fh, in which case it will have been padded withnuls. Rather, the encode_fh routine should choose a “type” whichindicates the decode_fh how much of the filehandle is valid, and howit should be interpreted.
Export Operations Flags¶
In addition to the operation vector pointers,structexport_operations alsocontains a “flags” field that allows the filesystem to communicate to nfsdthat it may want to do things differently when dealing with it. Thefollowing flags are defined:
- EXPORT_OP_NOWCC - disable NFSv3 WCC attributes on this filesystem
RFC 1813 recommends that servers always send weak cache consistency(WCC) data to the client after each operation. The server shouldatomically collect attributes about the inode, do an operation on it,and then collect the attributes afterward. This allows the client toskip issuing GETATTRs in some situations but means that the serveris calling vfs_getattr for almost all RPCs. On some filesystems(particularly those that are clustered or networked) this is expensiveand atomicity is difficult to guarantee. This flag indicates to nfsdthat it should skip providing WCC attributes to the client in NFSv3replies when doing operations on this filesystem. Consider enablingthis on filesystems that have an expensive ->getattr inode operation,or when atomicity between pre and post operation attribute collectionis impossible to guarantee.
- EXPORT_OP_NOSUBTREECHK - disallow subtree checking on this fs
Many NFS operations deal with filehandles, which the server must thenvet to ensure that they live inside of an exported tree. When theexport consists of an entire filesystem, this is trivial. nfsd can justensure that the filehandle live on the filesystem. When only part of afilesystem is exported however, then nfsd must walk the ancestors of theinode to ensure that it’s within an exported subtree. This is anexpensive operation and not all filesystems can support it properly.This flag exempts the filesystem from subtree checking and causesexportfs to get back an error if it tries to enable subtree checkingon it.
- EXPORT_OP_CLOSE_BEFORE_UNLINK - always close cached files before unlinking
On some exportable filesystems (such as NFS) unlinking a file thatis still open can cause a fair bit of extra work. For instance,the NFS client will do a “sillyrename” to ensure that the filesticks around while it’s still open. When reexporting, that openfile is held by nfsd so we usually end up doing a sillyrename, andthen immediately deleting the sillyrenamed file just afterward whenthe link count actually goes to zero. Sometimes this delete can racewith other operations (for instance an rmdir of the parent directory).This flag causes nfsd to close any open files for this inode _before_calling into the vfs to do an unlink or a rename that would replacean existing file.
- EXPORT_OP_REMOTE_FS - Backing storage for this filesystem is remote
PF_LOCAL_THROTTLE exists for loopback NFSD, where a thread needs towrite to one bdi (the final bdi) in order to free up writes queuedto another bdi (the client bdi). Such threads get a private balanceof dirty pages so that dirty pages for the client bdi do not imactthe daemon writing to the final bdi. For filesystems whose durablestorage is not local (such as exported NFS filesystems), thisconstraint has negative consequences. EXPORT_OP_REMOTE_FS enablesan export to disable writeback throttling.
- EXPORT_OP_NOATOMIC_ATTR - Filesystem does not update attributes atomically
EXPORT_OP_NOATOMIC_ATTR indicates that the exported filesystemcannot provide the semantics required by the “atomic” boolean inNFSv4’s change_info4. This boolean indicates to a client whether thereturned before and after change attributes were obtained atomicallywith the respect to the requested metadata operation (UNLINK,OPEN/CREATE, MKDIR, etc).
- EXPORT_OP_FLUSH_ON_CLOSE - Filesystem flushes file data on close(2)
On most filesystems, inodes can remain under writeback after thefile is closed. NFSD relies on client activity or local flusherthreads to handle writeback. Certain filesystems, such as NFS, flushall of an inode’s dirty data on last close. Exports that behave thisway should set EXPORT_OP_FLUSH_ON_CLOSE so that NFSD knows to skipwaiting for writeback when closing such files.