Overview of the Linux Virtual File System¶
Original author: Richard Gooch <rgooch@atnf.csiro.au>
Copyright (C) 1999 Richard Gooch
Copyright (C) 2005 Pekka Enberg
Introduction¶
The Virtual File System (also known as the Virtual Filesystem Switch) isthe software layer in the kernel that provides the filesystem interfaceto userspace programs. It also provides an abstraction within thekernel which allows different filesystem implementations to coexist.
VFS system calls open(2), stat(2), read(2), write(2), chmod(2) and so onare called from a process context. Filesystem locking is described inthe documentLocking.
Directory Entry Cache (dcache)¶
The VFS implements the open(2), stat(2), chmod(2), and similar systemcalls. The pathname argument that is passed to them is used by the VFSto search through the directory entry cache (also known as the dentrycache or dcache). This provides a very fast look-up mechanism totranslate a pathname (filename) into a specific dentry. Dentries livein RAM and are never saved to disc: they exist only for performance.
The dentry cache is meant to be a view into your entire filespace. Asmost computers cannot fit all dentries in the RAM at the same time, somebits of the cache are missing. In order to resolve your pathname into adentry, the VFS may have to resort to creating dentries along the way,and then loading the inode. This is done by looking up the inode.
The Inode Object¶
An individual dentry usually has a pointer to an inode. Inodes arefilesystem objects such as regular files, directories, FIFOs and otherbeasts. They live either on the disc (for block device filesystems) orin the memory (for pseudo filesystems). Inodes that live on the discare copied into the memory when required and changes to the inode arewritten back to disc. A single inode can be pointed to by multipledentries (hard links, for example, do this).
To look up an inode requires that the VFS calls thelookup() method ofthe parent directory inode. This method is installed by the specificfilesystem implementation that the inode lives in. Once the VFS has therequired dentry (and hence the inode), we can do all those boring thingslike open(2) the file, or stat(2) it to peek at the inode data. Thestat(2) operation is fairly simple: once the VFS has the dentry, itpeeks at the inode data and passes some of it back to userspace.
The File Object¶
Opening a file requires another operation: allocation of a filestructure (this is the kernel-side implementation of file descriptors).The freshly allocated file structure is initialized with a pointer tothe dentry and a set of file operation member functions. These aretaken from the inode data. The open() file method is then called so thespecific filesystem implementation can do its work. You can see thatthis is another switch performed by the VFS. The file structure isplaced into the file descriptor table for the process.
Reading, writing and closing files (and other assorted VFS operations)is done by using the userspace file descriptor to grab the appropriatefile structure, and then calling the required file structure method todo whatever is required. For as long as the file is open, it keeps thedentry in use, which in turn means that the VFS inode is still in use.
Registering and Mounting a Filesystem¶
To register and unregister a filesystem, use the following APIfunctions:
#include<linux/fs.h>externintregister_filesystem(structfile_system_type*);externintunregister_filesystem(structfile_system_type*);
The passedstructfile_system_type describes your filesystem. When arequest is made to mount a filesystem onto a directory in yournamespace, the VFS will call the appropriatemount() method for thespecific filesystem. New vfsmount referring to the tree returned by->mount() will be attached to the mountpoint, so that when pathnameresolution reaches the mountpoint it will jump into the root of thatvfsmount.
You can see all filesystems that are registered to the kernel in thefile /proc/filesystems.
struct file_system_type¶
This describes the filesystem. The followingmembers are defined:
structfile_system_type{constchar*name;intfs_flags;int(*init_fs_context)(structfs_context*);conststructfs_parameter_spec*parameters;structdentry*(*mount)(structfile_system_type*,int,constchar*,void*);void(*kill_sb)(structsuper_block*);structmodule*owner;structfile_system_type*next;structhlist_headfs_supers;structlock_class_keys_lock_key;structlock_class_keys_umount_key;structlock_class_keys_vfs_rename_key;structlock_class_keys_writers_key[SB_FREEZE_LEVELS];structlock_class_keyi_lock_key;structlock_class_keyi_mutex_key;structlock_class_keyinvalidate_lock_key;structlock_class_keyi_mutex_dir_key;};
namethe name of the filesystem type, such as “ext2”, “iso9660”,“msdos” and so on
fs_flagsvarious flags (i.e. FS_REQUIRES_DEV, FS_NO_DCACHE, etc.)
init_fs_contextInitializes ‘
structfs_context’ ->ops and ->fs_private fields withfilesystem-specific data.parametersPointer to the array of filesystem parameters descriptors‘
structfs_parameter_spec’.More info inFilesystem Mount API.mountthe method to call when a new instance of this filesystem shouldbe mounted
kill_sbthe method to call when an instance of this filesystem should beshut down
ownerfor internal VFS use: you should initialize this to THIS_MODULEin most cases.
nextfor internal VFS use: you should initialize this to NULL
fs_supersfor internal VFS use: hlist of filesystem instances (superblocks)
s_lock_key, s_umount_key, s_vfs_rename_key, s_writers_key,i_lock_key, i_mutex_key, invalidate_lock_key, i_mutex_dir_key: lockdep-specific
Themount() method has the following arguments:
structfile_system_type*fs_typedescribes the filesystem, partly initialized by the specificfilesystem code
intflagsmount flags
constchar*dev_namethe device name we are mounting.
void*dataarbitrary mount options, usually comes as an ASCII string (see“Mount Options” section)
Themount() method must return the root dentry of the tree requested bycaller. An active reference to its superblock must be grabbed and thesuperblock must be locked. On failure it should return ERR_PTR(error).
The arguments match those of mount(2) and their interpretation dependson filesystem type. E.g. for block filesystems, dev_name is interpretedas block device name, that device is opened and if it contains asuitable filesystem image the method creates and initializesstructsuper_block accordingly, returning its root dentry to caller.
->mount() may choose to return a subtree of existing filesystem - itdoesn’t have to create a new one. The main result from the caller’spoint of view is a reference to dentry at the root of (sub)tree to beattached; creation of new superblock is a common side effect.
The most interesting member of the superblock structure that themount()method fills in is the “s_op” field. This is a pointer to a “structsuper_operations” which describes the next level of the filesystemimplementation.
For more information on mounting (and the new mount API), seeFilesystem Mount API.
The Superblock Object¶
A superblock object represents a mounted filesystem.
struct super_operations¶
This describes how the VFS can manipulate the superblock of yourfilesystem. The following members are defined:
structsuper_operations{structinode*(*alloc_inode)(structsuper_block*sb);void(*destroy_inode)(structinode*);void(*free_inode)(structinode*);void(*dirty_inode)(structinode*,intflags);int(*write_inode)(structinode*,structwriteback_control*wbc);int(*drop_inode)(structinode*);void(*evict_inode)(structinode*);void(*put_super)(structsuper_block*);int(*sync_fs)(structsuper_block*sb,intwait);int(*freeze_super)(structsuper_block*sb,enumfreeze_holderwho);int(*freeze_fs)(structsuper_block*);int(*thaw_super)(structsuper_block*sb,enumfreeze_wholderwho);int(*unfreeze_fs)(structsuper_block*);int(*statfs)(structdentry*,structkstatfs*);int(*remount_fs)(structsuper_block*,int*,char*);void(*umount_begin)(structsuper_block*);int(*show_options)(structseq_file*,structdentry*);int(*show_devname)(structseq_file*,structdentry*);int(*show_path)(structseq_file*,structdentry*);int(*show_stats)(structseq_file*,structdentry*);ssize_t(*quota_read)(structsuper_block*,int,char*,size_t,loff_t);ssize_t(*quota_write)(structsuper_block*,int,constchar*,size_t,loff_t);structdquot**(*get_dquots)(structinode*);long(*nr_cached_objects)(structsuper_block*,structshrink_control*);long(*free_cached_objects)(structsuper_block*,structshrink_control*);};
All methods are called without any locks being held, unless otherwisenoted. This means that most methods can block safely. All methods areonly called from a process context (i.e. not from an interrupt handleror bottom half).
alloc_inodethis method is called by
alloc_inode()to allocate memory forstructinodeand initialize it. If this function is notdefined, a simple ‘structinode’ is allocated. Normallyalloc_inode will be used to allocate a larger structure whichcontains a ‘structinode’ embedded within it.destroy_inodethis method is called by
destroy_inode()to release resourcesallocated forstructinode. It is only required if->alloc_inode was defined and simply undoes anything done by->alloc_inode.free_inodethis method is called from RCU callback. If you use
call_rcu()in ->destroy_inode to free ‘structinode’ memory, then it’sbetter to release memory in this method.dirty_inodethis method is called by the VFS when an inode is marked dirty.This is specifically for the inode itself being marked dirty,not its data. If the update needs to be persisted by
fdatasync(),then I_DIRTY_DATASYNC will be set in the flags argument.I_DIRTY_TIME will be set in the flags in case lazytime is enabledandstructinodehas times updated since the last ->dirty_inodecall.write_inodethis method is called when the VFS needs to write an inode todisc. The second parameter indicates whether the write shouldbe synchronous or not, not all filesystems check this flag.
drop_inodecalled when the last access to the inode is dropped, with theinode->i_lock spinlock held.
This method should be either NULL (normal UNIX filesystemsemantics) or “inode_just_drop” (for filesystems that donot want to cache inodes - causing “delete_inode” to always becalled regardless of the value of i_nlink)
The “
inode_just_drop()” behavior is equivalent to the oldpractice of using “force_delete” in theput_inode()case, butdoes not have the races that the “force_delete()” approach had.evict_inodecalled when the VFS wants to evict an inode. Caller doesnot evict the pagecache or inode-associated metadata buffers;the method has to use
truncate_inode_pages_final()to get ridof those. Caller makes sure async writeback cannot be running forthe inode while (or after) ->evict_inode()is called. Optional.put_supercalled when the VFS wishes to free the superblock(i.e. unmount). This is called with the superblock lock held
sync_fscalled when VFS is writing out all dirty data associated with asuperblock. The second parameter indicates whether the methodshould wait until the write out has been completed. Optional.
freeze_superCalled instead of ->freeze_fs callback if provided.Main difference is that ->freeze_super is called without takingdown_write(&sb->s_umount). If filesystem implements it and wants->freeze_fs to be called too, then it has to call ->freeze_fsexplicitly from this callback. Optional.
freeze_fscalled when VFS is locking a filesystem and forcing it into aconsistent state. This method is currently used by the LogicalVolume Manager (LVM) and ioctl(FIFREEZE). Optional.
thaw_supercalled when VFS is unlocking a filesystem and making it writableagain after ->freeze_super. Optional.
unfreeze_fscalled when VFS is unlocking a filesystem and making it writableagain after ->freeze_fs. Optional.
statfscalled when the VFS needs to get filesystem statistics.
remount_fscalled when the filesystem is remounted. This is called withthe kernel lock held
umount_begincalled when the VFS is unmounting a filesystem.
show_optionscalled by the VFS to show mount options for /proc/<pid>/mountsand /proc/<pid>/mountinfo.(see “Mount Options” section)
show_devnameOptional. Called by the VFS to show device name for/proc/<pid>/{mounts,mountinfo,mountstats}. If not provided then‘(
structmount).mnt_devname’ will be used.show_pathOptional. Called by the VFS (for /proc/<pid>/mountinfo) to showthe mount root dentry path relative to the filesystem root.
show_statsOptional. Called by the VFS (for /proc/<pid>/mountstats) to showfilesystem-specific mount statistics.
quota_readcalled by the VFS to read from filesystem quota file.
quota_writecalled by the VFS to write to filesystem quota file.
get_dquotscalled by quota to get ‘
structdquot’ array for a particular inode.Optional.nr_cached_objectscalled by the sb cache shrinking function for the filesystem toreturn the number of freeable cached objects it contains.Optional.
free_cache_objectscalled by the sb cache shrinking function for the filesystem toscan the number of objects indicated to try to free them.Optional, but any filesystem implementing this method needs toalso implement ->nr_cached_objects for it to be calledcorrectly.
We can’t do anything with any errors that the filesystem mightencountered, hence the void return type. This will never becalled if the VM is trying to reclaim under GFP_NOFS conditions,hence this method does not need to handle that situation itself.
Implementations must include conditional reschedule calls insideany scanning loop that is done. This allows the VFS todetermine appropriate scan batch sizes without having to worryabout whether implementations will cause holdoff problems due tolarge scan batch sizes.
Whoever sets up the inode is responsible for filling in the “i_op”field. This is a pointer to a “structinode_operations” which describesthe methods that can be performed on individual inodes.
struct xattr_handler¶
On filesystems that support extended attributes (xattrs), the s_xattrsuperblock field points to a NULL-terminated array of xattr handlers.Extended attributes are name:value pairs.
nameIndicates that the handler matches attributes with the specifiedname (such as “system.posix_acl_access”); the prefix field mustbe NULL.
prefixIndicates that the handler matches all attributes with thespecified name prefix (such as “user.”); the name field must beNULL.
listDetermine if attributes matching this xattr handler should belisted for a particular dentry. Used by some listxattrimplementations like generic_listxattr.
getCalled by the VFS to get the value of a particular extendedattribute. This method is called by the getxattr(2) systemcall.
setCalled by the VFS to set the value of a particular extendedattribute. When the new value is NULL, called to remove aparticular extended attribute. This method is called by thesetxattr(2) and removexattr(2) system calls.
When none of the xattr handlers of a filesystem match the specifiedattribute name or when a filesystem doesn’t support extended attributes,the various*xattr(2) system calls return -EOPNOTSUPP.
The Inode Object¶
An inode object represents an object within the filesystem.
struct inode_operations¶
This describes how the VFS can manipulate an inode in your filesystem.As of kernel 2.6.22, the following members are defined:
structinode_operations{int(*create)(structmnt_idmap*,structinode*,structdentry*,umode_t,bool);structdentry*(*lookup)(structinode*,structdentry*,unsignedint);int(*link)(structdentry*,structinode*,structdentry*);int(*unlink)(structinode*,structdentry*);int(*symlink)(structmnt_idmap*,structinode*,structdentry*,constchar*);structdentry*(*mkdir)(structmnt_idmap*,structinode*,structdentry*,umode_t);int(*rmdir)(structinode*,structdentry*);int(*mknod)(structmnt_idmap*,structinode*,structdentry*,umode_t,dev_t);int(*rename)(structmnt_idmap*,structinode*,structdentry*,structinode*,structdentry*,unsignedint);int(*readlink)(structdentry*,char__user*,int);constchar*(*get_link)(structdentry*,structinode*,structdelayed_call*);int(*permission)(structmnt_idmap*,structinode*,int);structposix_acl*(*get_inode_acl)(structinode*,int,bool);int(*setattr)(structmnt_idmap*,structdentry*,structiattr*);int(*getattr)(structmnt_idmap*,conststructpath*,structkstat*,u32,unsignedint);ssize_t(*listxattr)(structdentry*,char*,size_t);void(*update_time)(structinode*,structtimespec*,int);int(*atomic_open)(structinode*,structdentry*,structfile*,unsignedopen_flag,umode_tcreate_mode);int(*tmpfile)(structmnt_idmap*,structinode*,structfile*,umode_t);structposix_acl*(*get_acl)(structmnt_idmap*,structdentry*,int);int(*set_acl)(structmnt_idmap*,structdentry*,structposix_acl*,int);int(*fileattr_set)(structmnt_idmap*idmap,structdentry*dentry,structfile_kattr*fa);int(*fileattr_get)(structdentry*dentry,structfile_kattr*fa);structoffset_ctx*(*get_offset_ctx)(structinode*inode);};
Again, all methods are called without any locks being held, unlessotherwise noted.
createcalled by the open(2) and creat(2) system calls. Only requiredif you want to support regular files. The dentry you get shouldnot have an inode (i.e. it should be a negative dentry). Hereyou will probably call
d_instantiate()with the dentry and thenewly created inodelookupcalled when the VFS needs to look up an inode in a parentdirectory. The name to look for is found in the dentry. Thismethod must call
d_add()to insert the found inode into thedentry. The “i_count” field in the inode structure should beincremented. If the named inode does not exist a NULL inodeshould be inserted into the dentry (this is called a negativedentry). Returning an error code from this routine must only bedone on a real error, otherwise creating inodes with systemcalls like create(2), mknod(2), mkdir(2) and so on will fail.If you wish to overload the dentry methods then you shouldinitialise the “d_dop” field in the dentry; this is a pointer toa struct “dentry_operations”. This method is called with thedirectory inode semaphore heldlinkcalled by the link(2) system call. Only required if you want tosupport hard links. You will probably need to call
d_instantiate()just as you would in thecreate()methodunlinkcalled by the unlink(2) system call. Only required if you wantto support deleting inodes
symlinkcalled by the symlink(2) system call. Only required if you wantto support symlinks. You will probably need to call
d_instantiate()just as you would in thecreate()methodmkdircalled by the mkdir(2) system call. Only required if you wantto support creating subdirectories. You will probably need tocall
d_instantiate_new()just as you would in thecreate()method.If
d_instantiate_new()is not used and if thefh_to_dentry()export operation is provided, or if the storage might beaccessible by another path (e.g. with a network filesystem)then more care may be needed. Importantlyd_instantate()should not be used with an inode that is no longer I_NEW if thereany chance that the inode could already be attached to a dentry.This is because of a hard rule in the VFS that a directory mustonly ever have one dentry.For example, if an NFS filesystem is mounted twice the new directorycould be visible on the other mount before it is on the originalmount, and a pair of
name_to_handle_at(),open_by_handle_at()calls could instantiate the directory inode with anIS_ROOT()dentry before the first mkdir returns.If there is any chance this could happen, then the new inodeshould be
d_drop()ed and attached withd_splice_alias(). Thereturned dentry (if any) should be returned by ->mkdir().rmdircalled by the rmdir(2) system call. Only required if you wantto support deleting subdirectories
mknodcalled by the mknod(2) system call to create a device (char,block) inode or a named pipe (FIFO) or socket. Only required ifyou want to support creating these types of inodes. You willprobably need to call
d_instantiate()just as you would in thecreate()methodrenamecalled by the rename(2) system call to rename the object to havethe parent and name given by the second inode and dentry.
The filesystem must return -EINVAL for any unsupported orunknown flags. Currently the following flags are implemented:(1) RENAME_NOREPLACE: this flag indicates that if the target ofthe rename exists the rename should fail with -EEXIST instead ofreplacing the target. The VFS already checks for existence, sofor local filesystems the RENAME_NOREPLACE implementation isequivalent to plain rename.(2) RENAME_EXCHANGE: exchange source and target. Both mustexist; this is checked by the VFS. Unlike plain rename, sourceand target may be of different type.
get_linkcalled by the VFS to follow a symbolic link to the inode itpoints to. Only required if you want to support symbolic links.This method returns the symlink body to traverse (and possiblyresets the current position with
nd_jump_link()). If the bodywon’t go away until the inode is gone, nothing else is needed;if it needs to be otherwise pinned, arrange for its release byhaving get_link(..., ..., done) do set_delayed_call(done,destructor, argument). In that case destructor(argument) willbe called once VFS is done with the body you’ve returned. Maybe called in RCU mode; that is indicated by NULL dentryargument. If request can’t be handled without leaving RCU mode,have it return ERR_PTR(-ECHILD).If the filesystem stores the symlink target in ->i_link, theVFS may use it directly without calling ->
get_link(); however,->get_link()must still be provided. ->i_link must not befreed until after an RCU grace period. Writing to ->i_linkpost-iget()time requires a ‘release’ memory barrier.readlinkthis is now just an override for use by readlink(2) for thecases when ->get_link uses
nd_jump_link()or object is not infact a symlink. Normally filesystems should only implement->get_link for symlinks and readlink(2) will automatically usethat.permissioncalled by the VFS to check for access rights on a POSIX-likefilesystem.
May be called in rcu-walk mode (mask & MAY_NOT_BLOCK). If inrcu-walk mode, the filesystem must check the permission withoutblocking or storing to the inode.
If a situation is encountered that rcu-walk cannot handle,return-ECHILD and it will be called again in ref-walk mode.
setattrcalled by the VFS to set attributes for a file. This method iscalled by chmod(2) and related system calls.
getattrcalled by the VFS to get attributes of a file. This method iscalled by stat(2) and related system calls.
listxattrcalled by the VFS to list all extended attributes for a givenfile. This method is called by the listxattr(2) system call.
update_timecalled by the VFS to update a specific time or the i_version ofan inode. If this is not defined the VFS will update the inodeitself and call mark_inode_dirty_sync.
atomic_opencalled on the last component of an open. Using this optionalmethod the filesystem can look up, possibly create and open thefile in one atomic operation. If it wants to leave actualopening to the caller (e.g. if the file turned out to be asymlink, device, or just something filesystem won’t do atomicopen for), it may signal this by returning finish_no_open(file,dentry). This method is only called if the last component isnegative or needs lookup. Cached positive dentries are stillhandled by f_op->open(). If the file was created, FMODE_CREATEDflag should be set in file->f_mode. In case of O_EXCL themethod must only succeed if the file didn’t exist and henceFMODE_CREATED shall always be set on success.
tmpfilecalled in the end of O_TMPFILE open(). Optional, equivalent toatomically creating, opening and unlinking a file in givendirectory. On success needs to return with the file alreadyopen; this can be done by calling
finish_open_simple()right atthe end.fileattr_getcalled on ioctl(FS_IOC_GETFLAGS) and ioctl(FS_IOC_FSGETXATTR) toretrieve miscellaneous file flags and attributes. Also calledbefore the relevant SET operation to check what is being changed(in this case with i_rwsem locked exclusive). If unset, thenfall back to f_op->ioctl().
fileattr_setcalled on ioctl(FS_IOC_SETFLAGS) and ioctl(FS_IOC_FSSETXATTR) tochange miscellaneous file flags and attributes. Callers holdi_rwsem exclusive. If unset, then fall back to f_op->ioctl().
get_offset_ctxcalled to get the offset context for a directory inode. Afilesystem must define this operation to usesimple_offset_dir_operations.
The Address Space Object¶
The address space object is used to group and manage pages in the pagecache. It can be used to keep track of the pages in a file (or anythingelse) and also track the mapping of sections of the file into processaddress spaces.
There are a number of distinct yet related services that anaddress-space can provide. These include communicating memory pressure,page lookup by address, and keeping track of pages tagged as Dirty orWriteback.
The first can be used independently to the others. The VM can try torelease clean pages in order to reuse them. To do this it can call->release_folio on clean folios with the privateflag set. Clean pages without PagePrivate and with no external referenceswill be released without notice being given to the address_space.
To achieve this functionality, pages need to be placed on an LRU withlru_cache_add and mark_page_active needs to be called whenever the pageis used.
Pages are normally kept in a radix tree index by ->index. This treemaintains information about the PG_Dirty and PG_Writeback status of eachpage, so that pages with either of these flags can be found quickly.
The Dirty tag is primarily used by mpage_writepages - the default->writepages method. It uses the tag to find dirty pages towrite back. If mpage_writepages is not used (i.e. the addressprovides its own ->writepages) , the PAGECACHE_TAG_DIRTY tag is almostunused. write_inode_now and sync_inode do use it (through__sync_single_inode) to check if ->writepages has been successful inwriting out the whole address_space.
The Writeback tag is used by filemap*wait* and sync_page* functions, viafilemap_fdatawait_range, to wait for all writeback to complete.
An address_space handler may attach extra information to a page,typically using the ‘private’ field in the ‘structpage’. If suchinformation is attached, the PG_Private flag should be set. This willcause various VM routines to make extra calls into the address_spacehandler to deal with that data.
An address space acts as an intermediate between storage andapplication. Data is read into the address space a whole page at atime, and provided to the application either by copying of the page, orby memory-mapping the page. Data is written into the address space bythe application, and then written-back to storage typically in wholepages, however the address_space has finer control of write sizes.
The read process essentially only requires ‘read_folio’. The writeprocess is more complicated and uses write_begin/write_end ordirty_folio to write data into the address_space, andwritepages to writeback data to storage.
Removing pages from an address_space requires holding the inode’s i_rwsemexclusively, while adding pages to the address_space requires holding theinode’s i_mapping->invalidate_lock exclusively.
When data is written to a page, the PG_Dirty flag should be set. Ittypically remains set until writepages asks for it to be written. Thisshould clear PG_Dirty and set PG_Writeback. It can be actually writtenat any point after PG_Dirty is clear. Once it is known to be safe,PG_Writeback is cleared.
Writeback makes use of a writeback_control structure to direct theoperations. This gives the writepages operation someinformation about the nature of and reason for the writeback request,and the constraints under which it is being done. It is also used toreturn information back to the caller about the result of awritepages request.
Handling errors during writeback¶
Most applications that do buffered I/O will periodically call a filesynchronization call (fsync, fdatasync, msync or sync_file_range) toensure that data written has made it to the backing store. When thereis an error during writeback, they expect that error to be reported whena file sync request is made. After an error has been reported on onerequest, subsequent requests on the same file descriptor should return0, unless further writeback errors have occurred since the previous filesynchronization.
Ideally, the kernel would report errors only on file descriptions onwhich writes were done that subsequently failed to be written back. Thegeneric pagecache infrastructure does not track the file descriptionsthat have dirtied each individual page however, so determining whichfile descriptors should get back an error is not possible.
Instead, the generic writeback error tracking infrastructure in thekernel settles for reporting errors to fsync on all file descriptionsthat were open at the time that the error occurred. In a situation withmultiple writers, all of them will get back an error on a subsequentfsync, even if all of the writes done through that particular filedescriptor succeeded (or even if there were no writes on that filedescriptor at all).
Filesystems that wish to use this infrastructure should callmapping_set_error to record the error in the address_space when itoccurs. Then, after writing back data from the pagecache in theirfile->fsync operation, they should call file_check_and_advance_wb_err toensure that thestructfile’s error cursor has advanced to the correctpoint in the stream of errors emitted by the backing device(s).
struct address_space_operations¶
This describes how the VFS can manipulate mapping of a file to pagecache in your filesystem. The following members are defined:
structaddress_space_operations{int(*read_folio)(structfile*,structfolio*);int(*writepages)(structaddress_space*,structwriteback_control*);bool(*dirty_folio)(structaddress_space*,structfolio*);void(*readahead)(structreadahead_control*);int(*write_begin)(conststructkiocb*,structaddress_space*mapping,loff_tpos,unsignedlen,structpage**pagep,void**fsdata);int(*write_end)(conststructkiocb*,structaddress_space*mapping,loff_tpos,unsignedlen,unsignedcopied,structfolio*folio,void*fsdata);sector_t(*bmap)(structaddress_space*,sector_t);void(*invalidate_folio)(structfolio*,size_tstart,size_tlen);bool(*release_folio)(structfolio*,gfp_t);void(*free_folio)(structfolio*);ssize_t(*direct_IO)(structkiocb*,structiov_iter*iter);int(*migrate_folio)(structmapping*,structfolio*dst,structfolio*src,enummigrate_mode);int(*launder_folio)(structfolio*);bool(*is_partially_uptodate)(structfolio*,size_tfrom,size_tcount);void(*is_dirty_writeback)(structfolio*,bool*,bool*);int(*error_remove_folio)(structmapping*mapping,structfolio*);int(*swap_activate)(structswap_info_struct*sis,structfile*f,sector_t*span)int(*swap_deactivate)(structfile*);int(*swap_rw)(structkiocb*iocb,structiov_iter*iter);};
read_folioCalled by the page cache to read a folio from the backing store.The ‘file’ argument supplies authentication information to networkfilesystems, and is generally not used by block based filesystems.It may be NULL if the caller does not have an open file (eg ifthe kernel is performing a read for itself rather than on behalfof a userspace process with an open file).
If the mapping does not support large folios, the folio willcontain a single page. The folio will be locked when read_foliois called. If the read completes successfully, the folio shouldbe marked uptodate. The filesystem should unlock the folioonce the read has completed, whether it was successful or not.The filesystem does not need to modify the refcount on the folio;the page cache holds a reference count and that will not bereleased until the folio is unlocked.
Filesystems may implement ->
read_folio()synchronously.In normal operation, folios are read through the ->readahead()method. Only if this fails, or if the caller needs to wait forthe read to complete will the page cache call ->read_folio().Filesystems should not attempt to perform their own readaheadin the ->read_folio()operation.If the filesystem cannot perform the read at this time, it canunlock the folio, do whatever action it needs to ensure that theread will succeed in the future and return AOP_TRUNCATED_PAGE.In this case, the caller should look up the folio, lock it,and call ->read_folio again.
Callers may invoke the ->
read_folio()method directly, but usingread_mapping_folio()will take care of locking, waiting for theread to complete and handle cases such as AOP_TRUNCATED_PAGE.writepagescalled by the VM to write out pages associated with theaddress_space object. If wbc->sync_mode is WB_SYNC_ALL, thenthe writeback_control will specify a range of pages that must bewritten out. If it is WB_SYNC_NONE, then a nr_to_write isgiven and that many pages should be written if possible. If no->writepages is given, then mpage_writepages is used instead.This will choose pages from the address space that are tagged asDIRTY and will write them back.
dirty_foliocalled by the VM to mark a folio as dirty. This is particularlyneeded if an address space attaches private data to a folio, andthat data needs to be updated when a folio is dirtied. This iscalled, for example, when a memory mapped page gets modified.If defined, it should set the folio dirty flag, and thePAGECACHE_TAG_DIRTY search mark in i_pages.
readaheadCalled by the VM to read pages associated with the address_spaceobject. The pages are consecutive in the page cache and arelocked. The implementation should decrement the page refcountafter starting I/O on each page. Usually the page will beunlocked by the I/O completion handler. The set of pages aredivided into some sync pages followed by some async pages,rac->ra->async_size gives the number of async pages. Thefilesystem should attempt to read all sync pages but may decideto stop once it reaches the async pages. If it does decide tostop attempting I/O, it can simply return. The caller willremove the remaining pages from the address space, unlock themand decrement the page refcount. Set PageUptodate if the I/Ocompletes successfully.
write_beginCalled by the generic buffered write code to ask the filesystemto prepare to write len bytes at the given offset in the file.The address_space should check that the write will be able tocomplete, by allocating space if necessary and doing any otherinternal housekeeping. If the write will update parts of anybasic-blocks on storage, then those blocks should be pre-read(if they haven’t been read already) so that the updated blockscan be written out properly.
The filesystem must return the locked pagecache folio for thespecified offset, in
*foliop, for the caller to write into.It must be able to cope with short writes (where the lengthpassed to write_begin is greater than the number of bytes copiedinto the folio).
A void * may be returned in fsdata, which then gets passed intowrite_end.
Returns 0 on success; < 0 on failure (which is the error code),in which case write_end is not called.
write_endAfter a successful write_begin, and data copy, write_end must becalled. len is the original len passed to write_begin, andcopied is the amount that was able to be copied.
The filesystem must take care of unlocking the folio,decrementing its refcount, and updating i_size.
Returns < 0 on failure, otherwise the number of bytes (<=‘copied’) that were able to be copied into pagecache.
bmapcalled by the VFS to map a logical block offset within object tophysical block number. This method is used by the FIBMAP ioctland for working with swap-files. To be able to swap to a file,the file must have a stable mapping to a block device. The swapsystem does not go through the filesystem but instead uses bmapto find out where the blocks in the file are and uses thoseaddresses directly.
invalidate_folioIf a folio has private data, then invalidate_folio will becalled when part or all of the folio is to be removed from theaddress space. This generally corresponds to either atruncation, punch hole or a complete invalidation of the addressspace (in the latter case ‘offset’ will always be 0 and ‘length’will be
folio_size()). Any private data associated with the folioshould be updated to reflect this truncation. If offset is 0and length isfolio_size(), then the private data should bereleased, because the folio must be able to be completelydiscarded. This may be done by calling the ->release_foliofunction, but in this case the release MUST succeed.release_foliorelease_folio is called on folios with private data to tell thefilesystem that the folio is about to be freed. ->release_folioshould remove any private data from the folio and clear theprivate flag. If
release_folio()fails, it should return false.release_folio()is used in two distinct though related cases.The first is when the VM wants to free a clean folio with noactive users. If ->release_folio succeeds, the folio will beremoved from the address_space and be freed.The second case is when a request has been made to invalidatesome or all folios in an address_space. This can happenthrough the fadvise(POSIX_FADV_DONTNEED) system call or by thefilesystem explicitly requesting it as nfs and 9p do (when theybelieve the cache may be out of date with storage) by calling
invalidate_inode_pages2(). If the filesystem makes such a call,and needs to be certain that all folios are invalidated, thenits release_folio will need to ensure this. Possibly it canclear the uptodate flag if it cannot free private data yet.free_foliofree_folio is called once the folio is no longer visible in thepage cache in order to allow the cleanup of any private data.Since it may be called by the memory reclaimer, it should notassume that the original address_space mapping still exists, andit should not block.
direct_IOcalled by the generic read/write routines to perform direct_IO -that is IO requests which bypass the page cache and transferdata directly between the storage and the application’s addressspace.
migrate_folioThis is used to compact the physical memory usage. If the VMwants to relocate a folio (maybe from a memory device that issignalling imminent failure) it will pass a new folio and an oldfolio to this function. migrate_folio should transfer any privatedata across and update any references that it has to the folio.
launder_folioCalled before freeing a folio - it writes back the dirty folio.To prevent redirtying the folio, it is kept locked during thewhole operation.
is_partially_uptodateCalled by the VM when reading a file through the pagecache whenthe underlying blocksize is smaller than the size of the folio.If the required block is up to date then the read can completewithout needing I/O to bring the whole page up to date.
is_dirty_writebackCalled by the VM when attempting to reclaim a folio. The VM usesdirty and writeback information to determine if it needs tostall to allow flushers a chance to complete some IO.Ordinarily it can use folio_test_dirty and folio_test_writeback butsome filesystems have more complex state (unstable folios in NFSprevent reclaim) or do not set those flags due to lockingproblems. This callback allows a filesystem to indicate to theVM if a folio should be treated as dirty or writeback for thepurposes of stalling.
error_remove_folionormally set to generic_error_remove_folio if truncation is okfor this address space. Used for memory failure handling.Setting this implies you deal with pages going away under you,unless you have them locked or reference counts increased.
swap_activate
Called to prepare the given file for swap. It should performany validation and preparation necessary to ensure that writescan be performed with minimal memory allocation. It should call
add_swap_extent(), or the helperiomap_swapfile_activate(), andreturn the number of extents added. If IO should be submittedthrough ->swap_rw(), it should set SWP_FS_OPS, otherwise IO willbe submitted directly to the block devicesis->bdev.
swap_deactivateCalled during swapoff on files where swap_activate wassuccessful.
swap_rwCalled to read or write swap pages when SWP_FS_OPS is set.
The File Object¶
A file object represents a file opened by a process. This is also knownas an “open file description” in POSIX parlance.
struct file_operations¶
This describes how the VFS can manipulate an open file. As of kernel4.18, the following members are defined:
structfile_operations{structmodule*owner;fop_flags_tfop_flags;loff_t(*llseek)(structfile*,loff_t,int);ssize_t(*read)(structfile*,char__user*,size_t,loff_t*);ssize_t(*write)(structfile*,constchar__user*,size_t,loff_t*);ssize_t(*read_iter)(structkiocb*,structiov_iter*);ssize_t(*write_iter)(structkiocb*,structiov_iter*);int(*iopoll)(structkiocb*kiocb,structio_comp_batch*,unsignedintflags);int(*iterate_shared)(structfile*,structdir_context*);__poll_t(*poll)(structfile*,structpoll_table_struct*);long(*unlocked_ioctl)(structfile*,unsignedint,unsignedlong);long(*compat_ioctl)(structfile*,unsignedint,unsignedlong);int(*mmap)(structfile*,structvm_area_struct*);int(*open)(structinode*,structfile*);int(*flush)(structfile*,fl_owner_tid);int(*release)(structinode*,structfile*);int(*fsync)(structfile*,loff_t,loff_t,intdatasync);int(*fasync)(int,structfile*,int);int(*lock)(structfile*,int,structfile_lock*);unsignedlong(*get_unmapped_area)(structfile*,unsignedlong,unsignedlong,unsignedlong,unsignedlong);int(*check_flags)(int);int(*flock)(structfile*,int,structfile_lock*);ssize_t(*splice_write)(structpipe_inode_info*,structfile*,loff_t*,size_t,unsignedint);ssize_t(*splice_read)(structfile*,loff_t*,structpipe_inode_info*,size_t,unsignedint);void(*splice_eof)(structfile*file);int(*setlease)(structfile*,int,structfile_lease**,void**);long(*fallocate)(structfile*file,intmode,loff_toffset,loff_tlen);void(*show_fdinfo)(structseq_file*m,structfile*f);#ifndef CONFIG_MMUunsigned(*mmap_capabilities)(structfile*);#endifssize_t(*copy_file_range)(structfile*,loff_t,structfile*,loff_t,size_t,unsignedint);loff_t(*remap_file_range)(structfile*file_in,loff_tpos_in,structfile*file_out,loff_tpos_out,loff_tlen,unsignedintremap_flags);int(*fadvise)(structfile*,loff_t,loff_t,int);int(*uring_cmd)(structio_uring_cmd*ioucmd,unsignedintissue_flags);int(*uring_cmd_iopoll)(structio_uring_cmd*,structio_comp_batch*,unsignedintpoll_flags);int(*mmap_prepare)(structvm_area_desc*);};
Again, all methods are called without any locks being held, unlessotherwise noted.
llseekcalled when the VFS needs to move the file position index
readcalled by read(2) and related system calls
read_iterpossibly asynchronous read with iov_iter as destination
writecalled by write(2) and related system calls
write_iterpossibly asynchronous write with iov_iter as source
iopollcalled when aio wants to poll for completions on HIPRI iocbs
iterate_sharedcalled when the VFS needs to read the directory contents
pollcalled by the VFS when a process wants to check if there isactivity on this file and (optionally) go to sleep until thereis activity. Called by the select(2) and poll(2) system calls
unlocked_ioctlcalled by the ioctl(2) system call.
compat_ioctl- called by the ioctl(2) system call when 32 bit system calls are
used on 64 bit kernels.
mmapcalled by the mmap(2) system call. Deprecated in favour of
mmap_prepare.opencalled by the VFS when an inode should be opened. When the VFSopens a file, it creates a new “
structfile”. It then calls theopen method for the newly allocated file structure. You mightthink that the open method really belongs in “structinode_operations”, and you may be right. I think it’s done theway it is because it makes filesystems simpler to implement.The open() method is a good place to initialize the“private_data” member in the file structure if you want to pointto a device structureflushcalled by the close(2) system call to flush a file
releasecalled when the last reference to an open file is closed
fsynccalled by the fsync(2) system call. Also see the section aboveentitled “Handling errors during writeback”.
fasynccalled by the fcntl(2) system call when asynchronous(non-blocking) mode is enabled for a file
lockcalled by the fcntl(2) system call for F_GETLK, F_SETLK, andF_SETLKW commands
get_unmapped_areacalled by the mmap(2) system call
check_flagscalled by the fcntl(2) system call for F_SETFL command
flockcalled by the flock(2) system call
splice_writecalled by the VFS to splice data from a pipe to a file. Thismethod is used by the splice(2) system call
splice_readcalled by the VFS to splice data from file to a pipe. Thismethod is used by the splice(2) system call
setleasecalled by the VFS to set or release a file lock lease. setleaseimplementations should call generic_setlease to record or removethe lease in the inode after setting it.
fallocatecalled by the VFS to preallocate blocks or punch a hole.
copy_file_rangecalled by the copy_file_range(2) system call.
remap_file_rangecalled by the ioctl(2) system call for FICLONERANGE and FICLONEand FIDEDUPERANGE commands to remap file ranges. Animplementation should remap len bytes at pos_in of the sourcefile into the dest file at pos_out. Implementations must handlecallers passing in len == 0; this means “remap to the end of thesource file”. The return value should the number of bytesremapped, or the usual negative error code if errors occurredbefore any bytes were remapped. The remap_flags parameteraccepts REMAP_FILE_* flags. If REMAP_FILE_DEDUP is set then theimplementation must only remap if the requested file ranges haveidentical contents. If REMAP_FILE_CAN_SHORTEN is set, the caller isok with the implementation shortening the request length tosatisfy alignment or EOF requirements (or any other reason).
fadvisepossibly called by the
fadvise64()system call.mmap_prepareCalled by the mmap(2) system call. Allows a VFS to set up afile-backed memory mapping, most notably establishing relevantprivate state and VMA callbacks.
If further action such as pre-population of page tables is required,this can be specified by the vm_area_desc->action field and relatedparameters.
Note that the file operations are implemented by the specificfilesystem in which the inode resides. When opening a device node(character or block special) most filesystems will call specialsupport routines in the VFS which will locate the required devicedriver information. These support routines replace the filesystem fileoperations with those for the device driver, and then proceed to callthe new open() method for the file. This is how opening a device filein the filesystem eventually ends up calling the device driver open()method.
Directory Entry Cache (dcache)¶
struct dentry_operations¶
This describes how a filesystem can overload the standard dentryoperations. Dentries and the dcache are the domain of the VFS and theindividual filesystem implementations. Device drivers have no businesshere. These methods may be set to NULL, as they are either optional orthe VFS uses a default. As of kernel 2.6.22, the following members aredefined:
structdentry_operations{int(*d_revalidate)(structinode*,conststructqstr*,structdentry*,unsignedint);int(*d_weak_revalidate)(structdentry*,unsignedint);int(*d_hash)(conststructdentry*,structqstr*);int(*d_compare)(conststructdentry*,unsignedint,constchar*,conststructqstr*);int(*d_delete)(conststructdentry*);int(*d_init)(structdentry*);void(*d_release)(structdentry*);void(*d_iput)(structdentry*,structinode*);char*(*d_dname)(structdentry*,char*,int);structvfsmount*(*d_automount)(structpath*);int(*d_manage)(conststructpath*,bool);structdentry*(*d_real)(structdentry*,enumd_real_typetype);bool(*d_unalias_trylock)(conststructdentry*);void(*d_unalias_unlock)(conststructdentry*);};
d_revalidatecalled when the VFS needs to revalidate a dentry. This iscalled whenever a name look-up finds a dentry in the dcache.Most local filesystems leave this as NULL, because all theirdentries in the dcache are valid. Network filesystems aredifferent since things can change on the server without theclient necessarily being aware of it.
This function should return a positive value if the dentry isstill valid, and zero or a negative error code if it isn’t.
d_revalidate may be called in rcu-walk mode (flags &LOOKUP_RCU). If in rcu-walk mode, the filesystem mustrevalidate the dentry without blocking or storing to the dentry,d_parent and d_inode should not be used without care (becausethey can change and, in d_inode case, even become NULL underus).
If a situation is encountered that rcu-walk cannot handle,return-ECHILD and it will be called again in ref-walk mode.
d_weak_revalidatecalled when the VFS needs to revalidate a “jumped” dentry. Thisis called when a path-walk ends at dentry that was not acquiredby doing a lookup in the parent directory. This includes “/”,“.” and “..”, as well as procfs-style symlinks and mountpointtraversal.
In this case, we are less concerned with whether the dentry isstill fully correct, but rather that the inode is still valid.As with d_revalidate, most local filesystems will set this toNULL since their dcache entries are always valid.
This function has the same return code semantics asd_revalidate.
d_weak_revalidate is only called after leaving rcu-walk mode.
d_hashcalled when the VFS adds a dentry to the hash table. The firstdentry passed to d_hash is the parent directory that the name isto be hashed into.
Same locking and synchronisation rules as d_compare regardingwhat is safe to dereference etc.
d_comparecalled to compare a dentry name with a given name. The firstdentry is the parent of the dentry to be compared, the second isthe child dentry. len and name string are properties of thedentry to be compared. qstr is the name to compare it with.
Must be constant and idempotent, and should not take locks ifpossible, and should not or store into the dentry. Should notdereference pointers outside the dentry without lots of care(eg. d_parent, d_inode, d_name should not be used).
However, our vfsmount is pinned, and RCU held, so the dentriesand inodes won’t disappear, neither will our sb or filesystemmodule. ->d_sb may be used.
It is a tricky calling convention because it needs to be calledunder “rcu-walk”, ie. without any locks or references on things.
d_deletecalled when the last reference to a dentry is dropped and thedcache is deciding whether or not to cache it. Return 1 todelete immediately, or 0 to cache the dentry. Default is NULLwhich means to always cache a reachable dentry. d_delete mustbe constant and idempotent.
d_initcalled when a dentry is allocated
d_releasecalled when a dentry is really deallocated
d_iputcalled when a dentry loses its inode (just prior to its beingdeallocated). The default when this is NULL is that the VFScalls
iput(). If you define this method, you must calliput()yourselfd_dnamecalled when the pathname of a dentry should be generated.Useful for some pseudo filesystems (sockfs, pipefs, ...) todelay pathname generation. (Instead of doing it when dentry iscreated, it’s done only when the path is needed.). Realfilesystems probably dont want to use it, because their dentriesare present in global dcache hash, so their hash should be aninvariant. As no lock is held,
d_dname()should not try tomodify the dentry itself, unless appropriate SMP safety is used.CAUTION :d_path()logic is quite tricky. The correct way toreturn for example “Hello” is to put it at the end of thebuffer, and returns a pointer to the first char.dynamic_dname()helper function is provided to take care ofthis.Example :
staticchar*pipefs_dname(structdentry*dent,char*buffer,intbuflen){returndynamic_dname(dentry,buffer,buflen,"pipe:[%lu]",dentry->d_inode->i_ino);}
d_automountcalled when an automount dentry is to be traversed (optional).This should create a new VFS mount record and return the recordto the caller. The caller is supplied with a path parametergiving the automount directory to describe the automount targetand the parent VFS mount record to provide inheritable mountparameters. NULL should be returned if someone else managed tomake the automount first. If the vfsmount creation failed, thenan error code should be returned. If -EISDIR is returned, thenthe directory will be treated as an ordinary directory andreturned to pathwalk to continue walking.
If a vfsmount is returned, the caller will attempt to mount iton the mountpoint and will remove the vfsmount from itsexpiration list in the case of failure.
This function is only used if DCACHE_NEED_AUTOMOUNT is set onthe dentry. This is set by
__d_instantiate()if S_AUTOMOUNT isset on the inode being added.d_managecalled to allow the filesystem to manage the transition from adentry (optional). This allows autofs, for example, to hold upclients waiting to explore behind a ‘mountpoint’ while lettingthe daemon go past and construct the subtree there. 0 should bereturned to let the calling process continue. -EISDIR can bereturned to tell pathwalk to use this directory as an ordinarydirectory and to ignore anything mounted on it and not to checkthe automount flag. Any other error code will abort pathwalkcompletely.
If the ‘rcu_walk’ parameter is true, then the caller is doing apathwalk in RCU-walk mode. Sleeping is not permitted in thismode, and the caller can be asked to leave it and call again byreturning -ECHILD. -EISDIR may also be returned to tellpathwalk to ignore d_automount or any mounts.
This function is only used if DCACHE_MANAGE_TRANSIT is set onthe dentry being transited from.
d_realoverlay/
uniontypefilesystems implement this method to return oneof the underlying dentries of a regular file hidden by the overlay.The ‘type’ argument takes the values D_REAL_DATA or D_REAL_METADATAfor returning the real underlying dentry that refers to the inodehosting the file’s data or metadata respectively.
For non-regular files, the ‘dentry’ argument is returned.
d_unalias_trylockif present, will be called by
d_splice_alias()before moving apreexisting attached alias. Returning false prevents__d_move(),makingd_splice_alias()fail with -ESTALE.Rationale: setting FS_RENAME_DOES_D_MOVE will prevent
d_move()andd_exchange()calls from the outside of filesystem methods;however, it does not guarantee that attached dentries won’tbe renamed or moved byd_splice_alias()finding a preexistingalias for a directory inode. Normally we would not care;however, something that wants to stabilize the entire path toroot over a blocking operation might need that. See 9p for one(and hopefully only) example.d_unalias_unlockshould be paired with
d_unalias_trylock; that one is called after__d_move()call in__d_unalias().
Each dentry has a pointer to its parent dentry, as well as a hash listof child dentries. Child dentries are basically like files in adirectory.
Directory Entry Cache API¶
There are a number of functions defined which permit a filesystem tomanipulate dentries:
dgetopen a new handle for an existing dentry (this just incrementsthe usage count)
dputclose a handle for a dentry (decrements the usage count). Ifthe usage count drops to 0, and the dentry is still in itsparent’s hash, the “d_delete” method is called to check whetherit should be cached. If it should not be cached, or if thedentry is not hashed, it is deleted. Otherwise cached dentriesare put into an LRU list to be reclaimed on memory shortage.
d_dropthis unhashes a dentry from its parents hash list. A subsequentcall to
dput()will deallocate the dentry if its usage countdrops to 0d_deletedelete a dentry. If there are no other open references to thedentry then the dentry is turned into a negative dentry (the
d_iput()method is called). If there are other references, thend_drop()is called insteadd_addadd a dentry to its parents hash list and then calls
d_instantiate()d_instantiateadd a dentry to the alias hash list for the inode and updatesthe “d_inode” member. The “i_count” member in the inodestructure should be set/incremented. If the inode pointer isNULL, the dentry is called a “negative dentry”. This functionis commonly called when an inode is created for an existingnegative dentry
d_lookuplook up a dentry given its parent and path name component Itlooks up the child of that given name from the dcache hashtable. If it is found, the reference count is incremented andthe dentry is returned. The caller must use
dput()to free thedentry when it finishes using it.
Mount Options¶
Parsing options¶
On mount and remount the filesystem is passed a string containing acomma separated list of mount options. The options can have either ofthese forms:
optionoption=value
The <linux/parser.h> header defines an API that helps parse theseoptions. There are plenty of examples on how to use it in existingfilesystems.
Showing options¶
If a filesystem accepts mount options, it must defineshow_options() toshow all the currently active options. The rules are:
options MUST be shown which are not default or their values differfrom the default
options MAY be shown which are enabled by default or have theirdefault value
Options used only internally between a mount helper and the kernel (suchas file descriptors), or which only have an effect during the mounting(such as ones controlling the creation of a journal) are exempt from theabove rules.
The underlying reason for the above rules is to make sure, that a mountcan be accurately replicated (e.g. umounting and mounting again) basedon the information found in /proc/mounts.
Resources¶
- (Note some of these resources are not up-to-date with the latest kernel
version.)
- Creating Linux virtual filesystems. 2002
- The Linux Virtual File-system Layer by Neil Brown. 1999
<http://www.cse.unsw.edu.au/~neilb/oss/linux-commentary/vfs.html>
- A tour of the Linux VFS by Michael K. Johnson. 1996
<https://www.tldp.org/LDP/khg/HyperNews/get/fs/vfstour.html>
- A small trail through the Linux kernel by Andries Brouwer. 2001