Overview of the Linux Virtual File System

Original author: Richard Gooch <rgooch@atnf.csiro.au>

  • Copyright (C) 1999 Richard Gooch

  • Copyright (C) 2005 Pekka Enberg

Introduction

The Virtual File System (also known as the Virtual Filesystem Switch) isthe software layer in the kernel that provides the filesystem interfaceto userspace programs. It also provides an abstraction within thekernel which allows different filesystem implementations to coexist.

VFS system calls open(2), stat(2), read(2), write(2), chmod(2) and so onare called from a process context. Filesystem locking is described inthe documentLocking.

Directory Entry Cache (dcache)

The VFS implements the open(2), stat(2), chmod(2), and similar systemcalls. The pathname argument that is passed to them is used by the VFSto search through the directory entry cache (also known as the dentrycache or dcache). This provides a very fast look-up mechanism totranslate a pathname (filename) into a specific dentry. Dentries livein RAM and are never saved to disc: they exist only for performance.

The dentry cache is meant to be a view into your entire filespace. Asmost computers cannot fit all dentries in the RAM at the same time, somebits of the cache are missing. In order to resolve your pathname into adentry, the VFS may have to resort to creating dentries along the way,and then loading the inode. This is done by looking up the inode.

The Inode Object

An individual dentry usually has a pointer to an inode. Inodes arefilesystem objects such as regular files, directories, FIFOs and otherbeasts. They live either on the disc (for block device filesystems) orin the memory (for pseudo filesystems). Inodes that live on the discare copied into the memory when required and changes to the inode arewritten back to disc. A single inode can be pointed to by multipledentries (hard links, for example, do this).

To look up an inode requires that the VFS calls thelookup() method ofthe parent directory inode. This method is installed by the specificfilesystem implementation that the inode lives in. Once the VFS has therequired dentry (and hence the inode), we can do all those boring thingslike open(2) the file, or stat(2) it to peek at the inode data. Thestat(2) operation is fairly simple: once the VFS has the dentry, itpeeks at the inode data and passes some of it back to userspace.

The File Object

Opening a file requires another operation: allocation of a filestructure (this is the kernel-side implementation of file descriptors).The freshly allocated file structure is initialized with a pointer tothe dentry and a set of file operation member functions. These aretaken from the inode data. The open() file method is then called so thespecific filesystem implementation can do its work. You can see thatthis is another switch performed by the VFS. The file structure isplaced into the file descriptor table for the process.

Reading, writing and closing files (and other assorted VFS operations)is done by using the userspace file descriptor to grab the appropriatefile structure, and then calling the required file structure method todo whatever is required. For as long as the file is open, it keeps thedentry in use, which in turn means that the VFS inode is still in use.

Registering and Mounting a Filesystem

To register and unregister a filesystem, use the following APIfunctions:

#include<linux/fs.h>externintregister_filesystem(structfile_system_type*);externintunregister_filesystem(structfile_system_type*);

The passedstructfile_system_type describes your filesystem. When arequest is made to mount a filesystem onto a directory in yournamespace, the VFS will call the appropriatemount() method for thespecific filesystem. New vfsmount referring to the tree returned by->mount() will be attached to the mountpoint, so that when pathnameresolution reaches the mountpoint it will jump into the root of thatvfsmount.

You can see all filesystems that are registered to the kernel in thefile /proc/filesystems.

struct file_system_type

This describes the filesystem. The followingmembers are defined:

structfile_system_type{constchar*name;intfs_flags;int(*init_fs_context)(structfs_context*);conststructfs_parameter_spec*parameters;structdentry*(*mount)(structfile_system_type*,int,constchar*,void*);void(*kill_sb)(structsuper_block*);structmodule*owner;structfile_system_type*next;structhlist_headfs_supers;structlock_class_keys_lock_key;structlock_class_keys_umount_key;structlock_class_keys_vfs_rename_key;structlock_class_keys_writers_key[SB_FREEZE_LEVELS];structlock_class_keyi_lock_key;structlock_class_keyi_mutex_key;structlock_class_keyinvalidate_lock_key;structlock_class_keyi_mutex_dir_key;};
name

the name of the filesystem type, such as “ext2”, “iso9660”,“msdos” and so on

fs_flags

various flags (i.e. FS_REQUIRES_DEV, FS_NO_DCACHE, etc.)

init_fs_context

Initializes ‘structfs_context’ ->ops and ->fs_private fields withfilesystem-specific data.

parameters

Pointer to the array of filesystem parameters descriptors‘structfs_parameter_spec’.More info inFilesystem Mount API.

mount

the method to call when a new instance of this filesystem shouldbe mounted

kill_sb

the method to call when an instance of this filesystem should beshut down

owner

for internal VFS use: you should initialize this to THIS_MODULEin most cases.

next

for internal VFS use: you should initialize this to NULL

fs_supers

for internal VFS use: hlist of filesystem instances (superblocks)

s_lock_key, s_umount_key, s_vfs_rename_key, s_writers_key,i_lock_key, i_mutex_key, invalidate_lock_key, i_mutex_dir_key: lockdep-specific

Themount() method has the following arguments:

structfile_system_type*fs_type

describes the filesystem, partly initialized by the specificfilesystem code

intflags

mount flags

constchar*dev_name

the device name we are mounting.

void*data

arbitrary mount options, usually comes as an ASCII string (see“Mount Options” section)

Themount() method must return the root dentry of the tree requested bycaller. An active reference to its superblock must be grabbed and thesuperblock must be locked. On failure it should return ERR_PTR(error).

The arguments match those of mount(2) and their interpretation dependson filesystem type. E.g. for block filesystems, dev_name is interpretedas block device name, that device is opened and if it contains asuitable filesystem image the method creates and initializesstructsuper_block accordingly, returning its root dentry to caller.

->mount() may choose to return a subtree of existing filesystem - itdoesn’t have to create a new one. The main result from the caller’spoint of view is a reference to dentry at the root of (sub)tree to beattached; creation of new superblock is a common side effect.

The most interesting member of the superblock structure that themount()method fills in is the “s_op” field. This is a pointer to a “structsuper_operations” which describes the next level of the filesystemimplementation.

For more information on mounting (and the new mount API), seeFilesystem Mount API.

The Superblock Object

A superblock object represents a mounted filesystem.

struct super_operations

This describes how the VFS can manipulate the superblock of yourfilesystem. The following members are defined:

structsuper_operations{structinode*(*alloc_inode)(structsuper_block*sb);void(*destroy_inode)(structinode*);void(*free_inode)(structinode*);void(*dirty_inode)(structinode*,intflags);int(*write_inode)(structinode*,structwriteback_control*wbc);int(*drop_inode)(structinode*);void(*evict_inode)(structinode*);void(*put_super)(structsuper_block*);int(*sync_fs)(structsuper_block*sb,intwait);int(*freeze_super)(structsuper_block*sb,enumfreeze_holderwho);int(*freeze_fs)(structsuper_block*);int(*thaw_super)(structsuper_block*sb,enumfreeze_wholderwho);int(*unfreeze_fs)(structsuper_block*);int(*statfs)(structdentry*,structkstatfs*);int(*remount_fs)(structsuper_block*,int*,char*);void(*umount_begin)(structsuper_block*);int(*show_options)(structseq_file*,structdentry*);int(*show_devname)(structseq_file*,structdentry*);int(*show_path)(structseq_file*,structdentry*);int(*show_stats)(structseq_file*,structdentry*);ssize_t(*quota_read)(structsuper_block*,int,char*,size_t,loff_t);ssize_t(*quota_write)(structsuper_block*,int,constchar*,size_t,loff_t);structdquot**(*get_dquots)(structinode*);long(*nr_cached_objects)(structsuper_block*,structshrink_control*);long(*free_cached_objects)(structsuper_block*,structshrink_control*);};

All methods are called without any locks being held, unless otherwisenoted. This means that most methods can block safely. All methods areonly called from a process context (i.e. not from an interrupt handleror bottom half).

alloc_inode

this method is called byalloc_inode() to allocate memory forstructinode and initialize it. If this function is notdefined, a simple ‘structinode’ is allocated. Normallyalloc_inode will be used to allocate a larger structure whichcontains a ‘structinode’ embedded within it.

destroy_inode

this method is called bydestroy_inode() to release resourcesallocated forstructinode. It is only required if->alloc_inode was defined and simply undoes anything done by->alloc_inode.

free_inode

this method is called from RCU callback. If you usecall_rcu()in ->destroy_inode to free ‘structinode’ memory, then it’sbetter to release memory in this method.

dirty_inode

this method is called by the VFS when an inode is marked dirty.This is specifically for the inode itself being marked dirty,not its data. If the update needs to be persisted byfdatasync(),then I_DIRTY_DATASYNC will be set in the flags argument.I_DIRTY_TIME will be set in the flags in case lazytime is enabledandstructinode has times updated since the last ->dirty_inodecall.

write_inode

this method is called when the VFS needs to write an inode todisc. The second parameter indicates whether the write shouldbe synchronous or not, not all filesystems check this flag.

drop_inode

called when the last access to the inode is dropped, with theinode->i_lock spinlock held.

This method should be either NULL (normal UNIX filesystemsemantics) or “inode_just_drop” (for filesystems that donot want to cache inodes - causing “delete_inode” to always becalled regardless of the value of i_nlink)

The “inode_just_drop()” behavior is equivalent to the oldpractice of using “force_delete” in theput_inode() case, butdoes not have the races that the “force_delete()” approach had.

evict_inode

called when the VFS wants to evict an inode. Caller doesnot evict the pagecache or inode-associated metadata buffers;the method has to usetruncate_inode_pages_final() to get ridof those. Caller makes sure async writeback cannot be running forthe inode while (or after) ->evict_inode() is called. Optional.

put_super

called when the VFS wishes to free the superblock(i.e. unmount). This is called with the superblock lock held

sync_fs

called when VFS is writing out all dirty data associated with asuperblock. The second parameter indicates whether the methodshould wait until the write out has been completed. Optional.

freeze_super

Called instead of ->freeze_fs callback if provided.Main difference is that ->freeze_super is called without takingdown_write(&sb->s_umount). If filesystem implements it and wants->freeze_fs to be called too, then it has to call ->freeze_fsexplicitly from this callback. Optional.

freeze_fs

called when VFS is locking a filesystem and forcing it into aconsistent state. This method is currently used by the LogicalVolume Manager (LVM) and ioctl(FIFREEZE). Optional.

thaw_super

called when VFS is unlocking a filesystem and making it writableagain after ->freeze_super. Optional.

unfreeze_fs

called when VFS is unlocking a filesystem and making it writableagain after ->freeze_fs. Optional.

statfs

called when the VFS needs to get filesystem statistics.

remount_fs

called when the filesystem is remounted. This is called withthe kernel lock held

umount_begin

called when the VFS is unmounting a filesystem.

show_options

called by the VFS to show mount options for /proc/<pid>/mountsand /proc/<pid>/mountinfo.(see “Mount Options” section)

show_devname

Optional. Called by the VFS to show device name for/proc/<pid>/{mounts,mountinfo,mountstats}. If not provided then‘(structmount).mnt_devname’ will be used.

show_path

Optional. Called by the VFS (for /proc/<pid>/mountinfo) to showthe mount root dentry path relative to the filesystem root.

show_stats

Optional. Called by the VFS (for /proc/<pid>/mountstats) to showfilesystem-specific mount statistics.

quota_read

called by the VFS to read from filesystem quota file.

quota_write

called by the VFS to write to filesystem quota file.

get_dquots

called by quota to get ‘structdquot’ array for a particular inode.Optional.

nr_cached_objects

called by the sb cache shrinking function for the filesystem toreturn the number of freeable cached objects it contains.Optional.

free_cache_objects

called by the sb cache shrinking function for the filesystem toscan the number of objects indicated to try to free them.Optional, but any filesystem implementing this method needs toalso implement ->nr_cached_objects for it to be calledcorrectly.

We can’t do anything with any errors that the filesystem mightencountered, hence the void return type. This will never becalled if the VM is trying to reclaim under GFP_NOFS conditions,hence this method does not need to handle that situation itself.

Implementations must include conditional reschedule calls insideany scanning loop that is done. This allows the VFS todetermine appropriate scan batch sizes without having to worryabout whether implementations will cause holdoff problems due tolarge scan batch sizes.

Whoever sets up the inode is responsible for filling in the “i_op”field. This is a pointer to a “structinode_operations” which describesthe methods that can be performed on individual inodes.

struct xattr_handler

On filesystems that support extended attributes (xattrs), the s_xattrsuperblock field points to a NULL-terminated array of xattr handlers.Extended attributes are name:value pairs.

name

Indicates that the handler matches attributes with the specifiedname (such as “system.posix_acl_access”); the prefix field mustbe NULL.

prefix

Indicates that the handler matches all attributes with thespecified name prefix (such as “user.”); the name field must beNULL.

list

Determine if attributes matching this xattr handler should belisted for a particular dentry. Used by some listxattrimplementations like generic_listxattr.

get

Called by the VFS to get the value of a particular extendedattribute. This method is called by the getxattr(2) systemcall.

set

Called by the VFS to set the value of a particular extendedattribute. When the new value is NULL, called to remove aparticular extended attribute. This method is called by thesetxattr(2) and removexattr(2) system calls.

When none of the xattr handlers of a filesystem match the specifiedattribute name or when a filesystem doesn’t support extended attributes,the various*xattr(2) system calls return -EOPNOTSUPP.

The Inode Object

An inode object represents an object within the filesystem.

struct inode_operations

This describes how the VFS can manipulate an inode in your filesystem.As of kernel 2.6.22, the following members are defined:

structinode_operations{int(*create)(structmnt_idmap*,structinode*,structdentry*,umode_t,bool);structdentry*(*lookup)(structinode*,structdentry*,unsignedint);int(*link)(structdentry*,structinode*,structdentry*);int(*unlink)(structinode*,structdentry*);int(*symlink)(structmnt_idmap*,structinode*,structdentry*,constchar*);structdentry*(*mkdir)(structmnt_idmap*,structinode*,structdentry*,umode_t);int(*rmdir)(structinode*,structdentry*);int(*mknod)(structmnt_idmap*,structinode*,structdentry*,umode_t,dev_t);int(*rename)(structmnt_idmap*,structinode*,structdentry*,structinode*,structdentry*,unsignedint);int(*readlink)(structdentry*,char__user*,int);constchar*(*get_link)(structdentry*,structinode*,structdelayed_call*);int(*permission)(structmnt_idmap*,structinode*,int);structposix_acl*(*get_inode_acl)(structinode*,int,bool);int(*setattr)(structmnt_idmap*,structdentry*,structiattr*);int(*getattr)(structmnt_idmap*,conststructpath*,structkstat*,u32,unsignedint);ssize_t(*listxattr)(structdentry*,char*,size_t);void(*update_time)(structinode*,structtimespec*,int);int(*atomic_open)(structinode*,structdentry*,structfile*,unsignedopen_flag,umode_tcreate_mode);int(*tmpfile)(structmnt_idmap*,structinode*,structfile*,umode_t);structposix_acl*(*get_acl)(structmnt_idmap*,structdentry*,int);int(*set_acl)(structmnt_idmap*,structdentry*,structposix_acl*,int);int(*fileattr_set)(structmnt_idmap*idmap,structdentry*dentry,structfile_kattr*fa);int(*fileattr_get)(structdentry*dentry,structfile_kattr*fa);structoffset_ctx*(*get_offset_ctx)(structinode*inode);};

Again, all methods are called without any locks being held, unlessotherwise noted.

create

called by the open(2) and creat(2) system calls. Only requiredif you want to support regular files. The dentry you get shouldnot have an inode (i.e. it should be a negative dentry). Hereyou will probably calld_instantiate() with the dentry and thenewly created inode

lookup

called when the VFS needs to look up an inode in a parentdirectory. The name to look for is found in the dentry. Thismethod must calld_add() to insert the found inode into thedentry. The “i_count” field in the inode structure should beincremented. If the named inode does not exist a NULL inodeshould be inserted into the dentry (this is called a negativedentry). Returning an error code from this routine must only bedone on a real error, otherwise creating inodes with systemcalls like create(2), mknod(2), mkdir(2) and so on will fail.If you wish to overload the dentry methods then you shouldinitialise the “d_dop” field in the dentry; this is a pointer toa struct “dentry_operations”. This method is called with thedirectory inode semaphore held

link

called by the link(2) system call. Only required if you want tosupport hard links. You will probably need to calld_instantiate() just as you would in thecreate() method

unlink

called by the unlink(2) system call. Only required if you wantto support deleting inodes

symlink

called by the symlink(2) system call. Only required if you wantto support symlinks. You will probably need to calld_instantiate() just as you would in thecreate() method

mkdir

called by the mkdir(2) system call. Only required if you wantto support creating subdirectories. You will probably need tocalld_instantiate_new() just as you would in thecreate() method.

Ifd_instantiate_new() is not used and if thefh_to_dentry()export operation is provided, or if the storage might beaccessible by another path (e.g. with a network filesystem)then more care may be needed. Importantlyd_instantate()should not be used with an inode that is no longer I_NEW if thereany chance that the inode could already be attached to a dentry.This is because of a hard rule in the VFS that a directory mustonly ever have one dentry.

For example, if an NFS filesystem is mounted twice the new directorycould be visible on the other mount before it is on the originalmount, and a pair ofname_to_handle_at(),open_by_handle_at()calls could instantiate the directory inode with anIS_ROOT()dentry before the first mkdir returns.

If there is any chance this could happen, then the new inodeshould bed_drop()ed and attached withd_splice_alias(). Thereturned dentry (if any) should be returned by ->mkdir().

rmdir

called by the rmdir(2) system call. Only required if you wantto support deleting subdirectories

mknod

called by the mknod(2) system call to create a device (char,block) inode or a named pipe (FIFO) or socket. Only required ifyou want to support creating these types of inodes. You willprobably need to calld_instantiate() just as you would in thecreate() method

rename

called by the rename(2) system call to rename the object to havethe parent and name given by the second inode and dentry.

The filesystem must return -EINVAL for any unsupported orunknown flags. Currently the following flags are implemented:(1) RENAME_NOREPLACE: this flag indicates that if the target ofthe rename exists the rename should fail with -EEXIST instead ofreplacing the target. The VFS already checks for existence, sofor local filesystems the RENAME_NOREPLACE implementation isequivalent to plain rename.(2) RENAME_EXCHANGE: exchange source and target. Both mustexist; this is checked by the VFS. Unlike plain rename, sourceand target may be of different type.

get_link

called by the VFS to follow a symbolic link to the inode itpoints to. Only required if you want to support symbolic links.This method returns the symlink body to traverse (and possiblyresets the current position withnd_jump_link()). If the bodywon’t go away until the inode is gone, nothing else is needed;if it needs to be otherwise pinned, arrange for its release byhaving get_link(..., ..., done) do set_delayed_call(done,destructor, argument). In that case destructor(argument) willbe called once VFS is done with the body you’ve returned. Maybe called in RCU mode; that is indicated by NULL dentryargument. If request can’t be handled without leaving RCU mode,have it return ERR_PTR(-ECHILD).

If the filesystem stores the symlink target in ->i_link, theVFS may use it directly without calling ->get_link(); however,->get_link() must still be provided. ->i_link must not befreed until after an RCU grace period. Writing to ->i_linkpost-iget() time requires a ‘release’ memory barrier.

readlink

this is now just an override for use by readlink(2) for thecases when ->get_link usesnd_jump_link() or object is not infact a symlink. Normally filesystems should only implement->get_link for symlinks and readlink(2) will automatically usethat.

permission

called by the VFS to check for access rights on a POSIX-likefilesystem.

May be called in rcu-walk mode (mask & MAY_NOT_BLOCK). If inrcu-walk mode, the filesystem must check the permission withoutblocking or storing to the inode.

If a situation is encountered that rcu-walk cannot handle,return-ECHILD and it will be called again in ref-walk mode.

setattr

called by the VFS to set attributes for a file. This method iscalled by chmod(2) and related system calls.

getattr

called by the VFS to get attributes of a file. This method iscalled by stat(2) and related system calls.

listxattr

called by the VFS to list all extended attributes for a givenfile. This method is called by the listxattr(2) system call.

update_time

called by the VFS to update a specific time or the i_version ofan inode. If this is not defined the VFS will update the inodeitself and call mark_inode_dirty_sync.

atomic_open

called on the last component of an open. Using this optionalmethod the filesystem can look up, possibly create and open thefile in one atomic operation. If it wants to leave actualopening to the caller (e.g. if the file turned out to be asymlink, device, or just something filesystem won’t do atomicopen for), it may signal this by returning finish_no_open(file,dentry). This method is only called if the last component isnegative or needs lookup. Cached positive dentries are stillhandled by f_op->open(). If the file was created, FMODE_CREATEDflag should be set in file->f_mode. In case of O_EXCL themethod must only succeed if the file didn’t exist and henceFMODE_CREATED shall always be set on success.

tmpfile

called in the end of O_TMPFILE open(). Optional, equivalent toatomically creating, opening and unlinking a file in givendirectory. On success needs to return with the file alreadyopen; this can be done by callingfinish_open_simple() right atthe end.

fileattr_get

called on ioctl(FS_IOC_GETFLAGS) and ioctl(FS_IOC_FSGETXATTR) toretrieve miscellaneous file flags and attributes. Also calledbefore the relevant SET operation to check what is being changed(in this case with i_rwsem locked exclusive). If unset, thenfall back to f_op->ioctl().

fileattr_set

called on ioctl(FS_IOC_SETFLAGS) and ioctl(FS_IOC_FSSETXATTR) tochange miscellaneous file flags and attributes. Callers holdi_rwsem exclusive. If unset, then fall back to f_op->ioctl().

get_offset_ctx

called to get the offset context for a directory inode. Afilesystem must define this operation to usesimple_offset_dir_operations.

The Address Space Object

The address space object is used to group and manage pages in the pagecache. It can be used to keep track of the pages in a file (or anythingelse) and also track the mapping of sections of the file into processaddress spaces.

There are a number of distinct yet related services that anaddress-space can provide. These include communicating memory pressure,page lookup by address, and keeping track of pages tagged as Dirty orWriteback.

The first can be used independently to the others. The VM can try torelease clean pages in order to reuse them. To do this it can call->release_folio on clean folios with the privateflag set. Clean pages without PagePrivate and with no external referenceswill be released without notice being given to the address_space.

To achieve this functionality, pages need to be placed on an LRU withlru_cache_add and mark_page_active needs to be called whenever the pageis used.

Pages are normally kept in a radix tree index by ->index. This treemaintains information about the PG_Dirty and PG_Writeback status of eachpage, so that pages with either of these flags can be found quickly.

The Dirty tag is primarily used by mpage_writepages - the default->writepages method. It uses the tag to find dirty pages towrite back. If mpage_writepages is not used (i.e. the addressprovides its own ->writepages) , the PAGECACHE_TAG_DIRTY tag is almostunused. write_inode_now and sync_inode do use it (through__sync_single_inode) to check if ->writepages has been successful inwriting out the whole address_space.

The Writeback tag is used by filemap*wait* and sync_page* functions, viafilemap_fdatawait_range, to wait for all writeback to complete.

An address_space handler may attach extra information to a page,typically using the ‘private’ field in the ‘structpage’. If suchinformation is attached, the PG_Private flag should be set. This willcause various VM routines to make extra calls into the address_spacehandler to deal with that data.

An address space acts as an intermediate between storage andapplication. Data is read into the address space a whole page at atime, and provided to the application either by copying of the page, orby memory-mapping the page. Data is written into the address space bythe application, and then written-back to storage typically in wholepages, however the address_space has finer control of write sizes.

The read process essentially only requires ‘read_folio’. The writeprocess is more complicated and uses write_begin/write_end ordirty_folio to write data into the address_space, andwritepages to writeback data to storage.

Removing pages from an address_space requires holding the inode’s i_rwsemexclusively, while adding pages to the address_space requires holding theinode’s i_mapping->invalidate_lock exclusively.

When data is written to a page, the PG_Dirty flag should be set. Ittypically remains set until writepages asks for it to be written. Thisshould clear PG_Dirty and set PG_Writeback. It can be actually writtenat any point after PG_Dirty is clear. Once it is known to be safe,PG_Writeback is cleared.

Writeback makes use of a writeback_control structure to direct theoperations. This gives the writepages operation someinformation about the nature of and reason for the writeback request,and the constraints under which it is being done. It is also used toreturn information back to the caller about the result of awritepages request.

Handling errors during writeback

Most applications that do buffered I/O will periodically call a filesynchronization call (fsync, fdatasync, msync or sync_file_range) toensure that data written has made it to the backing store. When thereis an error during writeback, they expect that error to be reported whena file sync request is made. After an error has been reported on onerequest, subsequent requests on the same file descriptor should return0, unless further writeback errors have occurred since the previous filesynchronization.

Ideally, the kernel would report errors only on file descriptions onwhich writes were done that subsequently failed to be written back. Thegeneric pagecache infrastructure does not track the file descriptionsthat have dirtied each individual page however, so determining whichfile descriptors should get back an error is not possible.

Instead, the generic writeback error tracking infrastructure in thekernel settles for reporting errors to fsync on all file descriptionsthat were open at the time that the error occurred. In a situation withmultiple writers, all of them will get back an error on a subsequentfsync, even if all of the writes done through that particular filedescriptor succeeded (or even if there were no writes on that filedescriptor at all).

Filesystems that wish to use this infrastructure should callmapping_set_error to record the error in the address_space when itoccurs. Then, after writing back data from the pagecache in theirfile->fsync operation, they should call file_check_and_advance_wb_err toensure that thestructfile’s error cursor has advanced to the correctpoint in the stream of errors emitted by the backing device(s).

struct address_space_operations

This describes how the VFS can manipulate mapping of a file to pagecache in your filesystem. The following members are defined:

structaddress_space_operations{int(*read_folio)(structfile*,structfolio*);int(*writepages)(structaddress_space*,structwriteback_control*);bool(*dirty_folio)(structaddress_space*,structfolio*);void(*readahead)(structreadahead_control*);int(*write_begin)(conststructkiocb*,structaddress_space*mapping,loff_tpos,unsignedlen,structpage**pagep,void**fsdata);int(*write_end)(conststructkiocb*,structaddress_space*mapping,loff_tpos,unsignedlen,unsignedcopied,structfolio*folio,void*fsdata);sector_t(*bmap)(structaddress_space*,sector_t);void(*invalidate_folio)(structfolio*,size_tstart,size_tlen);bool(*release_folio)(structfolio*,gfp_t);void(*free_folio)(structfolio*);ssize_t(*direct_IO)(structkiocb*,structiov_iter*iter);int(*migrate_folio)(structmapping*,structfolio*dst,structfolio*src,enummigrate_mode);int(*launder_folio)(structfolio*);bool(*is_partially_uptodate)(structfolio*,size_tfrom,size_tcount);void(*is_dirty_writeback)(structfolio*,bool*,bool*);int(*error_remove_folio)(structmapping*mapping,structfolio*);int(*swap_activate)(structswap_info_struct*sis,structfile*f,sector_t*span)int(*swap_deactivate)(structfile*);int(*swap_rw)(structkiocb*iocb,structiov_iter*iter);};
read_folio

Called by the page cache to read a folio from the backing store.The ‘file’ argument supplies authentication information to networkfilesystems, and is generally not used by block based filesystems.It may be NULL if the caller does not have an open file (eg ifthe kernel is performing a read for itself rather than on behalfof a userspace process with an open file).

If the mapping does not support large folios, the folio willcontain a single page. The folio will be locked when read_foliois called. If the read completes successfully, the folio shouldbe marked uptodate. The filesystem should unlock the folioonce the read has completed, whether it was successful or not.The filesystem does not need to modify the refcount on the folio;the page cache holds a reference count and that will not bereleased until the folio is unlocked.

Filesystems may implement ->read_folio() synchronously.In normal operation, folios are read through the ->readahead()method. Only if this fails, or if the caller needs to wait forthe read to complete will the page cache call ->read_folio().Filesystems should not attempt to perform their own readaheadin the ->read_folio() operation.

If the filesystem cannot perform the read at this time, it canunlock the folio, do whatever action it needs to ensure that theread will succeed in the future and return AOP_TRUNCATED_PAGE.In this case, the caller should look up the folio, lock it,and call ->read_folio again.

Callers may invoke the ->read_folio() method directly, but usingread_mapping_folio() will take care of locking, waiting for theread to complete and handle cases such as AOP_TRUNCATED_PAGE.

writepages

called by the VM to write out pages associated with theaddress_space object. If wbc->sync_mode is WB_SYNC_ALL, thenthe writeback_control will specify a range of pages that must bewritten out. If it is WB_SYNC_NONE, then a nr_to_write isgiven and that many pages should be written if possible. If no->writepages is given, then mpage_writepages is used instead.This will choose pages from the address space that are tagged asDIRTY and will write them back.

dirty_folio

called by the VM to mark a folio as dirty. This is particularlyneeded if an address space attaches private data to a folio, andthat data needs to be updated when a folio is dirtied. This iscalled, for example, when a memory mapped page gets modified.If defined, it should set the folio dirty flag, and thePAGECACHE_TAG_DIRTY search mark in i_pages.

readahead

Called by the VM to read pages associated with the address_spaceobject. The pages are consecutive in the page cache and arelocked. The implementation should decrement the page refcountafter starting I/O on each page. Usually the page will beunlocked by the I/O completion handler. The set of pages aredivided into some sync pages followed by some async pages,rac->ra->async_size gives the number of async pages. Thefilesystem should attempt to read all sync pages but may decideto stop once it reaches the async pages. If it does decide tostop attempting I/O, it can simply return. The caller willremove the remaining pages from the address space, unlock themand decrement the page refcount. Set PageUptodate if the I/Ocompletes successfully.

write_begin

Called by the generic buffered write code to ask the filesystemto prepare to write len bytes at the given offset in the file.The address_space should check that the write will be able tocomplete, by allocating space if necessary and doing any otherinternal housekeeping. If the write will update parts of anybasic-blocks on storage, then those blocks should be pre-read(if they haven’t been read already) so that the updated blockscan be written out properly.

The filesystem must return the locked pagecache folio for thespecified offset, in*foliop, for the caller to write into.

It must be able to cope with short writes (where the lengthpassed to write_begin is greater than the number of bytes copiedinto the folio).

A void * may be returned in fsdata, which then gets passed intowrite_end.

Returns 0 on success; < 0 on failure (which is the error code),in which case write_end is not called.

write_end

After a successful write_begin, and data copy, write_end must becalled. len is the original len passed to write_begin, andcopied is the amount that was able to be copied.

The filesystem must take care of unlocking the folio,decrementing its refcount, and updating i_size.

Returns < 0 on failure, otherwise the number of bytes (<=‘copied’) that were able to be copied into pagecache.

bmap

called by the VFS to map a logical block offset within object tophysical block number. This method is used by the FIBMAP ioctland for working with swap-files. To be able to swap to a file,the file must have a stable mapping to a block device. The swapsystem does not go through the filesystem but instead uses bmapto find out where the blocks in the file are and uses thoseaddresses directly.

invalidate_folio

If a folio has private data, then invalidate_folio will becalled when part or all of the folio is to be removed from theaddress space. This generally corresponds to either atruncation, punch hole or a complete invalidation of the addressspace (in the latter case ‘offset’ will always be 0 and ‘length’will befolio_size()). Any private data associated with the folioshould be updated to reflect this truncation. If offset is 0and length isfolio_size(), then the private data should bereleased, because the folio must be able to be completelydiscarded. This may be done by calling the ->release_foliofunction, but in this case the release MUST succeed.

release_folio

release_folio is called on folios with private data to tell thefilesystem that the folio is about to be freed. ->release_folioshould remove any private data from the folio and clear theprivate flag. Ifrelease_folio() fails, it should return false.release_folio() is used in two distinct though related cases.The first is when the VM wants to free a clean folio with noactive users. If ->release_folio succeeds, the folio will beremoved from the address_space and be freed.

The second case is when a request has been made to invalidatesome or all folios in an address_space. This can happenthrough the fadvise(POSIX_FADV_DONTNEED) system call or by thefilesystem explicitly requesting it as nfs and 9p do (when theybelieve the cache may be out of date with storage) by callinginvalidate_inode_pages2(). If the filesystem makes such a call,and needs to be certain that all folios are invalidated, thenits release_folio will need to ensure this. Possibly it canclear the uptodate flag if it cannot free private data yet.

free_folio

free_folio is called once the folio is no longer visible in thepage cache in order to allow the cleanup of any private data.Since it may be called by the memory reclaimer, it should notassume that the original address_space mapping still exists, andit should not block.

direct_IO

called by the generic read/write routines to perform direct_IO -that is IO requests which bypass the page cache and transferdata directly between the storage and the application’s addressspace.

migrate_folio

This is used to compact the physical memory usage. If the VMwants to relocate a folio (maybe from a memory device that issignalling imminent failure) it will pass a new folio and an oldfolio to this function. migrate_folio should transfer any privatedata across and update any references that it has to the folio.

launder_folio

Called before freeing a folio - it writes back the dirty folio.To prevent redirtying the folio, it is kept locked during thewhole operation.

is_partially_uptodate

Called by the VM when reading a file through the pagecache whenthe underlying blocksize is smaller than the size of the folio.If the required block is up to date then the read can completewithout needing I/O to bring the whole page up to date.

is_dirty_writeback

Called by the VM when attempting to reclaim a folio. The VM usesdirty and writeback information to determine if it needs tostall to allow flushers a chance to complete some IO.Ordinarily it can use folio_test_dirty and folio_test_writeback butsome filesystems have more complex state (unstable folios in NFSprevent reclaim) or do not set those flags due to lockingproblems. This callback allows a filesystem to indicate to theVM if a folio should be treated as dirty or writeback for thepurposes of stalling.

error_remove_folio

normally set to generic_error_remove_folio if truncation is okfor this address space. Used for memory failure handling.Setting this implies you deal with pages going away under you,unless you have them locked or reference counts increased.

swap_activate

Called to prepare the given file for swap. It should performany validation and preparation necessary to ensure that writescan be performed with minimal memory allocation. It should calladd_swap_extent(), or the helperiomap_swapfile_activate(), andreturn the number of extents added. If IO should be submittedthrough ->swap_rw(), it should set SWP_FS_OPS, otherwise IO willbe submitted directly to the block devicesis->bdev.

swap_deactivate

Called during swapoff on files where swap_activate wassuccessful.

swap_rw

Called to read or write swap pages when SWP_FS_OPS is set.

The File Object

A file object represents a file opened by a process. This is also knownas an “open file description” in POSIX parlance.

struct file_operations

This describes how the VFS can manipulate an open file. As of kernel4.18, the following members are defined:

structfile_operations{structmodule*owner;fop_flags_tfop_flags;loff_t(*llseek)(structfile*,loff_t,int);ssize_t(*read)(structfile*,char__user*,size_t,loff_t*);ssize_t(*write)(structfile*,constchar__user*,size_t,loff_t*);ssize_t(*read_iter)(structkiocb*,structiov_iter*);ssize_t(*write_iter)(structkiocb*,structiov_iter*);int(*iopoll)(structkiocb*kiocb,structio_comp_batch*,unsignedintflags);int(*iterate_shared)(structfile*,structdir_context*);__poll_t(*poll)(structfile*,structpoll_table_struct*);long(*unlocked_ioctl)(structfile*,unsignedint,unsignedlong);long(*compat_ioctl)(structfile*,unsignedint,unsignedlong);int(*mmap)(structfile*,structvm_area_struct*);int(*open)(structinode*,structfile*);int(*flush)(structfile*,fl_owner_tid);int(*release)(structinode*,structfile*);int(*fsync)(structfile*,loff_t,loff_t,intdatasync);int(*fasync)(int,structfile*,int);int(*lock)(structfile*,int,structfile_lock*);unsignedlong(*get_unmapped_area)(structfile*,unsignedlong,unsignedlong,unsignedlong,unsignedlong);int(*check_flags)(int);int(*flock)(structfile*,int,structfile_lock*);ssize_t(*splice_write)(structpipe_inode_info*,structfile*,loff_t*,size_t,unsignedint);ssize_t(*splice_read)(structfile*,loff_t*,structpipe_inode_info*,size_t,unsignedint);void(*splice_eof)(structfile*file);int(*setlease)(structfile*,int,structfile_lease**,void**);long(*fallocate)(structfile*file,intmode,loff_toffset,loff_tlen);void(*show_fdinfo)(structseq_file*m,structfile*f);#ifndef CONFIG_MMUunsigned(*mmap_capabilities)(structfile*);#endifssize_t(*copy_file_range)(structfile*,loff_t,structfile*,loff_t,size_t,unsignedint);loff_t(*remap_file_range)(structfile*file_in,loff_tpos_in,structfile*file_out,loff_tpos_out,loff_tlen,unsignedintremap_flags);int(*fadvise)(structfile*,loff_t,loff_t,int);int(*uring_cmd)(structio_uring_cmd*ioucmd,unsignedintissue_flags);int(*uring_cmd_iopoll)(structio_uring_cmd*,structio_comp_batch*,unsignedintpoll_flags);int(*mmap_prepare)(structvm_area_desc*);};

Again, all methods are called without any locks being held, unlessotherwise noted.

llseek

called when the VFS needs to move the file position index

read

called by read(2) and related system calls

read_iter

possibly asynchronous read with iov_iter as destination

write

called by write(2) and related system calls

write_iter

possibly asynchronous write with iov_iter as source

iopoll

called when aio wants to poll for completions on HIPRI iocbs

iterate_shared

called when the VFS needs to read the directory contents

poll

called by the VFS when a process wants to check if there isactivity on this file and (optionally) go to sleep until thereis activity. Called by the select(2) and poll(2) system calls

unlocked_ioctl

called by the ioctl(2) system call.

compat_ioctl
called by the ioctl(2) system call when 32 bit system calls are

used on 64 bit kernels.

mmap

called by the mmap(2) system call. Deprecated in favour ofmmap_prepare.

open

called by the VFS when an inode should be opened. When the VFSopens a file, it creates a new “structfile”. It then calls theopen method for the newly allocated file structure. You mightthink that the open method really belongs in “structinode_operations”, and you may be right. I think it’s done theway it is because it makes filesystems simpler to implement.The open() method is a good place to initialize the“private_data” member in the file structure if you want to pointto a device structure

flush

called by the close(2) system call to flush a file

release

called when the last reference to an open file is closed

fsync

called by the fsync(2) system call. Also see the section aboveentitled “Handling errors during writeback”.

fasync

called by the fcntl(2) system call when asynchronous(non-blocking) mode is enabled for a file

lock

called by the fcntl(2) system call for F_GETLK, F_SETLK, andF_SETLKW commands

get_unmapped_area

called by the mmap(2) system call

check_flags

called by the fcntl(2) system call for F_SETFL command

flock

called by the flock(2) system call

splice_write

called by the VFS to splice data from a pipe to a file. Thismethod is used by the splice(2) system call

splice_read

called by the VFS to splice data from file to a pipe. Thismethod is used by the splice(2) system call

setlease

called by the VFS to set or release a file lock lease. setleaseimplementations should call generic_setlease to record or removethe lease in the inode after setting it.

fallocate

called by the VFS to preallocate blocks or punch a hole.

copy_file_range

called by the copy_file_range(2) system call.

remap_file_range

called by the ioctl(2) system call for FICLONERANGE and FICLONEand FIDEDUPERANGE commands to remap file ranges. Animplementation should remap len bytes at pos_in of the sourcefile into the dest file at pos_out. Implementations must handlecallers passing in len == 0; this means “remap to the end of thesource file”. The return value should the number of bytesremapped, or the usual negative error code if errors occurredbefore any bytes were remapped. The remap_flags parameteraccepts REMAP_FILE_* flags. If REMAP_FILE_DEDUP is set then theimplementation must only remap if the requested file ranges haveidentical contents. If REMAP_FILE_CAN_SHORTEN is set, the caller isok with the implementation shortening the request length tosatisfy alignment or EOF requirements (or any other reason).

fadvise

possibly called by thefadvise64() system call.

mmap_prepare

Called by the mmap(2) system call. Allows a VFS to set up afile-backed memory mapping, most notably establishing relevantprivate state and VMA callbacks.

If further action such as pre-population of page tables is required,this can be specified by the vm_area_desc->action field and relatedparameters.

Note that the file operations are implemented by the specificfilesystem in which the inode resides. When opening a device node(character or block special) most filesystems will call specialsupport routines in the VFS which will locate the required devicedriver information. These support routines replace the filesystem fileoperations with those for the device driver, and then proceed to callthe new open() method for the file. This is how opening a device filein the filesystem eventually ends up calling the device driver open()method.

Directory Entry Cache (dcache)

struct dentry_operations

This describes how a filesystem can overload the standard dentryoperations. Dentries and the dcache are the domain of the VFS and theindividual filesystem implementations. Device drivers have no businesshere. These methods may be set to NULL, as they are either optional orthe VFS uses a default. As of kernel 2.6.22, the following members aredefined:

structdentry_operations{int(*d_revalidate)(structinode*,conststructqstr*,structdentry*,unsignedint);int(*d_weak_revalidate)(structdentry*,unsignedint);int(*d_hash)(conststructdentry*,structqstr*);int(*d_compare)(conststructdentry*,unsignedint,constchar*,conststructqstr*);int(*d_delete)(conststructdentry*);int(*d_init)(structdentry*);void(*d_release)(structdentry*);void(*d_iput)(structdentry*,structinode*);char*(*d_dname)(structdentry*,char*,int);structvfsmount*(*d_automount)(structpath*);int(*d_manage)(conststructpath*,bool);structdentry*(*d_real)(structdentry*,enumd_real_typetype);bool(*d_unalias_trylock)(conststructdentry*);void(*d_unalias_unlock)(conststructdentry*);};
d_revalidate

called when the VFS needs to revalidate a dentry. This iscalled whenever a name look-up finds a dentry in the dcache.Most local filesystems leave this as NULL, because all theirdentries in the dcache are valid. Network filesystems aredifferent since things can change on the server without theclient necessarily being aware of it.

This function should return a positive value if the dentry isstill valid, and zero or a negative error code if it isn’t.

d_revalidate may be called in rcu-walk mode (flags &LOOKUP_RCU). If in rcu-walk mode, the filesystem mustrevalidate the dentry without blocking or storing to the dentry,d_parent and d_inode should not be used without care (becausethey can change and, in d_inode case, even become NULL underus).

If a situation is encountered that rcu-walk cannot handle,return-ECHILD and it will be called again in ref-walk mode.

d_weak_revalidate

called when the VFS needs to revalidate a “jumped” dentry. Thisis called when a path-walk ends at dentry that was not acquiredby doing a lookup in the parent directory. This includes “/”,“.” and “..”, as well as procfs-style symlinks and mountpointtraversal.

In this case, we are less concerned with whether the dentry isstill fully correct, but rather that the inode is still valid.As with d_revalidate, most local filesystems will set this toNULL since their dcache entries are always valid.

This function has the same return code semantics asd_revalidate.

d_weak_revalidate is only called after leaving rcu-walk mode.

d_hash

called when the VFS adds a dentry to the hash table. The firstdentry passed to d_hash is the parent directory that the name isto be hashed into.

Same locking and synchronisation rules as d_compare regardingwhat is safe to dereference etc.

d_compare

called to compare a dentry name with a given name. The firstdentry is the parent of the dentry to be compared, the second isthe child dentry. len and name string are properties of thedentry to be compared. qstr is the name to compare it with.

Must be constant and idempotent, and should not take locks ifpossible, and should not or store into the dentry. Should notdereference pointers outside the dentry without lots of care(eg. d_parent, d_inode, d_name should not be used).

However, our vfsmount is pinned, and RCU held, so the dentriesand inodes won’t disappear, neither will our sb or filesystemmodule. ->d_sb may be used.

It is a tricky calling convention because it needs to be calledunder “rcu-walk”, ie. without any locks or references on things.

d_delete

called when the last reference to a dentry is dropped and thedcache is deciding whether or not to cache it. Return 1 todelete immediately, or 0 to cache the dentry. Default is NULLwhich means to always cache a reachable dentry. d_delete mustbe constant and idempotent.

d_init

called when a dentry is allocated

d_release

called when a dentry is really deallocated

d_iput

called when a dentry loses its inode (just prior to its beingdeallocated). The default when this is NULL is that the VFScallsiput(). If you define this method, you must calliput()yourself

d_dname

called when the pathname of a dentry should be generated.Useful for some pseudo filesystems (sockfs, pipefs, ...) todelay pathname generation. (Instead of doing it when dentry iscreated, it’s done only when the path is needed.). Realfilesystems probably dont want to use it, because their dentriesare present in global dcache hash, so their hash should be aninvariant. As no lock is held,d_dname() should not try tomodify the dentry itself, unless appropriate SMP safety is used.CAUTION :d_path() logic is quite tricky. The correct way toreturn for example “Hello” is to put it at the end of thebuffer, and returns a pointer to the first char.dynamic_dname() helper function is provided to take care ofthis.

Example :

staticchar*pipefs_dname(structdentry*dent,char*buffer,intbuflen){returndynamic_dname(dentry,buffer,buflen,"pipe:[%lu]",dentry->d_inode->i_ino);}
d_automount

called when an automount dentry is to be traversed (optional).This should create a new VFS mount record and return the recordto the caller. The caller is supplied with a path parametergiving the automount directory to describe the automount targetand the parent VFS mount record to provide inheritable mountparameters. NULL should be returned if someone else managed tomake the automount first. If the vfsmount creation failed, thenan error code should be returned. If -EISDIR is returned, thenthe directory will be treated as an ordinary directory andreturned to pathwalk to continue walking.

If a vfsmount is returned, the caller will attempt to mount iton the mountpoint and will remove the vfsmount from itsexpiration list in the case of failure.

This function is only used if DCACHE_NEED_AUTOMOUNT is set onthe dentry. This is set by__d_instantiate() if S_AUTOMOUNT isset on the inode being added.

d_manage

called to allow the filesystem to manage the transition from adentry (optional). This allows autofs, for example, to hold upclients waiting to explore behind a ‘mountpoint’ while lettingthe daemon go past and construct the subtree there. 0 should bereturned to let the calling process continue. -EISDIR can bereturned to tell pathwalk to use this directory as an ordinarydirectory and to ignore anything mounted on it and not to checkthe automount flag. Any other error code will abort pathwalkcompletely.

If the ‘rcu_walk’ parameter is true, then the caller is doing apathwalk in RCU-walk mode. Sleeping is not permitted in thismode, and the caller can be asked to leave it and call again byreturning -ECHILD. -EISDIR may also be returned to tellpathwalk to ignore d_automount or any mounts.

This function is only used if DCACHE_MANAGE_TRANSIT is set onthe dentry being transited from.

d_real

overlay/uniontype filesystems implement this method to return oneof the underlying dentries of a regular file hidden by the overlay.

The ‘type’ argument takes the values D_REAL_DATA or D_REAL_METADATAfor returning the real underlying dentry that refers to the inodehosting the file’s data or metadata respectively.

For non-regular files, the ‘dentry’ argument is returned.

d_unalias_trylock

if present, will be called byd_splice_alias() before moving apreexisting attached alias. Returning false prevents__d_move(),makingd_splice_alias() fail with -ESTALE.

Rationale: setting FS_RENAME_DOES_D_MOVE will preventd_move()andd_exchange() calls from the outside of filesystem methods;however, it does not guarantee that attached dentries won’tbe renamed or moved byd_splice_alias() finding a preexistingalias for a directory inode. Normally we would not care;however, something that wants to stabilize the entire path toroot over a blocking operation might need that. See 9p for one(and hopefully only) example.

d_unalias_unlock

should be paired withd_unalias_trylock; that one is called after__d_move() call in__d_unalias().

Each dentry has a pointer to its parent dentry, as well as a hash listof child dentries. Child dentries are basically like files in adirectory.

Directory Entry Cache API

There are a number of functions defined which permit a filesystem tomanipulate dentries:

dget

open a new handle for an existing dentry (this just incrementsthe usage count)

dput

close a handle for a dentry (decrements the usage count). Ifthe usage count drops to 0, and the dentry is still in itsparent’s hash, the “d_delete” method is called to check whetherit should be cached. If it should not be cached, or if thedentry is not hashed, it is deleted. Otherwise cached dentriesare put into an LRU list to be reclaimed on memory shortage.

d_drop

this unhashes a dentry from its parents hash list. A subsequentcall todput() will deallocate the dentry if its usage countdrops to 0

d_delete

delete a dentry. If there are no other open references to thedentry then the dentry is turned into a negative dentry (thed_iput() method is called). If there are other references, thend_drop() is called instead

d_add

add a dentry to its parents hash list and then callsd_instantiate()

d_instantiate

add a dentry to the alias hash list for the inode and updatesthe “d_inode” member. The “i_count” member in the inodestructure should be set/incremented. If the inode pointer isNULL, the dentry is called a “negative dentry”. This functionis commonly called when an inode is created for an existingnegative dentry

d_lookup

look up a dentry given its parent and path name component Itlooks up the child of that given name from the dcache hashtable. If it is found, the reference count is incremented andthe dentry is returned. The caller must usedput() to free thedentry when it finishes using it.

Mount Options

Parsing options

On mount and remount the filesystem is passed a string containing acomma separated list of mount options. The options can have either ofthese forms:

optionoption=value

The <linux/parser.h> header defines an API that helps parse theseoptions. There are plenty of examples on how to use it in existingfilesystems.

Showing options

If a filesystem accepts mount options, it must defineshow_options() toshow all the currently active options. The rules are:

  • options MUST be shown which are not default or their values differfrom the default

  • options MAY be shown which are enabled by default or have theirdefault value

Options used only internally between a mount helper and the kernel (suchas file descriptors), or which only have an effect during the mounting(such as ones controlling the creation of a journal) are exempt from theabove rules.

The underlying reason for the above rules is to make sure, that a mountcan be accurately replicated (e.g. umounting and mounting again) basedon the information found in /proc/mounts.

Resources

(Note some of these resources are not up-to-date with the latest kernel

version.)

Creating Linux virtual filesystems. 2002

<https://lwn.net/Articles/13325/>

The Linux Virtual File-system Layer by Neil Brown. 1999

<http://www.cse.unsw.edu.au/~neilb/oss/linux-commentary/vfs.html>

A tour of the Linux VFS by Michael K. Johnson. 1996

<https://www.tldp.org/LDP/khg/HyperNews/get/fs/vfstour.html>

A small trail through the Linux kernel by Andries Brouwer. 2001

<https://www.win.tue.nl/~aeb/linux/vfs/trail.html>