Changes since 2.5.0:¶

---

recommended

New helpers:sb_bread(),sb_getblk(),sb_find_get_block(),set_bh(),sb_set_blocksize() andsb_min_blocksize().

Use them.

(sb_find_get_block() replaces 2.4’sget_hash_table())

---

recommended

New methods: ->alloc_inode() and ->destroy_inode().

Remove inode->u.foo_inode_i

Declare:

struct foo_inode_info {        /* fs-private stuff */        struct inode vfs_inode;};static inline struct foo_inode_info *FOO_I(struct inode *inode){        return list_entry(inode, struct foo_inode_info, vfs_inode);}

Use FOO_I(inode) instead of &inode->u.foo_inode_i;

Addfoo_alloc_inode() andfoo_destroy_inode() - the former should allocatefoo_inode_info and return the address of ->vfs_inode, the latter should freeFOO_I(inode) (see in-tree filesystems for examples).

Make them ->alloc_inode and ->destroy_inode in your super_operations.

Keep in mind that now you need explicit initialization of private datatypically between callingiget_locked() and unlocking the inode.

At some point that will become mandatory.

mandatory

The foo_inode_info should always be allocated throughalloc_inode_sb() ratherthankmem_cache_alloc() orkmalloc() related to set up the inode reclaim contextcorrectly.

---

mandatory

Change of file_system_type method (->read_super to ->get_sb)

->read_super() is no more. Ditto for DECLARE_FSTYPE and DECLARE_FSTYPE_DEV.

Turn yourfoo_read_super() into a function that would return 0 in case ofsuccess and negative number in case of error (-EINVAL unless you have moreinformative error value to report). Call itfoo_fill_super(). Now declare:

int foo_get_sb(struct file_system_type *fs_type,      int flags, const char *dev_name, void *data, struct vfsmount *mnt){      return get_sb_bdev(fs_type, flags, dev_name, data, foo_fill_super,                         mnt);}

(or similar with s/bdev/nodev/ or s/bdev/single/, depending on the kind offilesystem).

Replace DECLARE_FSTYPE... with explicit initializer and have ->get_sb set asfoo_get_sb.

---

mandatory

Locking change: ->s_vfs_rename_sem is taken only by cross-directory renames.Most likely there is no need to change anything, but if you relied onglobal exclusion between renames for some internal purpose - you need tochange your internal locking. Otherwise exclusion warranties remain thesame (i.e. parents and victim are locked, etc.).

---

informational

Now we have the exclusion between ->lookup() and directory removal (by->rmdir() and ->rename()). If you used to need that exclusion and doit by internal locking (most of filesystems couldn’t care less) - youcan relax your locking.

---

mandatory

->lookup(), ->truncate(), ->create(), ->unlink(), ->mknod(), ->mkdir(),->rmdir(), ->link(), ->lseek(), ->symlink(), ->rename()and ->readdir() are called without BKL now. Grab it on entry, drop upon return- that will guarantee the same locking you used to have. If your method or itsparts do not need BKL - better yet, now you can shiftlock_kernel() andunlock_kernel() so that they would protect exactly what needs to beprotected.

---

mandatory

BKL is also moved from around sb operations. BKL should have been shifted intoindividual fs sb_op functions. If you don’t need it, remove it.

---

informational

check for ->link() target not being a directory is done by callers. Feelfree to drop it...

---

informational

->link() callers hold ->i_mutex on the object we are linking to. Some of yourproblems might be over...

---

mandatory

new file_system_type method - kill_sb(superblock). If you are convertingan existing filesystem, set it according to ->fs_flags:

FS_REQUIRES_DEV         -       kill_block_superFS_LITTER               -       kill_litter_superneither                 -       kill_anon_super

FS_LITTER is gone - just remove it from fs_flags.

---

mandatory

FS_SINGLE is gone (actually, that had happened back when ->get_sb()went in - and hadn’t been documented ;-/). Just remove it from fs_flags(and see ->get_sb() entry for other actions).

---

mandatory

->setattr() is called without BKL now. Caller _always_ holds ->i_mutex, sowatch for ->i_mutex-grabbing code that might be used by your ->setattr().Callers ofnotify_change() need ->i_mutex now.

---

recommended

New super_block fieldstructexport_operations*s_export_op forexplicit support for exporting, e.g. via NFS. The structure is fullydocumented at its declaration in include/linux/fs.h, and inMaking Filesystems Exportable.

Briefly it allows for the definition of decode_fh and encode_fh operationsto encode and decode filehandles, and allows the filesystem to usea standard helper function for decode_fh, and provide file-system specificsupport for this helper, particularly get_parent.

It is planned that this will be required for exporting once the codesettles down a bit.

mandatory

s_export_op is now required for exporting a filesystem.isofs, ext2, ext3, fatcan be used as examples of very different filesystems.

---

mandatory

iget4() and the read_inode2 callback have been superseded byiget5_locked()which has the following prototype:

struct inode *iget5_locked(struct super_block *sb, unsigned long ino,                            int (*test)(struct inode *, void *),                            int (*set)(struct inode *, void *),                            void *data);

‘test’ is an additional function that can be used when the inodenumber is not sufficient to identify the actual file object. ‘set’should be a non-blocking function that initializes those parts of anewly created inode to allow the test function to succeed. ‘data’ ispassed as an opaque value to both test and set functions.

When the inode has been created byiget5_locked(), it will be returned with theI_NEW flag set and will still be locked. The filesystem then needs to finalizethe initialization. Once the inode is initialized it must be unlocked bycallingunlock_new_inode().

The filesystem is responsible for setting (and possibly testing) i_inowhen appropriate. There is also a simpler iget_locked function thatjust takes the superblock and inode number as arguments and does thetest and set for you.

e.g.:

inode = iget_locked(sb, ino);if (inode_state_read_once(inode) & I_NEW) {        err = read_inode_from_disk(inode);        if (err < 0) {                iget_failed(inode);                return err;        }        unlock_new_inode(inode);}

Note that if the process of setting up a new inode fails, theniget_failed()should be called on the inode to render it dead, and an appropriate errorshould be passed back to the caller.

---

recommended

->getattr() finally getting used. See instances in nfs, minix, etc.

---

mandatory

->revalidate() is gone. If your filesystem had it - provide ->getattr()and let it call whatever you had as ->revlidate() + (for symlinks thathad ->revalidate()) add calls in ->follow_link()/->readlink().

---

mandatory

->d_parent changes are not protected by BKL anymore. Read access is safeif at least one of the following is true:

filesystem has no cross-directoryrename()
we know that parent had been locked (e.g. we are looking at->d_parent of ->lookup() argument).
we are called from ->rename().
the child’s ->d_lock is held

Audit your code and add locking if needed. Notice that any place that isnot protected by the conditions above is risky even in the old tree - youhad been relying on BKL and that’s prone to screwups. Old tree had quitea few holes of that kind - unprotected access to ->d_parent leading toanything from oops to silent memory corruption.

---

mandatory

FS_NOMOUNT is gone. If you use it - just set SB_NOUSER in flags(see rootfs for one kind of solution and bdev/socket/pipe for another).

---

recommended

Use bdev_read_only(bdev) instead of is_read_only(kdev). The latteris still alive, but only because of the mess in drivers/s390/block/dasd.c.As soon as it gets fixedis_read_only() will die.

---

mandatory

->permission() is called without BKL now. Grab it on entry, drop uponreturn - that will guarantee the same locking you used to have. Ifyour method or its parts do not need BKL - better yet, now you canshiftlock_kernel() andunlock_kernel() so that they would protectexactly what needs to be protected.

---

mandatory

->statfs() is now called without BKL held. BKL should have beenshifted into individual fs sb_op functions where it’s not clear thatit’s safe to remove it. If you don’t need it, remove it.

---

mandatory

is_read_only() is gone; usebdev_read_only() instead.

---

mandatory

destroy_buffers() is gone; useinvalidate_bdev().

---

mandatory

fsync_dev() is gone; usefsync_bdev(). NOTE: lvm breakage isdeliberate; as soon asstructblock_device * is propagated in a reasonableway by that code fixing will become trivial; until then nothing can bedone.

mandatory

block truncation on error exit from ->write_begin, and ->direct_IOmoved from generic methods (block_write_begin, cont_write_begin,nobh_write_begin, blockdev_direct_IO*) to callers. Take a look atext2_write_failed and callers for an example.

mandatory

->truncate is gone. The whole truncate sequence needs to beimplemented in ->setattr, which is now mandatory for filesystemsimplementing on-disk size changes. Start with a copy of the old inode_setattrand vmtruncate, and the reorder the vmtruncate + foofs_vmtruncate sequence tobe in order of zeroing blocks using block_truncate_page or similar helpers,size update and on finally on-disk truncation which should not fail.setattr_prepare (which used to be inode_change_ok) now includes the size checksfor ATTR_SIZE and must be called in the beginning of ->setattr unconditionally.

mandatory

->clear_inode() and ->delete_inode() are gone; ->evict_inode() shouldbe used instead. It gets called whenever the inode is evicted, whether it hasremaining links or not. Caller doesnot evict the pagecache or inode-associatedmetadata buffers; the method has to usetruncate_inode_pages_final() to get ridof those. Caller makes sure async writeback cannot be running for the inode while(or after) ->evict_inode() is called.

->drop_inode() returns int now; it’s called on finaliput() withinode->i_lock held and it returns true if filesystems wants the inode to bedropped. As before,inode_generic_drop() is still the default and it’s beenupdated appropriately.inode_just_drop() is also alive and it consistssimply of return 1. Note that all actual eviction work is done by caller after->drop_inode() returns.

As before,clear_inode() must be called exactly once on each call of->evict_inode() (as it used to be for each call of ->delete_inode()). Unlikebefore, if you are using inode-associated metadata buffers (i.e.mark_buffer_dirty_inode()), it’s your responsibility to callinvalidate_inode_buffers() beforeclear_inode().

NOTE: checking i_nlink in the beginning of ->write_inode() and bailing outif it’s zero is notandneverhadbeen enough. Finalunlink() andiput()may happen while the inode is in the middle of ->write_inode(); e.g. if you blindlyfree the on-disk inode, you may end up doing that while ->write_inode() is writingto it.

---

mandatory

.d_delete() now only advises the dcache as to whether or not to cacheunreferenced dentries, and is now only called when the dentry refcount goes to0. Even on 0 refcount transition, it must be able to tolerate being called 0,1, or more times (eg. constant, idempotent).

---

mandatory

.d_compare() calling convention and locking rules are significantlychanged. Read updated documentation inOverview of the Linux Virtual File System (andlook at examples of other filesystems) for guidance.

---

mandatory

.d_hash() calling convention and locking rules are significantlychanged. Read updated documentation inOverview of the Linux Virtual File System (andlook at examples of other filesystems) for guidance.

---

mandatory

dcache_lock is gone, replaced by fine grained locks. See fs/dcache.cfor details of what locks to replace dcache_lock with in order to protectparticular things. Most of the time, a filesystem only needs ->d_lock, whichprotectsall the dcache state of a given dentry.

---

mandatory

Filesystems must RCU-free their inodes, if they can have been accessedvia rcu-walk path walk (basically, if the file can have had a path name in thevfs namespace).

Even though i_dentry and i_rcu share storage in a union, we willinitialize the former ininode_init_always(), so just leave it alone inthe callback. It used to be necessary to clean it there, but not anymore(starting at 3.2).

---

recommended

vfs now tries to do path walking in “rcu-walk mode”, which avoidsatomic operations and scalability hazards on dentries and inodes (seePathname lookup). d_hash and d_compare changes(above) are examples of the changes required to support this. For more complexfilesystem callbacks, the vfs drops out of rcu-walk mode before the fs call, sono changes are required to the filesystem. However, this is costly and losesthe benefits of rcu-walk mode. We will begin to add filesystem callbacks thatare rcu-walk aware, shown below. Filesystems should take advantage of thiswhere possible.

---

mandatory

d_revalidate is a callback that is made on every path element (ifthe filesystem provides it), which requires dropping out of rcu-walk mode. Thismay now be called in rcu-walk mode (nd->flags & LOOKUP_RCU). -ECHILD should bereturned if the filesystem cannot handle rcu-walk. SeeOverview of the Linux Virtual File System for more details.

permission is an inode permission check that is called on many or alldirectory inodes on the way down a path walk (to check for exec permission). Itmust now be rcu-walk aware (mask & MAY_NOT_BLOCK). SeeOverview of the Linux Virtual File System for more details.

---

mandatory

In ->fallocate() you must check the mode option passed in. If yourfilesystem does not support hole punching (deallocating space in the middle of afile) you must return -EOPNOTSUPP if FALLOC_FL_PUNCH_HOLE is set in mode.Currently you can only have FALLOC_FL_PUNCH_HOLE with FALLOC_FL_KEEP_SIZE set,so the i_size should not change when hole punching, even when puching the end ofa file off.

---

mandatory

->get_sb() is gone. Switch to use of ->mount(). Typically it’s justa matter of switching from callingget_sb_... tomount_... and changingthe function type. If you were doing it manually, just switch from setting->mnt_root to some pointer to returning that pointer. On errors returnERR_PTR(...).

---

mandatory

->permission() andgeneric_permission()have lost flagsargument; instead of passing IPERM_FLAG_RCU we add MAY_NOT_BLOCK into mask.

generic_permission() has also lost the check_acl argument; ACL checkinghas been taken to VFS and filesystems need to provide a non-NULL->i_op->get_inode_acl to read an ACL from disk.

---

mandatory

If you implement your own ->llseek() you must handle SEEK_HOLE andSEEK_DATA. You can handle this by returning -EINVAL, but it would be nicer tosupport it in some way. The generic handler assumes that the entire file isdata and there is a virtual hole at the end of the file. So if the providedoffset is less than i_size and SEEK_DATA is specified, return the same offset.If the above is true for the offset and you are given SEEK_HOLE, return the endof the file. If the offset is i_size or greater return -ENXIO in either case.

mandatory

If you have your own ->fsync() you must make sure to callfilemap_write_and_wait_range() so that all dirty pages are synced out properly.You must also keep in mind that ->fsync() is not called with i_mutex heldanymore, so if you require i_mutex locking you must make sure to take it andrelease it yourself.

---

mandatory

d_alloc_root() is gone, along with a lot of bugs caused by codemisusing it. Replacement: d_make_root(inode). On success d_make_root(inode)allocates and returns a new dentry instantiated with the passed in inode.On failure NULL is returned and the passed in inode is dropped so the referenceto inode is consumed in all cases and failure handling need not do any cleanupfor the inode. If d_make_root(inode) is passed a NULL inode it returns NULLand also requires no further error handling. Typical usage is:

inode = foofs_new_inode(....);s->s_root = d_make_root(inode);if (!s->s_root)        /* Nothing needed for the inode cleanup */        return -ENOMEM;...

---

mandatory

The witch is dead! Well, 2/3 of it, anyway. ->d_revalidate() and->lookup() donot takestructnameidata anymore; just the flags.

---

mandatory

->create() doesn’t takestructnameidata*; unlike the previoustwo, it gets “is it an O_EXCL or equivalent?” boolean argument. Note thatlocal filesystems can ignore this argument - they are guaranteed that theobject doesn’t exist. It’s remote/distributed ones that might care...

---

mandatory

FS_REVAL_DOT is gone; if you used to have it, add ->d_weak_revalidate()in your dentry operations instead.

---

mandatory

vfs_readdir() is gone; switch toiterate_dir() instead

---

mandatory

->readdir() is gone now; switch to ->iterate_shared()

mandatory

vfs_follow_link has been removed. Filesystems must use nd_set_linkfrom ->follow_link for normal symlinks, or nd_jump_link for magic/proc/<pid> style links.

---

mandatory

iget5_locked()/ilookup5()/ilookup5_nowait()test() callback used to becalled with both ->i_lock and inode_hash_lock held; the former isnottaken anymore, so verify that your callbacks do not rely on it (noneof the in-tree instances did). inode_hash_lock is still held,of course, so they are still serialized wrt removal from inode hash,as well as wrtset() callback ofiget5_locked().

---

mandatory

d_materialise_unique() is gone;d_splice_alias() does everything youneed now. Remember that they have opposite orders of arguments ;-/

---

mandatory

f_dentry is gone; use f_path.dentry, or, better yet, see if you can avoidit entirely.

---

mandatory

never call ->read() and ->write() directly; use __vfs_{read,write} orwrappers; instead of checking for ->write or ->read being NULL, look forFMODE_CAN_{WRITE,READ} in file->f_mode.

---

mandatory

do _not_ use new_sync_{read,write} for ->read/->write; leave it NULLinstead.

---

mandatory: ->aio_read/->aio_write are gone. Use ->read_iter/->write_iter.

---

recommended

for embedded (“fast”) symlinks just set inode->i_link to wherever thesymlink body is and usesimple_follow_link() as ->follow_link().

---

mandatory

calling conventions for ->follow_link() have changed. Instead of returningcookie and usingnd_set_link() to store the body to traverse, we returnthe body to traverse and store the cookie using explicit void ** argument.nameidata isn’t passed at all -nd_jump_link() doesn’t need it andnd_[gs]et_link() is gone.

---

mandatory

calling conventions for ->put_link() have changed. It gets inode instead ofdentry, it does not get nameidata at all and it gets called only when cookieis non-NULL. Note that link body isn’t available anymore, so if you need it,store it as cookie.

---

mandatory

any symlink that might use page_follow_link_light/page_put_link() musthave inode_nohighmem(inode) called before anything might start playing withits pagecache. No highmem pages should end up in the pagecache of suchsymlinks. That includes any preseeding that might be done during symlinkcreation.page_symlink() will honour the mapping gfp flags, so onceyou’ve doneinode_nohighmem() it’s safe to use, but if you allocate andinsert the page manually, make sure to use the right gfp flags.

---

mandatory

->follow_link() is replaced with ->get_link(); same API, except that

->get_link() gets inode as a separate argument
->get_link() may be called in RCU mode - in that case NULLdentry is passed

---

mandatory

->get_link() getsstructdelayed_call*done now, and should doset_delayed_call() where it used to set*cookie.

->put_link() is gone - just give the destructor toset_delayed_call()in ->get_link().

---

mandatory

->getxattr() and xattr_handler.get() get dentry and inode passed separately.dentry might be yet to be attached to inode, so do _not_ use its ->d_inodein the instances. Rationale: !@#!@#security_d_instantiate() needs to becalled before we attach dentry to inode.

---

mandatory

symlinks are no longer the only inodes that donot have i_bdev/i_cdev/i_pipe/i_linkunionzeroed out at inode eviction. As the result, you can’tassume that non-NULL value in ->i_nlink at ->destroy_inode() implies thatit’s a symlink. Checking ->i_mode is really needed now. In-tree we hadto fixshmem_destroy_callback() that used to take that kind of shortcut;watch out, since that shortcut is no longer valid.

---

mandatory

->i_mutex is replaced with ->i_rwsem now.inode_lock() et.al. work asthey used to - they just take it exclusive. However, ->lookup() may becalled with parent locked shared. Its instances must not

use d_instantiate) andd_rehash() separately - used_add() ord_splice_alias() instead.
used_rehash() alone - call d_add(new_dentry, NULL) instead.
in the unlikely case when (read-only) access to filesystemdata structures needs exclusion for some reason, arrange ityourself. None of the in-tree filesystems needed that.
rely on ->d_parent and ->d_name not changing after dentry hasbeen fed tod_add() ord_splice_alias(). Again, none of thein-tree instances relied upon that.

We are guaranteed that lookups of the same name in the same directorywill not happen in parallel (“same” in the sense of your ->d_compare()).Lookups on different names in the same directory can and do happen inparallel now.

---

mandatory

->iterate_shared() is added.Exclusion onstructfile level is still provided (as well as thatbetween it and lseek on the samestructfile), but if your directoryhas been opened several times, you can get these called in parallel.Exclusion between that method and all directory-modifying ones isstill provided, of course.

If you have any per-inode or per-dentry in-core data structures modifiedby ->iterate_shared(), you might need something to serialize the accessto them. If you do dcache pre-seeding, you’ll need to switch tod_alloc_parallel() for that; look for in-tree examples.

---

mandatory

->atomic_open() calls without O_CREAT may happen in parallel.

---

mandatory

->setxattr() and xattr_handler.set() get dentry and inode passed separately.The xattr_handler.set() gets passed the user namespace of the mount the inodeis seen from so filesystems can idmap the i_uid and i_gid accordingly.dentry might be yet to be attached to inode, so do _not_ use its ->d_inodein the instances. Rationale: !@#!@#security_d_instantiate() needs to becalled before we attach dentry to inode and !@#!@##!@$!$#!@#$!@$!@$ smack->d_instantiate() uses not just ->getxattr() but ->setxattr() as well.

---

mandatory

->d_compare() doesn’t get parent as a separate argument anymore. If youused it for finding thestructsuper_block involved, dentry->d_sb willwork just as well; if it’s something more complicated, use dentry->d_parent.Just be careful not to assume that fetching it more than once will yieldthe same value - in RCU mode it could change under you.

---

mandatory

->rename() has an added flags argument. Any flags not handled by thefilesystem should result in EINVAL being returned.

---

recommended

->readlink is optional for symlinks. Don’t set, unless filesystem needsto fake something for readlink(2).

---

mandatory

->getattr() is now passed astructpath rather than a vfsmount anddentry separately, and it now has request_mask and query_flags argumentsto specify the fields and sync type requested by statx. Filesystems notsupporting any statx-specific features may ignore the new arguments.

---

mandatory

->atomic_open() calling conventions have changed. Gone isint*opened,along with FILE_OPENED/FILE_CREATED. In place of those we haveFMODE_OPENED/FMODE_CREATED, set in file->f_mode. Additionally, returnvalue for ‘calledfinish_no_open(), open it yourself’ case has become0, not 1. Sincefinish_no_open() itself is returning 0 now, that partdoes not need any changes in ->atomic_open() instances.

---

mandatory

alloc_file() has become static now; two wrappers are to be used instead.alloc_file_pseudo(inode, vfsmount, name, flags, ops) is for the caseswhen dentry needs to be created; that’s the majority of oldalloc_file()users. Calling conventions: on success a reference to newstructfileis returned and callers reference to inode is subsumed by that. Onfailure,ERR_PTR() is returned and no caller’s references are affected,so the caller needs to drop the inode reference it held.alloc_file_clone(file, flags, ops) does not affect any caller’s references.On success you get a newstructfile sharing the mount/dentry with theoriginal, on failure -ERR_PTR().

---

mandatory

->clone_file_range() and ->dedupe_file_range have been replaced with->remap_file_range(). SeeOverview of the Linux Virtual File System for moreinformation.

---

recommended

->lookup() instances doing an equivalent of:

if (IS_ERR(inode))        return ERR_CAST(inode);return d_splice_alias(inode, dentry);

don’t need to bother with the check -d_splice_alias() will do theright thing when given ERR_PTR(...) as inode. Moreover, passing NULLinode tod_splice_alias() will also do the right thing (equivalent ofd_add(dentry, NULL); return NULL;), so that kind of special casesalso doesn’t need a separate treatment.

---

strongly recommended

take the RCU-delayed parts of ->destroy_inode() into a new method -->free_inode(). If ->destroy_inode() becomes empty - all the better,just get rid of it. Synchronous work (e.g. the stuff that can’tbe done from an RCU callback, or anyWARN_ON() where we want thestack trace)might be movable to ->evict_inode(); however,that goes only for the things that are not needed to balance somethingdone by ->alloc_inode(). IOW, if it’s cleaning up the stuff thatmight have accumulated over the life of in-core inode, ->evict_inode()might be a fit.

Rules for inode destruction:

if ->destroy_inode() is non-NULL, it gets called
if ->free_inode() is non-NULL, it gets scheduled bycall_rcu()
combination of NULL ->destroy_inode and NULL ->free_inode istreated as NULL/free_inode_nonrcu, to preserve the compatibility.

Note that the callback (be it via ->free_inode() or explicitcall_rcu()in ->destroy_inode()) isNOT ordered wrt superblock destruction;as the matter of fact, the superblock and all associated structuresmight be already gone. The filesystem driver is guaranteed to be stillthere, but that’s it. Freeing memory in the callback is fine; doingmore than that is possible, but requires a lot of care and is bestavoided.

---

mandatory

DCACHE_RCUACCESS is gone; having an RCU delay on dentry freeing is thedefault. DCACHE_NORCU opts out, and onlyd_alloc_pseudo() has anybusiness doing so.

---

mandatory

d_alloc_pseudo() is internal-only; uses outside ofalloc_file_pseudo() arevery suspect (and won’t work in modules). Such uses are very likely tobe misspelledd_alloc_anon().

---

mandatory

[should’ve been added in 2016] stale comment infinish_open() notwithstanding,failure exits in ->atomic_open() instances shouldNOTfput() the file,no matter what. Everything is handled by the caller.

---

mandatory

clone_private_mount() returns a longterm mount now, so the proper destructor ofits result iskern_unmount() orkern_unmount_array().

---

mandatory

zero-length bvec segments are disallowed, they must be filtered out beforepassed on to an iterator.

---

mandatory

For bvec based itereratorsbio_iov_iter_get_pages() now doesn’t copy bvecs butuses the one provided. Anyone issuing kiocb-I/O should ensure that the bvec andpage references stay until I/O has completed, i.e. until ->ki_complete() hasbeen called or returned with non -EIOCBQUEUED code.

---

mandatory

mnt_want_write_file() can now only be paired withmnt_drop_write_file(),whereas previously it could be paired withmnt_drop_write() as well.

---

mandatory

iov_iter_copy_from_user_atomic() is gone; usecopy_page_from_iter_atomic().The difference iscopy_page_from_iter_atomic() advances the iterator andyou don’t neediov_iter_advance() after it. However, if you decide to useonly a part of obtained data, you should doiov_iter_revert().

---

mandatory

Calling conventions forfile_open_root() changed; now it takesstructpath *instead of passing mount and dentry separately. For callers that used topass <mnt, mnt->mnt_root> pair (i.e. the root of given mount), a new helperis provided -file_open_root_mnt(). In-tree users adjusted.

---

mandatory

no_llseek is gone; don’t set .llseek to that - just leave it NULL instead.Checks for “does that file have llseek(2), or should it fail with ESPIPE”should be done by looking at FMODE_LSEEK in file->f_mode.

---

mandatory

filldir_t (readdir callbacks) calling conventions have changed. Instead ofreturning 0 or -E... it returns bool now. false means “no more” (as -E... usedto) and true - “keep going” (as 0 in old calling conventions). Rationale:callers never looked at specific -E... values anyway. ->iterate_shared()instances require no changes at all, all filldir_t ones in the treeconverted.

---

mandatory

Calling conventions for ->tmpfile() have changed. It now takes astructfile pointer instead ofstructdentry pointer.d_tmpfile() is similarlychanged to simplify callers. The passed file is in a non-open state and onsuccess must be opened before returning (e.g. by callingfinish_open_simple()).

---

mandatory

Calling convention for ->huge_fault has changed. It now takes a pageorder instead of anenumpage_entry_size, and it may be called without themmap_lock held. All in-tree users have been audited and do not seem todepend on the mmap_lock being held, but out of tree users should verifyfor themselves. If they do need it, they can return VM_FAULT_RETRY tobe called with the mmap_lock held.

---

mandatory

The order of opening block devices and matching or creating superblocks haschanged.

The old logic opened block devices first and then tried to find asuitable superblock to reuse based on the block device pointer.

The new logic tries to find a suitable superblock first based on the devicenumber, and opening the block device afterwards.

Since opening block devices cannot happen under s_umount because of lockordering requirements s_umount is now dropped while opening block devices andreacquired before callingfill_super().

In the old logic concurrent mounters would find the superblock on the list ofsuperblocks for the filesystem type. Since the first opener of the block devicewould hold s_umount they would wait until the superblock became either born orwas discarded due to initialization failure.

Since the new logic drops s_umount concurrent mounters could grab s_umount andwould spin. Instead they are now made to wait using an explicit wait-wakemechanism without having to hold s_umount.

---

mandatory

The holder of a block device is now the superblock.

The holder of a block device used to be the file_system_type which wasn’tparticularly useful. It wasn’t possible to go from block device to owningsuperblock without matching on the device pointer stored in the superblock.This mechanism would only work for a single device so the block layer couldn’tfind the owning superblock of any additional devices.

In the old mechanism reusing or creating a superblock for a racing mount(2) andumount(2) relied on the file_system_type as the holder. This was severelyunderdocumented however:

Any concurrent mounter that managed to grab an active reference on anexisting superblock was made to wait until the superblock either becameready or until the superblock was removed from the list of superblocks ofthe filesystem type. If the superblock is ready the caller would simplereuse it.
If the mounter came afterdeactivate_locked_super() but beforethe superblock had been removed from the list of superblocks of thefilesystem type the mounter would wait until the superblock was shutdown,reuse the block device and allocate a new superblock.
If the mounter came afterdeactivate_locked_super() and afterthe superblock had been removed from the list of superblocks of thefilesystem type the mounter would reuse the block device and allocate a newsuperblock (the bd_holder point may still be set to the filesystem type).

Because the holder of the block device was the file_system_type any concurrentmounter could open the block devices of any superblock of the samefile_system_type without risking seeing EBUSY because the block device wasstill in use by another superblock.

Making the superblock the owner of the block device changes this as the holderis now a unique superblock and thus block devices associated with it cannot bereused by concurrent mounters. So a concurrent mounter in (2) could suddenlysee EBUSY when trying to open a block device whose holder was a differentsuperblock.

The new logic thus waits until the superblock and the devices are shutdown in->kill_sb(). Removal of the superblock from the list of superblocks of thefilesystem type is now moved to a later point when the devices are closed:

Any concurrent mounter managing to grab an active reference on an existingsuperblock is made to wait until the superblock is either ready or untilthe superblock and all devices are shutdown in ->kill_sb(). If thesuperblock is ready the caller will simply reuse it.
If the mounter comes afterdeactivate_locked_super() but beforethe superblock has been removed from the list of superblocks of thefilesystem type the mounter is made to wait until the superblock and thedevices are shut down in ->kill_sb() and the superblock is removed from thelist of superblocks of the filesystem type. The mounter will allocate a newsuperblock and grab ownership of the block device (the bd_holder pointer ofthe block device will be set to the newly allocated superblock).
This case is now collapsed into (2) as the superblock is left on the listof superblocks of the filesystem type until all devices are shutdown in->kill_sb(). In other words, if the superblock isn’t on the list ofsuperblock of the filesystem type anymore then it has given up ownership ofall associated block devices (the bd_holder pointer is NULL).

As this is a VFS level change it has no practical consequences for filesystemsother than that all of them must use one of the providedkill_litter_super(),kill_anon_super(), orkill_block_super() helpers.

---

mandatory

Lock ordering has been changed so that s_umount ranks above open_mutex again.All places where s_umount was taken under open_mutex have been fixed up.

---

mandatory

export_operations ->encode_fh() no longer has a default implementation toencode FILEID_INO32_GEN* file handles.Filesystems that used the default implementation may use the generic helpergeneric_encode_ino32_fh() explicitly.

---

mandatory

If ->rename() update of .. on cross-directory move needs an exclusion withdirectory modifications, donot lock the subdirectory in question in your->rename() - it’s done by the caller now [that item should’ve been added in28eceeda130f “fs: Lock moved directories”].

---

mandatory

On same-directory ->rename() the (tautological) update of .. is not protectedby any locks; just don’t do it if the old parent is the same as the new one.We really can’t lock two subdirectories in same-directory rename - not withoutdeadlocks.

---

mandatory

lock_rename() andlock_rename_child() may fail in cross-directory case, iftheir arguments do not have a common ancestor. In that case ERR_PTR(-EXDEV)is returned, with no locks taken. In-tree users updated; out-of-tree oneswould need to do so.

---

mandatory

The list of children anchored in parent dentry got turned into hlist now.Field names got changed (->d_children/->d_sib instead of ->d_subdirs/->d_childfor anchor/entries resp.), so any affected places will be immediately caughtby compiler.

---

mandatory

->d_delete() instances are now called for dentries with ->d_lock heldand refcount equal to 0. They are not permitted to drop/regain ->d_lock.None of in-tree instances did anything of that sort. Make sure yours do not...

---

mandatory

->d_prune() instances are now called without ->d_lock held on the parent.->d_lock on dentry itself is still held; if you need per-parent exclusions (noneof the in-tree instances did), use your own spinlock.

->d_iput() and ->d_release() are called with victim dentry still in thelist of parent’s children. It is still unhashed, marked killed, etc., just notremoved from parent’s ->d_children yet.

Anyone iterating through the list of children needs to be aware of thehalf-killed dentries that might be seen there; taking ->d_lock on those willsee them negative, unhashed and with negative refcount, which means that mostof the in-kernel users would’ve done the right thing anyway without any adjustment.

---

recommended

Block device freezing and thawing have been moved to holder operations.

Before this change,get_active_super() would only be able to find thesuperblock of the main block device, i.e., the one stored in sb->s_bdev. Blockdevice freezing now works for any block device owned by a given superblock, notjust the main block device. Theget_active_super() helper and bd_fsfreeze_sbpointer are gone.

---

mandatory

set_blocksize() takes openedstructfile instead ofstructblock_device nowand itmust be opened exclusive.

---

mandatory

->d_revalidate() gets two extra arguments - inode of parent directory andname our dentry is expected to have. Both are stable (dir is pinned innon-RCU case and will stay around during the call in RCU case, and nameis guaranteed to stay unchanging). Your instance doesn’t have to useeither, but it often helps to avoid a lot of painful boilerplate.Note that while name->name is stable and NUL-terminated, it may (andoften will) have name->name[name->len] equal to ‘/’ rather than ‘0’ -in normal case it points into the pathname being looked up.NOTE: if you need something like full path from the root of filesystem,you are still on your own - this assists with simple cases, but it’s notmagic.

---

recommended

kern_path_locked() anduser_path_locked() no longer return a negativedentry so this doesn’t need to be checked. If the name cannot be found,ERR_PTR(-ENOENT) is returned.

---

recommended

lookup_one_qstr_excl() is changed to return errors in more cases, sothese conditions don’t require explicit checks:

if LOOKUP_CREATE is NOT given, then the dentry won’t be negative,ERR_PTR(-ENOENT) is returned instead
if LOOKUP_EXCL IS given, then the dentry won’t be positive,ERR_PTR(-EEXIST) is rreturned instread

LOOKUP_EXCL now means “target must not exist”. It can be combined withLOOK_CREATE or LOOKUP_RENAME_TARGET.

---

mandatoryinvalidate_inodes() is gone useevict_inodes() instead.

---

mandatory

->mkdir() now returns a dentry. If the created inode is found toalready be in cache and have a dentry (oftenIS_ROOT()), it will need tobe spliced into the given name in place of the given dentry. That dentrynow needs to be returned. If the original dentry is used, NULL shouldbe returned. Any error should be returned withERR_PTR().

In general, filesystems which used_instantiate_new() to install the newinode can safely return NULL. Filesystems which may not have an I_NEW inodeshould used_drop();d_splice_alias() and return the result of the latter.

If a positive dentry cannot be returned for some reason, in-kernelclients such as cachefiles, nfsd, smb/server may not perform ideally butwill fail-safe.

---

** mandatory**

lookup_one(),lookup_one_unlocked(),lookup_one_positive_unlocked() nowtake a qstr instead of a name and len. These, not the “one_len”versions, should be used whenever accessing a filesystem from outsidethat filesysmtem, through a mount point - which will have a mnt_idmap.

---

** mandatory**

Functionstry_lookup_one_len(),lookup_one_len(),lookup_one_len_unlocked() andlookup_positive_unlocked() have beenrenamed totry_lookup_noperm(),lookup_noperm(),lookup_noperm_unlocked(),lookup_noperm_positive_unlocked(). They nowtake a qstr instead of separate name and length.QSTR() can be usedwhenstrlen() is needed for the length.

These function no longer do any permission checking - they previouslychecked that the caller has ‘X’ permission on the parent. They mustONLY be used internally by a filesystem on itself when it knows thatpermissions are irrelevant or in a context where permission checks havealready been performed such as aftervfs_path_parent_lookup()

---

** mandatory**

d_hash_and_lookup() is no longer exported or available outside the VFS.Usetry_lookup_noperm() instead. This adds name validation and takesarguments in the opposite order but is otherwise identical.

Usingtry_lookup_noperm() will require linux/namei.h to be included.

---

mandatory

Calling conventions for ->d_automount() have changed; we shouldnot graban extra reference to new mount - it should be returned with refcount 1.

---

collect_mounts()/drop_collected_mounts()/iterate_mounts() are gone now.Replacement iscollect_paths()/drop_collected_path(), with no specialiterator needed. Instead of a cloned mount tree, the new interface returnsan array ofstructpath, one for each mountcollect_mounts() would’vecreated. Thesestructpath point to locations in the caller’s namespacethat would be roots of the cloned mounts.

---

mandatory

If your filesystem sets the default dentry_operations, useset_default_d_op()rather than manually setting sb->s_d_op.

---

mandatory

d_set_d_op() is no longer exported (or public, for that matter); _if_your filesystem really needed that, make use ofd_splice_alias_ops()to have them set. Better yet, think hard whether you need different->d_op for different dentries - if not, just useset_default_d_op()at mount time and be done with that. Currently procfs is the onlything that really needs ->d_op varying between dentries.

---

highly recommended

The file operations mmap() callback is deprecated in favour ofmmap_prepare(). This passes a pointer to a vm_area_desc to the callbackrather than a VMA, as the VMA at this stage is not yet valid.

The vm_area_desc provides the minimum required information for a filesystemto initialise state upon memory mapping of a file-backed region, and outputparameters for the file system to set this state.

In nearly all cases, this is all that is required for a filesystem. However, ifa filesystem needs to perform an operation such a pre-population of page tables,then that action can be specified in the vm_area_desc->action field, which canbe configured using the mmap_action_*() helpers.

---

mandatory

Several functions are renamed:

kern_path_locked -> start_removing_path
kern_path_create -> start_creating_path
user_path_create -> start_creating_user_path
user_path_locked_at -> start_removing_user_path_at
done_path_create -> end_creating_path

---

mandatory

Calling conventions forvfs_parse_fs_string() have changed; it doesnottake length anymore (value ? strlen(value) : 0 is used). If you wanta different length, use

vfs_parse_fs_qstr(fc, key, &QSTR_LEN(value, len))

instead.

---

mandatory

vfs_mkdir() now returns a dentry - the one returned by ->mkdir(). Ifthat dentry is different from the dentry passed in, including if it isanIS_ERR() dentry pointer, the original dentry isdput().

Whenvfs_mkdir() returns an error, and so bothdputs() the originaldentry and doesn’t provide a replacement, it also unlocks the parent.Consequently the return value fromvfs_mkdir() can be passed toend_creating() and the parent will be unlocked precisely when necessary.

---

mandatory

kill_litter_super() is gone; convert to DCACHE_PERSISTENT use (as allin-tree filesystems have done).

Movatterモバイル変換

Changes since 2.5.0:¶