Re: [PATCH 3/4] vfs: count unlinked inodes

From: Al Viro
Date: Sat Dec 17 2011 - 02:36:34 EST


On Mon, Nov 21, 2011 at 12:11:32PM +0100, Miklos Szeredi wrote:
> @@ -241,6 +242,11 @@ void __destroy_inode(struct inode *inode)
> BUG_ON(inode_has_buffers(inode));
> security_inode_free(inode);
> fsnotify_inode_delete(inode);
> + if (!inode->i_nlink) {
> + WARN_ON(atomic_long_read(&inode->i_sb->s_remove_count) == 0);
> + atomic_long_dec(&inode->i_sb->s_remove_count);
> + }

Umm... That relies on ->destroy_inode() doing nothing stupid; granted,
all work on actual file removal should've been done in ->evice_inode()
leaving only (RCU'd) freeing of in-core, but there are odd ones that
do strange things in ->destroy_inode() and I'm not sure that it's not
a Yet Another Remount Race(tm). OTOH, it's clearly not worse than what
we used to have; just something to keep in mind for future work.

Anyway, I'm mostly OK with that series; I still hate your per-superblock
list of vfsmounts, but at least on top of the vfsmount-guts series they
won't be a temptation for abuse - list goes through struct mount now,
so filesystems won't be able to do fun things like "iterate through all
places where I'm mounted" (and #include "../mounts.h" in any fs code
will be a shootable offense - at least that is easy to spot).

There is another thing I'm less than happy about - suppose you have a
corrupted fs and run into zero on-disk i_nlink. Sure, the inode will
get immediately evicted and __destroy_inode() will happen; however, for
the duration of that window you end up with bumped ->s_remove_count.
Transient EROFS is annoying, but tolerable - we only hit it if attempt
to remount r/o fails in ->remount_fs(). But this is something different -
it's a transient -EBUSY on attempt to remount r/o happening when nothing
actually is trying to do any kind of write access at all. As it is,
you have ->s_remove_count equal to the number of in-core inodes with
zero ->i_nlink that had not yet reached destroy_inode(). Hell knows...
Maybe we want two versions of set_nlink(); one doing what yours does,
another returning -EINVAL if asked to set i_nlink to 0. And assorted
foo_read_inode() would use the latter. Anyway, that's a separate work;
so's the analysis of what happens if directory entry points to on-disk
inode with zero i_nlink.

Applied, with rebase on top of vfsmount-guts. Will push the whole pile
into #for-next as soon as I finish sorting out conflicts in btrfs patches
versus btrfs tree.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/