Re: [RFC] freeing unliked file indefinitely delayed

From: Ben Myers
Date: Wed Jul 08 2015 - 11:51:09 EST


Hey Al,

On Wed, Jul 08, 2015 at 02:42:38AM +0100, Al Viro wrote:
> Normally opening a file, unlinking it and then closing will have
> the inode freed upon close() (provided that it's not otherwise busy and
> has no remaining links, of course). However, there's one case where that
> does *not* happen. Namely, if you open it by fhandle with cold dcache,
> then unlink() and close().
>
> In normal case you get d_delete() in unlink(2) notice that dentry
> is busy and unhash it; on the final dput() it will be forcibly evicted from
> dcache, triggering iput() and inode removal. In this case, though, we end
> up with *two* dentries - disconnected (created by open-by-fhandle) and
> regular one (used by unlink()). The latter will have its reference to inode
> dropped just fine, but the former will not - it's considered hashed (it
> is on the ->s_anon list), so it will stay around until the memory pressure
> will finally do it in. As the result, we have the final iput() delayed
> indefinitely. It's trivial to reproduce -
>
> #define _GNU_SOURCE
> #include <sys/types.h>
> #include <sys/stat.h>
> #include <fcntl.h>
>
> void flush_dcache(void)
> {
> system("mount -o remount,rw /");
> }
>
> static char buf[20 * 1024 * 1024];
>
> main()
> {
> int fd;
> union {
> struct file_handle f;
> char buf[MAX_HANDLE_SZ];
> } x;
> int m;
>
> x.f.handle_bytes = sizeof(x);
> chdir("/root");
> mkdir("foo", 0700);
> fd = open("foo/bar", O_CREAT | O_RDWR, 0600);
> close(fd);
> name_to_handle_at(AT_FDCWD, "foo/bar", &x.f, &m, 0);
> flush_dcache();
> fd = open_by_handle_at(AT_FDCWD, &x.f, O_RDWR);
> unlink("foo/bar");
> write(fd, buf, sizeof(buf));
> system("df ."); /* 20Mb eaten */
> close(fd);
> system("df ."); /* should've freed those 20Mb */
> flush_dcache();
> system("df ."); /* should be the same as #2 */
> }
>
> will spit out something like
> Filesystem 1K-blocks Used Available Use% Mounted on
> /dev/root 322023 303843 1131 100% /
> Filesystem 1K-blocks Used Available Use% Mounted on
> /dev/root 322023 303843 1131 100% /
> Filesystem 1K-blocks Used Available Use% Mounted on
> /dev/root 322023 283282 21692 93% /
> - inode gets freed only when dentry is finally evicted (here we trigger
> than by remount; normally it would've happened in response to memory
> pressure hell knows when).
>
> IMO it's a bug. Between the close() and final flush_dcache() the file has
> no surviving links, is *not* busy, won't show up in fuser/lsof/whatnot
> output, and yet it's still not freed. I'm not saying that it's realistically
> exploitable (albeit with nfsd it might be), but it's a very unpleasant
> self-LART.
>
> FWIW, my prefered fix would be simply to have the final dput() treat
> disconnected dentries as "kill on sight"; checking for i_nlink of the
> inode, as Bruce suggested several years ago, will *not* work, simply
> because having another link to that file and unlinking it after close
> will reproduce the situation - disconnected dentry sticks around in
> dcache past its final dput() and past the last unlink() of our file.
> Theoretically it might cause an overhead for nfsd (no_subtree_check v3
> export might see more d_alloc()/d_free(); icache retention will still
> prevent constant rereading the inode from disk). _IF_ that proves to
> be noticable, we might need to come up with something more elaborate
> (e.g. have unlink() and rename() kick disconnected aliases out if the link
> count has reached 0), but it's more complex and needs careful ananlysis
> of correctness - we need to prove that there's no way to miss the link
> count reaching 0.

The bug rings a bell for me so I will stick my neck out instead of
lurking. Don't you need to sample that link count under the filesystems
internal lock in order to avoid an unlink/iget race? I suggest creating
a helper to prune disconnected dentries which a filesystem could call in
.unlink. That would avoid the risk of unintended side effects with the
d_alloc/d_free/icache approach and have provable link count correctness.

Thanks,
Ben
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/