RE: [RFC PATCH] fs: file freeze support

From: Namjae Jeon
Date: Mon Jan 19 2015 - 08:07:12 EST


> On Thu, Jan 15, 2015 at 08:36:55PM +0900, Namjae Jeon wrote:
> > We introduce per-file freeze feature for unifying defrag ext4 and xfs
> > as first ingredient. We get the idea courtesy of Dave Chinner
> > (https://lkml.org/lkml/2014/7/14/759)
Hi Dave,
>
> /me pages the context back in
>
> > per-file freeze will be used to avoid that file is not modified while
> > userspace is doing the defrag.
> >
> > This patch tries to implement file write freeze functionality.
> > Introduce new inode state, I_WRITE_FREEZED.
>
> I_WRITE_FROZEN.
Okay.
>
> > When this state is set, any process which tries to modify the file's address
> > space, either by pagefault mmap writes or using write(2), will block until
> > the this state is cleared. I_WRITE_FREEZED is set by calling FS_IOC_FWFREEZE
> > ioctl and clear by FS_IOC_FWTHAW ioctl.
> >
> > File write freeze functionality, when used in conjunction with
> > inode's immutable flag can be used for creating truly stable file snapshots
> > wherein write freeze will prevent any modification to the file from already
> > open file descriptors and immutable flag will prevent any new modification
> > to the file. One of the intended uses for stable file snapshots would be in
> > the defragmentation applications which defrags single file.
>
> I don't quite understand why the full filesystem freeze is
> necessary? The thaw occurs immediately after I_WRITE_FREEZED is set,
We started by looking at fs freeze for file freeze implementation,
So got biased for using fs freeze or similar approach.
Thanks for suggesting a better way.

> which means there's nothing that prevent the file from being
> truncated or otherwise modified by fallocate, etc while it is
> frozen....
Right, So, After that, we had also thought of setting immutable
flag of inode. Immutable flag + I_WRITE_FROZEN => truly frozen file.

>
> AFAICT, fsync will bring the file down to a consistent state and
> we've already got freeze hooks for all inode modification
> operations. We also have IO barriers for truncate operations so that
> we can wait for all outstanding IO to complete, so I would have
> thought this covers all bases for an inode freeze. i.e.:
Right.

>
> i_mutex -> I_FROZEN -> fsync -> inode_dio_wait
>
> Should give us a clean inode where there are not ongoing operations
> by the time that inode_dio_wait() completes. All new modification
> operations need to check I_FROZEN in addition to the superblock
> freeze checks...
I checked the routines where checks for I_FROZEN would be required.
Most of them are Ok but do_unlinkat() confuses me a little.
vfs_unlink is called under parent inode's i_mutex, so we cannot sleep
keeping parent's i_mutex held.
i.e while freezing file, all file in directory are blocked by parent
i_mutex. Is it ok to release parnets->mutex before checking for I_FROZEN
or there is some idea?

>
> > For implementation purpose, initially we tried to keep percpu usage counters
> > inside struct inode just like there is struct sb_writers in super_block.
> > But considering that it will significantly bloat up struct inode when actually
> > the usage of file write freeze will be infrequent, we dropped this idea.
> > Instead we have tried to use already present filesystem freezing infrastructure.
> > Current approach makes it possible for implementing file write freeze without
> > bloating any of struct super_block/inode.
> > In FS_IOC_FWFREEZE, we wait for complete fs to be frozen, set I_WRITE_FREEZED to
> > inode's state and unfreeze the fs.
> >
> > TODO : prevent inode eviction when I_WRITE_FREEZED state is set.
>
> The app needs to retain a reference to the inode (i.e. open
> file descriptor) for the length of time that the inode is to be
> frozen. That will prevent reclaim of the inode.
>
> If the app fails to thaw the inode before it drops all references
> to it, then all bets are off - we do not want to give userspace a
> mechanism to pin arbitrarily large amounts of memory....
Okay.

>
> > TODO : xfstests testcase for file freeze.
> >
> > Signed-off-by: Namjae Jeon <namjae.jeon@xxxxxxxxxxx>
> > Signed-off-by: Ashish Sangwan <a.sangwan@xxxxxxxxxxx>
> > ---
> > fs/btrfs/inode.c | 1 +
> > fs/buffer.c | 1 +
> > fs/ext4/inode.c | 1 +
> > fs/f2fs/file.c | 1 +
> > fs/gfs2/file.c | 1 +
> > fs/inode.c | 19 ++++++++++++++++++
> > fs/ioctl.c | 30 +++++++++++++++++++++++++++++
> > fs/nilfs2/file.c | 1 +
> > fs/ocfs2/mmap.c | 1 +
> > fs/super.c | 51 +++++++++++++++++++++++++++++++++++++++++++++++++
> > include/linux/fs.h | 13 +++++++++++++
> > include/uapi/linux/fs.h | 2 ++
> > mm/filemap.c | 1 +
> > 13 files changed, 123 insertions(+)
> >
> > diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> > index e687bb0..357c749 100644
> > --- a/fs/btrfs/inode.c
> > +++ b/fs/btrfs/inode.c
> > @@ -8262,6 +8262,7 @@ int btrfs_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
> > u64 page_start;
> > u64 page_end;
> >
> > + inode_start_write(inode);
> > sb_start_pagefault(inode->i_sb);
>
> Wouldn't it be better to have:
>
> inode_start_pagefault(inode);
>
> and then have it call sb_start_pagefault() internally?
Right.

>
> > +void inode_start_write(struct inode *inode)
> > +{
> > + struct super_block *sb = inode->i_sb;
> > +
> > +retry:
> > + spin_lock(&inode->i_lock);
> > + if (inode->i_state & I_WRITE_FREEZED) {
> > + DEFINE_WAIT(wait);
> > +
> > + prepare_to_wait(&sb->s_writers.wait_unfrozen, &wait,
> > + TASK_UNINTERRUPTIBLE);
> > + spin_unlock(&inode->i_lock);
> > + schedule();
> > + finish_wait(&sb->s_writers.wait_unfrozen, &wait);
> > + goto retry;
> > + }
> > + spin_unlock(&inode->i_lock);
>
> Please use a while() loop, not goto.
Okay.

>
> spin_lock(&inode->i_lock);
> while(inode->i_state & I_WRITE_FROZEN) {
> DEFINE_WAIT(wait);
>
> prepare_to_wait(&sb->s_writers.wait_unfrozen, &wait,
> TASK_UNINTERRUPTIBLE);
> spin_unlock(&inode->i_lock);
> schedule();
> finish_wait(&sb->s_writers.wait_unfrozen, &wait);
> spin_lock(&inode->i_lock);
> }
> spin_unlock(&inode->i_lock);
>
> > +
> > +int file_write_freeze(struct inode *inode)
> > +{
> > + struct super_block *sb = inode->i_sb;
> > + int ret = 0;
> > +
> > + if (!S_ISREG(inode->i_mode))
> > + return -EINVAL;
> > +
> > + spin_lock(&inode->i_lock);
> > +
> > + if (inode->i_state & I_WRITE_FREEZED)
> > + ret = -EBUSY;
> > + spin_unlock(&inode->i_lock);
> > + if (ret)
> > + return ret;
> > +
> > + ret = freeze_super(sb);
> > + if (ret)
> > + return ret;
> > +
> > + spin_lock(&inode->i_lock);
> > + inode->i_state |= I_WRITE_FREEZED;
> > + smp_wmb();
> > + spin_unlock(&inode->i_lock);
> > +
> > + return thaw_super(sb);
> > +}
> > +EXPORT_SYMBOL(file_write_freeze);
> > +
> > +int file_write_unfreeze(struct inode *inode)
> > +{
> > + struct super_block *sb = inode->i_sb;
> > +
> > + if (!S_ISREG(inode->i_mode))
> > + return -EINVAL;
> > +
> > + spin_lock(&inode->i_lock);
> > +
> > + if (!(inode->i_state & I_WRITE_FREEZED)) {
> > + spin_unlock(&inode->i_lock);
> > + return -EINVAL;
> > + }
> > +
> > + inode->i_state &= ~I_WRITE_FREEZED;
> > + smp_wmb();
> > + wake_up(&sb->s_writers.wait_unfrozen);
> > + spin_unlock(&inode->i_lock);
> > + return 0;
> > +}
> > +EXPORT_SYMBOL(file_write_unfreeze);
>
> These look terrible racy. Freeze can race with other filesystem
> freeze and thaw operations, so if thaw_super() fails you could have
> frozen the inode but you still get -EINVAL for the ioctl. Or you
> could have multiple applications doing file freeze/thaw, and so an
> unfreeze occurs while a freeze if waiting on a thaw, and so the app
> waiting on a freeze might return with an unfrozen inode, even though
> there were no errors....
True.

>
>
> > @@ -2328,6 +2340,7 @@ static inline bool file_start_write_trylock(struct file *file)
> > {
> > if (!S_ISREG(file_inode(file)->i_mode))
> > return true;
> > + inode_start_write(file_inode(file));
> > return __sb_start_write(file_inode(file)->i_sb, SB_FREEZE_WRITE, false);
> > }
>
> This needs "trylock" semantics, not blocking semantics.
Okay.

Thanks for your review!!
>
> Cheers,
>
> Dave.
> --
> Dave Chinner
> david@xxxxxxxxxxxxx

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/