Re: [patch 7/8] fs: fix or note I_DIRTY handling bugs infilesystems

From: Steven Whitehouse
Date: Tue Jan 04 2011 - 09:22:11 EST


Hi,

On Mon, 2011-01-03 at 11:58 -0500, Christoph Hellwig wrote:
> On Mon, Jan 03, 2011 at 03:03:29PM +0000, Steven Whitehouse wrote:
> >
> > - With "journaled data" files
> > - Do a log flush conditional upon the inode's glock
> > - The core code then writes back any dirty pages
>
> Any data writeback is done before calling ->fsync.
>
> > - With regular files/directories
> > - If datasync is not set, we need to write back the metadata including
> > timestamp updates, so that is done via ->write_inode. Note that an extra
> > complication here is that we need to get the glock on the inode if we
> > don't already have it in order to check and conditionally update the
> > atime.The call to ->write_inode includes an implicit (conditional) log
> > flush.
> > - If datasync is set, we assume that only the data pages need to be
> > written out. My understanding of datasync was that it was only supposed
> > to write out data and never any of the metadata. The reason for the call
> > to flush the log for "stuffed" files is that the data shares a disk
> > block with the inode metadata, so we cannot avoid the log flush in this
> > case, since we must unpin the block to write it back.
>
> What happens to indirect blocks, inode size updates, etc? In general
> the only correct form to use the datasync argument is along the lines
> of:
>
> if ((inode->i_state & I_DIRTY_DATASYNC) ||
> ((inode->i_state & I_DIRTY_SYNC) && !datasync)) {
> /* write out the inode */
> } else {
> /*
> * VFS inode not dirty, no need to write it out.
> *
> * If the filesystem support asynchronous inode writes,
> * we may have to wait for them here.
> */
> }
>
> or rather mostly correct, as pointed out by Nick in this series, that's
> why the above gets replaced with an equivalent check that also
> participates in the writeback locking protocol in this series.
>
Yes, that looks much better than what we have at the moment.

> For gfs2 on current mainline an fsync respecting that would look like:
>
> static int gfs2_fsync(struct file *file, int datasync)
> {
> struct inode *inode = file->f_mapping->host;
> struct gfs2_inode *ip = GFS2_I(inode);
> int ret = 0;
>
> if (gfs2_is_jdata(ip) {
> gfs2_log_flush(GFS2_SB(inode), ip);
> return 0;
> }
>
> if ((inode->i_state & I_DIRTY_DATASYNC) ||
> ((inode->i_state & I_DIRTY_SYNC) && !datasync))
> sync_inode_metadata(inode, 1);
> else if (gfs2_is_stuffed(ip))
> gfs2_log_flush(GFS2_SB(inode), ip->i_gl);
> }
>
> Note that the asynchronous write_inode_now is replaced with a
> sync_inode_metadata, which doesn't incorrectly write data again, and
> makes sure we do a synchronous write.
>
> I'm still not quite sure how the gfs2_log_flush are supposed to work.
> What's the reason we don't need the ->write_inode call for journaled
> data mode? Also is it guaranteed that we might not have an asynchronous
> transaction that update the inode in the log, e.g. why doesn't gfs2
> need some sort of log flush even if the VFS inode is not dirty, unlike
> most other journaled filesystems.
>

I think that has just been missed due to the way in which the code has
developed. It appears to be needed to me, but originally all the
timestamp updates were handled internally by GFS2 and in a synchronous
manner, so that there was no need for ->write_inode() in that case. I
think that needs to be added now the vfs looks after atime updates
though, in order to be correct.

After the log flush there should also be a write on the metadata mapping
as per the inode_go_sync() function which is very similar (but not quite
similar enough to use the same code, I think) function,

Steve.


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/