Re: [PATCH v8 1/5] fs: add infrastructure for multigrain timestamps

From: Kent Overstreet
Date: Fri Sep 22 2023 - 13:32:37 EST


On Fri, Sep 22, 2023 at 01:14:40PM -0400, Jeff Layton wrote:
> The VFS always uses coarse-grained timestamps when updating the ctime
> and mtime after a change. This has the benefit of allowing filesystems
> to optimize away a lot metadata updates, down to around 1 per jiffy,
> even when a file is under heavy writes.
>
> Unfortunately, this has always been an issue when we're exporting via
> NFS, which traditionally relied on timestamps to validate caches. A lot
> of changes can happen in a jiffy, and that can lead to cache-coherency
> issues between hosts.
>
> NFSv4 added a dedicated change attribute that must change value after
> any change to an inode. Some filesystems (btrfs, ext4 and tmpfs) utilize
> the i_version field for this, but the NFSv4 spec allows a server to
> generate this value from the inode's ctime.
>
> What we need is a way to only use fine-grained timestamps when they are
> being actively queried.
>
> POSIX generally mandates that when the the mtime changes, the ctime must
> also change. The kernel always stores normalized ctime values, so only
> the first 30 bits of the tv_nsec field are ever used.
>
> Use the 31st bit of the ctime tv_nsec field to indicate that something
> has queried the inode for the mtime or ctime. When this flag is set,
> on the next mtime or ctime update, the kernel will fetch a fine-grained
> timestamp instead of the usual coarse-grained one.
>
> Filesytems can opt into this behavior by setting the FS_MGTIME flag in
> the fstype. Filesystems that don't set this flag will continue to use
> coarse-grained timestamps.

Interesting...

So in bcachefs, for most inode fields the btree inode is the "master
copy"; we do inode updates via btree transactions, and then on
successful transaction commit we update the VFS inode to match.

(exceptions: i_size, i_blocks)

I'd been contemplating switching to that model for timestamp updates as
well, since that would allow us to get rid of our
super_operations.write_inode method - except we probably wouldn't want
to do that since it would likely make timestamp updates too expensive.

And now with your scheme of stashing extra state in timespec, I'm glad
we didn't.

Still, timestamp updates are a bit messier than I'd like, would be
lovely to figure out a way to clean that up - right now we have an
awkward mix of "sometimes timestamp updates happen in a btree
transaction first, other times just the VFS inode is updated and marked
dirty".

xfs doesn't have .write_inode, so it's probably time to study what it
does...