Re: [RFC PATCH v7 9/9] vfs: expose STATX_VERSION to userland

From: Jan Kara
Date: Tue Oct 18 2022 - 11:17:32 EST


On Tue 18-10-22 10:21:08, Jeff Layton wrote:
> On Tue, 2022-10-18 at 15:49 +0200, Jan Kara wrote:
> > On Tue 18-10-22 06:35:14, Jeff Layton wrote:
> > > On Tue, 2022-10-18 at 09:14 +1100, Dave Chinner wrote:
> > > > On Mon, Oct 17, 2022 at 06:57:09AM -0400, Jeff Layton wrote:
> > > > > Trond is of the opinion that monotonicity is a hard requirement, and
> > > > > that we should not allow filesystems that can't provide that quality to
> > > > > report STATX_VERSION at all. His rationale is that one of the main uses
> > > > > for this is for backup applications, and for those a counter that could
> > > > > go backward is worse than useless.
> > > >
> > > > From the perspective of a backup program doing incremental backups,
> > > > an inode with a change counter that has a different value to the
> > > > current backup inventory means the file contains different
> > > > information than what the current backup inventory holds. Again,
> > > > snapshots, rollbacks, etc.
> > > >
> > > > Therefore, regardless of whether the change counter has gone
> > > > forwards or backwards, the backup program needs to back up this
> > > > current version of the file in this backup because it is different
> > > > to the inventory copy. Hence if the backup program fails to back it
> > > > up, it will not be creating an exact backup of the user's data at
> > > > the point in time the backup is run...
> > > >
> > > > Hence I don't see that MONOTONIC is a requirement for backup
> > > > programs - they really do have to be able to handle filesystems that
> > > > have modifications that move backwards in time as well as forwards...
> > >
> > > Rolling backward is not a problem in and of itself. The big issue is
> > > that after a crash, we can end up with a change attr seen before the
> > > crash that is now associated with a completely different inode state.
> > >
> > > The scenario is something like:
> > >
> > > - Change attr for an empty file starts at 1
> > >
> > > - Write "A" to file, change attr goes to 2
> > >
> > > - Read and statx happens (client sees "A" with change attr 2)
> > >
> > > - Crash (before last change is logged to disk)
> > >
> > > - Machine reboots, inode is empty, change attr back to 1
> > >
> > > - Write "B" to file, change attr goes to 2
> > >
> > > - Client stat's file, sees change attr 2 and assumes its cache is
> > > correct when it isn't (should be "B" not "A" now).
> > >
> > > The real danger comes not from the thing going backward, but the fact
> > > that it can march forward again after going backward, and then the
> > > client can see two different inode states associated with the same
> > > change attr value. Jumping all the change attributes forward by a
> > > significant amount after a crash should avoid this issue.
> >
> > As Dave pointed out, the problem with change attr having the same value for
> > a different inode state (after going backwards) holds not only for the
> > crashes but also for restore from backups, fs snapshots, device snapshots
> > etc. So relying on change attr only looks a bit fragile. It works for the
> > common case but the edge cases are awkward and there's no easy way to
> > detect you are in the edge case.
> >
>
> This is true. In fact in the snapshot case you can't even rely on doing
> anything at reboot since you won't necessarily need to reboot to make it
> roll backward.
>
> Whether that obviates the use of this value altogether, I'm not sure.
>
> > So I think any implementation caring about data integrity would have to
> > include something like ctime into the picture anyway. Or we could just
> > completely give up any idea of monotonicity and on each mount select random
> > prime P < 2^64 and instead of doing inc when advancing the change
> > attribute, we'd advance it by P. That makes collisions after restore /
> > crash fairly unlikely.
>
> Part of the goal (at least for NFS) is to avoid unnecessary cache
> invalidations.
>
> If we just increment it by a particular offset on every reboot, then
> every time the server reboots, the clients will invalidate all of their
> cached inodes, and proceed to hammer the server with READ calls just as
> it's having to populate its own caches from disk.

Note that I didn't propose to increment by offset on every reboot or mount.
I have proposed that inode_maybe_inc_iversion() would not increment
i_version by 1 (in fact 1 << I_VERSION_QUERIED_SHIFT) but rather by P (or P
<< I_VERSION_QUERIED_SHIFT) where P is a suitable number randomly selected
on filesystem mount.

This will not cause cache invalidation after a clean unmount + remount. It
will cause cache invalidation after a crash, snapshot rollback etc., only for
inodes where i_version changed. If P is suitably selected (e.g. as being a
prime), then the chances of collisions (even after a snapshot rollback) are
very low (on the order of 2^(-50) if my piece of envelope calculations are
right).

So this should nicely deal with all the problems we've spotted so far. But
I may be missing something...

Honza
--
Jan Kara <jack@xxxxxxxx>
SUSE Labs, CR