Re: [GIT PULL v2] timestamp fixes

From: Amir Goldstein
Date: Sun Sep 24 2023 - 04:34:46 EST


On Thu, Sep 21, 2023 at 10:02 PM Jeff Layton <jlayton@xxxxxxxxxx> wrote:
>
> On Thu, 2023-09-21 at 11:24 -0700, Linus Torvalds wrote:
> > On Thu, 21 Sept 2023 at 04:21, Christian Brauner <brauner@xxxxxxxxxx> wrote:
> > >
> > > git@xxxxxxxxxxxxxxxxxxx:pub/scm/linux/kernel/git/vfs/vfs tags/v6.6-rc3.vfs.ctime.revert
> >
> > So for some reason pr-tracker-bot doesn't seem to have reacted to this
> > pull request, but it's in my tree now.
> >
> > I *do* have one reaction to all of this: now that you have made
> > "i_ctime" be something that cannot be accessed directly (and renamed
> > it to "__i_ctime"), would you mind horribly going all the way, and do
> > the same for i_atime and i_mtime too?
> >
> > The reason I ask is that I *really* despise "struct timespec64" as a type.
> >
> > I despise it inherently, but I despise it even more when you really
> > use it as another type entirely, and are hiding bits in there.
> >
> > I despise it because "tv_sec" obviously needs to be 64-bit, but then
> > "tv_nsec" is this horrible abomination. It's defined as "long", which
> > is all kinds of crazy. It's bogus and historical.
> >
> > And it's wasteful.
> >
> > Now, in the case of i_ctime, you took advantage of that waste by using
> > one (of the potentially 2..34!) unused bits for that
> > "fine-granularity" flag.
> >
> > But even when you do that, there's up to 33 other bits just lying
> > around, wasting space in a very central data structure.
> >
> > So it would actually be much better to explode the 'struct timespec64'
> > into explicit 64-bit seconds field, and an explicit 32-bit field with
> > two bits reserved. And not even next to each other, because they don't
> > pack well in general.
> >
> > So instead of
> >
> > struct timespec64 i_atime;
> > struct timespec64 i_mtime;
> > struct timespec64 __i_ctime;
> >
> > where that last one needs accessors to access, just make them *all*
> > have helper accessors, and make it be
> >
> > u64 i_atime_sec;
> > u64 i_mtime_sec;
> > u64 i_ctime_sec;
> > u32 i_atime_nsec;
> > u32 i_mtime_nsec;
> > u32 i_ctime_nsec;
> >
> > and now that 'struct inode' should shrink by 12 bytes.
> >
>
> I like it.
>
> > Then do this:
> >
> > #define inode_time(x) \
> > (struct timespec64) { x##_sec, x##_nsec }
> >
> > and you can now create a timespec64 by just doing
> >
> > inode_time(inode->i_atime)
> >
> > or something like that (to help create those accessor functions).
> >
> > Now, I agree that 12 bytes in the disaster that is 'struct inode' is
> > likely a drop in the ocean. We have tons of big things in there (ie
> > several list_heads, a whole 'struct address_space' etc etc), so it's
> > only twelve bytes out of hundreds.
> >
> > But dammit, that 'timespec64' really is ugly, and now that you're
> > hiding bits in it and it's no longer *really* a 'timespec64', I feel
> > like it would be better to do it right, and not mis-use a type that
> > has other semantics, and has other problems.
> >
>
>
> We have many, many inodes though, and 12 bytes per adds up!
>
> I'm on board with the idea, but...that's likely to be as big a patch
> series as the ctime overhaul was. In fact, it'll touch a lot of the same
> code. I can take a stab at that in the near future though.
>
> Since we're on the subject...another thing that bothers me with all of
> the timestamp handling is that we don't currently try to mitigate "torn
> reads" across the two different words. It seems like you could fetch a
> tv_sec value and then get a tv_nsec value that represents an entirely
> different timestamp if there are stores between them.
>
> Should we be concerned about this? I suppose we could do something with
> a seqlock, though I'd really want to avoid locking on the write side.

As luck would have it, if my calculations are correct, on x86-64 and with
CONFIG_FS_POSIX_ACL=y, CONFIG_SECURITY=y (as they are on
distro kernels), __i_ctime is exactly on split cache lines and maybe even
worse (?), __i_ctime.tv_nsec and the QUERIED bit are on the same
cache line with i_lock :-/

If we reorder the inode timestamps with i_size, we improve the situation
for this specific and very common configuration. Maybe.

Thanks,
Amir.