Re: [PATCH RFC 5/6] fs: xfs: iomap atomic write support

From: John Garry
Date: Thu Feb 15 2024 - 04:54:34 EST

Yes, I was just looking at adding a mkfs option to format for atomic writes,
which would check physical information about the volume and whether it suits
rtextsize and then subsequently also set XFS_SB_FEAT_RO_COMPAT_ATOMICWRITES.

FWIW, atomic writes need to be implemented in XFS in a way that
isn't specific to the rtdev. There is no reason they cannot be
applied to the data device (via superblock max atomic write size
field or extent size hints and the align flag) so
please don't get hung up on rtextsize as the only thing that atomic
write alignment might apply to.


Yes, mkfs allows the user to override the hardware configsi it
probes, but it also warns when the override is doing something
sub-optimal (like aligning all AG headers to the same disk in a

IOWs, mkfs should be pulling this atomic write info from the
hardware and configuring the filesysetm around that information.
That's the target we should be aiming the kernel implementation at
and optimising for - a filesystem that is correctly configured
according to published hardware capability.


So, for example, if the atomic writes option is set and the rtextsize set by
the user is so much larger than what HW can support in terms of atomic
writes, then we should let the user know about this.

Well, this is part of the problem I mention above: you're focussing
entirely on the rtdev setup and not the general "atomic writes
require BMBT extent alignment constraints".

I'm really just saying what I was considering based on this series only.

So, maybe, yes, we might want to warn that the rtextsize is much
bigger than the atomic write size of that device, but now there's
something else we need to take into account: the rtdev could have a
different atomic write size comxpapred to the data device.

What now?

IOWs, focussing on the rtdev misses key considerations for making
the functionality generic, and we most definitely don't want to have
to rev the on disk format a second time to support atomic writes
for the data device. Hence we likely need two variables for atomic
write sizes in the superblock - one for the rtdev, and one for the
data device. And this then feeds through to Darrick's multi-rtdev
stuff - each rtdev will need to have an attribute that tracks this


What the patchset does is try to extend and infer things from
existing allocation alignment constraints, but then not apply those
constraints to critical extent state operations (pure BMBT
modifications) that atomic writes also need constrained to work
correctly and efficiently.

Right. Previously I also did mention that we could explicitly request the
atomic write size per-inode, but a drawback is that this would require an
on-disk format change.

We're already having to change the on-disk format for this (inode
flag, superblock feature bit), so we really should be trying to make
this generic and properly featured so that it can be used for
anything that requires physical alignment of file data extents, not
just atomic writes...



Another motivation for this flag is that we can explicitly enable some
software-based atomic write support for an inode when the backing device
does not have HW support.

That's orthogonal to the aligment issue. If the BMBT extents are
always aligned in a way that is compatible with atomic writes, we
don't need and aomtic writes flag to tell the filesystem it should
do an atomic write.

Any instruction to do an atomic write should be encoded in the userspace write operation. Or maybe the file open operation in future - I still get questions about O_ATOMIC.

That comes from userspace via the IOCB_ATOMIC

It is the IOCB_ATOMIC that triggers the software fallback when the
hardware doesn't support atomic writes, not an inode flag.

To me, any such fallback seems like something which we should be explicitly enabling.

All the
filesystem has to do is guarantee all extent manipulations are
correctly aligned and IOCB_ATOMIC can always be executed regardless
of whether it is hardware or software that does it.

In addition, in this series setting FS_XFLAG_ATOMICWRITES means
XFS_DIFLAG2_ATOMICWRITES gets set, and I would expect it to do something
similar for other OSes, and for those other OSes it may also mean some other
special alignment feature enabled. We want a consistent user experience.

I don't care about other OSes - they don't implement the
FS_IOC_FSSETXATTR interfaces, so we just don't care about cross-OS
compatibility for the user API.

Other FSes need to support FS_IOC_FSSETXATTR for atomic writes. Or at least should support it....

Fundamentally, atomic writes are *not a property of the filesystem
on-disk format*. They require extent tracking constraints (i.e.
alignment), and that's the property of the filesystem on-disk format
that we need to manipulate here.

Users are not going to care if the setup ioctl for atomic writes
already know they have to align their IO properly for atomic writes,
so it's not like this is something they will be completely
unfamiliar with.

Indeed, FS_XFLAG_ALIGN_EXTENTS | FS_XFLAG_EXTSIZE w/ fsx.fsx_extsize
= max_atomic_write_size as a user interface to set up the inode for
atomic writes is pretty concise and easy to use. It also isn't
specific to atomic writes, and so this fucntionality can be used for
anything that needs aligned extent manipulation for performant

This is where there seems to be a difference between how you would like atomic writes to be enabled and how Christoph would, judging by this:

I think that if the FS and the userspace util can between them figure out what to do, then that is ok. This is something like what I proposed previously:

xfs_io -c "atomic-writes 64K" mnt/file

where the userspace util would use for the FS_IOC_FSSETXATTR ioctl:


There is a question on the purpose of the FS_XFLAG_ATOMIC_WRITES flag, but, to me, it does seem useful for future feature support.

We could just have FS_XFLAG_ATOMIC_WRITES | FS_XFLAG_EXTSIZE w/ fsx.fsx_extsize, and the kernel auto-enables FS_XFLAG_ALIGN_EXTENTS, but the other way seems better

to behave in a particular way - forced alignment - not that atomic
writes are being used in the filesystem....

At this point, the filesystem can do all the extent modification
alignment stuff that atomic writes require without caring if the
block device supports atomic writes or even if the
application is using atomic writes.

This means we can test the BMBT functionality in fstests without
requiring hardware (or emulation) that supports atomic writes - all
we need to do is set the forced align flag, an extent size hint and
go run fsx on it...

The current idea was that the forcealign feature would be required for the
second phase for atomic write support - non-rtvol support. Darrick did send
that series out separately over the New Year's break.

This is the wrong approach: this needs to be integrated into the
same patchset so we can review the changes for atomic writes as a
whole, not as two separate, independent on-disk format changes. The
on-disk format change that atomic writes need is aligned BMBT extent
manipulations, regardless of whether we are targeting the rtdev or
the datadev, and trying to separate them is leading you down
entirely the wrong path.

ok, fine.

BTW, going back to the original discussion on the extent zeroing and your idea to do this in the iomap dio path, my impression is that we require changes like this:

- in iomap_dio_bio_iter(), we need to zero out the extent according to extsize (and not just FS block size)
- xfs_dio_write_end_io() -> xfs_iomap_write_unwritten() also needs to consider this extent being written, and not assume a FS block
- per-inode locking similar to what is done in xfs_file_dio_write_unaligned() - I need to check that further, though, as I am not yet sure on how we decide to use this exclusive lock or not.

Does this sound sane?