Re: [PATCH v2 00/16] block atomic writes

From: John Garry
Date: Wed Dec 13 2023 - 04:32:41 EST


On 12/12/2023 16:32, Christoph Hellwig wrote:
On Tue, Dec 12, 2023 at 11:08:28AM +0000, John Garry wrote:
Two new fields are added to struct statx - atomic_write_unit_min and
atomic_write_unit_max. For each atomic individual write, the total length
of a write must be a between atomic_write_unit_min and
atomic_write_unit_max, inclusive, and a power-of-2. The write must also be
at a natural offset in the file wrt the write length.

SCSI sd.c and scsi_debug and NVMe kernel support is added.

Some open questions:
- How to make API extensible for when we have no HW support? In that case,
we would prob not have to follow rule of power-of-2 length et al.
As a possible solution, maybe we can say that atomic writes are
supported for the file via statx, but not set unit_min and max values,
and this means that writes need to be just FS block aligned there.
I don't think the power of two length is much of a problem to be
honest, and if we every want to lift it we can still do that easily
by adding a new flag or limit.

ok, but it would be nice to have some idea on what that flag or limit change would be.


What I'm a lot more worried about is how to tell the file system that
allocations are done right for these requirement. There is no way
a user can know that allocations in an existing file are properly
aligned, so atomic writes will just fail on existing files.

I suspect we need an on-disk flag that forces allocations to be
aligned to the atomic write limit, in some ways similar how the
XFS rt flag works. You'd need to set it on an empty file, and all
allocations after that are guaranteed to be properly aligned.

Hmmm... so how is this different to the XFS forcealign feature?

For XFS, I thought that your idea was to always CoW new extents for misaligned extents or writes which spanned multiple extents.


- For block layer, should atomic_write_unit_max be limited by
max_sectors_kb? Currently it is not.
Well. It must be limited to max_hw_sectors to actually work.

Sure, as max_hw_sectors may also be limited by DMA controller max mapping size.

max_sectors is a software limit below that, which with modern hardware
is actually pretty silly and a real performance issue with todays
workloads when people don't tweak it..

Right, so we should limit atomic write queue limits to max_hw_sectors. But people can still tweak max_sectors, and I am inclined to say that atomic_write_unit_max et al should be (dynamically) limited to max_sectors also.


- How to improve requirement that iovecs are PAGE-aligned.
There are 2x issues:
a. We impose this rule to not split BIOs due to virt boundary for
NVMe, but there virt boundary is 4K (and not PAGE size, so broken for
16K/64K pages). Easy solution is to impose requirement that iovecs
are 4K-aligned.
b. We don't enforce this rule for virt boundary == 0, i.e. SCSI
.. we require any device that wants to support atomic writes to not
have that silly limit. For NVMe that would require SGL support
(and some driver changes I've been wanting to make for long where
we always use SGLs for transfers larger than a single PRP if supported)

If we could avoid dealing with a virt boundary, then that would be nice.

Are there any patches yet for the change to always use SGLs for transfers larger than a single PRP?

On a related topic, I am not sure about how - or if we even should - enforce iovec PAGE-alignment or length; rather, the user could just be advised that iovecs must be PAGE-aligned and min PAGE length to achieve atomic_write_unit_max.



- Since debugging torn-writes due to unwanted kernel BIO splitting/merging
would be horrible, should we add some kernel storage stack software
integrity checks?
Yes, I think we'll need asserts in the drivers. At least for NVMe I
will insist on them.

Please see 16/16 in this series.

For SCSI I think the device actually checks
because the atomic writes are a different command anyway, or am I
misunderstanding how SCSI works here?

Right, for WRITE ATOMIC (16) the target will reject a command when it does not respect the device atomic write limits.

Thanks,
John