On Tue, Dec 12, 2023 at 11:08:28AM +0000, John Garry wrote:
Two new fields are added to struct statx - atomic_write_unit_min andI don't think the power of two length is much of a problem to be
atomic_write_unit_max. For each atomic individual write, the total length
of a write must be a between atomic_write_unit_min and
atomic_write_unit_max, inclusive, and a power-of-2. The write must also be
at a natural offset in the file wrt the write length.
SCSI sd.c and scsi_debug and NVMe kernel support is added.
Some open questions:
- How to make API extensible for when we have no HW support? In that case,
we would prob not have to follow rule of power-of-2 length et al.
As a possible solution, maybe we can say that atomic writes are
supported for the file via statx, but not set unit_min and max values,
and this means that writes need to be just FS block aligned there.
honest, and if we every want to lift it we can still do that easily
by adding a new flag or limit.
What I'm a lot more worried about is how to tell the file system that
allocations are done right for these requirement. There is no way
a user can know that allocations in an existing file are properly
aligned, so atomic writes will just fail on existing files.
I suspect we need an on-disk flag that forces allocations to be
aligned to the atomic write limit, in some ways similar how the
XFS rt flag works. You'd need to set it on an empty file, and all
allocations after that are guaranteed to be properly aligned.
- For block layer, should atomic_write_unit_max be limited byWell. It must be limited to max_hw_sectors to actually work.
max_sectors_kb? Currently it is not.
max_sectors is a software limit below that, which with modern hardware
is actually pretty silly and a real performance issue with todays
workloads when people don't tweak it..
- How to improve requirement that iovecs are PAGE-aligned... we require any device that wants to support atomic writes to not
There are 2x issues:
a. We impose this rule to not split BIOs due to virt boundary for
NVMe, but there virt boundary is 4K (and not PAGE size, so broken for
16K/64K pages). Easy solution is to impose requirement that iovecs
are 4K-aligned.
b. We don't enforce this rule for virt boundary == 0, i.e. SCSI
have that silly limit. For NVMe that would require SGL support
(and some driver changes I've been wanting to make for long where
we always use SGLs for transfers larger than a single PRP if supported)
- Since debugging torn-writes due to unwanted kernel BIO splitting/mergingYes, I think we'll need asserts in the drivers. At least for NVMe I
would be horrible, should we add some kernel storage stack software
integrity checks?
will insist on them.
For SCSI I think the device actually checks
because the atomic writes are a different command anyway, or am I
misunderstanding how SCSI works here?