Re: [PATCH v2 08/14] fs: xfs: iomap: Sub-extent zeroing

From: John Garry
Date: Thu Mar 07 2024 - 07:57:46 EST



if (unlikely(!xfs_valid_startblock(ip, imap->br_startblock)))
return xfs_alert_fsblock_zero(ip, imap);
@@ -134,6 +135,8 @@ xfs_bmbt_to_iomap(
iomap->validity_cookie = sequence_cookie;
iomap->folio_ops = &xfs_iomap_folio_ops;
+ if (extsz > 1)
+ iomap->extent_shift = ffs(extsz) - 1;

iomap->extent_size = mp->m_bsize;
if (xfs_inode_has_force_align(ip))
iomap->extent_size *= ip->i_extsize;

ok, fine



@@ -563,11 +566,19 @@ xfs_iomap_write_unwritten(
xfs_fsize_t i_size;
uint resblks;
int error;
+ xfs_extlen_t extsz = xfs_get_extsz(ip);
trace_xfs_unwritten_convert(ip, offset, count);
- offset_fsb = XFS_B_TO_FSBT(mp, offset);
- count_fsb = XFS_B_TO_FSB(mp, (xfs_ufsize_t)offset + count);
+ if (extsz > 1) {
+ xfs_extlen_t extsize_bytes = XFS_FSB_TO_B(mp, extsz);
+
+ offset_fsb = XFS_B_TO_FSBT(mp, round_down(offset, extsize_bytes));
+ count_fsb = XFS_B_TO_FSB(mp, round_up(offset + count, extsize_bytes));
+ } else {
+ offset_fsb = XFS_B_TO_FSBT(mp, offset);
+ count_fsb = XFS_B_TO_FSB(mp, (xfs_ufsize_t)offset + count);
+ }

I don't think this is correct. We should only be converting the
extent when the entire range has had data written to it. If we are
doing unaligned writes, we end up running 3 separate unwritten
conversion transactions - the leading zeroing, the data written and
the trailing zeroing.

Then I missed that in the code.

For sub-FS block conversion, I thought that this was doing the complete FS blocks conversion, including for the head and tail zeros. And now for sub-extent writes, we would be similarly doing the full extent conversion, including head and tail zeros.


This will end up converting the entire range to written when the
leading zeroing completes, exposing stale data until the data and
trailing zeroing completes.

That would not be good.


Concurrent reads (both DIO and buffered) can see this stale data
while the write is in progress, leading to a mechanism where a user
can issue sub-atomic write range IO and concurrent overlapping reads
to read arbitrary stale data from the disk just before it is
overwritten.

I suspect the only way to fix this for sub-force-aligned DIo writes
if for iomap_dio_bio_iter() to chain the zeroing and data bios so
the entire range gets a single completion run on it instead of three
separate sub-aligned extent IO completions. We only need to do this
in the zeroing case - this is already the DIo slow path, so
additional submission overhead is not an issue. It would, however,
reduce completion overhead and latency, as we only need to run a
single extent conversion instead of 3, so chaining the bios on
aligned writes may well be a net win...


ok, I'll check that idea.

Thoughts? Christoph probably needs to weigh in on this one...


ok

Cheers,
John