[PATCH 2/2] xfs: update dir3 leaf block metadata after swap

From: Jiachen Zhang
Date: Tue Nov 28 2023 - 04:40:10 EST


On 2023/11/28 16:39, Christoph Hellwig wrote:
On Tue, Nov 28, 2023 at 01:32:02PM +0800, Jiachen Zhang wrote:
From: Zhang Tianci <zhangtianci.1997@xxxxxxxxxxxxx>

xfs_da3_swap_lastblock() copy the last block content to the dead block,
but do not update the metadata in it. We need update some metadata
for some kinds of type block, such as dir3 leafn block records its
blkno, we shall update it to the dead block blkno. Otherwise,
before write the xfs_buf to disk, the verify_write() will fail in
blk_hdr->blkno != xfs_buf->b_bn, then xfs will be shutdown.

Do you have a reproducer for this? It would be very helpful to add it
to xfstests.

Hi Christoph,

Thanks for the review!

It's hard to reproduce the issue. Currently we can reproduce it with
some kernel code changes. We forcely reserve 0 t_blk_res for xfs_remove
on kernel version 4.19:

diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
index f2d06e1e4906..c8f84b95a0ec 100644
--- a/fs/xfs/xfs_inode.c
+++ b/fs/xfs/xfs_inode.c
@@ -2551,13 +2551,8 @@ xfs_remove(
* insert tries to happen, instead trimming the LAST
* block from the directory.
*/
- resblks = XFS_REMOVE_SPACE_RES(mp);
- error = xfs_trans_alloc(mp, &M_RES(mp)->tr_remove, resblks, 0, 0, &tp);
- if (error == -ENOSPC) {
- resblks = 0;
- error = xfs_trans_alloc(mp, &M_RES(mp)->tr_remove, 0, 0, 0,
- &tp);
- }
+ resblks = 0;
+ error = xfs_trans_alloc(mp, &M_RES(mp)->tr_remove, 0, 0, 0, &tp);
if (error) {
ASSERT(error != -ENOSPC);
goto std_return


After insmod the new modified xfs.ko, run the following scripts, and it
can reproduce the problem consistently on the final `umount mnt`:

fallocate -l 1G xfs.img
mkfs.xfs -f xfs.img
mkdir -p mnt
losetup /dev/loop0 xfs.img
mount -t xfs /dev/loop0 mnt
pushd mnt
mkdir dir3
prefix="a_"
for j in $(seq 0 13); do
for i in $(seq 0 2800); do
touch dir3/${prefix}_${i}_${j}
done
for i in $(seq 0 2500); do
rm -f dir3/${prefix}_${i}_${j}
if [ "$i" == "2094" ] && [ "$j" == "13" ]; then
echo "should reproduce now, so break here!"
break;
fi
done
done
popd
umount mnt


We are still trying to make a reproducer without any kernel changes. Do
you have any suggestions on this?




We will get this warning:

XFS (dm-0): Metadata corruption detected at xfs_dir3_leaf_verify+0xa8/0xe0 [xfs], xfs_dir3_leafn block 0x178
XFS (dm-0): Unmount and run xfs_repair
XFS (dm-0): First 128 bytes of corrupted metadata buffer:
00000000e80f1917: 00 80 00 0b 00 80 00 07 3d ff 00 00 00 00 00 00 ........=.......
000000009604c005: 00 00 00 00 00 00 01 a0 00 00 00 00 00 00 00 00 ................
000000006b6fb2bf: e4 44 e3 97 b5 64 44 41 8b 84 60 0e 50 43 d9 bf .D...dDA..`.PC..
00000000678978a2: 00 00 00 00 00 00 00 83 01 73 00 93 00 00 00 00 .........s......
00000000b28b247c: 99 29 1d 38 00 00 00 00 99 29 1d 40 00 00 00 00 .).8.....).@....
000000002b2a662c: 99 29 1d 48 00 00 00 00 99 49 11 00 00 00 00 00 .).H.....I......
00000000ea2ffbb8: 99 49 11 08 00 00 45 25 99 49 11 10 00 00 48 fe .I....E%.I....H.
0000000069e86440: 99 49 11 18 00 00 4c 6b 99 49 11 20 00 00 4d 97 .I....Lk.I. ..M.
XFS (dm-0): xfs_do_force_shutdown(0x8) called from line 1423 of file fs/xfs/xfs_buf.c. Return address = 00000000c0ff63c1
XFS (dm-0): Corruption of in-memory data detected. Shutting down filesystem
XFS (dm-0): Please umount the filesystem and rectify the problem(s)

>From the log above, we know xfs_buf->b_no is 0x178, but the block's hdr record
its blkno is 0x1a0.

Fixes: 24df33b45ecf ("xfs: add CRC checking to dir2 leaf blocks")
Signed-off-by: Zhang Tianci <zhangtianci.1997@xxxxxxxxxxxxx>
---
fs/xfs/libxfs/xfs_da_btree.c | 12 +++++++++++-
1 file changed, 11 insertions(+), 1 deletion(-)

diff --git a/fs/xfs/libxfs/xfs_da_btree.c b/fs/xfs/libxfs/xfs_da_btree.c
index e576560b46e9..35f70e4c6447 100644
--- a/fs/xfs/libxfs/xfs_da_btree.c
+++ b/fs/xfs/libxfs/xfs_da_btree.c
@@ -2318,8 +2318,18 @@ xfs_da3_swap_lastblock(
* Copy the last block into the dead buffer and log it.
*/
memcpy(dead_buf->b_addr, last_buf->b_addr, args->geo->blksize);
- xfs_trans_log_buf(tp, dead_buf, 0, args->geo->blksize - 1);
dead_info = dead_buf->b_addr;
+ /*
+ * Update the moved block's blkno if it's a dir3 leaf block
+ */
+ if (dead_info->magic == cpu_to_be16(XFS_DIR3_LEAF1_MAGIC) ||
+ dead_info->magic == cpu_to_be16(XFS_DIR3_LEAFN_MAGIC) ||
+ dead_info->magic == cpu_to_be16(XFS_ATTR3_LEAF_MAGIC)) {
+ struct xfs_da3_blkinfo *dap = (struct xfs_da3_blkinfo *)dead_info;
+
+ dap->blkno = cpu_to_be64(dead_buf->b_bn);
+ }
+ xfs_trans_log_buf(tp, dead_buf, 0, args->geo->blksize - 1);

The fix here looks correct to me, but also a little ugly and ad-hoc.

At last we should be using container_of and not casts for getting from a
xfs_da_blkinfo to a xfs_da3_blkinfo (even if there is bad precedence
for the cast in existing code).


Thanks, we will optimize the code in the next version of the patchset.

But I think it would be useful to add a helper that stamps in the blkno
in for a caller that only has as xfs_da_blkinfo but no xfs_da3_blkinfo
and use in all the places that do it currently in an open coded fashion
e.g. xfs_da3_root_join, xfs_da3_root_split, xfs_attr3_leaf_to_node.

That should probably be done on top of the small backportable fix.


I think the idea to add helper is great, and we can do it after this
fixes patch is merged.


Thanks,
Jiachen