Re: [PATCH] xfs: fail dax mount if reflink is enabled on a partition

From: ruansy.fnst@xxxxxxxxxxx
Date: Sun Oct 23 2022 - 23:22:02 EST


在 2022/10/24 6:00, Dave Chinner 写道:
> On Fri, Oct 21, 2022 at 07:11:02PM -0700, Darrick J. Wong wrote:
>> On Thu, Oct 20, 2022 at 10:17:45PM +0800, Yang, Xiao/杨 晓 wrote:
>>> In addition, I don't like your idea about the test change because it will
>>> make generic/470 become the special test for XFS. Do you know if we can fix
>>> the issue by changing the test in another way? blkdiscard -z can fix the
>>> issue because it does zero-fill rather than discard on the block device.
>>> However, blkdiscard -z will take a lot of time when the block device is
>>> large.
>>
>> Well we /could/ just do that too, but that will suck if you have 2TB of
>> pmem. ;)
>>
>> Maybe as an alternative path we could just create a very small
>> filesystem on the pmem and then blkdiscard -z it?
>>
>> That said -- does persistent memory actually have a future? Intel
>> scuttled the entire Optane product, cxl.mem sounds like expansion
>> chassis full of DRAM, and fsdax is horribly broken in 6.0 (weird kernel
>> asserts everywhere) and 6.1 (every time I run fstests now I see massive
>> data corruption).
>
> Yup, I see the same thing. fsdax was a train wreck in 6.0 - broken
> on both ext4 and XFS. Now that I run a quick check on 6.1-rc1, I
> don't think that has changed at all - I still see lots of kernel
> warnings, data corruption and "XFS_IOC_CLONE_RANGE: Invalid
> argument" errors.

Firstly, I think the "XFS_IOC_CLONE_RANGE: Invalid argument" error is
caused by the restrictions which prevent reflink work together with DAX:

a. fs/xfs/xfs_ioctl.c:1141
/* Don't allow us to set DAX mode for a reflinked file for now. */
if ((fa->fsx_xflags & FS_XFLAG_DAX) && xfs_is_reflink_inode(ip))
return -EINVAL;

b. fs/xfs/xfs_iops.c:1174
/* Only supported on non-reflinked files. */
if (xfs_is_reflink_inode(ip))
return false;

These restrictions were removed in "drop experimental warning" patch[1].
I think they should be separated from that patch.

[1]
https://lore.kernel.org/linux-xfs/1663234002-17-1-git-send-email-ruansy.fnst@xxxxxxxxxxx/


Secondly, how the data corruption happened? Or which case failed? Could
you give me more info (such as mkfs options, xfstests configs)?

>
> If I turn off reflink, then instead of data corruption I get kernel
> warnings like this from fsx and fsstress workloads:
>
> [415478.558426] ------------[ cut here ]------------
> [415478.560548] WARNING: CPU: 12 PID: 1515260 at fs/dax.c:380 dax_insert_entry+0x2a5/0x320
> [415478.564028] Modules linked in:
> [415478.565488] CPU: 12 PID: 1515260 Comm: fsx Tainted: G W 6.1.0-rc1-dgc+ #1615
> [415478.569221] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 04/01/2014
> [415478.572876] RIP: 0010:dax_insert_entry+0x2a5/0x320
> [415478.574980] Code: 08 48 83 c4 30 5b 5d 41 5c 41 5d 41 5e 41 5f c3 48 8b 58 20 48 8d 53 01 e9 65 ff ff ff 48 8b 58 20 48 8d 53 01 e9 50 ff ff ff <0f> 0b e9 70 ff ff ff 31 f6 4c 89 e7 e8 da ee a7 00 eb a4 48 81 e6
> [415478.582740] RSP: 0000:ffffc90002867b70 EFLAGS: 00010002
> [415478.584730] RAX: ffffea000f0d0800 RBX: 0000000000000001 RCX: 0000000000000001
> [415478.587487] RDX: ffffea0000000000 RSI: 000000000000003a RDI: ffffea000f0d0840
> [415478.590122] RBP: 0000000000000011 R08: 0000000000000000 R09: 0000000000000000
> [415478.592380] R10: ffff888800dc9c18 R11: 0000000000000001 R12: ffffc90002867c58
> [415478.594865] R13: ffff888800dc9c18 R14: ffffc90002867e18 R15: 0000000000000000
> [415478.596983] FS: 00007fd719fa2b80(0000) GS:ffff88883ec00000(0000) knlGS:0000000000000000
> [415478.599364] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [415478.600905] CR2: 00007fd71a1ad640 CR3: 00000005cf241006 CR4: 0000000000060ee0
> [415478.602883] Call Trace:
> [415478.603598] <TASK>
> [415478.604229] dax_fault_iter+0x240/0x600
> [415478.605410] dax_iomap_pte_fault+0x19c/0x3d0
> [415478.606706] __xfs_filemap_fault+0x1dd/0x2b0
> [415478.607744] __do_fault+0x2e/0x1d0
> [415478.608587] __handle_mm_fault+0xcec/0x17b0
> [415478.609593] handle_mm_fault+0xd0/0x2a0
> [415478.610517] exc_page_fault+0x1d9/0x810
> [415478.611398] asm_exc_page_fault+0x22/0x30
> [415478.612311] RIP: 0033:0x7fd71a04b9ba
> [415478.613168] Code: 4d 29 c1 4c 29 c2 48 3b 15 db 95 11 00 0f 87 af 00 00 00 0f 10 01 0f 10 49 f0 0f 10 51 e0 0f 10 59 d0 48 83 e9 40 48 83 ea 40 <41> 0f 29 01 41 0f 29 49 f0 41 0f 29 51 e0 41 0f 29 59 d0 49 83 e9
> [415478.617083] RSP: 002b:00007ffcf277be18 EFLAGS: 00010206
> [415478.618213] RAX: 00007fd71a1a3fc5 RBX: 0000000000000fc5 RCX: 00007fd719f5a610
> [415478.619854] RDX: 000000000000964b RSI: 00007fd719f50fd5 RDI: 00007fd71a1a3fc5
> [415478.621286] RBP: 0000000000030fc5 R08: 000000000000000e R09: 00007fd71a1ad640
> [415478.622730] R10: 0000000000000001 R11: 00007fd71a1ad64e R12: 0000000000009699
> [415478.624164] R13: 000000000000a65e R14: 00007fd71a1a3000 R15: 0000000000000001
> [415478.625600] </TASK>
> [415478.626087] ---[ end trace 0000000000000000 ]---
>
> Even generic/247 is generating a warning like this from xfs_io,
> which is a mmap vs DIO racer. Given that DIO doesn't exist for
> fsdax, this test turns into just a normal write() vs mmap() racer.
>
> Given these are the same fsdax infrastructure failures that I
> reported for 6.0, it is also likely that ext4 is still throwing
> them. IOWs, whatever got broke in the 6.0 cycle wasn't fixed in the
> 6.1 cycle.

Still working on it...

>
>> Frankly at this point I'm tempted just to turn of fsdax support for XFS
>> for the 6.1 LTS because I don't have time to fix it.
>
> /me shrugs
>
> Backporting fixes (whenever they come along) is a problem for the
> LTS kernel maintainer to deal with, not the upstream maintainer.
>
> IMO, the issue right now is that the DAX maintainers seem to have
> little interest in ensuring that the FSDAX infrastructure actually
> works correctly. If anything, they seem to want to make things
> harder for block based filesystems to use pmem devices and hence
> FSDAX. e.g. the direction of the DAX core away from block interfaces
> that filesystems need for their userspace tools to manage the
> storage.
>
> At what point do we simply say "the experiment failed, FSDAX is
> dead" and remove it from XFS altogether?

I'll hurry up and try my best to solve these problems.


--
Thanks,
Ruan.

>
> Cheers,
>
> Dave.