Re: [PATCH v2 7/8] ext4: Use rbtrees to manage PAs instead of inode i_prealloc_list

From: Theodore Ts'o
Date: Mon Nov 28 2022 - 22:07:04 EST


This commit (determined via bisecion) seems to be causing a reliable
failure using the ext4/ext3 configuration when running generic/269:

% kvm-xfstests -c ext4/ext3 generic/269
...
BEGIN TEST ext3 (1 test): Ext4 4k block emulating ext3 Mon Nov 28 21:39:35 EST 2022
DEVICE: /dev/vdd
EXT_MKFS_OPTIONS: -O ^extents,^flex_bg,^uninit_bg,^64bit,^metadata_csum,^huge_file,^die
EXT_MOUNT_OPTIONS: -o block_validity,nodelalloc
FSTYP -- ext4
PLATFORM -- Linux/x86_64 kvm-xfstests 6.1.0-rc4-xfstests-00018-g1c85d4890f15 #8492
MKFS_OPTIONS -- -F -q -O ^extents,^flex_bg,^uninit_bg,^64bit,^metadata_csum,^huge_filc
MOUNT_OPTIONS -- -o acl,user_xattr -o block_validity,nodelalloc /dev/vdc /vdc

generic/269 23s ... [21:39:35][ 3.085973] run fstests generic/269 at 2022-11-28 215
[ 14.931680] ------------[ cut here ]------------
[ 14.931902] kernel BUG at fs/ext4/mballoc.c:4025!
[ 14.932137] invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
[ 14.932366] CPU: 1 PID: 2677 Comm: fsstress Not tainted 6.1.0-rc4-xfstests-00018-g19
[ 14.932756] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.0-debian-4
[ 14.933169] RIP: 0010:ext4_mb_pa_adjust_overlap.constprop.0+0x18e/0x1c0
[ 14.933457] Code: 66 54 8b 48 54 89 4c 24 04 e8 ae 92 9f 00 41 8b 46 40 85 c0 75 bc4
[ 14.934270] RSP: 0018:ffffc90003aeb868 EFLAGS: 00010283
[ 14.934499] RAX: 0000000000000000 RBX: 00000000000000fc RCX: 0000000000000000
[ 14.934830] RDX: 0000000000000001 RSI: ffffc90003aeb8d4 RDI: 0000000000000001
[ 14.935146] RBP: 0000000000000200 R08: 0000000000008000 R09: 0000000000000001
[ 14.935447] R10: 0000000000000001 R11: 0000000000000001 R12: 0000000000000103
[ 14.935744] R13: 0000000000000101 R14: ffff8880073370e0 R15: ffff888007337118
[ 14.936043] FS: 00007f94eda0b740(0000) GS:ffff88807dd00000(0000) knlGS:000000000000
[ 14.936390] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 14.936634] CR2: 000055ba905a0448 CR3: 000000001092c005 CR4: 0000000000770ee0
[ 14.936932] PKRU: 55555554
[ 14.937048] Call Trace:
[ 14.937190] <TASK>
[ 14.937285] ext4_mb_normalize_request.constprop.0+0x1e9/0x440
[ 14.937534] ext4_mb_new_blocks+0x3a2/0x560
[ 14.937715] ext4_alloc_branch+0x21e/0x350
[ 14.937892] ext4_ind_map_blocks+0x322/0x750
[ 14.938076] ext4_map_blocks+0x380/0x6e0
[ 14.938260] _ext4_get_block+0xb2/0x120
[ 14.938426] ext4_block_write_begin+0x13c/0x3d0
[ 14.938624] ? _ext4_get_block+0x120/0x120
[ 14.938801] ext4_write_begin+0x1c1/0x570
[ 14.938973] generic_perform_write+0xcf/0x220
[ 14.939175] ext4_buffered_write_iter+0x84/0x140
[ 14.939377] do_iter_readv_writev+0xf0/0x150
[ 14.939562] do_iter_write+0x80/0x150
[ 14.939722] vfs_writev+0xed/0x1f0
[ 14.939871] do_writev+0x73/0x100
[ 14.940016] do_syscall_64+0x37/0x90
[ 14.940186] entry_SYSCALL_64_after_hwframe+0x63/0xcd
[ 14.940403] RIP: 0033:0x7f94edb02da3
[ 14.940559] Code: 8b 15 f1 90 0c 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b8 0f 1f8
[ 14.941341] RSP: 002b:00007ffc5e82d0d8 EFLAGS: 00000246 ORIG_RAX: 0000000000000014
[ 14.941659] RAX: ffffffffffffffda RBX: 0000000000000036 RCX: 00007f94edb02da3
[ 14.941961] RDX: 0000000000000356 RSI: 000055ba901c1240 RDI: 0000000000000003
[ 14.942290] RBP: 0000000000000003 R08: 000055ba901cf240 R09: 00007f94edbccbe0
[ 14.942596] R10: 0000000000000080 R11: 0000000000000246 R12: 000000000000062a
[ 14.942902] R13: 0000000000000356 R14: 000055ba901c1240 R15: 000000000000b529
[ 14.943219] </TASK>
[ 14.943326] ---[ end trace 0000000000000000 ]---

Looking at the stack trace it looks like we're hitting this BUG_ON:

spin_lock(&tmp_pa->pa_lock);
if (tmp_pa->pa_deleted == 0)
BUG_ON(!(start >= tmp_pa_end || end <= tmp_pa_start));
spin_unlock(&tmp_pa->pa_lock);

... in the inline function ext4_mb_pa_assert_overlap(), called from
ext4_mb_pa_adjust_overlap().

The generic/269 test runs fstress with an ENOSPC hitter as an
antogonist process. The ext3 configuration disables delayed
allocation, which means that fstress is going to be allocating blocks
at write time (instead of dirty page writeback time).

Could you take a look? Thanks!

- Ted