Re: Crash with PREEMPT_RT on aarch64 machine

From: Pierre Gondois
Date: Wed Nov 30 2022 - 12:20:42 EST



On 11/28/22 16:58, Sebastian Andrzej Siewior wrote:
How about this?

- The fast path is easy…

- The slow path first sets the WAITER bits (mark_rt_mutex_waiters()) so
I made that one _acquire so that it is visible by the unlocker forcing
everyone into slow path.

- If the lock is acquired, then the owner is written via
rt_mutex_set_owner(). This happens under wait_lock so it is
serialized and so a WRITE_ONCE() is used to write the final owner. I
replaced it with a cmpxchg_acquire() to have the owner there.
Not sure if I shouldn't make this as you put it:
| e.g. by making use of dependency ordering where it already exists.
The other (locking) CPU needs to see the owner not only the WAITER
bit. I'm not sure if this could be replaced with smp_store_release().

- After the whole operation completes, fixup_rt_mutex_waiters() cleans
the WAITER bit and I kept the _acquire semantic here. Now looking at
it again, I don't think that needs to be done since that shouldn't be
set here.

- There is rtmutex_spin_on_owner() which (as the name implies) spins on
the owner as long as it active. It obtains it via READ_ONCE() and I'm
not sure if any memory barrier is needed. Worst case is that it will
spin while owner isn't set if it observers a stale value.

- The unlock path first clears the waiter bit if there are no waiters
recorded (via simple assignments under the wait_lock (every locker
will fail with the cmpxchg_acquire() and go for the wait_lock)) and
then finally drop it via rt_mutex_cmpxchg_release(,, NULL).
Should there be a wait, it will just store the WAITER bit with
smp_store_release() (setting the owner is NULL but the WAITER bit
forces everyone into the slow path).

- Added rt_mutex_set_owner_pi() which does simple assignment. This is
used from the futex code and here everything happens under a lock.

- I added a smp_load_acquire() to rt_mutex_base_is_locked() since I
*think* want to observe a real waiter and not something stale.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@xxxxxxxxxxxxx>


Hello,
Just to share some debug attempts, I tried Sebastian's patch and could not
reproduce the error. While trying to understand the solution, I could not
reproduce the error if I only took the changes made to
mark_rt_mutex_waiters(), or to rt_mutex_set_owner_pi(). I am not sure I
understand why this would be a rt-mutex issue.

Without Sebastian's patch, to try adding some synchronization around the
'i_wb_list', I did the following:

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 443f83382b9b..42ce1f7f8aef 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -1271,10 +1271,10 @@ void sb_clear_inode_writeback(struct inode *inode)
struct super_block *sb = inode->i_sb;
unsigned long flags;
- if (!list_empty(&inode->i_wb_list)) {
+ if (!list_empty_careful(&inode->i_wb_list)) {
spin_lock_irqsave(&sb->s_inode_wblist_lock, flags);
- if (!list_empty(&inode->i_wb_list)) {
- list_del_init(&inode->i_wb_list);
+ if (!list_empty_careful(&inode->i_wb_list)) {
+ list_del_init_careful(&inode->i_wb_list);
trace_sb_clear_inode_writeback(inode);
}
spin_unlock_irqrestore(&sb->s_inode_wblist_lock, flags);
diff --git a/fs/inode.c b/fs/inode.c
index b608528efd3a..fbe6b4fe5831 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -621,7 +621,7 @@ void clear_inode(struct inode *inode)
BUG_ON(!list_empty(&inode->i_data.private_list));
BUG_ON(!(inode->i_state & I_FREEING));
BUG_ON(inode->i_state & I_CLEAR);
- BUG_ON(!list_empty(&inode->i_wb_list));
+ BUG_ON(!list_empty_careful(&inode->i_wb_list));
/* don't need i_lock here, no concurrent mods to i_state */
inode->i_state = I_FREEING | I_CLEAR;
}

I never stepped on the:
BUG_ON(!list_empty_careful(&inode->i_wb_list))
statement again, but got the dump at [2]. I also regularly end-up
with the following endless logs when trying other things, when rebooting:

EXT4-fs (nvme0n1p3): sb orphan head is 2840597
sb_info orphan list:
inode nvme0n1p3:3958579 at 00000000b5934dff: mode 100664, nlink 1, next 0
inode nvme0n1p3:3958579 at 00000000b5934dff: mode 100664, nlink 1, next 0
...

Also, Jan said that the issue was reproducible on 'two different aarch64
machines', cf [1]. Would it be possible to know which one ?

Regards,
Pierre

[1]
https://lore.kernel.org/all/20221103115444.m2rjglbkubydidts@quack3/

[2]
EXT4-fs (nvme0n1p3): Inode 2834153 (0000000051c7b29b): orphan list check failed!
0000000051c7b29b: 0000f30a 00000004 00000000 00000000 ................
00000000a0792dde: 00000000 00000000 00000000 00000000 ................
0000000065a25e3d: 00000000 00000000 00000000 00000000 ................
00000000b2085d6d: 00000000 00000000 00000000 002b3fe8 .............?+.
0000000088f2c42f: 00000000 00000000 00000159 00000000 ........Y.......
00000000b9d22813: 00080000 00000040 80000000 00000000 ....@...........
0000000004f93ec7: 00000000 00000000 00000000 00000000 ................
00000000d8e00df6: 00000000 00000000 00000000 00000000 ................
00000000d29047af: ba813320 ffff07ff 0d822240 ffff4001 3......@"...@..
00000000e9708e5a: 0d822250 ffff4001 0d822250 ffff4001 P"...@..P"...@..
000000001f08ff8a: 0d822260 ffff4001 0d822260 ffff4001 `"...@..`"...@..
000000003231aba6: 00000000 00000000 00000000 00000000 ................
00000000df5b63ba: 00000000 00000000 00000000 00000000 ................
00000000367f58f3: 00000000 00000000 00000000 00000000 ................
000000008a1af872: 0d8222a0 ffff4001 0d8222a0 ffff4001 ."...@..."...@..
000000002a8dc95e: 00000000 00000000 00000000 00000000 ................
00000000e4c5bdb4: 00000000 00000000 00000000 00000000 ................
000000004f0a738d: 00000000 00000000 80000000 00000000 ................
0000000082a9e5b2: 00000000 00000000 00000000 00000000 ................
0000000064dce462: 00000000 00000000 00000000 00000000 ................
00000000fe106bb0: 000d8180 000040ab 000040ab 00000000 .....@...@......
00000000946bcd55: 00000000 00000000 00000000 00000000 ................
000000001d64e9fd: c1390a80 ffffdab5 87d6c800 ffff07ff ..9.............
000000002afa83ff: 0d822488 ffff4001 54d25060 ffff4002 .$...@..`P.T.@..
0000000075bfe8c6: 002b3ee9 00000000 00000000 00000000 .>+.............
00000000580feb22: 00000000 00000000 63877e76 00000000 ........v~.c....
00000000cca606aa: 010d8f0e 00000000 63877e76 00000000 ........v~.c....
00000000d446eaef: 010d8f0e 00000000 63877e76 00000000 ........v~.c....
0000000072243536: 010d8f0e 00000000 00000000 00000000 ................
000000007cbeccb9: 00000000 00000000 00000000 00000000 ................
0000000026d5ad72: 00000000 00000000 000c0000 00000000 ................
00000000e28ac20a: 00000000 00000000 00000060 00000000 ........`.......
0000000076ed32fb: 80000000 00000000 00000000 00000000 ................
00000000cd183175: 00000000 00000000 00000000 00000000 ................
00000000ecafc825: 00000000 00000000 00000000 00000000 ................
00000000408fde6f: 00000000 00000000 00000000 00000000 ................
000000004d7c3704: 00000000 00000000 0d822408 ffff4001 .........$...@..
000000007a24c141: 0d822408 ffff4001 0d822418 ffff4001 .$...@...$...@..
000000007ce51788: 0d822418 ffff4001 0d822428 ffff4001 .$...@..($...@..
00000000efd9c162: 0d822428 ffff4001 0d822438 ffff4001 ($...@..8$...@..
0000000013f0626e: 0d822438 ffff4001 00000000 00000000 8$...@..........
00000000e8fc5904: 00000000 00000000 00000002 00000000 ................
000000006533e04b: 00000000 00000000 00000000 00000000 ................
000000009a33c9d5: 00000000 00000000 c1390b40 ffffdab5 ........@.9.....
00000000b743e93e: 00000000 00000000 0d822300 ffff4001 .........#...@..
0000000041c5a701: 00000000 00000000 00000000 00000000 ................
000000009f872e56: 00000000 00000000 00000000 00000000 ................
00000000f6ca0703: 00000021 00000000 00000000 00000000 !...............
00000000e79eacb9: 80000000 00000000 00000000 00000000 ................
000000003afb3989: 00000000 00000000 00000000 00000000 ................
00000000164436d4: 00000000 00000000 00100cca 00000000 ................
00000000d63a3021: 00000000 00000000 00000000 00000000 ................
00000000e0e1ace3: 80000000 00000000 00000000 00000000 ................
0000000043ac9c19: 00000000 00000000 00000000 00000000 ................
00000000f3870564: 00000000 00000000 00000000 00000000 ................
0000000082c87bf9: 00000000 00000000 c1391310 ffffdab5 ..........9.....
00000000f5524c75: 00000010 00000000 00000000 00000000 ................
00000000dfda4192: 00000000 00000000 00000000 00000000 ................
00000000863650dd: 00000000 00000000 00000000 00000000 ................
00000000b709ac61: 0d822570 ffff4001 0d822570 ffff4001 p%...@..p%...@..
000000009f316d71: 00000000 00000000 0d822588 ffff4001 .........%...@..
00000000f15fb4ed: 0d822588 ffff4001 00000000 00000000 .%...@..........
000000000027077a: 6401ffec 00000000 00000000 00000000 ...d............
00000000c828fe47: 00000000 00000000 00000000 00000000 ................
00000000b5f575af: 00000000 00000000 00000000 00000000 ................
000000001466bf98: 00000000 00000000 00000000 00000000 ................
0000000097025855: 00000000 00000000 00000000 00000000 ................
000000002557bcf0: 63877e76 00000000 010d8f0e 00000000 v~.c............
00000000b128f3c5: 00000000 00000000 0d822608 ffff4001 .........&...@..
000000005b473b40: 0d822608 ffff4001 00000000 00000000 .&...@..........
000000000913f445: 00000000 00000000 00000000 00000000 ................
000000003023853f: 00000000 00000000 00000000 00000000 ................
0000000025fcdffb: 00000000 00000000 80000000 00000000 ................
00000000334c5dc4: 00000000 00000000 00000000 00000000 ................
00000000f1ada795: 00000000 00000000 00000000 00000000 ................
0000000030e27dd3: 00000000 00000000 0d822678 ffff4001 ........x&...@..
0000000033a6a483: 0d822678 ffff4001 00000000 00000000 x&...@..........
00000000eb614a98: 00000000 ffffffff 00000000 00000000 ................
00000000df74e25f: 00000000 00000000 00000020 00000000 ........ .......
00000000efcf717a: 00000000 00000000 00000000 00000000 ................
00000000f84ffeba: 00000000 00000000 00000000 00000000 ................
00000000651071c7: 00000000 00000000 0d8226d8 ffff4001 .........&...@..
00000000a748241c: 0d8226d8 ffff4001 ffffffe0 0000000f .&...@..........
000000004c1557b3: 0d8226f0 ffff4001 0d8226f0 ffff4001 .&...@...&...@..
00000000dccbb716: c08cf6b0 ffffdab5 00000000 00000000 ................
000000009d2fb057: 00000000 00000000 00000000 00000000 ................
00000000f6f284e6: 00000000 00000000 00000000 00000000 ................
000000003bb74f4c: 00414087 00414087 00000000 00000000 .@A..@A.........
00000000e0601ec8: 00000000 00000000 00000000 00000000 ................
000000006a017bfb: cef69528 00000000 (.......
CPU: 125 PID: 3898 Comm: dbench Not tainted 6.0.5-rt14-[...]
Hardware name: WIWYNN Mt.Jade Server System B81.03001.0005/Mt.Jade Motherboard, BIOS 1.08.20220218 (SCP: 1.08.20220218) 2022/02/18
Call trace:
[...]
ext4_destroy_inode+0xc8/0xd0
destroy_inode+0x48/0x80
evict+0x148/0x190
iput+0x184/0x250
do_unlinkat+0x1d8/0x290
__arm64_sys_unlinkat+0x48/0x90
invoke_syscall+0x78/0x100
el0_svc_common.constprop.0+0x54/0x194
do_el0_svc+0x38/0xd0
el0_svc+0x34/0x160
el0t_64_sync_handler+0xbc/0x13c
el0t_64_sync+0x1a0/0x1a4