Re: Crash with PREEMPT_RT on aarch64 machine

From: Pierre Gondois
Date: Wed Nov 09 2022 - 08:53:00 EST

Next message: Nick Alcock: "[PATCH v9 2/8] kbuild: add modules_thick.builtin"
Previous message: Nick Alcock: "[PATCH v9 3/8] kbuild: generate an address ranges map at vmlinux link time"
In reply to: Jan Kara: "Re: Crash with PREEMPT_RT on aarch64 machine"
Next in thread: Pierre Gondois: "Re: Crash with PREEMPT_RT on aarch64 machine"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On 11/9/22 12:01, Jan Kara wrote:

On Wed 09-11-22 09:55:07, Mark Rutland wrote:

On Tue, Nov 08, 2022 at 06:45:29PM +0100, Jan Kara wrote:

On Tue 08-11-22 10:53:40, Mark Rutland wrote:

On Mon, Nov 07, 2022 at 11:49:01AM -0500, Waiman Long wrote:

On 11/7/22 10:10, Sebastian Andrzej Siewior wrote:

+ locking, arm64

On 2022-11-07 14:56:36 [+0100], Jan Kara wrote:

spinlock_t and raw_spinlock_t differ slightly in terms of locking.
rt_spin_lock() has the fast path via try_cmpxchg_acquire(). If you
enable CONFIG_DEBUG_RT_MUTEXES then you would force the slow path which
always acquires the rt_mutex_base::wait_lock (which is a raw_spinlock_t)
while the actual lock is modified via cmpxchg.

So I've tried enabling CONFIG_DEBUG_RT_MUTEXES and indeed the corruption
stops happening as well. So do you suspect some bug in the CPU itself?

If it is only enabling CONFIG_DEBUG_RT_MUTEXES (and not whole lockdep)
then it looks very suspicious.
CONFIG_DEBUG_RT_MUTEXES enables a few additional checks but the main
part is that rt_mutex_cmpxchg_acquire() + rt_mutex_cmpxchg_release()
always fail (and so the slowpath under a raw_spinlock_t is done).

So if it is really the fast path (rt_mutex_cmpxchg_acquire()) then it
somehow smells like the CPU is misbehaving.

Could someone from the locking/arm64 department check if the locking in
RT-mutex (rtlock_lock()) is correct?

rtmutex locking uses try_cmpxchg_acquire(, ptr, ptr) for the fastpath
(and try_cmpxchg_release(, ptr, ptr) for unlock).
Now looking at it again, I don't see much difference compared to what
queued_spin_trylock() does except the latter always operates on 32bit
value instead a pointer.

Both the fast path of queued spinlock and rt_spin_lock are using
try_cmpxchg_acquire(), the only difference I saw is the size of the data to
be cmpxchg'ed. qspinlock uses 32-bit integer whereas rt_spin_lock uses
64-bit pointer. So I believe it is more on how the arm64 does cmpxchg. I
believe there are two different ways of doing it depending on whether LSE
atomics is available in the platform. So exactly what arm64 system is being
used here and what hardware capability does it have?

From the /proc/cpuinfo output earlier, this is a Neoverse N1 system, with the
LSE atomics. Assuming the kernel was built with support for atomics in-kernel
(which is selected by default), it'll be using the LSE version.

So I was able to reproduce the corruption both with LSE atomics enabled &
disabled in the kernel. It seems the problem takes considerably longer to
reproduce with LSE atomics enabled but it still does happen.

BTW, I've tried to reproduced the problem on another aarch64 machine with
CPU from a different vendor:

processor : 0
BogoMIPS : 200.00
Features : fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma dcpop asimddp asimdfhm
CPU implementer : 0x48
CPU architecture: 8
CPU variant : 0x1
CPU part : 0xd01
CPU revision : 0

And there the problem does not reproduce. So might it be a genuine bug in
the CPU implementation?

Perhaps, though I suspect it's more likely that we have an ordering bug in the
kernel code, and it shows up on CPUs with legitimate but more relaxed ordering.
We've had a couple of those show up on Apple M1, so it might be worth trying on
one of those.

How easy is this to reproduce? What's necessary?

As Pierre writes, on Ampere Altra machine running dbench benchmark on XFS
filesystem triggers this relatively easily (it takes it about 10 minutes to
trigger without atomics and about 30 minutes to trigger with the atomics
enabled).

Running the benchmark on XFS somehow seems to be important, we didn't see
the crash happen on ext4 (which may just mean it is less frequent on ext4
and didn't trigger in our initial testing after which we've started to
investigate crashes with XFS).

Honza

It was possible to reproduce on an Ampere eMAG. It takes < 1min to reproduce
once dbench is launched and seems more likely to trigger with the previous diff
applied. It even sometimes triggers without launching dbench on the Altra.

/proc/cpuinfo for eMAG:
processor : 0
BogoMIPS : 80.00
Features : fp asimd evtstrm aes pmull sha1 sha2 crc32 cpuid
CPU implementer : 0x50
CPU architecture: 8
CPU variant : 0x3
CPU part : 0x000
CPU revision : 2

Next message: Nick Alcock: "[PATCH v9 2/8] kbuild: add modules_thick.builtin"
Previous message: Nick Alcock: "[PATCH v9 3/8] kbuild: generate an address ranges map at vmlinux link time"
In reply to: Jan Kara: "Re: Crash with PREEMPT_RT on aarch64 machine"
Next in thread: Pierre Gondois: "Re: Crash with PREEMPT_RT on aarch64 machine"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]