Re: [PATCH v4.4] x86/kconfig: Fall back to ticket spinlocks

From: Peter Zijlstra
Date: Thu Nov 01 2018 - 05:18:19 EST


On Wed, Oct 31, 2018 at 09:14:58AM +0100, Daniel Wagner wrote:
> From: Daniel Wagner <daniel.wagner@xxxxxxxxxxx>
>
> Sebastian writes:
>
> """
> We reproducibly observe cache line starvation on a Core2Duo E6850 (2
> cores), a i5-6400 SKL (4 cores) and on a NXP LS2044A ARM Cortex-A72 (4
> cores).
>
> The problem can be triggered with a v4.9-RT kernel by starting
>
> cyclictest -S -p98 -m -i2000 -b 200
>
> and as "load"
>
> stress-ng --ptrace 4
>
> The reported maximal latency is usually less than 60us. If the problem
> triggers then values around 400us, 800us or even more are reported. The
> upperlimit is the -i parameter.
>
> Reproduction with 4.9-RT is almost immediate on Core2Duo, ARM64 and SKL,
> but it took 7.5 hours to trigger on v4.14-RT on the Core2Duo.
>
> Instrumentation show always the picture:
>
> CPU0 CPU1
> => do_syscall_64 => do_syscall_64
> => SyS_ptrace => syscall_slow_exit_work
> => ptrace_check_attach => ptrace_do_notify / rt_read_unlock
> => wait_task_inactive rt_spin_lock_slowunlock()
> -> while task_running() __rt_mutex_unlock_common()
> / check_task_state() mark_wakeup_next_waiter()
> | raw_spin_lock_irq(&p->pi_lock); raw_spin_lock(&current->pi_lock);
> | . .
> | raw_spin_unlock_irq(&p->pi_lock); .
> \ cpu_relax() .
> - .
> *IRQ* <lock acquired>
>
> In the error case we observe that the while() loop is repeated more than
> 5000 times which indicates that the pi_lock can be acquired. CPU1 on the
> other side does not make progress waiting for the same lock with interrupts
> disabled.
>
> This continues until an IRQ hits CPU0. Once CPU0 starts processing the IRQ
> the other CPU is able to acquire pi_lock and the situation relaxes.
> """
>
> This matches with the observeration for v4.4-rt on a Core2Duo E6850:
>
> CPU 0:
>
> - no progress for a very long time in rt_mutex_dequeue_pi):
>
> stress-n-1931 0d..11 5060.891219: function: __try_to_take_rt_mutex
> stress-n-1931 0d..11 5060.891219: function: rt_mutex_dequeue
> stress-n-1931 0d..21 5060.891220: function: rt_mutex_enqueue_pi
> stress-n-1931 0....2 5060.891220: signal_generate: sig=17 errno=0 code=262148 comm=stress-ng-ptrac pid=1928 grp=1 res=1
> stress-n-1931 0d..21 5060.894114: function: rt_mutex_dequeue_pi
> stress-n-1931 0d.h11 5060.894115: local_timer_entry: vector=239
>
> CPU 1:
>
> - IRQ at 5060.894114 on CPU 1 followed by the IRQ on CPU 0
>
> stress-n-1928 1....0 5060.891215: sys_enter: NR 101 (18, 78b, 0, 0, 17, 788)
> stress-n-1928 1d..11 5060.891216: function: __try_to_take_rt_mutex
> stress-n-1928 1d..21 5060.891216: function: rt_mutex_enqueue_pi
> stress-n-1928 1d..21 5060.891217: function: rt_mutex_dequeue_pi
> stress-n-1928 1....1 5060.891217: function: rt_mutex_adjust_prio
> stress-n-1928 1d..11 5060.891218: function: __rt_mutex_adjust_prio
> stress-n-1928 1d.h10 5060.894114: local_timer_entry: vector=239
>
> Thomas writes:
>
> """
> This has nothing to do with RT. RT is merily exposing the
> problem in an observable way. The same issue happens with upstream, it's
> harder to trigger and it's harder to observe for obvious reasons.
>
> If you read through the discussions [see the links below] then you
> really see that there is an upstream issue with the x86 qrlock
> implementation and Peter has posted fixes which resolve it, both at
> the practical and the theoretical level.
> """
>
> Backporting all qspinlock related patches is very likely to introduce
> regressions on v4.4. Therefore, the recommended solution by Peter and
> Thomas is to drop back to ticket spinlocks for v4.4.
>
> Link :https://lkml.kernel.org/r/20180921120226.6xjgr4oiho22ex75@xxxxxxxxxxxxx
> Link: https://lkml.kernel.org/r/20180926110117.405325143@xxxxxxxxxxxxx
> Cc: Sebastian Andrzej Siewior <bigeasy@xxxxxxxxxxxxx>
> Cc: Peter Zijlstra <peterz@xxxxxxxxxxxxx>
> Cc: Thomas Gleixner <tglx@xxxxxxxxxxxxx>
> Signed-off-by: Daniel Wagner <daniel.wagner@xxxxxxxxxxx>

Acked-by: Peter Zijlstra (Intel) <peterz@xxxxxxxxxxxxx>