Re: [PATCH] hardlockup: detect hard lockups using secondary (buddy) cpus

From: Doug Anderson
Date: Mon Apr 24 2023 - 11:48:52 EST


Hi,

On Mon, Apr 24, 2023 at 5:54 AM Daniel Thompson
<daniel.thompson@xxxxxxxxxx> wrote:
>
> On Fri, Apr 21, 2023 at 03:53:30PM -0700, Douglas Anderson wrote:
> > From: Colin Cross <ccross@xxxxxxxxxxx>
> >
> > Implement a hardlockup detector that can be enabled on SMP systems
> > that don't have an arch provided one or one implemented atop perf by
> > using interrupts on other cpus. Each cpu will use its softlockup
> > hrtimer to check that the next cpu is processing hrtimer interrupts by
> > verifying that a counter is increasing.
> >
> > NOTE: unlike the other hard lockup detectors, the buddy one can't
> > easily provide a backtrace on the CPU that locked up. It relies on
> > some other mechanism in the system to get information about the locked
> > up CPUs. This could be support for NMI backtraces like [1], it could
> > be a mechanism for printing the PC of locked CPUs like [2], or it
> > could be something else.
> >
> > This style of hardlockup detector originated in some downstream
> > Android trees and has been rebased on / carried in ChromeOS trees for
> > quite a long time for use on arm and arm64 boards. Historically on
> > these boards we've leveraged mechanism [2] to get information about
> > hung CPUs, but we could move to [1].
>
> On the Arm platforms is this code able to leverage the existing
> infrastructure to extract status from stuck CPUs:
> https://docs.kernel.org/trace/coresight/coresight-cpu-debug.html

Yup! I wasn't explicit about this, but that's where you end up if you
follow the whole bug tracker item that was linked as [2].
Specifically, we used to have downstream patches in the ChromeOS that
just reached into the coresight range from a SoC specific driver and
printed out the CPU_DBGPCSR. When Brian was uprevving rk3399
Chromebooks he found that the equivalent functionality had made it
upstream in a generic way through the coresight framework. Brian
confirmed it was working on rk3399 and made all of the device tree
changes needed to get it all hooked up, so (at least for that SoC) it
should work on that SoC.

[2] https://issuetracker.google.com/172213129