Re: [PATCH] hardlockup: detect hard lockups using secondary (buddy) cpus

From: Chen-Yu Tsai
Date: Tue Apr 25 2023 - 00:58:21 EST


On Mon, Apr 24, 2023 at 11:42 PM Doug Anderson <dianders@xxxxxxxxxxxx> wrote:
>
> Hi,
>
> On Mon, Apr 24, 2023 at 5:54 AM Daniel Thompson
> <daniel.thompson@xxxxxxxxxx> wrote:
> >
> > On Fri, Apr 21, 2023 at 03:53:30PM -0700, Douglas Anderson wrote:
> > > From: Colin Cross <ccross@xxxxxxxxxxx>
> > >
> > > Implement a hardlockup detector that can be enabled on SMP systems
> > > that don't have an arch provided one or one implemented atop perf by
> > > using interrupts on other cpus. Each cpu will use its softlockup
> > > hrtimer to check that the next cpu is processing hrtimer interrupts by
> > > verifying that a counter is increasing.
> > >
> > > NOTE: unlike the other hard lockup detectors, the buddy one can't
> > > easily provide a backtrace on the CPU that locked up. It relies on
> > > some other mechanism in the system to get information about the locked
> > > up CPUs. This could be support for NMI backtraces like [1], it could
> > > be a mechanism for printing the PC of locked CPUs like [2], or it
> > > could be something else.
> > >
> > > This style of hardlockup detector originated in some downstream
> > > Android trees and has been rebased on / carried in ChromeOS trees for
> > > quite a long time for use on arm and arm64 boards. Historically on
> > > these boards we've leveraged mechanism [2] to get information about
> > > hung CPUs, but we could move to [1].
> >
> > On the Arm platforms is this code able to leverage the existing
> > infrastructure to extract status from stuck CPUs:
> > https://docs.kernel.org/trace/coresight/coresight-cpu-debug.html
>
> Yup! I wasn't explicit about this, but that's where you end up if you
> follow the whole bug tracker item that was linked as [2].
> Specifically, we used to have downstream patches in the ChromeOS that
> just reached into the coresight range from a SoC specific driver and
> printed out the CPU_DBGPCSR. When Brian was uprevving rk3399
> Chromebooks he found that the equivalent functionality had made it
> upstream in a generic way through the coresight framework. Brian
> confirmed it was working on rk3399 and made all of the device tree
> changes needed to get it all hooked up, so (at least for that SoC) it
> should work on that SoC.
>
> [2] https://issuetracker.google.com/172213129

IIRC with the coresight CPU debug driver enabled and the proper DT nodes
added, the panic handler does dump out information from the hardware.
I don't think it's wired up for hung tasks though.

ChenYu