Re: [PATCH v2] hardlockup: detect hard lockups using secondary (buddy) CPUs

From: Doug Anderson
Date: Mon May 01 2023 - 11:09:06 EST


Hi,

On Sat, Apr 29, 2023 at 2:22 PM Ian Rogers <irogers@xxxxxxxxxx> wrote:
>
> On Fri, Apr 28, 2023 at 4:41 PM Douglas Anderson <dianders@xxxxxxxxxxxx> wrote:
> >
> > From: Colin Cross <ccross@xxxxxxxxxxx>
> >
> > Implement a hardlockup detector that doesn't doesn't need any extra
> > arch-specific support code to detect lockups. Instead of using
> > something arch-specific we will use the buddy system, where each CPU
> > watches out for another one. Specifically, each CPU will use its
> > softlockup hrtimer to check that the next CPU is processing hrtimer
> > interrupts by verifying that a counter is increasing.
> >
> > NOTE: unlike the other hard lockup detectors, the buddy one can't
> > easily show what's happening on the CPU that locked up just by doing a
> > simple backtrace. It relies on some other mechanism in the system to
> > get information about the locked up CPUs. This could be support for
> > NMI backtraces like [1], it could be a mechanism for printing the PC
> > of locked CPUs at panic time like [2] / [3], or it could be something
> > else. Even though that means we still rely on arch-specific code, this
> > arch-specific code seems to often be implemented even on architectures
> > that don't have a hardlockup detector.
> >
> > This style of hardlockup detector originated in some downstream
> > Android trees and has been rebased on / carried in ChromeOS trees for
> > quite a long time for use on arm and arm64 boards. Historically on
> > these boards we've leveraged mechanism [2] / [3] to get information
> > about hung CPUs, but we could move to [1].
> >
> > Although the original motivation for the buddy system was for use on
> > systems without an arch-specific hardlockup detector, it can still be
> > useful to use even on systems that _do_ have an arch-specific
> > hardlockup detector. On x86, for instance, there is a 24-part patch
> > series [4] in progress switching the arch-specific hard lockup
> > detector from a scarce perf counter to a less-scarce hardware
> > resource. Potentially the buddy system could be a simpler alternative
> > to free up the perf counter but still get hard lockup detection.
> >
> > Overall, pros (+) and cons (-) of the buddy system compared to an
> > arch-specific hardlockup detector:
> > + Usable on systems that don't have an arch-specific hardlockup
> > detector, like arm32 and arm64 (though it's being worked on for
> > arm64 [5]).
> > + May free up scarce hardware resources.
> > + If a CPU totally goes out to lunch (can't process NMIs) the buddy
> > system could still detect the problem (though it would be unlikely
> > to be able to get a stack trace).
> > - If all CPUs are hard locked up at the same time the buddy system
> > can't detect it.
> > - If we don't have SMP we can't use the buddy system.
> > - The buddy system needs an arch-specific mechanism (possibly NMI
> > backtrace) to get info about the locked up CPU.
>
> Thanks for this list, it is really useful! Is it worth mentioning the
> behavior around idle? Could this approach potentially use more power?

Sure, I'll add some text in there. If I'm analyzing the code properly,
my belief is that, if anything, the buddy detector should be better
for idle/power than some other detectors.

Specifically, note that the main "worker" of the buddy detector is
called from watchdog_timer_fn(). The timer function is the same one
that runs for other hard lockup detectors, but in those cases it
_only_ pets the watchdog of the running CPU. With the buddy detector
it pets the running CPU's watchdog and then checks on the buddy's
watchdog. There is no separate wakeup / interrupt that needs to run
periodically to look for hard lockups.

I'm about to send a v3 to fix the cpu=>CPU that I missed on v2. I'll
add text similar to the above to the commit message.

-Doug