Re: [PATCH v8 06/10] arm64: idle: Tag the arm64 idle functions as __cpuidle

From: Doug Anderson
Date: Wed May 10 2023 - 17:13:50 EST


Hi,

On Wed, May 10, 2023 at 9:43 AM Mark Rutland <mark.rutland@xxxxxxx> wrote:
>
> On Wed, Apr 19, 2023 at 03:56:00PM -0700, Douglas Anderson wrote:
> > As per the (somewhat recent) comment before the definition of
> > `__cpuidle`, the tag is like `noinstr` but also marks a function so it
> > can be identified by cpu_in_idle(). Let'a add this.
> >
> > After doing this then when we dump stack traces of all processors
> > using nmi_cpu_backtrace() then instead of getting useless backtraces
> > we get things like:
> >
> > NMI backtrace for cpu N skipped: idling at cpu_do_idle+0x94/0x98
>
> As a heads-up, this is only going to work in the trivial case where a CPU is
> within the default cpu_do_idle(), and not for anything using PSCI
> cpu_suspend() (which I'd *really* hope is the common case).

Thanks for the review and the heads up! Right. It only catches shallow
idle. Specifically this was the stack trace when it was "caught" on a
sc7180-trogdor device:

cpu_do_idle+0x94/0x98
psci_enter_idle_state+0x58/0x70
cpuidle_enter_state+0xb8/0x260
cpuidle_enter+0x44/0x5c
do_idle+0x188/0x30c
...

I checked in kgdb and saw that "psci_enter_idle_state()" had "idx" as
0, which makes sense since __CPU_PM_CPU_IDLE_ENTER will directly call
cpu_do_idle() when idx is 0.

I agree that it doesn't catch every case. Certainly it's not too hard
to see a CPU here:

gic_cpu_sys_reg_init+0x1f8/0x314
gic_cpu_pm_notifier+0x40/0x78
raw_notifier_call_chain+0x5c/0x134
cpu_pm_notify+0x38/0x64
cpu_pm_exit+0x20/0x2c
psci_enter_idle_state+0x48/0x70
cpuidle_enter_state+0xb8/0x260
cpuidle_enter+0x44/0x5c
do_idle+0x188/0x30c
...

...and kgdb showed "idx" was 3.

That being said, catching some of the cases is better than catching
none of the cases. ;-)

In reality, I've seen cases where it catches most CPUs. For instance,
running soon after bootup on my sc7180-trogdor device:

echo HARDLOCKUP > /sys/kernel/debug/provoke-crash/DIRECT

...and then having the "buddy" hardlockup detector [1] catch the crash
caught all of the CPUs other than the one that was locked up and the
one that detected it. Specifically:

NMI backtrace for cpu 2 skipped: idling at cpu_do_idle+0x94/0x98
NMI backtrace for cpu 1 skipped: idling at cpu_do_idle+0x94/0x98
NMI backtrace for cpu 0 skipped: idling at cpu_do_idle+0x94/0x98
NMI backtrace for cpu 3 skipped: idling at cpu_do_idle+0x94/0x98
NMI backtrace for cpu 4 skipped: idling at cpu_do_idle+0x94/0x98
NMI backtrace for cpu 7 skipped: idling at cpu_do_idle+0x94/0x98

I haven't analyzed it, but I guess it's possible that when the buddy
hardlockup detector triggers that other CPUs are more likely to be in
a shallow idle? Certainly I seem to catch a lot more CPUs in a shallow
idle in buddy lockup reports...


[1] https://lore.kernel.org/r/20230504221349.1535669-1-dianders@xxxxxxxxxxxx


> That doesn't get inlined, and the invocation is shared with other SMCCC users,
> so we probably need more work there if culling idle backtraces is important.

At this point I'd say that we should just take the low hanging fruit
(this patch). If someone later wants to try to do better they can.


> I'm not averse to this change itself.

Any chance you'd be willing to give any tags to it? :-P Do you need me
to add any of the above caveats to the commit message?

I also certainly wouldn't object to this patch landing even if others
aren't ready. It has no dependencies on any other patches and just
makes the debug messages prettier in some cases.

-Doug