Re: [PATCH 1/2] arm64: smp: Fix pseudo NMI issues w/ broken Mediatek FW

From: Mark Rutland
Date: Tue Oct 03 2023 - 08:29:39 EST


On Mon, Oct 02, 2023 at 12:16:17PM -0700, Doug Anderson wrote:
> Hi,
>
> On Mon, Oct 2, 2023 at 10:24 AM Mark Rutland <mark.rutland@xxxxxxx> wrote:
> >
> > On Mon, Oct 02, 2023 at 09:45:29AM -0700, Douglas Anderson wrote:
> > > Some mediatek devices have the property
> > > "mediatek,broken-save-restore-fw" in their GIC. This means that,
> > > although the hardware supports pseudo-NMI, the firmware has a bug
> > > that blocks enabling it. When we're in this state,
> > > system_uses_irq_prio_masking() will return true but we'll fail to
> > > actually enable the IRQ in the GIC.
> > >
> > > Let's make the code handle this. We'll detect that we failed to
> > > request an IPI as NMI and fallback to requesting it normally. Though
> > > we expect that either all of our requests will fail or all will
> > > succeed, it's just as cheap to keep a per-IPI bitmap and that keeps us
> > > robust.
> > >
> > > Fixes: 331a1b3a836c ("arm64: smp: Add arch support for backtrace using pseudo-NMI")
> > > Reported-by: Chen-Yu Tsai <wenst@xxxxxxxxxxxx>
> > > Closes: https://issuetracker.google.com/issues/197061987#comment68
> > > Signed-off-by: Douglas Anderson <dianders@xxxxxxxxxxxx>
> > > ---
> > >
> > > arch/arm64/kernel/smp.c | 19 ++++++++++++-------
> > > 1 file changed, 12 insertions(+), 7 deletions(-)
> >
> > I'm not too keen on falling back here when we have no idea why the request failed.
> >
> > I'd prefer if we could check the `supports_pseudo_nmis` static key directly to
> > account for the case of broken FW, e.g. as below.
> >
> > Mark.
> >
> > ---->8----
> > From 72fdec05c64a74f21871b44c7c760bbe07cac044 Mon Sep 17 00:00:00 2001
> > From: Mark Rutland <mark.rutland@xxxxxxx>
> > Date: Mon, 2 Oct 2023 18:00:36 +0100
> > Subject: [PATCH] arm64: smp: avoid NMI IPIs with broken MediaTek FW
> >
> > Some MediaTek devices have broken firmware which corrupts some GICR
> > registers behind the back of the OS, and pseudo-NMIs cannot be used on
> > these devices. For more details see commit:
> >
> > 44bd78dd2b8897f5 ("irqchip/gic-v3: Disable pseudo NMIs on Mediatek devices w/ firmware issues")
> >
> > We did not take this problem into account in commit:
> >
> > 331a1b3a836c0f38 ("arm64: smp: Add arch support for backtrace using pseudo-NMI")
> >
> > Since that commit arm64's SMP code will try to setup some IPIs as
> > pseudo-NMIs, even on systems with broken FW. The GICv3 code will
> > (rightly) reject attempts to request interrupts as pseudo-NMIs,
> > resulting in boot-time failures.
> >
> > Avoid the problem by taking the broken FW into account when deciding to
> > request IPIs as pseudo-NMIs. The GICv3 driver maintains a static_key
> > named "supports_pseudo_nmis" which is false on systems with broken FW,
> > and we can consult this within ipi_should_be_nmi().
> >
> > Fixes: 331a1b3a836c0f38 ("arm64: smp: Add arch support for backtrace using pseudo-NMI")
> > Reported-by: Chen-Yu Tsai <wenst@xxxxxxxxxxxx>
> > Closes: https://issuetracker.google.com/issues/197061987#comment68
> > Signed-off-by: Mark Rutland <mark.rutland@xxxxxxx>
> > Cc: Douglas Anderson <dianders@xxxxxxxxxxxx>
> > Cc: Marc Zyngier <maz@xxxxxxxxxx>
> > ---
> > arch/arm64/kernel/smp.c | 5 ++++-
> > drivers/irqchip/irq-gic-v3.c | 2 +-
> > 2 files changed, 5 insertions(+), 2 deletions(-)
>
> Sure, this is OK w/ me as long as folks don't mind accessing the
> global here, it's OK w/ me:
>
> Reviewed-by: Douglas Anderson <dianders@xxxxxxxxxxxx>
>
> It seems to work for me, thus:
>
> Tested-by: Douglas Anderson <dianders@xxxxxxxxxxxx>
>
>
> > diff --git a/arch/arm64/kernel/smp.c b/arch/arm64/kernel/smp.c
> > index 814d9aa93b21b..061c69160f90f 100644
> > --- a/arch/arm64/kernel/smp.c
> > +++ b/arch/arm64/kernel/smp.c
> > @@ -964,7 +964,10 @@ static void smp_cross_call(const struct cpumask *target, unsigned int ipinr)
> >
> > static bool ipi_should_be_nmi(enum ipi_msg_type ipi)
> > {
> > - if (!system_uses_irq_prio_masking())
> > + DECLARE_STATIC_KEY_FALSE(supports_pseudo_nmis);
> > +
> > + if (!system_uses_irq_prio_masking() ||
> > + !static_branch_likely(&supports_pseudo_nmis))
>
> One thought, actually, is whether we should actually change
> system_uses_irq_prio_masking() to return the correct value. What do
> you think?

I don't think we should add this to system_uses_irq_prio_masking(); that's used
by the low-level flags manipulation code that gets inlined all over the place,
and that code will work regarldess of whether we actually use NMI priorities.

If we want to avoid using PMR masking *at all* on these platforms, we'd need to
detect that within can_use_gic_priorities() or early_enable_pseudo_nmi().

Thanks,
Mark.