Re: Queries on ARM SDEI Linux kernel code

From: Neeraj Upadhyay
Date: Wed Oct 21 2020 - 13:32:11 EST

Next message: Yu Zhang: "Re: [PATCH v2 07/20] kvm: x86/mmu: Support zapping SPTEs in the TDP MMU"
Previous message: Chuck Lever: "Re: [PATCH] SUNRPC: fix copying of multiple pages in gss_read_proxy_verf()"
In reply to: James Morse: "Re: Queries on ARM SDEI Linux kernel code"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Hi James,

Sorry for late reply. Thanks for your comments!

On 10/16/2020 9:57 PM, James Morse wrote:

Hi Neeraj,

On 15/10/2020 07:07, Neeraj Upadhyay wrote:

1. Looks like interrupt bind interface (SDEI_1_0_FN_SDEI_INTERRUPT_BIND) is not available
for clients to use; can you please share information on
why it is not provided?

There is no compelling use-case for it, and its very complex to support as the driver can
no longer hide things like hibernate.

Last time I looked, it looked like the SDEI driver would need to ask the irqchip to
prevent modification while firmware re-configures the irq. I couldn't work out how this
would work if the irq is in-progress on another CPU.

Got it. I will think in this direction, on how to achieve this.

The reasons to use bound-interrupts can equally be supported with an event provided by
firmware.

Ok, I will explore in that direction.

While trying to dig information on this, I saw that [1] says:
Now the hotplug callbacks save nothing, and restore the OS-view of registered/enabled.
This makes bound-interrupts harder to work with.

Based on this comment, the changes from v4 [2], which I could understand is, cpu down path
does not save the current event enable status, and we rely on the enable status
`event->reenable', which is set, when register/unregister, enable/disable calls are made;
this enable status is used during cpu up path, to decide whether to reenable the interrupt.

Does this make, bound-interrupts harder to work with? how? Can you please explain? Or
above save/restore is not the reason and you meant something else?

If you bind a level-triggered interrupt, how does firmware know how to clear the interrupt
from whatever is generating it?

What happens if the OS can't do this either, as it needs to allocate memory, or take a
lock, which it can't do in nmi context?

Ok, makes sense.
The people that wrote the SDEI spec's answer to this was that the handler can disable the
event from inside the handler... and firmware will do, something, to stop the interrupt
screaming.

So now an event can become disabled anytime its registered, which makes it more
complicated to save/restore.

Also, does shared bound interrupts

Shared-interrupts as an NMI made me jump. But I think you mean a bound interrupt as a
shared event. i.e. and SPI not a PPI.

Sorry I should have worded properly; yes I meant SPI as shared event.

also have the same problem, as save/restore behavior
was only for private events?

See above, the problem is the event disabling itself.

This makes sense now.

Additionally those changes to unregister the private-event mean the code can't tell the
difference between cpuhp and hibernate... only hibernate additionally loses the state in
firmware.

Got it!

2. SDEI_EVENT_SIGNAL api is not provided? What is the reason for it? Its handling has the
same problems, which are there for bound interrupts?

Its not supported as no-one showed up with a use-case.
While firmware is expected to back it with a PPI, its doesn't have the same problems as
bound-interrupts as its not an interrupt the OS ever knows about.

Also, if it is provided, clients need to register event 0 ? Vendor events or other event
nums are not supported, as per spec.

Ideally the driver would register the event, and provide a call_on_cpu() helper to trigger
it. This should fit in with however the GIC's PMR based NMI does its PPI based
crash/stacktrace call so that the caller doesn't need to know if its back by IRQ, pNMI or
SDEI.

Ok; I will explore how PMR based NMIs work; I thought it was SGI based. But will recheck.

3. Can kernel panic() be triggered from sdei event handler?

Yes,

Is it a safe operation?

panic() wipes out the machine... did you expect it to keep running?

I wanted to check the case where panic triggers kexec/kdump path into capture kernel.

What does safe mean here?
I think I didn't put it correctly; I meant what possible scenarios can

happen in this case and you explained one below, thanks!

You should probably call nmi_panic() if there is the risk that the event occurred during
panic() on the same CPU, as it would otherwise just block.

The spec says, synchronous exceptions should not be triggered; I think panic
won't do it; but anything which triggers a WARN
or other sync exception in that path can cause undefined behavior. Can you share your
thoughts on this?

What do you mean by undefined behaviour?

I was thinking, if SDEI event preempts EL1, at the point, where EL1 has just entered an exception, and hasn't captured the registers like spsr_el1, elr_el1 and other registers, what will be the behavior?

SDEI was originally to report external abort to the OS in regions where the OS can't take
an exception because the exception-registers are live, just after and exception and just
before eret.

If you take another exception from the NMI handler, chances are you're going to go back
round the loop again, only this time firmware can't inject the SDEI event, so it has to
reboot.

Got it.

If you know it might cause an exception, you shouldn't do it in NMI context.

Ok, I understand now.

"The handler code should not enable asynchronous exceptions by clearing any of the
PSTATE.DAIF bits, and should not cause synchronous exceptions to the client Exception level."

What are you using this thing for?

Usecase is, a watchdog SPI interrupt, which we want to bound to a SDEI event. Below is the flow:

wdog expiry -> SDEI event -> HLOS panic -> trigger kexec/kdump

Thanks
Neeraj

Thanks,

James

--
QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum, hosted by The Linux Foundation

Next message: Yu Zhang: "Re: [PATCH v2 07/20] kvm: x86/mmu: Support zapping SPTEs in the TDP MMU"
Previous message: Chuck Lever: "Re: [PATCH] SUNRPC: fix copying of multiple pages in gss_read_proxy_verf()"
In reply to: James Morse: "Re: Queries on ARM SDEI Linux kernel code"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]