Re: [PATCH 1/2] tpm, tpm_tis: Handle interrupt storm

From: Jarkko Sakkinen
Date: Tue Jun 06 2023 - 12:36:05 EST


On Tue May 30, 2023 at 8:56 PM EEST, Jerry Snitselaar wrote:
> On Mon, May 29, 2023 at 09:46:08AM +0300, Péter Ujfalusi wrote:
> > Hi Lino,
> >
> > On 23/05/2023 23:46, Lino Sanfilippo wrote:
> > >> On the other hand any new functionality is objectively a maintanance
> > >> burden of some measure (applies to any functionality). So how do we know
> > >> that taking this change is less of a maintenance burden than just add
> > >> new table entries, as they come up?
> > >>
> > >
> > > Initially this set was created as a response to this 0-day bug report which you asked me
> > > to have a look at:
> > >
> > > https://lore.kernel.org/linux-integrity/d80b180a569a9f068d3a2614f062cfa3a78af5a6.camel@xxxxxxxxxx/
> > >
> > > My hope was that it could also avoid some of (existing or future) DMI entries. But even if it does not
> > > (e.g. the problem Péter Ujfalusi reported with the UPX-i11 cannot be fixed by this patch set and thus
> > > needs the DMI quirk) we may at least avoid more bug reports due to interrupt storms once
> > > 6.4 is released.
> >
> > I'm surprised that there is a need for a storm detection in the first
> > place... Do we have something else on the same IRQ line on the affected
> > devices which might have a bug or no driver at all?
> > It is hard to believe that a TPM (Trusted Platform Module) is integrated
> > so poorly ;)
> >
> > But put that aside: I think the storm detection is good given that there
> > is no other way to know which machine have sloppy TPM integration.
> > There are machines where this happens, so it is a know integration
> > issue, right?
> >
> > My only 'nitpick' is with the printk level to be used.
> > The ERR level is not correct as we know the issue and we handle it, so
> > all is under control.
> > If we want to add these machines to the quirk list then WARN is a good
> > level to gain attention but I'm not sure if a user will know how to get
> > the machine in the quirk (where to file a bug).
> > If we only want the quirk to be used for machines like UPX-i11 which
> > simply just have broken (likely floating) IRQ line then the WARN is too
> > high level, INFO or even DBG would be appropriate as you are not going
> > to update the quirk, it is just handled under the hood (which is a great
> > thing, but on the other hand you will have the storm never the less and
> > that is not a nice thing).
> >
> > It is a matter on how this is going to be handled in a long term. Add
> > quirk for all the known machines with either stormy or plain broken IRQ
> > line or handle the stormy ones and quirk the broken ones only.
> >
> > >>> Detect an interrupt storm by counting the number of unhandled interrupts
> > >>> within a 10 ms time interval. In case that more than 1000 were unhandled
> > >>> deactivate interrupts, deregister the handler and fall back to polling.
> > >>
> > >> I know it can be sometimes hard to evaluate but can you try to explain
> > >> how you came up to the 10 ms sampling period and 1000 interrupt
> > >> threshold? I just don't like abritrary numbers.
> > >
> > > At least the 100 ms is not plucked out of thin air but its the same time period
> > > that the generic code in note_interrupt() uses - I assume for a good reason.
> > > Not only this number but the whole irq storm detection logic is taken from
> > > there:
> > >
> > >>
> > >>> This equals the implementation that handles interrupt storms in
> > >>> note_interrupt() by means of timestamps and counters in struct irq_desc.
> > >> The number of 1000 unhandled interrupts is still far below the 99900
> > used in
> > > note_interrupt() but IMHO enough to indicate that there is something seriously
> > > wrong with interrupt processing and it is probably saver to fall back to polling.
> >
> > Except that if the line got the spurious designation in core, the
> > interrupt line will be disabled while the TPM driver will think that it
> > is still using IRQ mode and will not switch to polling.
> >
> > A storm of 1000 is better than a storm of 99900 for sure but quirking
> > these would be the desired final solution. imho
>
> If that is the case, then output could probably be sent to the console
> detailing the dmi info needed to update the table.

+1

Good idea.

BR, Jarkko