Re: [PATCH 1/5] PCI/switchtec: Error out MRPC execution when no GAS access

From: Kelvin.Cao
Date: Fri Oct 01 2021 - 19:49:25 EST


On Fri, 2021-10-01 at 14:29 -0600, Logan Gunthorpe wrote:
> EXTERNAL EMAIL: Do not click links or open attachments unless you
> know the content is safe
>
> On 2021-10-01 2:18 p.m., Bjorn Helgaas wrote:
> > On Fri, Sep 24, 2021 at 11:08:38AM +0000, kelvin.cao@xxxxxxxxxxxxx
> > wrote:
> > > From: Kelvin Cao <kelvin.cao@xxxxxxxxxxxxx>
> > >
> > > After a firmware hard reset, MRPC command executions, which are
> > > based
> > > on the PCI BAR (which Microchip refers to as GAS) read/write,
> > > will hang
> > > indefinitely. This is because after a reset, the host will fail
> > > all GAS
> > > reads (get all 1s), in which case the driver won't get a valid
> > > MRPC
> > > status.
> >
> > Trying to write a merge commit log for this, but having a hard time
> > summarizing it. It sounds like it covers both Switchtec-specific
> > (firmware and MRPC commands) and generic PCIe behavior (MMIO read
> > failures).
> >
> > This has something to do with a firmware hard reset. What is that?
> > Is that like a firmware reboot? A device reset, e.g., FLR or
> > secondary bus reset, that causes a firmware reboot? A device reset
> > initiated by firmware?
A firmware reset can be triggered by a reset command issued to the
firmware to reboot it.
> > Anyway, apparently when that happens, MMIO reads to the switch fail
> > (timeout or error completion on PCIe) for a while. If a device
> > reset
> > is involved, that much is standard PCIe behavior. And the driver
> > sees
> > ~0 data from those failed reads. That's not part of the PCIe spec,
> > but is typical root complex behavior.
> >
> > But you said the MRPC commands hang indefinitely. Presumably MMIO
> > reads would start succeeding eventually when the device becomes
> > ready,
> > so I don't know how that translates to "indefinitely."
>
> I suspect Kelvin can expand on this and fix the issue below. But in
> my
> experience, the MMIO will read ~0 forever after a firmware reset,
> until
> the system is rebooted. Presumably on systems that have good hot plug
> support they are supposed to recover. Though I've never seen that.

This is also my observation, all MMIO read will fail (~0 returned)
until the system is rebooted or a PCI rescan is performed.

> The MMIO read that signals the MRPC status always returns ~0 and the
> userspace request will eventually time out.

The problem in this case is that, in DMA MRPC mode, the status (in host
memory) is always initialized to 'in progress', and it's up to the
firmware to update it to 'done' after the command is executed in the
firmware. After a firmware reset is performed, the firmware cannot be
triggered to start a MRPC command, therefore the status in host memory
remains 'in progress' in the driver, which prevents a MRPC from timing
out. I should have included this in the message.
>
> > Weird to refer to a PCI BAR as "GAS". Maybe expanding the acronym
> > would help it make sense.
> GAS is the term used by the firmware developers and is in all their
> documentation. It stands for Global Address Space.
>
> > What does "host" refer to? I guess it's the switch (the
> > switchtec_dev), since you say it fails MMIO reads?
>
> Yes, a bit confusing. The firmware is dead or not setup right so MMIO
> reads are not succeeding and the root complex is returning ~0 to the
> driver on reads.
Ditto. Will update in v2.
>
> Logan