Re: [PATCH v2 0/9] PCI: rockchip: Fix RK3399 PCIe endpoint controller driver

From: Rick Wertenbroek
Date: Thu Apr 13 2023 - 10:35:25 EST


On Thu, Apr 13, 2023 at 3:49 PM Lorenzo Pieralisi <lpieralisi@xxxxxxxxxx> wrote:
>
> On Fri, Mar 17, 2023 at 07:09:04AM +0900, Damien Le Moal wrote:
> > On 3/17/23 01:34, Rick Wertenbroek wrote:
> > >>> By the way, enabling the interrupts to see the error notifications, I do see a
> > >>> lot of retry timeout and other recoverable errors. So the issues I am seeing
> > >>> could be due to my PCI cable setup that is not ideal (bad signal, ground loops,
> > >>> ... ?). Not sure. I do not have a PCI analyzer handy :)
> > >
> > > I have enabled the IRQs and messages thanks to your patches but I don't get
> > > messages from the IRQs (it seems no IRQs are fired). My PCIe link seems stable.
> > > The main issue I face is still that after a random amount of time, the BARs are
> > > reset to 0, I don't have a PCIe analyzer so I cannot chase config space TLPs
> > > (e.g., host writing the BAR values to the config header), but I don't think that
> > > the problem comes from a TLP issued from the host. (it might be).
> >
> > Hmmm... I am getting lots of IRQs, especially the ones signaling "replay timer
> > timed out" and "replay timer rolled over after 4 transmissions of the same TLP"
> > but also some "phy error detected on receive side"... Need to try to rework my
> > cable setup I guess.
> >
> > As for the BARs being reset to 0, I have not checked, but it may be why I see
> > things not working after some inactivity. Will check that. We may be seeing the
> > same regarding that.
> >
> > > I don't think it's a buffer overflow / out-of-bounds access by kernel
> > > code for two reasons
> > > 1) The values in the config space around the BARs is coherent and unchanged
> > > 2) The bars are reset to 0 and not a random value
> > >
> > > I suspect a hardware reset of those registers issued internally in the
> > > PCIe controller,
> > > I don't know why (it might be a link related event or power state
> > > related event).
> > >
> > > I have also experienced very slow behavior with the PCI endpoint test driver,
> > > e.g., pcitest -w 1024 -d would take tens of seconds to complete. It seems to
> > > come from LCRC errors, when I check the "LCRC Error count register"
> > > @0xFD90'0214 I can see it drastically increase between two calls of pcitest
> > > (when I mean drastically it means by 6607 (0x19CF) for example).
> > >
> > > The "ECC Correctable Error Count Register" @0xFD90'0218 reads 0 though.
> > >
> > > I have tried to shorten the cabling by removing one of the PCIe extenders, that
> > > didn't change the issues much.
> > >
> > > Any ideas as to why I see a large number of TLPs with LCRC errors in them ?
> > > Do you experience the same ? What are your values in 0xFD90'0214 when
> > > running e.g., pcitest -w 1024 -d (note: you can reset the counter by writing
> > > 0xFFFF to it in case it reaches the maximum value of 0xFFFF).
> >
> > I have not checked. But I will look at these counters to see what I have there.
>
> Hi,
>
> checking where are we with this thread and whether there is something to
> consider for v6.4, if testing succeeds.
>
> Thanks,
> Lorenzo

Hello,
Thank you for considering this.

There is a V3 of this patch series [1|, that fixes the issues
encountered with the V2.
The debugging following this thread was discussed off-list with Damien Le Moal.
The V3 has been tested successfully by Damien Le Moal [2]

I will submit a V4 next week, since there are minor changes that were
suggested in
the threads for the V3 (mostly minor changes in code style, macros, comments).

I hope it can be considered for v6.4, thank you.

[1] https://lore.kernel.org/linux-pci/29a5ccc3-d2c8-b844-a333-28bc20657942@xxxxxxxxxxxx/T/#mc8f2589ff04862175cb0c906b38cb37a90db0e42
[2] https://lore.kernel.org/linux-pci/29a5ccc3-d2c8-b844-a333-28bc20657942@xxxxxxxxxxxx/


Notes on what was discovered off-list :

The issues regarding BAR reset were due to power supply issues (PCI cable
jumping host 3V3 supply to SoC 3V3 supply, and are fixed with proper cabling).
a few LCRC errors are normal with PCIe, the number will depend on
signal integrity
at the physical layer (cabling).


Best regards,
Rick