Re: [PATCH v3] PCI: cadence: Fix Gen2 Link Retraining process

From: Siddharth Vadapalli
Date: Thu Jun 08 2023 - 00:02:32 EST


Hello Lorenzo,

Thank you for reviewing this patch.

On 07/06/23 15:53, Lorenzo Pieralisi wrote:
> On Wed, Jun 07, 2023 at 02:44:27PM +0530, Siddharth Vadapalli wrote:
>> The Link Retraining process is initiated to account for the Gen2 defect in
>> the Cadence PCIe controller in J721E SoC. The errata corresponding to this
>> is i2085, documented at:
>> https://www.ti.com/lit/er/sprz455c/sprz455c.pdf
>>
>> The existing workaround implemented for the errata waits for the Data Link
>> initialization to complete and assumes that the link retraining process
>> at the Physical Layer has completed. However, it is possible that the
>> Physical Layer training might be ongoing as indicated by the
>> PCI_EXP_LNKSTA_LT bit in the PCI_EXP_LNKSTA register.
>>
>> Fix the existing workaround, to ensure that the Physical Layer training
>> has also completed, in addition to the Data Link initialization.
>>
>> Fixes: 4740b969aaf5 ("PCI: cadence: Retrain Link to work around Gen2 training defect")
>> Signed-off-by: Siddharth Vadapalli <s-vadapalli@xxxxxx>
>> Reviewed-by: Vignesh Raghavendra <vigneshr@xxxxxx>
>> ---
>>
>> Hello,
>>
>> This patch is based on linux-next tagged next-20230606.
>>
>> v2:
>> https://lore.kernel.org/r/20230315070800.1615527-1-s-vadapalli@xxxxxx/
>> Changes since v2:
>> - Merge the cdns_pcie_host_training_complete() function with the
>> cdns_pcie_host_wait_for_link() function, as suggested by Bjorn
>> for the v2 patch.
>> - Add dev_err() to notify when Link Training fails, since this is a
>> fatal error and proceeding from this point will almost always crash
>> the kernel.
>>
>> v1:
>> https://lore.kernel.org/r/20230102075656.260333-1-s-vadapalli@xxxxxx/
>> Changes since v1:
>> - Collect Reviewed-by tag from Vignesh Raghavendra.
>> - Rebase on next-20230315.
>>
>> Regards,
>> Siddharth.
>>
>> .../controller/cadence/pcie-cadence-host.c | 20 +++++++++++++++++++
>> 1 file changed, 20 insertions(+)
>>
>> diff --git a/drivers/pci/controller/cadence/pcie-cadence-host.c b/drivers/pci/controller/cadence/pcie-cadence-host.c
>> index 940c7dd701d6..70a5f581ff4f 100644
>> --- a/drivers/pci/controller/cadence/pcie-cadence-host.c
>> +++ b/drivers/pci/controller/cadence/pcie-cadence-host.c
>> @@ -12,6 +12,8 @@
>>
>> #include "pcie-cadence.h"
>>
>> +#define LINK_RETRAIN_TIMEOUT HZ
>> +
>> static u64 bar_max_size[] = {
>> [RP_BAR0] = _ULL(128 * SZ_2G),
>> [RP_BAR1] = SZ_2G,
>> @@ -80,8 +82,26 @@ static struct pci_ops cdns_pcie_host_ops = {
>> static int cdns_pcie_host_wait_for_link(struct cdns_pcie *pcie)
>> {
>> struct device *dev = pcie->dev;
>> + unsigned long end_jiffies;
>> + u16 link_status;
>> int retries;
>>
>> + /* Wait for link training to complete */
>> + end_jiffies = jiffies + LINK_RETRAIN_TIMEOUT;
>> + do {
>> + link_status = cdns_pcie_rp_readw(pcie, CDNS_PCIE_RP_CAP_OFFSET + PCI_EXP_LNKSTA);
>> + if (!(link_status & PCI_EXP_LNKSTA_LT))
>> + break;
>
> You can use a bool variable eg link_trained and use that below.

Sure, I will do that. link_trained = !(link_status & PCI_EXP_LNKSTA_LT); within
the do-while loop and checking for it to be true in the loop as well as below.

>
>> + usleep_range(0, 1000);
>> + } while (time_before(jiffies, end_jiffies));
>> +
>> + if (!(link_status & PCI_EXP_LNKSTA_LT)) {
>> + dev_info(dev, "Link training complete\n");
>> + } else {
>> + dev_err(dev, "Fatal! Link training incomplete\n");
>> + return -ETIMEDOUT;
>> + }
>
> I don't necessarily see the reason why you are adding additional
> logging, more so given that this now does not affect just the
> workaround but all cadence controllers.
>
> Actually, is that something you have tested and considered ?

While I agree that I could have performed the entire Link Training check only if
the Gen2 Link Retraining Quirk is set for the RC, considering that the
completion of the Link Training is a necessity irrespective of whether or not
the Quirk exists, I preferred to add the check unconditionally. I would like to
point out that the race condition responsible for the crash is the following:
Without the completion of the Physical Layer link training, the call to the
cdns_pci_map_bus() function in order to access the End Point's registers (if an
EP device is connected) results in the crash. This is primarily observed only on
RT Linux where the software call to cdns_pci_map_bus() by PCI subsystem occurs
quite fast, before the Physical Layer link training is complete. For this
reason, irrespective of whether the Physical Layer link training occurs only
once because of the default flow or occurs a second time due to the Gen2 Link
Retraining Quirk, it appears to me that the crash could potentially occur in
both cases if we don't wait for the Physical Layer link training to complete.

Please let me know if this sounds acceptable. If not, I will check if the quirk
is set before proceeding to verify link training completion and implement this
in the v4 patch.

>
> Thanks,
> Lorenzo
>
>> +
>> /* Check if the link is up or not */
>> for (retries = 0; retries < LINK_WAIT_MAX_RETRIES; retries++) {
>> if (cdns_pcie_link_up(pcie)) {
>> --
>> 2.25.1
>>

--
Regards,
Siddharth.