Re: [PATCH v4 2/2] PCI: Enable NO_BUS_RESET quirk for Nvidia GPUs

From: Bjorn Helgaas
Date: Fri Apr 30 2021 - 13:01:57 EST


On Wed, Apr 28, 2021 at 07:49:07PM -0500, Shanker Donthineni wrote:
> On select platforms, some Nvidia GPU devices do not work with SBR.
> Triggering SBR would leave the device inoperable for the current
> system boot. It requires a system hard-reboot to get the GPU device
> back to normal operating condition post-SBR. For the affected
> devices, enable NO_BUS_RESET quirk to fix the issue.

Since 1/2 adds _RST support, should I infer that _RST works on these
Nvidia GPUs even though SBR does not? If so, how does _RST do the
reset?

Do you have a root cause for why SBR doesn't work? I'm not super
confident that we perform resets correctly in general, and if the
problem is an issue in Linux, it'd be nice to fix that.

> This issue will be fixed in the next generation of hardware.
>
> Signed-off-by: Shanker Donthineni <sdonthineni@xxxxxxxxxx>
> ---
> Changes since v1:
> - Split patch into 2, code for handling _RST and SBR specific quirk
> - The RST based reset is called as a first-class mechanism in the reset code path
>
> drivers/pci/quirks.c | 12 ++++++++++++
> 1 file changed, 12 insertions(+)
>
> diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c
> index 8f47d139c381..e1216a8165df 100644
> --- a/drivers/pci/quirks.c
> +++ b/drivers/pci/quirks.c
> @@ -3910,6 +3910,18 @@ static int delay_250ms_after_flr(struct pci_dev *dev, int probe)
> return 0;
> }
>
> +/*
> + * Some Nvidia GPU devices do not work with bus reset, SBR needs to be
> + * prevented for those affected devices.
> + */
> +static void quirk_nvidia_no_bus_reset(struct pci_dev *dev)
> +{
> + if ((dev->device & 0xffc0) == 0x2340)
> + dev->dev_flags |= PCI_DEV_FLAGS_NO_BUS_RESET;
> +}
> +DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_NVIDIA, PCI_ANY_ID,
> + quirk_nvidia_no_bus_reset);

Can you move this next to the existing quirk_no_bus_reset(), and maybe
even just call quirk_no_bus_reset(), e.g.,

if ((dev->device & 0xffc0) == 0x2340)
quirk_no_bus_reset(dev);

It doesn't look connected to this spot.

> static const struct pci_dev_reset_methods pci_dev_reset_methods[] = {
> { PCI_VENDOR_ID_INTEL, PCI_DEVICE_ID_INTEL_82599_SFP_VF,
> reset_intel_82599_sfp_virtfn },
> --
> 2.17.1
>