Re: [REGRESSION][BISECTED] 5.15-rc1: Broken AHCI on NVIDIA ION (MCP79)

From: Rui Salvaterra
Date: Thu Oct 07 2021 - 10:50:33 EST


Hi again, Marc,

On Thu, 7 Oct 2021 at 15:42, Marc Zyngier <maz@xxxxxxxxxx> wrote:
>
> Right. Let's see if we can be less brutal and only quirk the AHCI
> device (patch below, completely untested). I'm a bit concerned that
> all the devices in this system seem to report 'Maskable-'...

True. However…

rui@vedder:~$ cat /proc/interrupts
CPU0 CPU1 CPU2 CPU3
0: 124 0 0 0 IO-APIC 2-edge timer
1: 0 0 0 0 IO-APIC 1-edge i8042
8: 0 0 0 1 IO-APIC 8-edge rtc0
9: 0 0 0 0 IO-APIC 9-fasteoi acpi
12: 0 1 0 0 IO-APIC 12-edge i8042
20: 0 0 12734 852750 IO-APIC 20-fasteoi
ehci_hcd:usb2, enp0s10
21: 25 0 0 0 IO-APIC 21-fasteoi
ohci_hcd:usb4
22: 25672 288 0 0 IO-APIC 22-fasteoi
ehci_hcd:usb1
23: 0 0 0 709 IO-APIC 23-fasteoi
ohci_hcd:usb3, snd_hda_intel:card0
29: 0 0 83164 1779 PCI-MSI
1572864-edge nvkm
30: 3595 5645 0 0 PCI-MSI 180224-edge
ahci[0000:00:0b.0]
NMI: 0 0 0 0 Non-maskable interrupts
LOC: 202323 194669 107282 197322 Local timer interrupts
SPU: 0 0 0 0 Spurious interrupts
PMI: 0 0 0 0 Performance
monitoring interrupts
IWI: 0 0 0 0 IRQ work interrupts
RTR: 0 0 0 0 APIC ICR read retries
RES: 179 995 208 273 Rescheduling interrupts
CAL: 1149 1495 949 1211 Function call interrupts
TLB: 110 76 79 79 TLB shootdowns
TRM: 0 0 0 0 Thermal event interrupts
THR: 0 0 0 0 Threshold APIC interrupts
MCE: 0 0 0 0 Machine check exceptions
MCP: 20 20 20 20 Machine check polls
ERR: 1
MIS: 0
PIN: 0 0 0 0 Posted-interrupt
notification event
NPI: 0 0 0 0 Nested posted-interrupt event
PIW: 0 0 0 0 Posted-interrupt wakeup event
rui@vedder:~$

… the only devices using MSIs are the AHCI controller and the GPU, so
I think any damage would be more contained (and obvious), in this
case.

>
> M.
>
> diff --git a/drivers/pci/msi.c b/drivers/pci/msi.c
> index 0099a00af361..2f9ec7210991 100644
> --- a/drivers/pci/msi.c
> +++ b/drivers/pci/msi.c
> @@ -479,6 +479,9 @@ msi_setup_entry(struct pci_dev *dev, int nvec, struct irq_affinity *affd)
> goto out;
>
> pci_read_config_word(dev, dev->msi_cap + PCI_MSI_FLAGS, &control);
> + /* Lies, damned lies, and MSIs */

Best comment ever. :)

> + if (dev->dev_flags & PCI_DEV_FLAGS_HAS_MSI_MASKING)
> + control |= PCI_MSI_FLAGS_MASKBIT;
>
> entry->msi_attrib.is_msix = 0;
> entry->msi_attrib.is_64 = !!(control & PCI_MSI_FLAGS_64BIT);
> diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c
> index 4537d1ea14fd..dc7741431bf3 100644
> --- a/drivers/pci/quirks.c
> +++ b/drivers/pci/quirks.c
> @@ -5795,3 +5795,9 @@ static void apex_pci_fixup_class(struct pci_dev *pdev)
> }
> DECLARE_PCI_FIXUP_CLASS_HEADER(0x1ac1, 0x089a,
> PCI_CLASS_NOT_DEFINED, 8, apex_pci_fixup_class);
> +
> +static void nvidia_ion_ahci_fixup(struct pci_dev *pdev)
> +{
> + pdev->dev_flags |= PCI_MSI_FLAGS_MASKBIT;
> +}
> +DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_NVIDIA, 0x0ab8, nvidia_ion_ahci_fixup);
> diff --git a/include/linux/pci.h b/include/linux/pci.h
> index cd8aa6fce204..152a4d74f87f 100644
> --- a/include/linux/pci.h
> +++ b/include/linux/pci.h
> @@ -233,6 +233,8 @@ enum pci_dev_flags {
> PCI_DEV_FLAGS_NO_FLR_RESET = (__force pci_dev_flags_t) (1 << 10),
> /* Don't use Relaxed Ordering for TLPs directed at this device */
> PCI_DEV_FLAGS_NO_RELAXED_ORDERING = (__force pci_dev_flags_t) (1 << 11),
> + /* Device does honor MSI masking despite saying otherwise */
> + PCI_DEV_FLAGS_HAS_MSI_MASKING = (__force pci_dev_flags_t) (1 << 12),
> };
>
> enum pci_irq_reroute_variant {
>
>
> --

I'm taking this one for a ride too and report back.

Thanks,
Rui