Re: QCA6174 pcie wifi: Add pci quirks

From: Pali Rohár
Date: Tue Jun 08 2021 - 14:43:59 EST


Hello! So should I add also 0x003e device id in next patch iteration?

On Saturday 05 June 2021 16:46:36 Ingmar Klein wrote:
> Hi Pali and Bjorn,
>
> finally found the time to test.
> Pali's v3 patch seems to work like a charm for my card with "0x003e" id
> as well.
> Just finished compiling a pve-kernel v5.11.21 with Pali's patch,
> slightly adjusted for my test card and the Ubuntu kernel source (no
> functional differences, just minor adjustments to make it fit the
> Proxmox pve-kernel).
>
> System works just fine, in contrast to without patch. Of course, no long
> term tests, yet. However, it is looking really good.
> Thanks guys!
>
> Best regards,
> Ingmar
>
>
> Am 28.05.2021 um 20:47 schrieb Ingmar Klein:
> > Hi Pali,
> > sorry for not checking that detail!
> > Of course no problem that you couldn't test that ID. Will be glad to
> > do so.
> >
> > I'll let you know how this turns out.
> >
> > Best regards,
> > Ingmar
> >
> >
> > Am 28.05.2021 um 20:21 schrieb Pali Rohár:
> > > Hello Ingmar!
> > >
> > > Now I see that in your patch you have Atheros card with id 0x003e:
> > > https://lore.kernel.org/linux-pci/08982e05-b6e8-5a8d-24ab-da1488ee50a8@xxxxxx/
> > >
> > >
> > > With my patch I have tested 5 different Atheros cards but none has id
> > > 0x003e:
> > > https://lore.kernel.org/linux-pci/20210505163357.16012-1-pali@xxxxxxxxxx/
> > >
> > >
> > > So my patch does not fix that issue for your 0x003e card. I just do not
> > > have such card for testing.
> > >
> > > Could you try to apply my patch and then add your id 0x003e into quirk
> > > list if it helps?
> > >
> > > On Friday 28 May 2021 20:08:52 Ingmar Klein wrote:
> > > > Thanks to both of you, Bjorn and Pali!
> > > > I had hoped that Pali would come with an appropriate fix. Good to know,
> > > > that this is taken care of.
> > > >
> > > > Will test ASAP, but I am confident, that it will work anyway.
> > > > Should it unexpectedly not fix my issues, I'll let you know.
> > > > Have a nice weekend!
> > > > Best regards,
> > > > Ingmar
> > > >
> > > >
> > > > Am 26.05.2021 um 00:12 schrieb Bjorn Helgaas:
> > > > > On Thu, Apr 15, 2021 at 09:53:38PM +0200, Pali Rohár wrote:
> > > > > > Hello!
> > > > > >
> > > > > > On Thursday 15 April 2021 13:01:19 Alex Williamson wrote:
> > > > > > > [cc +Pali]
> > > > > > >
> > > > > > > On Thu, 15 Apr 2021 20:02:23 +0200
> > > > > > > Ingmar Klein <ingmar_klein@xxxxxx> wrote:
> > > > > > >
> > > > > > > > First thanks to you both, Alex and Bjorn!
> > > > > > > > I am in no way an expert on this topic, so I have to fully rely
> > > > > > > > on your
> > > > > > > > feedback, concerning this issue.
> > > > > > > >
> > > > > > > > If you should have any other solution approach, in form of
> > > > > > > > patch-set, I
> > > > > > > > would be glad to test it out. Just let me know, what you think
> > > > > > > > might
> > > > > > > > make sense.
> > > > > > > > I will wait for your further feedback on the issue. In the
> > > > > > > > meantime I
> > > > > > > > have my current workaround via quirk entry.
> > > > > > > >
> > > > > > > > By the way, my layman's question:
> > > > > > > > Do you think, that the following topic might also apply for the
> > > > > > > > QCA6174?
> > > > > > > > https://www.spinics.net/lists/linux-pci/msg106395.html
> > > > > > I have been testing more ath cards and I'm going to send a new
> > > > > > version
> > > > > > of this patch with including more PCI ids.
> > > > > Dropping this patch in favor of Pali's new version.
> > > > >
> > > > > > > > Or in other words, should a similar approach be tried for the
> > > > > > > > QCA6174
> > > > > > > > and if yes, would it bring any benefit at all?
> > > > > > > > I hope you can excuse me, in case the questions should not make
> > > > > > > > too much
> > > > > > > > sense.
> > > > > > > If you run lspci -vvv on your device, what do LnkCap and LnkSta
> > > > > > > report
> > > > > > > under the express capability?  I wonder if your device even supports
> > > > > > > > Gen1 speeds, mine does not.
> > > > > > > I would not expect that patch to be relevant to you based on your
> > > > > > > report.  I understand it to resolve an issue during link
> > > > > > > retraining to a
> > > > > > > higher speed on boot, not during a bus reset.  Pali can correct
> > > > > > > if I'm
> > > > > > > wrong.  Thanks,
> > > > > > These two issues are are related. Both operations (PCIe Hot Reset and
> > > > > > PCIe Link Retraining) cause reset of ath chips. Seems that they cause
> > > > > > double reset. After reset these chips reads configuration from
> > > > > > internal
> > > > > > EEPROM/OTP and if another reset is triggered prior chip finishes
> > > > > > internal configuration read then it stops working. My testing showed
> > > > > > that ath10k chips completely disappear from the PCIe bus, some ath9k
> > > > > > chips works fine but starts reporting incorrect PCI ID (0xABCD)
> > > > > > and some
> > > > > > other ath9k chips reports correct PCI ID but does not work. I had
> > > > > > discussion with Adrian Chadd who knows probably everything about
> > > > > > ath9k
> > > > > > and confirmed me that this issue is there with ath9k and ath10k
> > > > > > chips.
> > > > > >
> > > > > > He wrote me that workaround to turn card back from this "broken"
> > > > > > state
> > > > > > is to do PCIe Cold Reset of the card, which means turning power
> > > > > > supply
> > > > > > off for particular PCIe slot. Such thing is not supported on many
> > > > > > low-end boards, so workaround cannot be applied.
> > > > > >
> > > > > > I was able to recover my testing cards from this "broken" state by
> > > > > > PCIe
> > > > > > Warm Reset (= reset via PERST# pin).
> > > > > >
> > > > > > I have tried many other reset methods (PCIe PM reset, Link Down, PCIe
> > > > > > Hot Reset with bigger internal, ...) but nothing worked. So seems
> > > > > > that
> > > > > > the only workaround is to do PCIe Cold Reset or PCIe Warm Reset.
> > > > > >
> > > > > > I will send V2 of my patch with details and explanation.
> > > > > >
> > > > > > As kernel does not have API for doing PCIe Warm Reset, I think is
> > > > > > another argument why kernel really needs it.
> > > > > >
> > > > > > I do not have any QCA6174 card for testing, but based on the fact I
> > > > > > reproduced this issue with more ath9k and ath10 cards and Adrian
> > > > > > confirmed that above reset issue is there, I think that it affects
> > > > > > all
> > > > > > AR9xxx and QCAxxxx cards handled by ath9k and ath10 drivers.
> > > > > >
> > > > > > I was told that AMI BIOS was patching their BIOSes found in
> > > > > > notebooks to
> > > > > > avoid triggering this issue on notebooks ath9k cards.
> > > > > >
> > > > > > > Alex
> > > > > > >
> > > > > > > > Am 15.04.2021 um 04:36 schrieb Alex Williamson:
> > > > > > > > > On Wed, 14 Apr 2021 16:03:50 -0500
> > > > > > > > > Bjorn Helgaas <helgaas@xxxxxxxxxx> wrote:
> > > > > > > > >
> > > > > > > > > > [+cc Alex]
> > > > > > > > > >
> > > > > > > > > > On Fri, Apr 09, 2021 at 11:26:33AM +0200, Ingmar Klein wrote:
> > > > > > > > > > > Edit: Retry, as I did not consider, that my mail-client would
> > > > > > > > > > > make this
> > > > > > > > > > > party html.
> > > > > > > > > > >
> > > > > > > > > > > Dear maintainers,
> > > > > > > > > > > I recently encountered an issue on my Proxmox server system,
> > > > > > > > > > > that
> > > > > > > > > > > includes a Qualcomm QCA6174 m.2 PCIe wifi module.
> > > > > > > > > > > https://deviwiki.com/wiki/AIRETOS_AFX-QCA6174-NX
> > > > > > > > > > >
> > > > > > > > > > > On system boot and subsequent virtual machine start (with
> > > > > > > > > > > passed-through
> > > > > > > > > > > QCA6174), the VM would just freeze/hang, at the point where
> > > > > > > > > > > the ath10k
> > > > > > > > > > > driver loads.
> > > > > > > > > > > Quick search in the proxmox related topics, brought me to the
> > > > > > > > > > > following
> > > > > > > > > > > discussion, which suggested a PCI quirk entry for the QCA6174
> > > > > > > > > > > in the kernel:
> > > > > > > > > > > https://forum.proxmox.com/threads/pcie-passthrough-freezes-proxmox.27513/
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > I then went ahead, got the Proxmox kernel source (v5.4.106)
> > > > > > > > > > > and applied
> > > > > > > > > > > the attached patch.
> > > > > > > > > > > Effect was as hoped, that the VM hangs are now gone. System
> > > > > > > > > > > boots and
> > > > > > > > > > > runs as intended.
> > > > > > > > > > >
> > > > > > > > > > > Judging by the existing quirk entries for Atheros, I would
> > > > > > > > > > > think, that
> > > > > > > > > > > my proposed "fix" could be included in the vanilla kernel.
> > > > > > > > > > > As far as I saw, there is no entry yet, even in the latest
> > > > > > > > > > > kernel sources.
> > > > > > > > > > This would need a signed-off-by; see
> > > > > > > > > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/Documentation/process/submitting-patches.rst?id=v5.11#n361
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > This is an old issue, and likely we'll end up just applying
> > > > > > > > > > this as
> > > > > > > > > > yet another quirk.  But looking at c3e59ee4e766 ("PCI: Mark
> > > > > > > > > > Atheros
> > > > > > > > > > AR93xx to avoid bus reset"), where it started, it seems to be
> > > > > > > > > > connected to 425c1b223dac ("PCI: Add Virtual Channel to
> > > > > > > > > > save/restore
> > > > > > > > > > support").
> > > > > > > > > >
> > > > > > > > > > I'd like to dig into that a bit more to see if there are any
> > > > > > > > > > clues.
> > > > > > > > > > AFAIK Linux itself still doesn't use VC at all, and
> > > > > > > > > > 425c1b223dac added
> > > > > > > > > > a fair bit of code.  I wonder if we're restoring something out of
> > > > > > > > > > order or making some simple mistake in the way to restore VC
> > > > > > > > > > config.
> > > > > > > > > I don't really have any faith in that bisect report in commit
> > > > > > > > > c3e59ee4e766.  To double check I dug out the card from that
> > > > > > > > > commit,
> > > > > > > > > installed an old Fedora release so I could build kernel v3.13,
> > > > > > > > > pre-dating 425c1b223dac and tested triggering a bus reset both via
> > > > > > > > > setpci and by masking PM reset so that sysfs can trigger the
> > > > > > > > > bus reset
> > > > > > > > > path with the kernel save/restore code.  Both result in the system
> > > > > > > > > hanging when the device is accessed either restoring from the
> > > > > > > > > kernel
> > > > > > > > > bus reset or reading from the device after the setpci reset. 
> > > > > > > > > Thanks,
> > > > > > > > >
> > > > > > > > > Alex
> > > > > > > > >