Re: [PATCH 5.4 182/389] PCI/portdrv: Dont disable AER reporting in get_port_device_capability()

From: Stefan Roese
Date: Wed Aug 31 2022 - 01:57:23 EST


On 31.08.22 00:11, Bjorn Helgaas wrote:
[+cc Gregory, linux-wireless for iwlwifi issue]

On Tue, Aug 30, 2022 at 01:47:48PM -0700, Ben Greear wrote:
On 8/23/22 11:41 PM, Greg Kroah-Hartman wrote:
On Tue, Aug 23, 2022 at 07:20:14AM -0500, Bjorn Helgaas wrote:
On Tue, Aug 23, 2022, 6:35 AM Greg Kroah-Hartman <gregkh@xxxxxxxxxxxxxxxxxxx>
wrote:

From: Stefan Roese <sr@xxxxxxx>

[ Upstream commit 8795e182b02dc87e343c79e73af6b8b7f9c5e635 ]


There's an open regression related to this commit:

https://bugzilla.kernel.org/show_bug.cgi?id=216373

This is already in the following released stable kernels:
5.10.137 5.15.61 5.18.18 5.19.2

I'll go drop it from the 4.19 and 5.4 queues, but when this gets
resolved in Linus's tree, make sure there's a cc: stable on the fix so
that we know to backport it to the above branches as well. Or at the
least, a "Fixes:" tag.

This is still in 5.19.5. We saw some funny iwlwifi crashes in 5.19.3+
that we did not see in 5.19.0+. I just bisected the scary looking
AER errors to this patch, though I do not know for certain if it
causes the iwlwifi related crashes yet.

In general, from reading the commit msg, this patch doesn't seem to
be a great candidate for stable in general. Does it fix some
important problem?

I agree, I don't think this is a good candidate for stable. It has
already exposed latent amdgpu issues and we'll likely find more. It's
good to find and fix these things, but I'd rather do it in -rc than in
stable kernels.

I also agree. It was not my intention to have this patch added to
the stable branches. Frankly I missed intervening when seeing the
mails about the integration into stable a few weeks ago.

Still I find it very interesting to see, if and what now pops up with
full AER enabled in such more complex (PCIe wise) systems. I expect to
see more users detecting PCIe related problems in their system now.
This will definitely help fixing some bug, as already seen in the
AMD GPU thread. But again not really stable material but better -next
and -rc.

Thanks,
Stefan

It would be interesting to know whether similar crashes or AER reports
occur in v6.0-rc.

In case it helps, here is example of what I see in dmesg. The
kernel crashes in iwlwifi had to do with rx messages from the
firmware, and some warnings lead me to believe that pci messages
were slow coming back and/or maybe duplicated. So maybe this AER
patch changes timing or otherwise screws up the PCI adapter boards
we use...

It shouldn't. This looks like a latent issue that happened before but
was ignored because we didn't have AER enabled at the switch that
detected the error.

[ 50.905809] iwlwifi 0000:04:00.0: AER: can't recover (no error_detected callback)
[ 50.905830] pcieport 0000:03:01.0: AER: device recovery failed
[ 50.905831] pcieport 0000:00:1c.0: AER: Uncorrected (Non-Fatal) error received: 0000:03:01.0
[ 50.905845] pcieport 0000:03:01.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, (Requester ID)
[ 50.915679] pcieport 0000:03:01.0: device [10b5:8619] error status/mask=00100000/00000000
[ 50.922735] pcieport 0000:03:01.0: [20] UnsupReq (First)
[ 50.928230] pcieport 0000:03:01.0: AER: TLP Header: 34000000 04001f10 00000000 88c888c8

This is an LTR message (Message Code 0x10), Requester ID 04:00.0. I
think the iwlwifi device at 04:00.0 sent the LTR message, and 03:01.0
(probably a Switch Downstream Port leading to bus 04) received it but
had LTR disabled. In that case, 03:01.0 would treat the LTR message
as an Unsupported Request.

The other errors below are the same but from different devices.

Does this happen during or after a suspend/resume? I assume no
hotplug involved. Can you collect the output of "sudo lspci -vv" so
we can see the LTR config for the entire path?

You can boot with "pci=noaer" to shut up the AER messages (that
shouldn't affect the parts of lspci output I'm interested in). Would
be interesting to know whether "pci=noaer" affects the iwlwifi
crashes, though.

[ 51.331638] ACPI: \: failed to evaluate _DSM bf0212f2-788f-c64d-a5b3-1f738e285ade (0x1001)
[ 51.345413] ACPI: \: failed to evaluate _DSM bf0212f2-788f-c64d-a5b3-1f738e285ade (0x1001)

These look like they're from iwlwifi:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/net/wireless/intel/iwlwifi/fw/acpi.c?id=v5.19#n13

No idea what this is about. Maybe unrelated, but the fact that Google
can't find anything with that UUID makes me think it might actually be
related. The UUID was only added to the message in v5.19-rc1 by
06eb8dc097b3 ("ACPI: utils: include UUID in _DSM evaluation warning"),
but that should be enough time to see some for a common device like
iwlwifi.

Too bad we print the GUID in a different byte order than GUID_INIT
takes, which makes it hard to search for, even in the Linux source.

Bjorn

Viele Grüße,
Stefan Roese

--
DENX Software Engineering GmbH, Managing Director: Wolfgang Denk
HRB 165235 Munich, Office: Kirchenstr.5, D-82194 Groebenzell, Germany
Phone: (+49)-8142-66989-51 Fax: (+49)-8142-66989-80 Email: sr@xxxxxxx