Re: Regression from dcadfd7f7c74ef9ee415e072a19bdf6c085159eb

From: Mario Limonciello
Date: Tue Nov 07 2023 - 16:27:18 EST


+linux-pci / Bjorn
On 11/7/2023 06:17, Takashi Sakamoto wrote:
Hi Mario,

Thanks for the report.

I apologize for the inconvenience you and your reporter facing, however
I can not avoid to say that the problem appears to be specific to the AMD
Ryzen machines.

Unfortunately I don't have this 1394 hardware myself. I was just looking at another completely unrelated issue on Bugzilla and noticed the report come up in my search and wanted to ensure it's on your radar already as the author as it's lingered a while.


I've already received the similar report[1], and have been
investigating it in the last few weeks, then got the insight. Please take
a look at my short report about it in PR to Linus for 6.7-rc1:
https://lore.kernel.org/lkml/20231105144852.GA165906@workstation.local/

I can confirm that I have been abe to reproduce the problem on AMD Ryzen
machine. However, it's important to note that I have not observed the
problem on the following systems:

Any chance you (or anyone with the issue) has a serial output available?
I think it would be really good to look at the circumstances surrounding the reboot.


* Intel machine (Sandy Bridge and Skylake generations)
* AMD machines predating Ryzen (Sempron 145)
* Machines using different 1394 OHCI hardware from other vendors such as
TI
* VIA VT6307 connected directly to PCI slot (i.e. without the issued
PCIe/PCI bridge)

Currently, I have not been able to obtain any useful debug output from
the Linux system or any hardware error reports when the system reboots.
It seems that the system reboots spontaneously. My assumption at this
point is that AMD Ryzen machines detect a specific hardware error
triggered by Ryzen machine quirk related to the combination of the Asmedia
ASM1083/1085 and VIA VT6306/6307/6308, leading to power reset.


Recent kernels have enabled PCI AER. Could that be factoring in perhaps?

I genuinely appreciate your assistance in debugging this elusive
hardware issue. If any workaround specific to AMD Ryzen machine quirk is
required in PCI driver for 1394 OHCI hardware, I'm willing to apply it.
However, it is preferable to figure out the reboot mechanism at first,
I think.

Does the BIOS on these machines enable a watchdog timer? If so, I'd suggest disabling that for a starting point.

How about if you compile as a module and then modprobe.blacklist the module on kernel command line and load it later. Can you trigger the fault/reboot this way? If so, it at least rules out some conditions that happen during a race at boot.

Looking more closely at the change, I would guess the fault is specifically in get_cycle_time(). I can see that the VIA devices do set
QUIRK_CYCLE_TIMER which will cause additional reads.

Another guesses worth looking at is to see if iommu=pt or amd_iommu=off help.

If either of those help it could point at being a problem with get_cycle_time() and IOMMU. The older systems you mentioned working probably didn't enable IOMMU by default but most AMD Ryzen systems do.


On Mon, Nov 06, 2023 at 02:14:39PM -0600, Mario Limonciello wrote:
Hi,

I recently came across a kernel bugzilla that bisected a boot problem [1]
introduced in kernel 6.5 to this change.

commit dcadfd7f7c74ef9ee415e072a19bdf6c085159eb (HEAD -> dcadfd7f7c7)
Author: Takashi Sakamoto <o-takashi@xxxxxxxxxxxxx>
Date: Tue May 30 08:12:40 2023 +0900

firewire: core: use union for callback of transaction completion

Removing the firewire card from the system fixes it for both reporters
(CC'ed)

As the author of this issue can you please take a look at it?

Thanks,

[1] https://bugzilla.kernel.org/show_bug.cgi?id=217993


[1] https://bugzilla.suse.com/show_bug.cgi?id=1215436
[2] https://bugzilla.kernel.org/show_bug.cgi?id=217994

Thanks

Takashi Sakamoto