Re: amdgpu didn't start with pci=nocrs parameter, get error "Fatal error during GPU init"

From: Christian König
Date: Fri Dec 15 2023 - 07:37:51 EST


Am 15.12.23 um 12:45 schrieb Mikhail Gavrilov:
On Tue, Feb 28, 2023 at 5:43 PM Christian König
<ckoenig.leichtzumerken@xxxxxxxxx> wrote:
The point is it doesn't need to talk to the amdgpu hardware. What it
does is that it talks to the good old VGA/VESA emulation and that just
happens to be still enabled by the BIOS/GRUB.

And that VGA/VESA emulation doesn't need any BAR or whatever to keep the
hw running in the state where it was initialized before the kernel
started. The kernel just grabs the addresses where it needs to write the
display data and keeps going with that.

But when a hw specific driver wants to load this is the first thing
which gets disabled because we need to load new firmware. And with the
BARs disabled this can't be re-enabled without rebooting the system.

My suggestion is that if
amdgpu fails to talk to the hardware, then let another suitable driver
do it. I attached a system log when I apply "pci=nocrs" with
"modprobe.blacklist=amdgpu" for showing that graphics work right in
this case.
To do this, does the Linux module loading mechanism need to be refined?
That's actually working as expected. The real problem is that the BIOS
on that system is so broken that we can't access the hw correctly.

What we could to do is to check the BARs very early on and refuse to
load when they are disable. The problem with this approach is that there
are systems where it is normal that the BARs are disable until the
driver loads and get enabled during the hardware initialization process.

What you might want to look into is to find a quirk for the BIOS to
properly enable the nvme controller.

That's interesting. I noticed that now amdgpu could work even with
parameter [pci=nocrs] on 6.7.0-0.rc4 and higher kernels.
It means BARs became available?
I attached here the kerner log and lspci. What's changed?

I have no idea :)

From the logs I can see that the AMDGPU now has the proper BARs assigned:

[    5.722015] pci 0000:03:00.0: [1002:73df] type 00 class 0x038000
[    5.722051] pci 0000:03:00.0: reg 0x10: [mem 0xf800000000-0xfbffffffff 64bit pref]
[    5.722081] pci 0000:03:00.0: reg 0x18: [mem 0xfc00000000-0xfc0fffffff 64bit pref]
[    5.722112] pci 0000:03:00.0: reg 0x24: [mem 0xfca00000-0xfcafffff]
[    5.722134] pci 0000:03:00.0: reg 0x30: [mem 0xfcb00000-0xfcb1ffff pref]
[    5.722368] pci 0000:03:00.0: PME# supported from D1 D2 D3hot D3cold
[    5.722484] pci 0000:03:00.0: 63.008 Gb/s available PCIe bandwidth, limited by 8.0 GT/s PCIe x8 link at 0000:00:01.1 (capable of 252.048 Gb/s with 16.0 GT/s PCIe x16 link)

And with that the driver can work perfectly fine.

Have you updated the BIOS or added/removed some other hardware? Maybe somebody added a quirk for your BIOS into the PCIe code or something like that.

Regards,
Christian.