RE: Crashes under Xen with Radeon graphics card

From: Deucher, Alexander
Date: Fri Dec 15 2023 - 11:04:56 EST


[Public]

> -----Original Message-----
> From: Juergen Gross <jgross@xxxxxxxx>
> Sent: Friday, December 15, 2023 6:57 AM
> To: lkml <linux-kernel@xxxxxxxxxxxxxxx>; xen-devel@xxxxxxxxxxxxxxxxxxxx; amd-
> gfx@xxxxxxxxxxxxxxxxxxxxx
> Cc: Deucher, Alexander <Alexander.Deucher@xxxxxxx>; Koenig, Christian
> <Christian.Koenig@xxxxxxx>; Pan, Xinhui <Xinhui.Pan@xxxxxxx>
> Subject: Crashes under Xen with Radeon graphics card
>
> Hi,
>
> I recently stumbled over a test system which showed crashes probably
> resulting from memory being overwritten randomly.
>
> The problem is occurring only in Dom0 when running under Xen. It seems to
> be present since at least kernel 6.3 (I didn't go back further yet), and it seems
> NOT to be present in kernel 5.14.
>
> I tracked the problem down to the initialization of the graphics card (the
> problem might surface only later, but at least an early initialization failure made
> the problem go away).
>
> # lspci
> 01:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI]
> Caicos XTX [Radeon HD 8490 / R5 235X OEM]
> 01:00.1 Audio device: Advanced Micro Devices, Inc. [AMD/ATI] Caicos HDMI
> Audio [Radeon HD 6450 / 7450/8450/8490 OEM / R5 230/235/235X OEM]
>
> I had a working .config and one which did produce the crashes, so I narrowed
> the problem down to detect that the important difference was in the area of
> firmware loading (the working .config didn't have
> CONFIG_FW_LOADER_COMPRESS_XZ set, causing firmware loading for the
> card to fail). This was of course not the real problem, but it caused the card
> initialization to fail.
>
> I manually decompressed the firmware files one by one to see whether the
> problem would be in the decompressor or probably in the driver of the card.
>
> The last step without crash was:
>
> # dmesg | grep radeon
> [ 10.106405] [drm] radeon kernel modesetting enabled.
> [ 10.106455] radeon 0000:01:00.0: vgaarb: deactivate vga console
> [ 10.222944] radeon 0000:01:00.0: VRAM: 1024M 0x0000000000000000
> -
> 0x000000003FFFFFFF (1024M used)
> [ 10.252921] radeon 0000:01:00.0: GTT: 1024M 0x0000000040000000 -
> 0x000000007FFFFFFF
> [ 10.278255] [drm] radeon: 1024M of VRAM memory ready
> [ 10.295828] [drm] radeon: 1024M of GTT memory ready.
> [ 10.295867] radeon 0000:01:00.0: Direct firmware load for
> radeon/CAICOS_pfp.bin succeeded
> [ 10.330846] radeon 0000:01:00.0: Direct firmware load for
> radeon/CAICOS_me.bin succeeded
> [ 10.330858] radeon 0000:01:00.0: Direct firmware load for
> radeon/BTC_rlc.bin
> succeeded
> [ 10.330870] radeon 0000:01:00.0: Direct firmware load for
> radeon/CAICOS_mc.bin failed with error -2
> [ 10.380979] ni_cp: Failed to load firmware "radeon/CAICOS_mc.bin"
> [ 10.381006] [drm:evergreen_init [radeon]] *ERROR* Failed to load
> firmware!
> [ 10.405765] radeon 0000:01:00.0: Fatal error during GPU init
> [ 10.432107] [drm] radeon: finishing device.
> [ 10.439179] [drm] radeon: ttm finalized
> [ 10.463203] radeon: probe of 0000:01:00.0 failed with error -2
>
> And with decompressing radeon/CAICOS_mc.bin I got:
>
> # dmesg | grep radeon
> [ 10.266491] [drm] radeon kernel modesetting enabled.
> [ 10.266552] radeon 0000:01:00.0: vgaarb: deactivate vga console
> [ 10.456047] radeon 0000:01:00.0: VRAM: 1024M 0x0000000000000000
> -
> 0x000000003FFFFFFF (1024M used)
> [ 10.470270] radeon 0000:01:00.0: GTT: 1024M 0x0000000040000000 -
> 0x000000007FFFFFFF
> [ 10.566946] [drm] radeon: 1024M of VRAM memory ready
> [ 10.576891] [drm] radeon: 1024M of GTT memory ready.
> [ 10.586971] radeon 0000:01:00.0: Direct firmware load for
> radeon/CAICOS_pfp.bin succeeded
> [ 10.611886] radeon 0000:01:00.0: Direct firmware load for
> radeon/CAICOS_me.bin succeeded
> [ 10.611909] radeon 0000:01:00.0: Direct firmware load for
> radeon/BTC_rlc.bin
> succeeded
> [ 10.611938] radeon 0000:01:00.0: Direct firmware load for
> radeon/CAICOS_mc.bin succeeded
> [ 10.660599] radeon 0000:01:00.0: Direct firmware load for
> radeon/CAICOS_smc.bin failed with error -2
> [ 10.660601] smc: error loading firmware "radeon/CAICOS_smc.bin"

You also need to make sure CAICOS_smc.bin is available.

> [ 10.661676] [drm] radeon: power management initialized
> [ 10.713666] radeon 0000:01:00.0: Direct firmware load for
> radeon/SUMO_uvd.bin
> failed with error -2
> [ 10.713668] radeon 0000:01:00.0: radeon_uvd: Can't load firmware
> "radeon/SUMO_uvd.bin"
> [ 10.713669] radeon 0000:01:00.0: failed UVD (-2) init.

And SUMO_uvd.bin.

> [ 10.714787] [drm] enabling PCIE gen 2 link speeds, disable with
> radeon.pcie_gen2=0
> [ 10.809213] radeon 0000:01:00.0: WB enabled
> [ 10.817528] radeon 0000:01:00.0: fence driver on ring 0 use gpu addr
> 0x0000000040000c00
> [ 10.833755] radeon 0000:01:00.0: fence driver on ring 3 use gpu addr
> 0x0000000040000c0c
> [ 10.850330] radeon 0000:01:00.0: radeon: MSI limited to 32-bit
> [ 10.862154] radeon 0000:01:00.0: radeon: using MSI.
> [ 10.871930] [drm] radeon: irq initialized.
> [ 11.062028] [drm] Initialized radeon 2.50.0 20080528 for 0000:01:00.0 on
> minor 0
> [ 11.119723] [drm:radeon_dvi_detect [radeon]] *ERROR* DVI-I-1: probed a
> monitor but no|invalid EDID
> [ 11.411370] fbcon: radeondrmfb (fb0) is primary device
> [ 11.507252] radeon 0000:01:00.0: [drm] fb0: radeondrmfb frame buffer
> device
> [ 11.674028] [drm:radeon_dvi_detect [radeon]] *ERROR* DVI-I-1: probed a
> monitor but no|invalid EDID
> [ 11.834317] [drm:radeon_dvi_detect [radeon]] *ERROR* DVI-I-1: probed a
> monitor but no|invalid EDID
> [ 28.313041] snd_hda_intel 0000:01:00.1: bound 0000:01:00.0 (ops
> radeon_audio_component_bind_ops [radeon])
> [ 44.371991] [drm:radeon_dvi_detect [radeon]] *ERROR* DVI-I-1: probed a
> monitor but no|invalid EDID
> [ 44.428068] [drm:radeon_dvi_detect [radeon]] *ERROR* DVI-I-1: probed a
> monitor but no|invalid EDID
>
> followed by a crash some seconds after the system was up.
>
> The crashes vary, but often the kernel accesses non-canonical addresses or
> tries to map illegal physical addresses. Sometimes the system is just hanging,
> either with softlockups or without any further signs of being alive.
>
> I can easily reproduce the problem, so any debug patches to narrow down the
> problem are welcome.

There are still missing firmware required for proper operation. Please fix them up.

Alex