[PATCH 0/2] Recover from failure to probe GPU

From: Mario Limonciello
Date: Thu Dec 22 2022 - 13:30:50 EST


One of the first thing that KMS drivers do during initialization is
destroy the system firmware framebuffer by means of
`drm_aperture_remove_conflicting_pci_framebuffers`

This means that if for any reason the GPU failed to probe the user
will be stuck with at best a screen frozen at the last thing that
was shown before the KMS driver continued it's probe.

The problem is most pronounced when new GPU support is introduced
because users will need to have a recent linux-firmware snapshot
on their system when they boot a kernel with matching support.

However the problem is further exaggerated in the case of amdgpu because
it has migrated to "IP discovery" where amdgpu will attempt to load
on "ALL" AMD GPUs even if the driver is missing support for IP blocks
contained in that GPU.

IP discovery requires some probing and isn't run until after the
framebuffer has been destroyed.

This means a situation can occur where a user purchases a new GPU not
yet supported by a distribution and when booting the installer it will
"freeze" even if the distribution doesn't have the matching kernel support
for those IP blocks.

The perfect example of this is Ubuntu 21.10 and the new dGPUs just
launched by AMD. The installation media ships with kernel 5.19 (which
has IP discovery) but the amdgpu support for those IP blocks landed in
kernel 6.0. The matching linux-firmware was released after 21.10's launch.
The screen will freeze without nomodeset. Even if a user manages to install
and then upgrades to kernel 6.0 after install they'll still have the
problem of missing firmware, and the same experience.

This is quite jarring for users, particularly if they don't know
that they have to use "nomodeset" to install.

To help the situation, allow drivers to re-run the init process for the
firmware framebuffer during a failed probe. As this problem is most
pronounced with amdgpu, this is the only driver changed.

But if this makes sense more generally for other KMS drivers, the call
can be added to the cleanup routine for those too.

Here is a sample of what happens with missing GPU firmware and this
series:

[ 5.950056] amdgpu 0000:63:00.0: vgaarb: deactivate vga console
[ 5.950114] amdgpu 0000:63:00.0: enabling device (0006 -> 0007)
[ 5.950883] [drm] initializing kernel modesetting (YELLOW_CARP 0x1002:0x1681 0x17AA:0x22F1 0xD2).
[ 5.952954] [drm] register mmio base: 0xB0A00000
[ 5.952958] [drm] register mmio size: 524288
[ 5.954633] [drm] add ip block number 0 <nv_common>
[ 5.954636] [drm] add ip block number 1 <gmc_v10_0>
[ 5.954637] [drm] add ip block number 2 <navi10_ih>
[ 5.954638] [drm] add ip block number 3 <psp>
[ 5.954639] [drm] add ip block number 4 <smu>
[ 5.954641] [drm] add ip block number 5 <dm>
[ 5.954642] [drm] add ip block number 6 <gfx_v10_0>
[ 5.954643] [drm] add ip block number 7 <sdma_v5_2>
[ 5.954644] [drm] add ip block number 8 <vcn_v3_0>
[ 5.954645] [drm] add ip block number 9 <jpeg_v3_0>
[ 5.954663] amdgpu 0000:63:00.0: amdgpu: Fetched VBIOS from VFCT
[ 5.954666] amdgpu: ATOM BIOS: 113-REMBRANDT-X37
[ 5.954677] [drm] VCN(0) decode is enabled in VM mode
[ 5.954678] [drm] VCN(0) encode is enabled in VM mode
[ 5.954680] [drm] JPEG decode is enabled in VM mode
[ 5.954681] amdgpu 0000:63:00.0: amdgpu: Trusted Memory Zone (TMZ) feature disabled as experimental (default)
[ 5.954683] amdgpu 0000:63:00.0: amdgpu: PCIE atomic ops is not supported
[ 5.954724] [drm] vm size is 262144 GB, 4 levels, block size is 9-bit, fragment size is 9-bit
[ 5.954732] amdgpu 0000:63:00.0: amdgpu: VRAM: 512M 0x000000F400000000 - 0x000000F41FFFFFFF (512M used)
[ 5.954735] amdgpu 0000:63:00.0: amdgpu: GART: 1024M 0x0000000000000000 - 0x000000003FFFFFFF
[ 5.954738] amdgpu 0000:63:00.0: amdgpu: AGP: 267419648M 0x000000F800000000 - 0x0000FFFFFFFFFFFF
[ 5.954747] [drm] Detected VRAM RAM=512M, BAR=512M
[ 5.954750] [drm] RAM width 256bits LPDDR5
[ 5.954834] [drm] amdgpu: 512M of VRAM memory ready
[ 5.954838] [drm] amdgpu: 15680M of GTT memory ready.
[ 5.954873] [drm] GART: num cpu pages 262144, num gpu pages 262144
[ 5.955333] [drm] PCIE GART of 1024M enabled (table at 0x000000F41FC00000).
[ 5.955502] amdgpu 0000:63:00.0: Direct firmware load for amdgpu/yellow_carp_toc.bin failed with error -2
[ 5.955505] amdgpu 0000:63:00.0: amdgpu: fail to request/validate toc microcode
[ 5.955510] [drm:psp_sw_init [amdgpu]] *ERROR* Failed to load psp firmware!
[ 5.955725] [drm:amdgpu_device_init.cold [amdgpu]] *ERROR* sw_init of IP block <psp> failed -2
[ 5.955952] amdgpu 0000:63:00.0: amdgpu: amdgpu_device_ip_init failed
[ 5.955954] amdgpu 0000:63:00.0: amdgpu: Fatal error during GPU init
[ 5.955957] amdgpu 0000:63:00.0: amdgpu: amdgpu: finishing device.
[ 5.971162] efifb: probing for efifb
[ 5.971281] efifb: showing boot graphics
[ 5.974803] efifb: framebuffer at 0x910000000, using 20252k, total 20250k
[ 5.974805] efifb: mode is 2880x1800x32, linelength=11520, pages=1
[ 5.974807] efifb: scrolling: redraw
[ 5.974807] efifb: Truecolor: size=8:8:8:8, shift=24:16:8:0
[ 5.974974] Console: switching to colour frame buffer device 180x56
[ 5.978181] fb0: EFI VGA frame buffer device
[ 5.978199] amdgpu: probe of 0000:63:00.0 failed with error -2
[ 5.978285] [drm] amdgpu: ttm finalized

Now if the user loads the firmware into the system they can re-load the
driver or re-attach using sysfs and it gracefully recovers.

[ 665.080480] [drm] Initialized amdgpu 3.49.0 20150101 for 0000:63:00.0 on minor 0
[ 665.090075] fbcon: amdgpudrmfb (fb0) is primary device
[ 665.090248] [drm] DSC precompute is not needed.

Mario Limonciello (2):
firmware: sysfb: Allow re-creating system framebuffer after init
drm/amd: Re-create firmware framebuffer on failure to probe

drivers/firmware/efi/sysfb_efi.c | 6 +++---
drivers/firmware/sysfb.c | 15 ++++++++++++++-
drivers/firmware/sysfb_simplefb.c | 4 ++--
drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 2 ++
include/linux/sysfb.h | 5 +++++
5 files changed, 26 insertions(+), 6 deletions(-)


base-commit: 830b3c68c1fb1e9176028d02ef86f3cf76aa2476
--
2.34.1