Re: [PATCH v2 00/11] Recover from failure to probe GPU

From: Lazar, Lijo
Date: Tue Jan 03 2023 - 05:11:21 EST




On 12/28/2022 10:00 PM, Mario Limonciello wrote:
One of the first thing that KMS drivers do during initialization is
destroy the system firmware framebuffer by means of
`drm_aperture_remove_conflicting_pci_framebuffers`

This means that if for any reason the GPU failed to probe the user
will be stuck with at best a screen frozen at the last thing that
was shown before the KMS driver continued it's probe.

The problem is most pronounced when new GPU support is introduced
because users will need to have a recent linux-firmware snapshot
on their system when they boot a kernel with matching support.

However the problem is further exaggerated in the case of amdgpu because
it has migrated to "IP discovery" where amdgpu will attempt to load
on "ALL" AMD GPUs even if the driver is missing support for IP blocks
contained in that GPU.

IP discovery requires some probing and isn't run until after the
framebuffer has been destroyed.

This means a situation can occur where a user purchases a new GPU not
yet supported by a distribution and when booting the installer it will
"freeze" even if the distribution doesn't have the matching kernel support
for those IP blocks.

The perfect example of this is Ubuntu 22.10 and the new dGPUs just
launched by AMD. The installation media ships with kernel 5.19 (which
has IP discovery) but the amdgpu support for those IP blocks landed in
kernel 6.0. The matching linux-firmware was released after 22.10's launch.
The screen will freeze without nomodeset. Even if a user manages to install
and then upgrades to kernel 6.0 after install they'll still have the
problem of missing firmware, and the same experience.

This is quite jarring for users, particularly if they don't know
that they have to use "nomodeset" to install.

To help the situation make changes to GPU discovery:
1) Delay releasing the firmware framebuffer until after IP discovery has
completed. This will help the situation of an older kernel that doesn't
yet support the IP blocks probing a new GPU.
2) Request loading all PSP, VCN, SDMA, MES and GC microcode into memory
during IP discovery. This will help the situation of new enough kernel for
the IP discovery phase to otherwise pass but missing microcode from
linux-firmware.git.

Not all requested firmware will be loaded during IP discovery as some of it
will require larger driver architecture changes. For example SMU firmware
isn't loaded on certain products, but that's not known until later on when
the early_init phase of the SMU load occurs.

v1->v2:
* Take the suggestion from v1 thread to delay the framebuffer release until
ip discovery is done. This patch is CC to stable to that older stable
kernels with IP discovery won't try to probe unknown IP.
* Drop changes to drm aperature.
* Fetch SDMA, VCN, MES, GC and PSP microcode during IP discovery.


What is the gain here in just checking if firmware files are available? It can fail anywhere during sw_init and it's the same situation.

Restricting IP FWs to IP specific files looks better to me than centralizing and creating interdependencies.

Thanks,
Lijo

Mario Limonciello (11):
drm/amd: Delay removal of the firmware framebuffer
drm/amd: Add a legacy mapping to "amdgpu_ucode_ip_version_decode"
drm/amd: Convert SMUv11 microcode init to use
`amdgpu_ucode_ip_version_decode`
drm/amd: Convert SMU v13 to use `amdgpu_ucode_ip_version_decode`
drm/amd: Request SDMA microcode during IP discovery
drm/amd: Request VCN microcode during IP discovery
drm/amd: Request MES microcode during IP discovery
drm/amd: Request GFX9 microcode during IP discovery
drm/amd: Request GFX10 microcode during IP discovery
drm/amd: Request GFX11 microcode during IP discovery
drm/amd: Request PSP microcode during IP discovery

drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 8 +
drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.c | 590 +++++++++++++++++-
drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 6 -
drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c | 2 -
drivers/gpu/drm/amd/amdgpu/amdgpu_sdma.c | 9 +-
drivers/gpu/drm/amd/amdgpu/amdgpu_sdma.h | 2 +-
drivers/gpu/drm/amd/amdgpu/amdgpu_ucode.c | 208 ++++++
drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.c | 85 +--
drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c | 180 +-----
drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c | 64 +-
drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c | 143 +----
drivers/gpu/drm/amd/amdgpu/mes_v10_1.c | 28 -
drivers/gpu/drm/amd/amdgpu/mes_v11_0.c | 25 +-
drivers/gpu/drm/amd/amdgpu/psp_v10_0.c | 106 +---
drivers/gpu/drm/amd/amdgpu/psp_v11_0.c | 165 +----
drivers/gpu/drm/amd/amdgpu/psp_v12_0.c | 102 +--
drivers/gpu/drm/amd/amdgpu/psp_v13_0.c | 82 ---
drivers/gpu/drm/amd/amdgpu/psp_v13_0_4.c | 36 --
drivers/gpu/drm/amd/amdgpu/psp_v3_1.c | 36 --
drivers/gpu/drm/amd/amdgpu/sdma_v4_0.c | 61 +-
drivers/gpu/drm/amd/amdgpu/sdma_v5_0.c | 42 +-
drivers/gpu/drm/amd/amdgpu/sdma_v5_2.c | 65 +-
drivers/gpu/drm/amd/amdgpu/sdma_v6_0.c | 30 +-
.../gpu/drm/amd/pm/swsmu/smu11/smu_v11_0.c | 35 +-
.../gpu/drm/amd/pm/swsmu/smu13/smu_v13_0.c | 12 +-
25 files changed, 919 insertions(+), 1203 deletions(-)


base-commit: de9a71e391a92841582ca3008e7b127a0b8ccf41