Re: [PATCH v2 00/11] Recover from failure to probe GPU

From: Alex Deucher
Date: Tue Jan 03 2023 - 09:17:41 EST


On Tue, Jan 3, 2023 at 5:10 AM Lazar, Lijo <lijo.lazar@xxxxxxx> wrote:
>
>
>
> On 12/28/2022 10:00 PM, Mario Limonciello wrote:
> > One of the first thing that KMS drivers do during initialization is
> > destroy the system firmware framebuffer by means of
> > `drm_aperture_remove_conflicting_pci_framebuffers`
> >
> > This means that if for any reason the GPU failed to probe the user
> > will be stuck with at best a screen frozen at the last thing that
> > was shown before the KMS driver continued it's probe.
> >
> > The problem is most pronounced when new GPU support is introduced
> > because users will need to have a recent linux-firmware snapshot
> > on their system when they boot a kernel with matching support.
> >
> > However the problem is further exaggerated in the case of amdgpu because
> > it has migrated to "IP discovery" where amdgpu will attempt to load
> > on "ALL" AMD GPUs even if the driver is missing support for IP blocks
> > contained in that GPU.
> >
> > IP discovery requires some probing and isn't run until after the
> > framebuffer has been destroyed.
> >
> > This means a situation can occur where a user purchases a new GPU not
> > yet supported by a distribution and when booting the installer it will
> > "freeze" even if the distribution doesn't have the matching kernel support
> > for those IP blocks.
> >
> > The perfect example of this is Ubuntu 22.10 and the new dGPUs just
> > launched by AMD. The installation media ships with kernel 5.19 (which
> > has IP discovery) but the amdgpu support for those IP blocks landed in
> > kernel 6.0. The matching linux-firmware was released after 22.10's launch.
> > The screen will freeze without nomodeset. Even if a user manages to install
> > and then upgrades to kernel 6.0 after install they'll still have the
> > problem of missing firmware, and the same experience.
> >
> > This is quite jarring for users, particularly if they don't know
> > that they have to use "nomodeset" to install.
> >
> > To help the situation make changes to GPU discovery:
> > 1) Delay releasing the firmware framebuffer until after IP discovery has
> > completed. This will help the situation of an older kernel that doesn't
> > yet support the IP blocks probing a new GPU.
> > 2) Request loading all PSP, VCN, SDMA, MES and GC microcode into memory
> > during IP discovery. This will help the situation of new enough kernel for
> > the IP discovery phase to otherwise pass but missing microcode from
> > linux-firmware.git.
> >
> > Not all requested firmware will be loaded during IP discovery as some of it
> > will require larger driver architecture changes. For example SMU firmware
> > isn't loaded on certain products, but that's not known until later on when
> > the early_init phase of the SMU load occurs.
> >
> > v1->v2:
> > * Take the suggestion from v1 thread to delay the framebuffer release until
> > ip discovery is done. This patch is CC to stable to that older stable
> > kernels with IP discovery won't try to probe unknown IP.
> > * Drop changes to drm aperature.
> > * Fetch SDMA, VCN, MES, GC and PSP microcode during IP discovery.
> >
>
> What is the gain here in just checking if firmware files are available?
> It can fail anywhere during sw_init and it's the same situation.

Other failures are presumably a bug or hardware issue. The missing
firmware would be a common issue when chips are first launched.
Thinking about it a bit more, another option might be to move the
calls to request_firmware() into the IP specific early_init()
functions and then move the drm_aperture release after early_init().
That would keep the firmware handling in the IPs and should still
happen early enough that we haven't messed with the hardware yet.

Alex

>
> Restricting IP FWs to IP specific files looks better to me than
> centralizing and creating interdependencies.
>
> Thanks,
> Lijo
>
> > Mario Limonciello (11):
> > drm/amd: Delay removal of the firmware framebuffer
> > drm/amd: Add a legacy mapping to "amdgpu_ucode_ip_version_decode"
> > drm/amd: Convert SMUv11 microcode init to use
> > `amdgpu_ucode_ip_version_decode`
> > drm/amd: Convert SMU v13 to use `amdgpu_ucode_ip_version_decode`
> > drm/amd: Request SDMA microcode during IP discovery
> > drm/amd: Request VCN microcode during IP discovery
> > drm/amd: Request MES microcode during IP discovery
> > drm/amd: Request GFX9 microcode during IP discovery
> > drm/amd: Request GFX10 microcode during IP discovery
> > drm/amd: Request GFX11 microcode during IP discovery
> > drm/amd: Request PSP microcode during IP discovery
> >
> > drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 8 +
> > drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.c | 590 +++++++++++++++++-
> > drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 6 -
> > drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c | 2 -
> > drivers/gpu/drm/amd/amdgpu/amdgpu_sdma.c | 9 +-
> > drivers/gpu/drm/amd/amdgpu/amdgpu_sdma.h | 2 +-
> > drivers/gpu/drm/amd/amdgpu/amdgpu_ucode.c | 208 ++++++
> > drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.c | 85 +--
> > drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c | 180 +-----
> > drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c | 64 +-
> > drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c | 143 +----
> > drivers/gpu/drm/amd/amdgpu/mes_v10_1.c | 28 -
> > drivers/gpu/drm/amd/amdgpu/mes_v11_0.c | 25 +-
> > drivers/gpu/drm/amd/amdgpu/psp_v10_0.c | 106 +---
> > drivers/gpu/drm/amd/amdgpu/psp_v11_0.c | 165 +----
> > drivers/gpu/drm/amd/amdgpu/psp_v12_0.c | 102 +--
> > drivers/gpu/drm/amd/amdgpu/psp_v13_0.c | 82 ---
> > drivers/gpu/drm/amd/amdgpu/psp_v13_0_4.c | 36 --
> > drivers/gpu/drm/amd/amdgpu/psp_v3_1.c | 36 --
> > drivers/gpu/drm/amd/amdgpu/sdma_v4_0.c | 61 +-
> > drivers/gpu/drm/amd/amdgpu/sdma_v5_0.c | 42 +-
> > drivers/gpu/drm/amd/amdgpu/sdma_v5_2.c | 65 +-
> > drivers/gpu/drm/amd/amdgpu/sdma_v6_0.c | 30 +-
> > .../gpu/drm/amd/pm/swsmu/smu11/smu_v11_0.c | 35 +-
> > .../gpu/drm/amd/pm/swsmu/smu13/smu_v13_0.c | 12 +-
> > 25 files changed, 919 insertions(+), 1203 deletions(-)
> >
> >
> > base-commit: de9a71e391a92841582ca3008e7b127a0b8ccf41