Re: [PATCH 0/2] Recover from failure to probe GPU

From: Mario Limonciello
Date: Fri Dec 23 2022 - 10:51:18 EST

Next message: Christoph Hellwig: "Re: [PATCH 1/2] Revert "remoteproc: qcom_q6v5_mss: map/unmap metadata region before/after use""
Previous message: Jonathan Cameron: "Re: [PATCH] iio:adc:twl6030: Enable measurement of VAC"
In reply to: Javier Martinez Canillas: "Re: [PATCH 0/2] Recover from failure to probe GPU"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On 12/22/22 13:41, Javier Martinez Canillas wrote:

[adding Thomas Zimmermann to CC list]

Hello Mario,

Interesting case.

On 12/22/22 19:30, Mario Limonciello wrote:

One of the first thing that KMS drivers do during initialization is
destroy the system firmware framebuffer by means of
`drm_aperture_remove_conflicting_pci_framebuffers`

The reason why that's done at the very beginning is that there are no
guarantees that the firmware-provided framebuffer would keep working
after the real display controller driver re-initializes the IP block.

This means that if for any reason the GPU failed to probe the user
will be stuck with at best a screen frozen at the last thing that
was shown before the KMS driver continued it's probe.

The problem is most pronounced when new GPU support is introduced
because users will need to have a recent linux-firmware snapshot
on their system when they boot a kernel with matching support.

Right. That's a problem indeed but as mentioned there's a gap between
the firmware-provided framebuffer is removed and the real driver sets
up its framebuffer.

However the problem is further exaggerated in the case of amdgpu because
it has migrated to "IP discovery" where amdgpu will attempt to load
on "ALL" AMD GPUs even if the driver is missing support for IP blocks
contained in that GPU.

IP discovery requires some probing and isn't run until after the
framebuffer has been destroyed.

This means a situation can occur where a user purchases a new GPU not
yet supported by a distribution and when booting the installer it will
"freeze" even if the distribution doesn't have the matching kernel support
for those IP blocks.

The perfect example of this is Ubuntu 21.10 and the new dGPUs just
launched by AMD. The installation media ships with kernel 5.19 (which
has IP discovery) but the amdgpu support for those IP blocks landed in
kernel 6.0. The matching linux-firmware was released after 21.10's launch.
The screen will freeze without nomodeset. Even if a user manages to install
and then upgrades to kernel 6.0 after install they'll still have the
problem of missing firmware, and the same experience.

s/21.10/22.10/

This is quite jarring for users, particularly if they don't know
that they have to use "nomodeset" to install.

I'm not familiar with AMD GPUs, but could be possible that this discovery
and firmware loading step be done at the beginning before the firmware FB
is removed ? That way the FB removal will not happen unless that succeeds.

Possible? I think so, but maybe Alex can comment on this after the holidays as he's more familiar.

It would mean splitting and introducing an entirely new phase to driver initialization. The information about the discovery table comes from VRAM.

amdgpu_driver_load_kms -> amdgpu_device_init -> amdgpu_device_ip_early_init

Basically that code specific would have to call earlier and then there would need to be a separate set of code for all the IP blocks to *just* collect what firmware they need.

To help the situation, allow drivers to re-run the init process for the
firmware framebuffer during a failed probe. As this problem is most
pronounced with amdgpu, this is the only driver changed.

But if this makes sense more generally for other KMS drivers, the call
can be added to the cleanup routine for those too.

The problem I see is that depending on how far the driver's probe function
went, there may not be possible to re-run the init process. Since firmware
provided framebuffer may already been destroyed or the IP block just be in
a half initialized state.

I'm not against this series if it solves the issue in practice for amdgpu,
but don't think is a general solution and would like to know Thomas' opinion
on this before as well

Running on this idea I'm pretty sure that request_firmware returns -ENOENT in this case. So another proposal for when to trigger this flow would be to only do it on -ENOENT. We could then also change amdgpu_discovery.c to return -ENOENT when an IP block isn't supported instead of the current -EINVAL.

Or we could instead co-opt -ENOTSUPP and remap all the cases that we explicitly want the system framebuffer to re-initialize to that.

Next message: Christoph Hellwig: "Re: [PATCH 1/2] Revert "remoteproc: qcom_q6v5_mss: map/unmap metadata region before/after use""
Previous message: Jonathan Cameron: "Re: [PATCH] iio:adc:twl6030: Enable measurement of VAC"
In reply to: Javier Martinez Canillas: "Re: [PATCH 0/2] Recover from failure to probe GPU"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]