[BUG] gpu: drm: amd: noretry=0 causes failure in amdgpu_device_ip_resume on vega10

From: Carl Klemm
Date: Wed Nov 01 2023 - 08:51:13 EST


Hi,

When migrateing from 5.15 to 6.5.9 i noticed that noretry no longer
function on vega10 (Instinct MI25). The device will fail to start:

[ 40.080411] amdgpu: fw load failed
[ 40.083816] amdgpu: smu firmware loading failed
[ 40.088350] amdgpu 0000:83:00.0: amdgpu: amdgpu_device_ip_resume
failed (-22).

I have also repoduced the same issue on 6.1.55
It is also possible that the issue was caused by newer gpu firmware,
instead of the change in kernel. The issue was repduced with the
firmware from linux-firmware-20230804.

for full dmesg see: https://uvos.xyz/noretry.dmesg.log

The same system also contains 2 vega20 and 1 navi21 device that both
work fine with noretry=0. For more information on the system this
problem was encountered on see this rocminfo dump:
https://uvos.xyz/rocminfo.log

Regards,
Carl