Re: [PATCH] drm/amdgpu: add mb for si

From: Lazar, Lijo
Date: Fri Nov 25 2022 - 01:14:26 EST



On 11/25/2022 7:43 AM, Quan, Evan wrote:
[AMD Official Use Only - General]



-----Original Message-----
From: Lazar, Lijo <Lijo.Lazar@xxxxxxx>
Sent: Thursday, November 24, 2022 6:49 PM
To: Quan, Evan <Evan.Quan@xxxxxxx>; 李真能 <lizhenneng@xxxxxxxxxx>;
Michel Dänzer <michel.daenzer@xxxxxxxxxxx>; Koenig, Christian
<Christian.Koenig@xxxxxxx>; Deucher, Alexander
<Alexander.Deucher@xxxxxxx>
Cc: amd-gfx@xxxxxxxxxxxxxxxxxxxxx; Pan, Xinhui <Xinhui.Pan@xxxxxxx>;
linux-kernel@xxxxxxxxxxxxxxx; dri-devel@xxxxxxxxxxxxxxxxxxxxx
Subject: Re: [PATCH] drm/amdgpu: add mb for si



On 11/24/2022 4:11 PM, Lazar, Lijo wrote:

On 11/24/2022 3:34 PM, Quan, Evan wrote:
[AMD Official Use Only - General]

Could the attached patch help?

Evan
-----Original Message-----
From: amd-gfx <amd-gfx-bounces@xxxxxxxxxxxxxxxxxxxxx> On Behalf
Of ???
Sent: Friday, November 18, 2022 5:25 PM
To: Michel Dänzer <michel.daenzer@xxxxxxxxxxx>; Koenig, Christian
<Christian.Koenig@xxxxxxx>; Deucher, Alexander
<Alexander.Deucher@xxxxxxx>
Cc: amd-gfx@xxxxxxxxxxxxxxxxxxxxx; Pan, Xinhui <Xinhui.Pan@xxxxxxx>;
linux-kernel@xxxxxxxxxxxxxxx; dri-devel@xxxxxxxxxxxxxxxxxxxxx
Subject: Re: [PATCH] drm/amdgpu: add mb for si


在 2022/11/18 17:18, Michel Dänzer 写道:
On 11/18/22 09:01, Christian König wrote:
Am 18.11.22 um 08:48 schrieb Zhenneng Li:
During reboot test on arm64 platform, it may failure on boot, so
add this mb in smc.

The error message are as follows:
[    6.996395][ 7] [  T295] [drm:amdgpu_device_ip_late_init
[amdgpu]] *ERROR*
                  late_init of IP block <si_dpm> failed -22 [
7.006919][ 7] [  T295] amdgpu 0000:04:00.0:
The issue is happening in late_init() which eventually does

    ret = si_thermal_enable_alert(adev, false);

Just before this, si_thermal_start_thermal_controller is called in
hw_init and that enables thermal alert.

Maybe the issue is with enable/disable of thermal alerts in quick
succession. Adding a delay inside si_thermal_start_thermal_controller
might help.

On a second look, temperature range is already set as part of
si_thermal_start_thermal_controller in hw_init
https://elixir.bootlin.com/linux/v6.1-
rc6/source/drivers/gpu/drm/amd/pm/legacy-dpm/si_dpm.c#L6780

There is no need to set it again here -

https://elixir.bootlin.com/linux/v6.1-
rc6/source/drivers/gpu/drm/amd/pm/legacy-dpm/si_dpm.c#L7635

I think it is safe to remove the call from late_init altogether. Alex/Evan?

[Quan, Evan] Yes, it makes sense to me. But I'm not sure whether that’s related with the issue here.
Since per my understandings, if the issue is caused by double calling of thermal_alert enablement, it will fail every time.
That cannot explain why adding some delays or a mb() calling can help.

The side effect of the patch is just some random delay introduced for every SMC message

The issue happens in late_init(). Between late_init() and dpm enablement, there are many smc messages sent which don't have this issue. So I think the issue is not with FW not running.

Thus the only case I see is enable/disable of thermal alert in random succession.

Thanks,

Lijo

BR
Evan
Thanks,
Lijo

Thanks,
Lijo

amdgpu_device_ip_late_init failed [    7.014224][ 7] [  T295] amdgpu
0000:04:00.0: Fatal error during GPU init
Memory barries are not supposed to be sprinkled around like this,
you
need to give a detailed explanation why this is necessary.
Regards,
Christian.

Signed-off-by: Zhenneng Li <lizhenneng@xxxxxxxxxx>
---
    drivers/gpu/drm/amd/pm/legacy-dpm/si_smc.c | 2 ++
    1 file changed, 2 insertions(+)

diff --git a/drivers/gpu/drm/amd/pm/legacy-dpm/si_smc.c
b/drivers/gpu/drm/amd/pm/legacy-dpm/si_smc.c
index 8f994ffa9cd1..c7656f22278d 100644
--- a/drivers/gpu/drm/amd/pm/legacy-dpm/si_smc.c
+++ b/drivers/gpu/drm/amd/pm/legacy-dpm/si_smc.c
@@ -155,6 +155,8 @@ bool amdgpu_si_is_smc_running(struct
amdgpu_device *adev)
        u32 rst = RREG32_SMC(SMC_SYSCON_RESET_CNTL);
        u32 clk = RREG32_SMC(SMC_SYSCON_CLOCK_CNTL_0);
    +    mb();
+
        if (!(rst & RST_REG) && !(clk & CK_DISABLE))
            return true;
In particular, it makes no sense in this specific place, since it
cannot directly
affect the values of rst & clk.

I thinks so too.

But when I do reboot test using nine desktop machines,  there maybe
report
this error on one or two machines after Hundreds of times or
Thousands of
times reboot test, at the beginning, I use msleep() instead of mb(),
these
two methods are all works, but I don't know what is the root case.

I use this method on other verdor's oland card, this error message are
reported again.

What could be the root reason?

test environmen:

graphics card: OLAND 0x1002:0x6611 0x1642:0x1869 0x87

driver: amdgpu

os: ubuntu 2004

platform: arm64

kernel: 5.4.18