Re: [PATCH] drm/panfrost: Really power off GPU cores in panfrost_gpu_power_off()

From: Marek Szyprowski
Date: Mon Nov 27 2023 - 06:24:59 EST


On 24.11.2023 13:45, Marek Szyprowski wrote:
> On 22.11.2023 10:29, Krzysztof Kozlowski wrote:
>> On 22/11/2023 10:06, AngeloGioacchino Del Regno wrote:
>>>>>> Hey Krzysztof,
>>>>>>
>>>>>> This is interesting. It might be about the cores that are missing
>>>>>> from the partial
>>>>>> core_mask raising interrupts, but an external abort on
>>>>>> non-linefetch is strange to
>>>>>> see here.
>>>>> I've seen such external aborts in the past, and the fault type has
>>>>> often been misleading. It's unlikely to have anything to do with a
>>>> Yeah, often accessing device with power or clocks gated.
>>>>
>>> Except my commit does *not* gate SoC power, nor SoC clocks 🙂
>> It could be that something (like clocks or power supplies) was missing
>> on this board/SoC, which was not critical till your patch came.
>>
>>> What the "Really power off ..." commit does is to ask the GPU to
>>> internally power
>>> off the shaders, tilers and L2, that's why I say that it is strange
>>> to see that
>>> kind of abort.
>>>
>>> The GPU_INT_CLEAR GPU_INT_STAT, GPU_FAULT_STATUS and
>>> GPU_FAULT_ADDRESS_{HI/LO}
>>> registers should still be accessible even with shaders, tilers and
>>> cache OFF.
>>>
>>> Anyway, yes, synchronizing IRQs before calling the poweroff sequence
>>> would also
>>> work, but that'd add up quite a bit of latency on the
>>> runtime_suspend() call, so
>>> in this case I'd be more for avoiding to execute any register r/w in
>>> the handler
>>> by either checking if the GPU is supposed to be OFF, or clearing
>>> interrupts, which
>>> may not work if those are generated after the execution of the
>>> poweroff function.
>>> Or we could simply disable the irq after power_off, but that'd be
>>> hacky (as well).
>>>
>>>
>>> Let's see if asking to poweroff *everything* works:
>> Worked.
>
> Yes, I also got into this issue some time ago, but I didn't report it
> because I also had some power supply related problems on my test farm
> and everything was a bit unstable. I wasn't 100% sure that the
> $subject patch is responsible for the observed issues. Now, after
> fixing power supply, I confirm that the issue was revealed by the
> $subject patch and above mentioned change fixes the problem. Feel free
> to add:
>
> Tested-by: Marek Szyprowski <m.szyprowski@xxxxxxxxxxx>


I must revoke my tested-by tag for the above fix alone. Although it
fixed the boot issue and system stability issue, it looks that there is
still something missing and opening the panfrost dri device causes a
system crash:

root@target:~# ./modetest -C
trying to open device 'i915'...failed
trying to open device 'amdgpu'...failed
trying to open device 'radeon'...failed
trying to open device 'nouveau'...failed
trying to open device 'vmwgfx'...failed
trying to open device 'omapdrm'...failed
trying to open device 'exynos'...done
root@target:~#

8<--- cut here ---
Unhandled fault: external abort on non-linefetch (0x1008) at 0xf0c6803c
[f0c6803c] *pgd=42d87811, *pte=11800653, *ppte=11800453
Internal error: : 1008 [#1] PREEMPT SMP ARM
Modules linked in: exynos_gsc s5p_mfc s5p_jpeg v4l2_mem2mem
videobuf2_dma_contig videobuf2_memops videobuf2_v4l2 videobuf2_common
videodev mc s5p_cec
CPU: 0 PID: 0 Comm: swapper/0 Not tainted
6.7.0-rc2-next-20231127-00055-ge14abcb527d6 #7649
Hardware name: Samsung Exynos (Flattened Device Tree)
PC is at panfrost_gpu_irq_handler+0x18/0xfc
LR is at __handle_irq_event_percpu+0xcc/0x31c
...
Process swapper/0 (pid: 0, stack limit = 0x0e2875ff)
Stack: (0xc1301e48 to 0xc1302000)
...
 panfrost_gpu_irq_handler from __handle_irq_event_percpu+0xcc/0x31c
 __handle_irq_event_percpu from handle_irq_event+0x38/0x80
 handle_irq_event from handle_fasteoi_irq+0x9c/0x250
 handle_fasteoi_irq from generic_handle_domain_irq+0x24/0x34
 generic_handle_domain_irq from gic_handle_irq+0x88/0xa8
 gic_handle_irq from generic_handle_arch_irq+0x34/0x44
 generic_handle_arch_irq from __irq_svc+0x8c/0xd0
Exception stack(0xc1301f10 to 0xc1301f58)
...
 __irq_svc from default_idle_call+0x20/0x2c4
 default_idle_call from do_idle+0x244/0x2b4
 do_idle from cpu_startup_entry+0x28/0x2c
 cpu_startup_entry from rest_init+0xec/0x190
 rest_init from arch_post_acpi_subsys_init+0x0/0x8
Code: e591300c e593402c f57ff04f e591300c (e593903c)
---[ end trace 0000000000000000 ]---
Kernel panic - not syncing: Fatal exception in interrupt
CPU2: stopping


It looks that the panfrost interrupts must be somehow synchronized with
turning power off, what has been already discussed. Let me know if you
want me to test any patch.


Best regards
--
Marek Szyprowski, PhD
Samsung R&D Institute Poland