Re: [PATCH] drm/panfrost: Really power off GPU cores in panfrost_gpu_power_off()

From: Krzysztof Kozlowski
Date: Tue Nov 21 2023 - 12:08:55 EST


On 21/11/2023 17:55, Boris Brezillon wrote:
> On Tue, 21 Nov 2023 17:11:42 +0100
> AngeloGioacchino Del Regno <angelogioacchino.delregno@xxxxxxxxxxxxx>
> wrote:
>
>> Il 21/11/23 16:34, Krzysztof Kozlowski ha scritto:
>>> On 08/11/2023 14:20, Steven Price wrote:
>>>> On 02/11/2023 14:15, AngeloGioacchino Del Regno wrote:
>>>>> The layout of the registers {TILER,SHADER,L2}_PWROFF_LO, used to request
>>>>> powering off cores, is the same as the {TILER,SHADER,L2}_PWRON_LO ones:
>>>>> this means that in order to request poweroff of cores, we are supposed
>>>>> to write a bitmask of cores that should be powered off!
>>>>> This means that the panfrost_gpu_power_off() function has always been
>>>>> doing nothing.
>>>>>
>>>>> Fix powering off the GPU by writing a bitmask of the cores to poweroff
>>>>> to the relevant PWROFF_LO registers and then check that the transition
>>>>> (from ON to OFF) has finished by polling the relevant PWRTRANS_LO
>>>>> registers.
>>>>>
>>>>> While at it, in order to avoid code duplication, move the core mask
>>>>> logic from panfrost_gpu_power_on() to a new panfrost_get_core_mask()
>>>>> function, used in both poweron and poweroff.
>>>>>
>>>>> Fixes: f3ba91228e8e ("drm/panfrost: Add initial panfrost driver")
>>>>> Signed-off-by: AngeloGioacchino Del Regno <angelogioacchino.delregno@xxxxxxxxxxxxx>
>>>
>>>
>>> Hi,
>>>
>>> This commit was added to next recently but it causes "external abort on
>>> non-linefetch" during boot of my Odroid HC1 board.
>>>
>>> At least bisect points to it.
>>>
>>> If fixed, please add:
>>>
>>> Reported-by: Krzysztof Kozlowski <krzysztof.kozlowski@xxxxxxxxxx>
>>>
>>> [ 4.861683] 8<--- cut here ---
>>> [ 4.863429] Unhandled fault: external abort on non-linefetch (0x1008) at 0xf0c8802c
>>> [ 4.871018] [f0c8802c] *pgd=433ed811, *pte=11800653, *ppte=11800453
>>> ...
>>> [ 5.164010] panfrost_gpu_irq_handler from __handle_irq_event_percpu+0xcc/0x31c
>>> [ 5.171276] __handle_irq_event_percpu from handle_irq_event+0x38/0x80
>>> [ 5.177765] handle_irq_event from handle_fasteoi_irq+0x9c/0x250
>>> [ 5.183743] handle_fasteoi_irq from generic_handle_domain_irq+0x28/0x38
>>> [ 5.190417] generic_handle_domain_irq from gic_handle_irq+0x88/0xa8
>>> [ 5.196741] gic_handle_irq from generic_handle_arch_irq+0x34/0x44
>>> [ 5.202893] generic_handle_arch_irq from __irq_svc+0x8c/0xd0
>>>
>>> Full log:
>>> https://krzk.eu/#/builders/21/builds/4392/steps/11/logs/serial0
>>>
>>
>> Hey Krzysztof,
>>
>> This is interesting. It might be about the cores that are missing from the partial
>> core_mask raising interrupts, but an external abort on non-linefetch is strange to
>> see here.
>
> I've seen such external aborts in the past, and the fault type has
> often been misleading. It's unlikely to have anything to do with a

Yeah, often accessing device with power or clocks gated.

> non-linefetch access, but it might be caused by a register access after
> the clock or power domain driving the register bank has been disabled.
> The following diff might help validate this theory. If that works, we
> probably want to make sure we synchronize IRQs before disabling in the
> suspend path.
>
> --->8---
> diff --git a/drivers/gpu/drm/panfrost/panfrost_regs.h b/drivers/gpu/drm/panfrost/panfrost_regs.h
> index 55ec807550b3..98df66e5cc9b 100644
> --- a/drivers/gpu/drm/panfrost/panfrost_regs.h
> +++ b/drivers/gpu/drm/panfrost/panfrost_regs.h
> @@ -34,8 +34,6 @@
> (GPU_IRQ_FAULT |\
> GPU_IRQ_MULTIPLE_FAULT |\
> GPU_IRQ_RESET_COMPLETED |\
> - GPU_IRQ_POWER_CHANGED |\
> - GPU_IRQ_POWER_CHANGED_ALL |\

This helped, at least for this issue (next-20231121). Much later in
user-space boot I have lockups:
watchdog: BUG: soft lockup - CPU#4 stuck for 26s! [kworker/4:1:61]

[ 56.329224] smp_call_function_single from
__sync_rcu_exp_select_node_cpus+0x29c/0x78c
[ 56.337111] __sync_rcu_exp_select_node_cpus from
sync_rcu_exp_select_cpus+0x334/0x878
[ 56.344995] sync_rcu_exp_select_cpus from wait_rcu_exp_gp+0xc/0x18
[ 56.351231] wait_rcu_exp_gp from process_one_work+0x20c/0x620
[ 56.357038] process_one_work from worker_thread+0x1d0/0x488
[ 56.362668] worker_thread from kthread+0x104/0x138
[ 56.367521] kthread from ret_from_fork+0x14/0x28

But anyway the external abort does not appear.

Best regards,
Krzysztof