Re: [PATCH] ARM: mach-qcom: fix support for ipq806x

From: Christian Marangi
Date: Wed Jan 17 2024 - 08:17:18 EST


On Wed, Oct 26, 2022 at 10:19:21AM +0200, Linus Walleij wrote:
> On Tue, Oct 25, 2022 at 1:47 AM Christian Marangi <ansuelsmth@xxxxxxxxx> wrote:
>
> > bad news... yesterday I tested this binding and it's problematic. It
> > does work and the router correctly boot...
>
> That's actually partly good news :D

Hi,
sorry for the necroposting but I got some time and wanted to fix and
bisect this for good since IPQ806x is finally in a better shape and is
actually modern enough.

>
> > problem is that SMEM is
> > broken with such configuration... I assume with this binding, by the
> > system view ram starts from 0x42000000 instead of 0x40000000 and this
> > cause SMEM to fail probe with the error "SBL didn't init SMEM".
>
> We need to fix this.
>

Totally but I think the problem is more deep...

> > This is the location of SMEM entry in ram
> >
> > smem: smem@41000000 {
> > compatible = "qcom,smem";
> > reg = <0x41000000 0x200000>;
> > no-map;
> >
> > hwlocks = <&sfpb_mutex 3>;
> > };
> (...)
> > Wonder if you have other ideas about this.
>
> So the problem is that the resource is outside of the system RAM?
>
> I don't understand why that triggers it since this is per definition not
> system RAM, it is SMEM after all. And it is no different in esssence
> from any memory mapped IO or other things that are outside of
> the system RAM.
>
> The SMEM node is special since it is created without children thanks
> to the hack in drivers/of/platform.c.
>
> Then the driver in drivers/soc/qcom/smem.c
> contains things like this:
>
> rmem = of_reserved_mem_lookup(pdev->dev.of_node);
> if (rmem) {
> smem->regions[0].aux_base = rmem->base;
> smem->regions[0].size = rmem->size;
> } else {
> /*
> * Fall back to the memory-region reference, if we're not a
> * reserved-memory node.
> */
> ret = qcom_smem_resolve_mem(smem, "memory-region",
> &smem->regions[0]);
> if (ret)
> return ret;
> }
>
> However it is treated as memory-mapped IO later:
>
> for (i = 1; i < num_regions; i++) {
> smem->regions[i].virt_base = devm_ioremap_wc(&pdev->dev,
>
> smem->regions[i].aux_base,
>
> smem->regions[i].size);
> if (!smem->regions[i].virt_base) {
> dev_err(&pdev->dev, "failed to remap %pa\n",
> &smem->regions[i].aux_base);
> return -ENOMEM;
> }
> }
>
> As a first hack I would check:
>
> 1. Is it the of_reserved_mem_lookup() or qcom_smem_resolve_smem() stuff
> in drivers/soc/qcom/smem.c that is failing?
>
> If yes then:
>
> 2. Add a fallback path just using of_iomap(node) for aux_base and size
> with some comment like /* smem is outside of the main memory map */
> and see if that works.
>

I think we got confused and we didn't read the code correctly. The
error is "SMEM is not initialized by SBL" that is triggered by...

header = smem->regions[0].virt_base;
if (le32_to_cpu(header->initialized) != 1 ||
le32_to_cpu(header->reserved)) {
dev_err(&pdev->dev, "SMEM is not initialized by SBL\n",);
return -EINVAL;
}

I verified correctly that aux_base and size are the correct values
0x41000000 and 0x200000. And from what I can see they get correctly
iomapped.

Problem is that initialized and reserved have garbage in it. (not random
data tho but everytime the same data)

My theory is that somehow the loader is still writing data there but I'm
a bit lost on how to verify that. (the fact that the data in those
values is always the same with the same compiled image makes me think
it's actually just loaded data)

I also tested with disabling the CONFIG_ARM_ATAG_DTB_COMPAT flag but I
have the same result.

What I'm using is this memory node

memory@0 {
reg = <0x42000000 0x1e000000>;
device_type = "memory";
};

And in chosed I have

chosen {
bootargs = "earlycon";
linux,usable-memory-range = <0x42000000 0x10000000>;
};

(the size is different just for the sake of it but it should not cause
problem right?)

Maybe there is a way to make the SMEM reclaim those RAM space and reinit
it? (it's a workaround tho)

Also with the current situation the kernel panics with... But I assume
this is caused by SMEM malfunctioning (the panic happen right after rpm
init when the RPM regulators are getting init. Looking at the affected
codes maybe it's failing at the "Free unused pages" stage?

[ 1.912392] 8<--- cut here ---
[ 1.912431] Unable to handle kernel NULL pointer dereference at virtual address 00000000
[ 1.914356] [00000000] *pgd=00000000
[ 1.922676] Internal error: Oops: 80000007 [#1] SMP ARM
[ 1.926158] Modules linked in:
[ 1.931103] CPU: 1 PID: 84 Comm: modprobe Not tainted 6.1.65 #0
[ 1.934229] Hardware name: Generic DT based system
[ 1.940045] PC is at 0x0
[ 1.944902] LR is at release_pages+0x114/0x36c
[ 1.947595] pc : [<00000000>] lr : [<c04298dc>] psr: 40000013
[ 1.951851] sp : c27abe18 ip : c13cd5c1 fp : c27abe38
[ 1.958012] r10: 0000009c r9 : c4018268 r8 : 00000005
[ 1.963220] r7 : c243f400 r6 : c243f400 r5 : 00000098 r4 : df992b54
[ 1.968431] r3 : 00000000 r2 : 00000000 r1 : 60000013 r0 : df992b54
[ 1.975029] Flags: nZcv IRQs on FIQs on Mode SVC_32 ISA ARM Segment none
[ 1.981543] Control: 10c5787d Table: 4367806a DAC: 00000051
[ 1.988744] Register r0 information: non-slab/vmalloc memory
[ 1.994472] Register r1 information: non-paged memory
[ 2.000200] Register r2 information: NULL pointer
[ 2.005148] Register r3 information: NULL pointer
[ 2.009834] Register r4 information: non-slab/vmalloc memory
[ 2.014525] Register r5 information: non-paged memory
[ 2.020252] Register r6 information: slab kmalloc-1k start c243f400 pointer offset 0 size 1024
[ 2.025206] Register r7 information: slab kmalloc-1k start c243f400 pointer offset 0 size 1024
[ 2.033714] Register r8 information: non-paged memory
[ 2.042301] Register r9 information: non-slab/vmalloc memory
[ 2.047424] Register r10 information: non-paged memory
[ 2.053152] Register r11 information: non-slab/vmalloc memory
[ 2.058100] Register r12 information: non-paged memory
[ 2.063915] Process modprobe (pid: 84, stack limit = 0x(ptrval))
[ 2.068953] Stack: (0xc27abe18 to 0xc27ac000)
[ 2.075115] be00: 00000000 00000000
[ 2.079378] be20: c147514c ffefffcf 00000000 00000000 0000009c 60000013 dfa12928 dfa12b44
[ 2.087537] be40: c27abf24 0000009c c4018000 c401800c c27abf0c c27abf24 00000000 000000f8
[ 2.095697] be60: 00000000 c045b248 ffffffff c27abf0c c35d1400 00000000 c35d1438 c045b4f8
[ 2.103858] be80: c27abf0c 00002000 00000000 c044fb14 00000000 c0b6c2bc c35d1400 ffffffff
[ 2.112016] bea0: ffffffff c35a4c0c 00000000 ffffffff 00000000 00001c01 00000000 c3591510
[ 2.120176] bec0: 00000000 c35d1400 ffffffff c3591510 00000000 c35d1400 00000000 c0458f30
[ 2.128336] bee0: 00000000 c08f35c8 c36ebf00 c35d1400 00010000 00013fff c35a4c0c 00000000
[ 2.136496] bf00: ffffffff 00000000 00000101 c35d1400 ffffffff ffffffff c2420501 00000001
[ 2.144656] bf20: c4018000 c4018000 00000000 00000008 dfde733c dfde7360 dfde7384 dfde73a8
[ 2.152815] bf40: dfa12a44 dfa12948 dfa129d8 dfa12ad4 c35d1400 00000000 c35d1438 00000698
[ 2.160976] bf60: c27abf78 c0318a34 c35d1400 c2731000 c35d1438 c0320604 0000ff00 c258ea00
[ 2.169136] bf80: c2731000 c2456f40 c03002c4 c2456f40 00000000 c0320e0c 000000f8 c0320e6c
[ 2.177294] bfa0: ffffffff c0300060 ffffffff bed38eb4 ffffffff bed38dcc 00000000 ffffffff
[ 2.185455] bfc0: ffffffff bed38eb4 00010f60 000000f8 6474e552 00000020 00000000 00000000
[ 2.193614] bfe0: 6ffffff9 bed38e78 b6f91f1c b6fa4a44 60000010 ffffffff 00000000 00000000
[ 2.201777] release_pages from tlb_batch_pages_flush+0x3c/0x70
[ 2.209927] tlb_batch_pages_flush from tlb_finish_mmu+0x4c/0x130
[ 2.215656] tlb_finish_mmu from exit_mmap+0xec/0x1e0
[ 2.221903] exit_mmap from mmput+0x40/0x120
[ 2.226939] mmput from do_exit+0x238/0x890
[ 2.231279] do_exit from do_group_exit+0x34/0x84
[ 2.235184] do_group_exit from __wake_up_parent+0x0/0x18
[ 2.240053] Code: bad PC value
[ 2.245556] ---[ end trace 0000000000000000 ]---
[ 2.248448] Kernel panic - not syncing: Fatal exception
[ 2.253158] CPU0: stopping
[ 2.253169] CPU: 0 PID: 0 Comm: swapper/0 Tainted: G D 6.1.65 #0
[ 2.253180] Hardware name: Generic DT based system
[ 2.253189] unwind_backtrace from show_stack+0x10/0x14
[ 2.253216] show_stack from dump_stack_lvl+0x40/0x4c
[ 2.253249] dump_stack_lvl from do_handle_IPI+0xf0/0x124
[ 2.253276] do_handle_IPI from ipi_handler+0x18/0x20
[ 2.253293] ipi_handler from handle_percpu_devid_irq+0x78/0x134
[ 2.253313] handle_percpu_devid_irq from generic_handle_domain_irq+0x28/0x38
[ 2.253338] generic_handle_domain_irq from gic_handle_irq+0x74/0x88
[ 2.253361] gic_handle_irq from generic_handle_arch_irq+0x34/0x44
[ 2.253391] generic_handle_arch_irq from call_with_stack+0x18/0x20
[ 2.253419] call_with_stack from __irq_svc+0x80/0x98
[ 2.253438] Exception stack(0xc1401f00 to 0xc1401f48)
[ 2.253451] 1f00: 00000005 00000000 00000a61 c03128a0 c1408640 00000000 c1404f68 c1404fa4
[ 2.253461] 1f20: 00000000 c13c9c38 00000000 00000000 c14c1f00 c1401f50 c0307148 c030714c
[ 2.253467] 1f40: 60000013 ffffffff
[ 2.253474] __irq_svc from arch_cpu_idle+0x38/0x3c
[ 2.253500] arch_cpu_idle from default_idle_call+0x24/0x34
[ 2.253526] default_idle_call from do_idle+0x1ec/0x240
[ 2.253545] do_idle from cpu_startup_entry+0x28/0x2c
[ 2.253559] cpu_startup_entry from kernel_init+0x0/0x12c
[ 2.376160] Rebooting in 1 seconds..

--
Ansuel