Re: [PATCH] Revert "arm64: dts: qcom: sa8540p-ride: enable pcie2a node"

From: Eric Chanudet
Date: Mon Jun 12 2023 - 16:06:03 EST


On Fri, Jun 02, 2023 at 03:33:21PM -0400, Lucas Karpinski wrote:
> This reverts commit 2eb4cdcd5aba2db83f2111de1242721eeb659f71.
>
> The patch introduced a sporadic error where the Qdrive3 will fail to
> boot occasionally due to an rcu preempt stall.
> Qualcomm has disabled pcie2a downstream:
> https://git.codelinaro.org/clo/la/platform/vendor/qcom-opensource/rh-patch/-/commit/447f2135909683d1385af36f95fae5e1d63a7e2f
>
> rcu: INFO: rcu_preempt self-detected stall on CPU
> rcu: 0-....: (1 GPs behind) idle=77fc/1/0x4000000000000004 softirq=841/841 fqs=2476
> rcu: (t=5253 jiffies g=-175 q=2552 ncpus=8)
> Call trace:
> __do_softirq
> ____do_softirq
> call_on_irq_stack
> do_softirq_own_stack
> __irq_exit_rcu
> irq_exit_rcu
>
> The issue occurs normally once every 3-4 boot cycles.
> There is likely a race condition caused when setting up the two pcie
> domains concurrently (pcie2a and pcie3a).
>
> The issue is not present when only pcie2a is enabled or when only pcie3a
> is enabled.
> A workaround was found that allowed the Qdrive3 to boot with both pcie2a
> and pcie3a enabled.
> Set the .probe_type to PROBE_FORCE_SYNCHRONOUS and add an msleep() to
> the probing function.
> This is not a solution, so this patch is disabling pcie2a as it seems
> Red Hat are the only ones working on the board,
> we're find with disabling the node until a root cause is found. If
> anyone has further suggestions for debugging, let me know.
>
> Signed-off-by: Lucas Karpinski <lkarpins@xxxxxxxxxx>
> ---
> During debugging:
> - Added additional time for clock/regulator stabilization.
> - Reduced the bandwidth across pcie2a and pcie3a.
> - Replaced the interconnect setup from another driver.
> - The 32-bit/64-bit/config-io space for both pcie2a and pcie3a look to be mapped correctly.
> - Verified interconnects were started successfully.

I was looking at another issue downstream triggering a soft lock on
CPU0, but it turns out this could be the same thing except the symptoms
are less noticeable (the 3-4 boot cycles you mention).

Using next-20230609, if I add a return kprobe on dw_handle_msi_irq:

echo 'r:dwmsi_probe dw_handle_msi_irq $retval' > /sys/kernel/debug/tracing/kprobe_events
echo 1 > /sys/kernel/debug/tracing/events/kprobes/dwmsi_probe/enable
cat /sys/kernel/debug/tracing/trace_pipe
<idle>-0 [000] d.h1. 690.417268: dwmsi_probe: (dw_chained_msi_isr+0x38/0xb8 <- dw_handle_msi_irq) arg1=0x0
<idle>-0 [000] d.h1. 690.417272: dwmsi_probe: (dw_chained_msi_isr+0x38/0xb8 <- dw_handle_msi_irq) arg1=0x0
<idle>-0 [000] d.h1. 690.417276: dwmsi_probe: (dw_chained_msi_isr+0x38/0xb8 <- dw_handle_msi_irq) arg1=0x0
<idle>-0 [000] d.h1. 690.417281: dwmsi_probe: (dw_chained_msi_isr+0x38/0xb8 <- dw_handle_msi_irq) arg1=0x0
<idle>-0 [000] d.h1. 690.417284: dwmsi_probe: (dw_chained_msi_isr+0x38/0xb8 <- dw_handle_msi_irq) arg1=0x0
<idle>-0 [000] d.h1. 690.417288: dwmsi_probe: (dw_chained_msi_isr+0x38/0xb8 <- dw_handle_msi_irq) arg1=0x0
[...]

dw_handle_msi_irq constantly fires and never returns IRQ_HANDLED. It
happens consistently for pcie2a or pcie3a, after I disable one or the
other. I presume having both might be enough to overwhelm the system and
trigger the stall?

Looking at the handler, the status is always 0 after:
status = dw_pcie_readl_dbi(pci, PCIE_MSI_INTR0_STATUS +
(i * MSI_REG_CTRL_BLOCK_SIZE));

Unfortunately I do not know why that is yet.

>
> arch/arm64/boot/dts/qcom/sa8540p-ride.dts | 44 -----------------------
> 1 file changed, 44 deletions(-)
>
> diff --git a/arch/arm64/boot/dts/qcom/sa8540p-ride.dts b/arch/arm64/boot/dts/qcom/sa8540p-ride.dts
> index 24fa449d48a6..d492723ccf7c 100644
> --- a/arch/arm64/boot/dts/qcom/sa8540p-ride.dts
> +++ b/arch/arm64/boot/dts/qcom/sa8540p-ride.dts
> @@ -186,27 +186,6 @@ &i2c18 {
> status = "okay";
> };
>
> -&pcie2a {
> - ranges = <0x01000000 0x0 0x3c200000 0x0 0x3c200000 0x0 0x100000>,
> - <0x02000000 0x0 0x3c300000 0x0 0x3c300000 0x0 0x1d00000>,
> - <0x03000000 0x5 0x00000000 0x5 0x00000000 0x1 0x00000000>;
> -
> - perst-gpios = <&tlmm 143 GPIO_ACTIVE_LOW>;
> - wake-gpios = <&tlmm 145 GPIO_ACTIVE_HIGH>;
> -
> - pinctrl-names = "default";
> - pinctrl-0 = <&pcie2a_default>;
> -
> - status = "okay";
> -};
> -
> -&pcie2a_phy {
> - vdda-phy-supply = <&vreg_l11a>;
> - vdda-pll-supply = <&vreg_l3a>;
> -
> - status = "okay";
> -};
> -
> &pcie3a {
> ranges = <0x01000000 0x0 0x40200000 0x0 0x40200000 0x0 0x100000>,
> <0x02000000 0x0 0x40300000 0x0 0x40300000 0x0 0x20000000>,
> @@ -356,29 +335,6 @@ i2c18_default: i2c18-default-state {
> bias-pull-up;
> };
>
> - pcie2a_default: pcie2a-default-state {
> - perst-pins {
> - pins = "gpio143";
> - function = "gpio";
> - drive-strength = <2>;
> - bias-pull-down;
> - };
> -
> - clkreq-pins {
> - pins = "gpio142";
> - function = "pcie2a_clkreq";
> - drive-strength = <2>;
> - bias-pull-up;
> - };
> -
> - wake-pins {
> - pins = "gpio145";
> - function = "gpio";
> - drive-strength = <2>;
> - bias-pull-up;
> - };
> - };
> -
> pcie3a_default: pcie3a-default-state {
> perst-pins {
> pins = "gpio151";
> --
> 2.40.1
>

--
Eric Chanudet