Re: [PATCH] regulator: qcom-rpmh: Revert "regulator: qcom-rpmh: Use PROBE_FORCE_SYNCHRONOUS"

From: Doug Anderson
Date: Thu Jun 01 2023 - 10:05:40 EST


Hi,

On Wed, May 31, 2023 at 11:11 PM Amit Pundir <amit.pundir@xxxxxxxxxx> wrote:
>
> On Wed, 17 May 2023 at 02:54, Doug Anderson <dianders@xxxxxxxxxxxx> wrote:
> >
> > Hi,
> >
> > On Tue, May 16, 2023 at 11:12 AM Amit Pundir <amit.pundir@xxxxxxxxxx> wrote:
> > >
> > > On Mon, 15 May 2023 at 20:33, Doug Anderson <dianders@xxxxxxxxxxxx> wrote:
> > > >
> > > > Hi,
> > > >
> > > > On Mon, May 15, 2023 at 7:42 AM Amit Pundir <amit.pundir@xxxxxxxxxx> wrote:
> > > > >
> > > > > On Sun, 14 May 2023 at 18:11, Caleb Connolly <caleb.connolly@xxxxxxxxxx> wrote:
> > > > > >
> > > > > >
> > > > > >
> > > > > > On 13/05/2023 18:08, Amit Pundir wrote:
> > > > > > > On Fri, 24 Mar 2023 at 19:05, Douglas Anderson <dianders@xxxxxxxxxxxx> wrote:
> > > > > > >>
> > > > > > >> This reverts commit 58973046c1bf ("regulator: qcom-rpmh: Use
> > > > > > >> PROBE_FORCE_SYNCHRONOUS"). Further digging into the problems that
> > > > > > >> prompted the us to switch to synchronous probe showed that the root
> > > > > > >> cause was a missing "rootwait" in the kernel command line
> > > > > > >> arguments. Let's reinstate asynchronous probe.
> > > > > > >
> > > > > > > Hi, the asynchronous probe is broken on Dragonboard 845c (SDM845)
> > > > > > > running AOSP (Android Open Source Project) with v6.4-rc1
> > > > > > > https://bugs.linaro.org/show_bug.cgi?id=5975.
> > > > > > > Can we please go back to synchronous probe.
> > > > > > >
> > > > > > > AOSP do not make use of rootwait, IIRC, but it is added by the
> > > > > > > bootloader anyway. And the device fails to boot AOSP regardless of
> > > > > > > "rootwait" bootarg being present or not.
> > > > > >
> > > > > > Could you try applying this diff to enable some log spam and let me know
> > > > > > what you get? I'm keen to try and figure this one out. My mail client
> > > > > > might crunch this a bit so I have pasted it here too
> > > > > > https://p.calebs.dev/ab74b7@raw
> > > > >
> > > > > These prints add just enough delay for the UFS probe to succeed that I
> > > > > can't reproduce the failure anymore.
> > > >
> > > > I'd prefer doing at least a little debugging before jumping to a
> > > > revert. From looking at your dmesg [1], it looks as if the async probe
> > > > is allowing RPMH to probe at the same time as "qcom-vadc-common".
> > > > That's something that talks on the SPMI bus and is (potentially)
> > > > talking to the same PMICs that RPMH-regulator is, right? I'm by no
> > > > means an expert on how Qualcomm's PMICs work, but it seems plausible
> > > > that the "qcom-vadc-common" is somehow causing problems and screwing
> > > > up RPMH. Does that seem plausible to you?
> > > >
> > > > If so, one interesting way to track it down would be to move around
> > > > delays. Put ~500ms sleep at the _end_ of vadc_probe(). Presumably that
> > > > _won't_ fix the problem. Now put a ~500ms sleep at the beginning of
> > > > vadc_probe(). Maybe that will fix the problem? If so, you can move the
> > > > delay around to narrow down the conflict. My wild guess would be that
> > > > vadc_reset() could be throwing things for a loop?
> > > >
> > > > If the above doesn't work, maybe we could add more tracing / printouts
> > > > to see what is probing at the same time as RPMH?
> > >
> > > Tried out a few changes today but none of them worked or were
> > > effective enough to debug this crash further, other than setting
> > > fw_devlink=permissive.
> > >
> > > Adding more tracing / prints (BOOTTIME_TRACING and
> > > FUNCTION_GRAPH_TRACER) didn't work and didn't help in reproducing the
> > > crash either. They added just enough delay to boot the device
> > > successfully everytime.
> > >
> > > I tried to reason with the kernel modules which gets loaded before and
> > > after the qcom-rpmh-regulator (QCOM_REBOOT_MODE, QCOM_PON, IIO/VADC,
> > > SPMI_PMIC* etc) as suggested, but I run into the same crash even if I
> > > disable those driver modules. So I don't think that the other driver
> > > modules which gets loaded at around the same time as
> > > qcom-rpmh-regulator by default have any impact on this failure.
> >
> > Ugh, Heisenbugs are no fun to debug. :( It sorta sounds as if pretty
> > much anything you can do to change the timing fixes you. That does
> > make reverting the async probe of the regulator less appealing. If, as
> > you said, it's not just some other driver loading at the same time
> > that's interfering then the revert "fixes" you in the same way that a
> > "msleep" would fix you. That doesn't seem like enough of a
> > justification for the revert to me.
> >
> > It still feels to me like _something_ is happening at the same time as
> > the RPMH regulator driver is loading, though, I'm just not sure how to
> > suggest debugging it. I guess other thoughts:
> >
> > * When RPMH complains, is it always with the same regulator (lvs1), or
> > sometimes different ones? Any clue there?
>
> It is always either lvs1 or lvs2.

If you reorder the nodes in the device tree, I think it'll change the
probe order. Does that affect anything? I'm wondering if there's some
sort of delayed reaction from a previous regulator.


> > * How much can you control module loading order? If rpmh-regulator
> > module loads first, does it "fix" things? If it does, maybe you could
> > bisect to find the place where problems start cropping up. Does that
> > give any clues?
>
> Loading qcom-rpmh-regulator first does make it difficult to reproduce
> the crash. I tried a few combinations to narrow down the issue further
> and in one case, I managed to reproduce the crash (1 in 20+ reboots)
> while loading the qcom-rpmh-regulator (and the dependent cmd-db,
> qcom_rpmh) module alone
> https://bugs.linaro.org/attachment.cgi?id=1140.
>
> Regards,
> Amit Pundir

OK, now that's even weirder. If loading the module alone reproduces
the problem, I can't imagine why this would be different without
"async" probe. In other words, If it reproduces 5% of the time when
loading the module alone with async probe, I'd expect it to reproduce
5% of the time when loading the module alone _without_ async probe
too.

Note: make sure you're careful to collect more than one reproduction
before asserting odds. If it reproduced once in 20 reboots, it might
be 5%, it might be 20%, or it might be .01%. Ideally you could script
this and let it run for a few hours to get a good reproduction rate?

-Doug