Re: [PATCH] regulator: qcom-rpmh: Revert "regulator: qcom-rpmh: Use PROBE_FORCE_SYNCHRONOUS"

From: Amit Pundir
Date: Fri Jun 02 2023 - 03:31:37 EST


On Thu, 1 Jun 2023 at 19:35, Doug Anderson <dianders@xxxxxxxxxxxx> wrote:
>
> Hi,
>
> On Wed, May 31, 2023 at 11:11 PM Amit Pundir <amit.pundir@xxxxxxxxxx> wrote:
> >
> > On Wed, 17 May 2023 at 02:54, Doug Anderson <dianders@xxxxxxxxxxxx> wrote:
> > >
> > > Hi,
> > >
> > > On Tue, May 16, 2023 at 11:12 AM Amit Pundir <amit.pundir@xxxxxxxxxx> wrote:
> > > >
> > > > On Mon, 15 May 2023 at 20:33, Doug Anderson <dianders@xxxxxxxxxxxx> wrote:
> > > > >
> > > > > Hi,
> > > > >
> > > > > On Mon, May 15, 2023 at 7:42 AM Amit Pundir <amit.pundir@xxxxxxxxxx> wrote:
> > > > > >
> > > > > > On Sun, 14 May 2023 at 18:11, Caleb Connolly <caleb.connolly@xxxxxxxxxx> wrote:
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > On 13/05/2023 18:08, Amit Pundir wrote:
> > > > > > > > On Fri, 24 Mar 2023 at 19:05, Douglas Anderson <dianders@xxxxxxxxxxxx> wrote:
> > > > > > > >>
> > > > > > > >> This reverts commit 58973046c1bf ("regulator: qcom-rpmh: Use
> > > > > > > >> PROBE_FORCE_SYNCHRONOUS"). Further digging into the problems that
> > > > > > > >> prompted the us to switch to synchronous probe showed that the root
> > > > > > > >> cause was a missing "rootwait" in the kernel command line
> > > > > > > >> arguments. Let's reinstate asynchronous probe.
> > > > > > > >
> > > > > > > > Hi, the asynchronous probe is broken on Dragonboard 845c (SDM845)
> > > > > > > > running AOSP (Android Open Source Project) with v6.4-rc1
> > > > > > > > https://bugs.linaro.org/show_bug.cgi?id=5975.
> > > > > > > > Can we please go back to synchronous probe.
> > > > > > > >
> > > > > > > > AOSP do not make use of rootwait, IIRC, but it is added by the
> > > > > > > > bootloader anyway. And the device fails to boot AOSP regardless of
> > > > > > > > "rootwait" bootarg being present or not.
> > > > > > >
> > > > > > > Could you try applying this diff to enable some log spam and let me know
> > > > > > > what you get? I'm keen to try and figure this one out. My mail client
> > > > > > > might crunch this a bit so I have pasted it here too
> > > > > > > https://p.calebs.dev/ab74b7@raw
> > > > > >
> > > > > > These prints add just enough delay for the UFS probe to succeed that I
> > > > > > can't reproduce the failure anymore.
> > > > >
> > > > > I'd prefer doing at least a little debugging before jumping to a
> > > > > revert. From looking at your dmesg [1], it looks as if the async probe
> > > > > is allowing RPMH to probe at the same time as "qcom-vadc-common".
> > > > > That's something that talks on the SPMI bus and is (potentially)
> > > > > talking to the same PMICs that RPMH-regulator is, right? I'm by no
> > > > > means an expert on how Qualcomm's PMICs work, but it seems plausible
> > > > > that the "qcom-vadc-common" is somehow causing problems and screwing
> > > > > up RPMH. Does that seem plausible to you?
> > > > >
> > > > > If so, one interesting way to track it down would be to move around
> > > > > delays. Put ~500ms sleep at the _end_ of vadc_probe(). Presumably that
> > > > > _won't_ fix the problem. Now put a ~500ms sleep at the beginning of
> > > > > vadc_probe(). Maybe that will fix the problem? If so, you can move the
> > > > > delay around to narrow down the conflict. My wild guess would be that
> > > > > vadc_reset() could be throwing things for a loop?
> > > > >
> > > > > If the above doesn't work, maybe we could add more tracing / printouts
> > > > > to see what is probing at the same time as RPMH?
> > > >
> > > > Tried out a few changes today but none of them worked or were
> > > > effective enough to debug this crash further, other than setting
> > > > fw_devlink=permissive.
> > > >
> > > > Adding more tracing / prints (BOOTTIME_TRACING and
> > > > FUNCTION_GRAPH_TRACER) didn't work and didn't help in reproducing the
> > > > crash either. They added just enough delay to boot the device
> > > > successfully everytime.
> > > >
> > > > I tried to reason with the kernel modules which gets loaded before and
> > > > after the qcom-rpmh-regulator (QCOM_REBOOT_MODE, QCOM_PON, IIO/VADC,
> > > > SPMI_PMIC* etc) as suggested, but I run into the same crash even if I
> > > > disable those driver modules. So I don't think that the other driver
> > > > modules which gets loaded at around the same time as
> > > > qcom-rpmh-regulator by default have any impact on this failure.
> > >
> > > Ugh, Heisenbugs are no fun to debug. :( It sorta sounds as if pretty
> > > much anything you can do to change the timing fixes you. That does
> > > make reverting the async probe of the regulator less appealing. If, as
> > > you said, it's not just some other driver loading at the same time
> > > that's interfering then the revert "fixes" you in the same way that a
> > > "msleep" would fix you. That doesn't seem like enough of a
> > > justification for the revert to me.
> > >
> > > It still feels to me like _something_ is happening at the same time as
> > > the RPMH regulator driver is loading, though, I'm just not sure how to
> > > suggest debugging it. I guess other thoughts:
> > >
> > > * When RPMH complains, is it always with the same regulator (lvs1), or
> > > sometimes different ones? Any clue there?
> >
> > It is always either lvs1 or lvs2.
>
> If you reorder the nodes in the device tree, I think it'll change the
> probe order. Does that affect anything? I'm wondering if there's some
> sort of delayed reaction from a previous regulator.

Hi, Bumping lvs1 and lvs2 regulators up to the top of the list in the
DTS https://bugs.linaro.org/show_bug.cgi?id=5975#c4 does seem to work.
I can't reproduce the crash in 125 reboots so far, while I'm still
testing with only qcom-rpmh-regulator kernel module. I'll do some more
testing with full system running and send this re-ordering fix I can't
reproduce the crash further.

>
>
> > > * How much can you control module loading order? If rpmh-regulator
> > > module loads first, does it "fix" things? If it does, maybe you could
> > > bisect to find the place where problems start cropping up. Does that
> > > give any clues?
> >
> > Loading qcom-rpmh-regulator first does make it difficult to reproduce
> > the crash. I tried a few combinations to narrow down the issue further
> > and in one case, I managed to reproduce the crash (1 in 20+ reboots)
> > while loading the qcom-rpmh-regulator (and the dependent cmd-db,
> > qcom_rpmh) module alone
> > https://bugs.linaro.org/attachment.cgi?id=1140.
> >
> > Regards,
> > Amit Pundir
>
> OK, now that's even weirder. If loading the module alone reproduces
> the problem, I can't imagine why this would be different without
> "async" probe. In other words, If it reproduces 5% of the time when
> loading the module alone with async probe, I'd expect it to reproduce
> 5% of the time when loading the module alone _without_ async probe
> too.

I don't know the internal workings of forcing an asynchronous or
synchronous probe, but loading this module alone crashed 60 times in
350 reboots with this async patch, while it didn't crash at all in
about 250 reboots if I do PROBE_FORCE_SYNCHRONOUS.

Regards,
Amit Pundir

>
> Note: make sure you're careful to collect more than one reproduction
> before asserting odds. If it reproduced once in 20 reboots, it might
> be 5%, it might be 20%, or it might be .01%. Ideally you could script
> this and let it run for a few hours to get a good reproduction rate?
>
> -Doug