Re: [PATCH] arm64: dts: qcom: sdm845-db845c: Move LVS regulator nodes up

From: Linux regression tracking (Thorsten Leemhuis)
Date: Thu Jun 22 2023 - 03:47:23 EST


Hi, Thorsten here, the Linux kernel's regression tracker. Top-posting
for once, to make this easily accessible to everyone.

As Linus will likely release 6.4 on this or the following Sunday a quick
status inquiry so I can brief him appropriately: is there any hope the
regression this patch tried to fix will be resolved any time soon?
Doesn't look like it from below message and this thread, but maybe I
missed something.

Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
--
Everything you wanna know about Linux kernel regression tracking:
https://linux-regtracking.leemhuis.info/about/#tldr
If I did something stupid, please tell me, as explained on that page.

#regzbot poke

On 20.06.23 17:59, Bjorn Andersson wrote:
> On Wed, Jun 14, 2023 at 12:44:15PM -0700, Doug Anderson wrote:
>> Hi,
>>
>> On Wed, Jun 14, 2023 at 11:47 AM Krzysztof Kozlowski
>> <krzysztof.kozlowski@xxxxxxxxxx> wrote:
>>>
>>> On 14/06/2023 20:18, Linux regression tracking (Thorsten Leemhuis) wrote:
>>>> On 02.06.23 18:12, Amit Pundir wrote:
>>>>> Move lvs1 and lvs2 regulator nodes up in the rpmh-regulators
>>>>> list to workaround a boot regression uncovered by the upstream
>>>>> commit ad44ac082fdf ("regulator: qcom-rpmh: Revert "regulator:
>>>>> qcom-rpmh: Use PROBE_FORCE_SYNCHRONOUS"").
>>>>>
>>>>> Without this fix DB845c fail to boot at times because one of the
>>>>> lvs1 or lvs2 regulators fail to turn ON in time.
>>>>
>>>> /me waves friendly
>>>>
>>>> FWIW, as it's not obvious: this...
>>>>
>>>>> Link: https://lore.kernel.org/all/CAMi1Hd1avQDcDQf137m2auz2znov4XL8YGrLZsw5edb-NtRJRw@xxxxxxxxxxxxxx/
>>>>
>>>> ...is a report about a regression. One that we could still solve before
>>>> 6.4 is out. One I'll likely will point Linus to, unless a fix comes into
>>>> sight.
>>>>
>>>> When I noticed the reluctant replies to this patch I earlier today asked
>>>> in the thread with the report what the plan forward was:
>>>> https://lore.kernel.org/all/CAD%3DFV%3DV-h4EUKHCM9UivsFHRsJPY5sAiwXV3a1hUX9DUMkkxdg@xxxxxxxxxxxxxx/
>>>>
>>>> Dough there replied:
>>>>
>>>> ```
>>>> Of the two proposals made (the revert vs. the reordering of the dts),
>>>> the reordering of the dts seems better. It only affects the one buggy
>>>> board (rather than preventing us to move to async probe for everyone)
>>>> and it also has a chance of actually fixing something (changing the
>>>> order that regulators probe in rpmh-regulator might legitimately work
>>>> around the problem). That being said, just like the revert the dts
>>>> reordering is still just papering over the problem and is fragile /
>>>> not guaranteed to work forever.
>>>> ```
>>>>
>>>> Papering over obviously is not good, but has anyone a better idea to fix
>>>> this? Or is "not fixing" for some reason an viable option here?
>>>>
>>>
>>> I understand there is a regression, although kernel is not mainline
>>> (hash df7443a96851 is unknown) and the only solutions were papering the
>>> problem. Reverting commit is a temporary workaround. Moving nodes in DTS
>>> is not acceptable because it hides actual problem and only solves this
>>> one particular observed problem, while actual issue is still there. It
>>> would be nice to be able to reproduce it on real mainline with normal
>>> operating system (not AOSP) - with ramdiks/without/whatever. So far no
>>> one did it, right?
>>
>> The worry I have about the revert here is that it will never be able
>> to be undone and that doesn't seem great long term. I'm all for a
>> temporary revert to fix a problem while the root cause is understood,
>> but in this case I have a hard time believing that we'll make more
>> progress towards a root cause once the revert lands. All the
>> investigation we've done so far seems to indicate that the revert only
>> fixes the problem by luck...
>>
>> I completely agree that moving the nodes in the DTS is a hack and just
>> hides the problem. However, it also at least limits the workaround to
>> the one board showing the problem and doesn't mean we're stuck with
>> synchronous probe for rpmh-regulator for all eternity because nobody
>> can understand this timing issue on db845c.
>>
>
> I agree that we shouldn't hide this by reverting the regulator change.
>
>
> And as has been stated a few times already, the symptom indicates that
> we have a misconfigured system.
>
> Before accepting a patch just shuffling the bricks, I'd like to see some
> more analysis of what happens wrt the rpmh right before the timeout.
> Perhaps the landing team can assist here?
>
> Regards,
> Bjorn
>
>