Re: [PATCH 0/4] Broadcom STB PM PSCI extensions

From: Florian Fainelli
Date: Mon Feb 14 2022 - 13:13:06 EST


On 2/7/22 8:27 AM, Sudeep Holla wrote:
> On Thu, Feb 03, 2022 at 11:33:26AM -0800, Florian Fainelli wrote:
>>
>>
>> On 2/3/2022 10:52 AM, Sudeep Holla wrote:
>>> Correction: it is known as "freeze" rather than "idle" in terms of values
>>> as per /sys/power/state. Sorry for referring it as "idle" and creating any
>>> confusion.
>>>
>>> On Thu, Feb 03, 2022 at 09:36:28AM -0800, Florian Fainelli wrote:
>>>>
>>>>
>>>> On 2/3/2022 3:14 AM, Sudeep Holla wrote:
>>>>> On Fri, Jan 21, 2022 at 07:54:17PM -0800, Florian Fainelli wrote:
>>>>>> Hi all,
>>>>>>
>>>>>> This patch series contains the Broadcom STB PSCI extensions which adds
>>>>>> some additional functions on top of the existing standard PSCI interface
>>>>>> which is the reason for having the driver implement a custom
>>>>>> suspend_ops.
>>>>>>
>>>>>> These platforms have traditionally supported a mode that is akin to
>>>>>> ACPI's S2 with the CPU in WFI and all of the chip being clock gated
>>>>>> which is entered with "echo standby > /sys/power/state". Additional a
>>>>>> true suspend to DRAM as defined in ACPI by S3 is implemented with "echo
>>>>>> mem > /sys/power/state".
>>>>>
>>>>> How different is the above "standby" state compare to the standard "idle"
>>>>> (a.k.a suspend-to-idle which is different from system-to-ram/S3) ?
>>>>
>>>> There are a few differences:
>>>>
>>>> - s2idle does not power gate the secondary CPUs
>>>>
>>>
>>> Not sure what you mean by that ? S2I takes CPUs to deepest idle state.
>>> If you want shallower states, one possible option is the disable deeper
>>> states from the userspace.
>>
>> What I mean is that we do not get to call PSCI CPU_OFF here so the CPUs are
>> idle, but not power gated. Those CPUs do not have any other idle state other
>> than WFI because the HW designers sort of forgot or rather did not know that
>> wiring up the ARM GIC power controller back to the power gating logic of the
>> CPU was a good idea.
>>
>
> Nice 😄
>
>>>
>>>> - s2idle requires the use of in-band interrupts for wake-up
>>>>
>>>
>>> I am not sure if that is true. S2I behaves very similar to S2R except it
>>> has low wake latency as all secondaries CPUs are not hotplugged out.
>>
>> OK, the fact that secondary CPUs are not hot-plugged could be remedied by
>> doing this ahead of entering s2idle by user-space so this is not a valid
>> argument from me anymore.
>>
>
> Fair enough.
>
>>>
>>>> The reasons for implementing "standby" are largely two fold:
>>>>
>>>> - we need to achieve decent power savings (typically below 0.5W for the
>>>> whole system while allowing Wake-on-WLAN, GPIO, RTC, infrared, etc.)
>>>>
>>>
>>> I fail to understand how that is a problem from S2I. It is probably worth
>>> checking if there are any unnecessary IRQF_NO_SUSPEND users. Check section
>>> IRQF_NO_SUSPEND and enable_irq_wake() in [1]. I don't see any issues other
>>> wise in terms of unnecessary/spurious wakeup by in-band(to be precise
>>> no-wake up) interrupts.
>>
>> I don't think your hyperlink referenced by [1] was provided, but my quick
>> testing with:
>>
>
> Yikes, I meant to refer Documentation/power/suspend-and-interrupts.rst
>
>> echo s2idle > /sys/power/mem_sleep
>> echo mem > /sys/power/state
>>
>> appears to work to some extent when I use peripherals that can generate
>> in-band interrupts.
>>
>> It looks like we have s2idle_ops that allows a platform to override some of
>> the operations before/after entering s2idle, however the actual s2idle idle
>> loop is still within the kernel, so we will not call into the ARM Trusted
>> Firmware and engage the power management state machine.
>>
>
> Correct.
>
>> This means that there will not be any of the clock gating that only the
>> hardware state machine is capable of performing, the DRAM controller as a
>> result will not enter self refresh power down, and in addition the side band
>> wake-up interrupts will not be activate because the interrupt controller
>> that aggregates them only outputs to the ARM GIC when the state machine has
>> been engaged.
>>
>
> One possible solution IIUC the issue is to add this as additional CPU Idle
> state disabled most of the time. Enable them from user-space just prior to
> calling freeze/s2idle, so that PSCI CPU_SUSPEND is called with right param
> to indicate this is deepest idle state(in your case just WFI) + DRAM self
> refresh/retention mode. Also TF-A can take care to enable the side band
> interrupts before entering the state.

Not knowing how to enable a disabled idle state from user-space, and
ensure that it does not race with cpuidle somehow choosing to enter that
state, I have all sorts of concerns about such interactions but can see
how this could be made to work. In fact, I am wondering if we had not
better off work around our broken HW and always advertise that state,
and just let cpuidle pick that "deep" idle state resulting in powering
down secondary core(s). In TF-A we would have to ensure that we save all
of the SPIs affine to that particular core, and probably re-configure
the PPIs and SGIs to be made secure such that TF-A can "trap" them and
wake-up the core that was just powered off.

It sounds like for your suggested approach plus requesting to enter
s2idle, we need to start trapping WFI at ELx into EL3 such that TF-A has
a chance of observing that all CPU cores are powered down and/or idle
and then can engage our power management state machine hardware to clock
gate the system and mux in the out of band interrupts. Unless the boot
CPU's default idle state has to either be modified, or we have to
advertise an additional "deeper" idle state that involves calling into
TF-A with PSCI CPU_SUSPEND, too?

What I really like about our approach other than it has been proven to
work over the past 10 years, is that it fits well with Linux system
suspend path via suspend_ops, with each layer taking care of defining
its points of no return etc. We know how to debug it, and it is not
opportunistic unlike cpuidle, which makes it easier to control. From a
non technical point of view, it is also the devil people are used to,
and no matter how we shape it, there will be resistance to change.

>
> Do you see any issue with this approach ? I am trying to find ways to avoid
> deviating from standard PSCI.

Re-reading the PSCI specification and the SYSTEM_SUSPEND specifically it
does sound like ACPI S2 was taken into account but that lead to
SYSTEM_SUSPEND not allowing to differentiate between these two states on
the premise that "In practice, operating systems use only one suspend to
RAM state, so this is not seen as a limitation". It would have been nice
to leave provision for defining both instead of not.

I am sympathetic to avoid divergence of both interpretation and
implementation of the PSCI specification and its reference
implementation in Linux. So I can see a few paths forward:

- have me try what you suggested, which will take me weeks because I
have a TODO list long like my arm (and my arms are really long),
assuming that this even works for our use cases

- work with you to amend the PSCI specification such that we can
differentiate between an entry into S2 or S3 by defining which state we
want to enter, but I assume we will get lots of resistance here?
--
Florian