Re: [PATCH 1/4] cpufreq: qcom-nvmem: Enable virtual power domain devices

From: Ulf Hansson
Date: Fri Sep 29 2023 - 09:14:52 EST


On Wed, 13 Sept 2023 at 14:26, Stephan Gerhold <stephan@xxxxxxxxxxx> wrote:
>
> On Wed, Sep 13, 2023 at 12:56:16PM +0200, Ulf Hansson wrote:
> > On Tue, 12 Sept 2023 at 11:40, Stephan Gerhold
> > <stephan.gerhold@xxxxxxxxxxxxxxx> wrote:
> > >
> > > The genpd core ignores performance state votes from devices that are
> > > runtime suspended as of commit 5937c3ce2122 ("PM: domains: Drop/restore
> > > performance state votes for devices at runtime PM").
> >
> > I think you are referring to the wrong commit above. Please have a
> > look at commit 3c5a272202c2 ("PM: domains: Improve runtime PM
> > performance state handling"), instead.
> >
> > I also suggest rephrasing the above into saying that the performance
> > state vote for a device is cached rather than carried out, if
> > pm_runtime_suspended() returns true for it.
> >
> > Another relevant information in the commit message would be to add
> > that during device-attach (genpd_dev_pm_attach_by_id()), calls
> > pm_runtime_enable() the device.
> >
>
> Thanks, I will try to clarify this a bit! I was actually looking at that
> commit originally but decided to reference the commit that "started the
> change", since the this commit is marked as fix of the one I referenced.
> But I think you're right, it would be more clear to reference "PM:
> domains: Improve runtime PM performance state handling" directly.
>
> > > However, at the
> > > moment nothing ever enables the virtual devices created in
> > > qcom-cpufreq-nvmem for the cpufreq power domain scaling, so they are
> > > permanently runtime-suspended.
> > >
> > > Fix this by enabling the devices after attaching them and use
> > > dev_pm_syscore_device() to ensure the power domain also stays on when
> > > going to suspend. Since it supplies the CPU we can never turn it off
> > > from Linux. There are other mechanisms to turn it off when needed,
> > > usually in the RPM firmware or the cpuidle path.
> > >
> > > Without this fix performance states votes are silently ignored, and the
> > > CPU/CPR voltage is never adjusted. This has been broken since 5.14 but
> > > for some reason no one noticed this on QCS404 so far.
> > >
> > > Cc: stable@xxxxxxxxxxxxxxx
> > > Fixes: 1cb8339ca225 ("cpufreq: qcom: Add support for qcs404 on nvmem driver")
> > > Signed-off-by: Stephan Gerhold <stephan.gerhold@xxxxxxxxxxxxxxx>
> > > ---
> > > drivers/cpufreq/qcom-cpufreq-nvmem.c | 21 ++++++++++++++++++++-
> > > 1 file changed, 20 insertions(+), 1 deletion(-)
> > >
> > > diff --git a/drivers/cpufreq/qcom-cpufreq-nvmem.c b/drivers/cpufreq/qcom-cpufreq-nvmem.c
> > > index 84d7033e5efe..17d6ab14c909 100644
> > > --- a/drivers/cpufreq/qcom-cpufreq-nvmem.c
> > > +++ b/drivers/cpufreq/qcom-cpufreq-nvmem.c
> > > @@ -25,6 +25,7 @@
> > > #include <linux/platform_device.h>
> > > #include <linux/pm_domain.h>
> > > #include <linux/pm_opp.h>
> > > +#include <linux/pm_runtime.h>
> > > #include <linux/slab.h>
> > > #include <linux/soc/qcom/smem.h>
> > >
> > > @@ -280,6 +281,7 @@ static int qcom_cpufreq_probe(struct platform_device *pdev)
> > > }
> > >
> > > for_each_possible_cpu(cpu) {
> > > + struct device **virt_devs = NULL;
> > > struct dev_pm_opp_config config = {
> > > .supported_hw = NULL,
> > > };
> > > @@ -300,7 +302,7 @@ static int qcom_cpufreq_probe(struct platform_device *pdev)
> > >
> > > if (drv->data->genpd_names) {
> > > config.genpd_names = drv->data->genpd_names;
> > > - config.virt_devs = NULL;
> > > + config.virt_devs = &virt_devs;
> > > }
> > >
> > > if (config.supported_hw || config.genpd_names) {
> > > @@ -311,6 +313,23 @@ static int qcom_cpufreq_probe(struct platform_device *pdev)
> > > goto free_opp;
> > > }
> > > }
> > > +
> > > + if (virt_devs) {
> > > + const char * const *name = config.genpd_names;
> > > + int i;
> > > +
> > > + for (i = 0; *name; i++, name++) {
> > > + ret = pm_runtime_resume_and_get(virt_devs[i]);
> > > + if (ret) {
> > > + dev_err(cpu_dev, "failed to resume %s: %d\n",
> > > + *name, ret);
> > > + goto free_opp;
> > > + }
> >
> > Shouldn't we restore the usage count at ->remove() too?
> >
> > > +
> > > + /* Keep CPU power domain always-on */
> > > + dev_pm_syscore_device(virt_devs[i], true);
> >
> > Is this really correct? cpufreq is suspended/resumed by the PM core
> > during system wide suspend/resume. See dpm_suspend|resume(). Isn't
> > that sufficient?
> >
> > Moreover, it looks like the cpr genpd provider supports genpd's
> > ->power_on|off() callbacks. Is there something wrong with this, that I
> > am missing?
> >
>
> I think this question is a quite fundamental one. To explain this
> properly I will need to delve a bit into the implementation details of
> the two different GENPD providers that are applicable here:
>
> Fundamentally, we are describing the main power supply for the CPU here.
> Consider a simple regulator with adjustable voltage. From the Linux
> point of view this regulator should be marked as "regulator-always-on".
> If we would turn off this regulator, the CPU would be immediately dead
> without proper shutdown done by firmware or hardware.
>
> Representing the regulator as power domain does not change much, except
> that we now have abstract "performance states" instead of actual voltages.
> However, for power domains there is currently no generic mechanism like
> "regulator-always-on" in the DT, only drivers can specify
> GENPD_FLAG_ALWAYS_ON.

We have relied on genpd providers to act on their compatible strings
to make the correct configuration. If that isn't sufficient, I don't
see why we couldn't add a new DT property corresponding to
GENPD_FLAG_ALWAYS_ON.

>
> The special situation on MSM8909 is that there are two possible setups
> for the CPU power supply depending on the PMIC that is used (see
> "[PATCH 4/4] cpufreq: qcom-nvmem: Add MSM8909"): CPR or RPMPD. Both are
> GENPD providers so in theory we can just have either
>
> cpu@0 { power-domains = <&cpr>; }; // or
> cpu@0 { power-domains = <&rpmpd MSM8909_VDDCX_AO>; };
>
> in the DT, without handling this specifically on the cpufreq side.

Looks like it would be nice to get a patch for the MSM8909 DTS too, as
part of the series, to get a better picture of how this is going to be
used. Would that be possible for you to provide?

>
> The two GENPD providers behave quite differently though:
>
> - CPR: CPR is not really a power domain itself. It's more like a monitor
> on a power supply line coming from some other regulator. CPR provides
> suggestions how to adjust the voltage for best power/stability.
>
> The GENPD .power_off() disables the CPR state machine and forwards
> this to the regulator with regulator_disable(). On QCS404 the
> regulator is marked regulator-always-on, so it will never be disabled
> from Linux. The SAW/SPM hardware component on Qualcomm SoCs will
> usually disable the regulator during deep cpuidle states.

Parts of this sound a bit odd to me. The CPR/CPUfreq shouldn't really
need to vote for the CPU's power-rail(s) from a powered-on/off (CPU
idle states) point of view, but only from a performance (voltage
level) point of view.

If the enable/disable voting on the regulator really has an impact on
some platforms, it sounds like it could prevent deeper CPU idle states
too. That's probably not what we want, right?

I also had a look at the existing CPR genpd provider's probe
function/path (cpr_probe()) - and it turns out there is no call to
regulator_enable(). Whatever that means to us.

>
> - RPMPD: This is the generic driver for all the SoC power domains
> managed by the RPM firmware. It's not CPU-specific. However, as
> special feature each power domain is exposed twice in Linux, e.g.
> "MSM8909_VDDCX" and "MSM8909_VDDCX_AO". The _AO ("active-only")
> variant tells the RPM firmware that the performance/enable vote only
> applies when the CPU is active (not in deep cpuidle state).
>
> The GENPD .power_off() drops all performance state votes and also
> releases the "enable" vote for the power domain.
>
> Now, imagine what happens during system wide suspend/resume:
>
> - CPR: The CPR state machine gets disabled. The voltage stays as-is.
> - With "regulator-always-on": The CPU keeps running until WFI.
> - Without: I would expect the CPU is dead immediately(?)

As I indicated above, I am starting to feel that this is a bit upside
down. CPR/CPUfreq should vote on voltages to scale performance, but
not for cpu idle states.

Perhaps what is missing is a synchronization point or a notification,
to inform the CPR driver that its state machine (registers) needs to
be saved/restored, when the power-rails get turned on/off. In fact, we
have a couple mechanisms at hand to support this.

>
> - RPMPD: The performance/enable vote is dropped. The power domain might
> go to minimal voltage or even turn off completely. However, the CPU
> actually needs to keep running at the same frequency until WFI!
> Worst case, the CPU is dead immediately when the power domain votes
> get dropped.

Since RPMPD is managing the voting for both performance and low power
states for different kinds of devices, this certainly gets a bit more
complicated.

On the other hand, the CPUfreq driver should really only vote for
performance states for the CPUs and not for low power states. The
latter is a job for cpuidle and other consumers of the RPMPD to
manage, I think.

So, while thinking of this, I just realized that it may not always be
a good idea for genpd to cache a performance state request, for an
attached device and for which pm_runtime_suspended() returns true for
it. As this is the default behaviour in genpd, I am thinking that we
need a way to make that behaviour configurable for an attached device.
What do you think about that?

>
> In case of RPMPD, the votes must remain even during system wide suspend.
> The special _AO variant of the power domain tells the firmware to
> release the votes once the CPU has been shut down cleanly. It will also
> restore them once the CPU wakes up (long before the resume handlers run).
>
> My conclusion was that in both cases we want to keep the "power domain"
> enabled, since the CPU must keep running for a short while even after
> the system suspend handlers have been called.

It looks to me that the system wide suspend/resume case isn't really
much different from the runtime suspend/resume case.

It's not CPUfreq's role (by calling pm_runtime_resume_and_get(), for
example) to prevent the RPMPD from entering a low power state.

>
> Does this help with understanding the problem? It's a bit complicated. :D

Yes, I think so, thanks!

Although, I am afraid my response made this even more complicated. :-)

>
> Thanks!
> Stephan
>
> PS: This is essentially just another manifestation of a discussion we
> had a few times already over the years about where to enable power
> domains used by cpufreq, e.g. [1, 2, 3, 4]. Apparently I already
> mentioned back in 2021 already that QCS404 is broken [5] (I forgot
> about that :')).

Right, I recall these discussions now, thanks for the pointers.

Let's try to get to the bottom of this and figure out a proper solution.

>
> [1]: https://lore.kernel.org/linux-pm/YLi5N06Qs+gYHgYg@xxxxxxxxxxx/
> [2]: https://lore.kernel.org/linux-pm/20200826093328.88268-1-stephan@xxxxxxxxxxx/
> [3]: https://lore.kernel.org/linux-pm/20200730080146.25185-1-stephan@xxxxxxxxxxx/
> [4]: https://lore.kernel.org/linux-arm-msm/20200426123140.GA190483@xxxxxxxxxxx/
> [5]: https://lore.kernel.org/linux-pm/YLoTl7MfMfq2g10h@xxxxxxxxxxx/

Kind regards
Uffe