Re: [PATCH v2 2/2] cpufreq: intel_pstate: Conditional frequency invariant accounting

From: Rafael J. Wysocki
Date: Fri Oct 04 2019 - 05:17:55 EST


On Fri, Oct 4, 2019 at 10:52 AM Giovanni Gherdovich <ggherdovich@xxxxxxx> wrote:
>
> On Fri, 2019-10-04 at 10:29 +0200, Rafael J. Wysocki wrote:
> > On Fri, Oct 4, 2019 at 10:24 AM Giovanni Gherdovich <ggherdovich@xxxxxxx> wrote:
> > >
> > > On Thu, 2019-10-03 at 20:31 -0700, Srinivas Pandruvada wrote:
> > > > On Thu, 2019-10-03 at 20:05 +0200, Rafael J. Wysocki wrote:
> > > > > On Wednesday, October 2, 2019 2:29:26 PM CEST Giovanni Gherdovich
> > > > > wrote:
> > > > > > From: Srinivas Pandruvada <srinivas.pandruvada@xxxxxxxxxxxxxxx>
> > > > > >
> > > > > > intel_pstate has two operating modes: active and passive. In "active"
> > > > > > mode, the in-built scaling governor is used and in "passive" mode, the
> > > > > > driver can be used with any governor like "schedutil". In "active" mode
> > > > > > the utilization values from schedutil is not used and there is a
> > > > > > requirement from high performance computing use cases, not to readas
> > > > > > well any APERF/MPERF MSRs.
> > > > >
> > > > > Well, this isn't quite convincing.
> > > > >
> > > > > In particular, I don't see why the "don't read APERF/MPERF MSRs" argument
> > > > > applies *only* to intel_pstate in the "active" mode. What about
> > > > > intel_pstate in the "passive" mode combined with the "performance"
> > > > > governor? Or any other governor different from "schedutil" for that
> > > > > matter?
> > > > >
> > > > > And what about acpi_cpufreq combined with any governor different from
> > > > > "schedutil"?
> > > > >
> > > > > Scale invariance is not really needed in all of those cases right now
> > > > > AFAICS, or is it?
> > > >
> > > > Correct. This is just part of the patch to disable in active mode
> > > > (particularly in HWP and performance mode).
> > > >
> > > > But this patch is 2 years old. The folks who wanted this, disable
> > > > intel-pstate and use userspace governor with acpi-cpufreq. So may be
> > > > better to address those cases too.
> > >
> > > I disagree with "scale invariance is needed only by the schedutil governor";
> > > the two other users are the CPU's estimated utilization in the wakeup path,
> > > via cpu_util_without(), as well as the load-balance path, via cpu_util() which
> > > is used by update_sg_lb_stats().
> >
> > OK, so there are reasons to run the scale invariance code which are
> > not related to the cpufreq governor in use.
> >
> > I wonder then why those reasons are not relevant for intel_pstate in
> > the "active" mode.
> >
> > > Also remember that scale invariance is applied to both PELT signals util_avg
> > > and load_avg; schedutil uses the former but not the latter.
> > >
> > > I understand Srinivas patch to disable MSR accesses during the tick as a
> > > band-aid solution to address a specific use case he cares about, but I don't
> > > think that extending this approach to any non-schedutil governor is a good
> > > idea -- you'd be killing load balancing in the process.
> >
> > But that is also the case for intel_pstate in the "active" mode, isn't it?
>
> Sure it is.
>
> Now, what's the performance impact of loosing scale-invariance in PELT signals?

That needs to be measured.

> And what's the performance impact of accessing two MSRs at the scheduler tick
> on each CPU?

That would be the MSR access latency times two and I don't remember
the exact numbers from the top of my head. It would also depend on
how much time it takes to run the tick without those two MSR accesses,
on average.

The question I have, however, is whether or not it really is necessary
to update arch_cpu_freq on every tick. Maybe it would be sufficient
to do that every 10 ms, say (in case the tick is more frequent than
that), or similar?

> I am sporting Srinivas' patch because he expressed the concern that the losses
> don't justify the gains for a specific class of users (supercomputing),
> although I don't fully like the idea (and arguably that should be measured).

My point is that this patch doesn't even cover the entire case in
question, because the HPC people may very well be using cpufreq
driver/governor configurations different from intel_pstate in the
"active" mode. Moreover, given the lack of data, it is even hard to
say what the potential impact is, if any.

I guess it should be a fairly straightforward exercise to compare the
results of the various benchmarks using intel_pstate in the "passive"
mode and the "performance" governor with and without patch [1/2] from
this series. If you see any perf regressions after applying the patch,
that's what the others will probably see as well.