Re: [PATCH v6 2/2] cpuidle: teo: Introduce util-awareness

From: Qais Yousef
Date: Tue Jul 18 2023 - 08:45:30 EST


On 07/18/23 11:23, Lukasz Luba wrote:
>
>
> On 7/17/23 19:21, Qais Yousef wrote:
> > +CC Vincent and Peter
> >
> > On 07/17/23 14:47, Lukasz Luba wrote:
> > > Hi Qais,
> > >
> > > The rule is 'one size doesn't fit all', please see below.
> > >
> > > On 7/11/23 18:58, Qais Yousef wrote:
> > > > Hi Kajetan
> > > >
> > > > On 01/05/23 14:51, Kajetan Puchalski wrote:
> > > >
> > > > [...]
> > > >
> > > > > @@ -510,9 +598,11 @@ static int teo_enable_device(struct cpuidle_driver *drv,
> > > > > struct cpuidle_device *dev)
> > > > > {
> > > > > struct teo_cpu *cpu_data = per_cpu_ptr(&teo_cpus, dev->cpu);
> > > > > + unsigned long max_capacity = arch_scale_cpu_capacity(dev->cpu);
> > > > > int i;
> > > > > memset(cpu_data, 0, sizeof(*cpu_data));
> > > > > + cpu_data->util_threshold = max_capacity >> UTIL_THRESHOLD_SHIFT;
> > > >
> > > > Given that utilization is invariant, why do we set the threshold based on
> > > > cpu capacity?
> > >
> > >
> > > To treat CPUs differently, not with the same policy.
> > >
> > >
> > > >
> > > > I'm not sure if this is a problem, but on little cores this threshold would be
> > > > too low. Given that util is invariant - I wondered if we need to have a single
> > > > threshold for all type of CPUs instead. Have you tried something like that
> > >
> > > A single threshold for all CPUs might be biased towards some CPUs. Let's
> > > pick the value 15 - which was tested to work really good in benchmarks
> > > for the big CPUs. On the other hand when you set that value to little
> > > CPUs, with max_capacity = 124, than you have 15/124 ~= 13% threshold.
> > > That means you prefer to enter deeper idle state ~9x times (at max
> > > freq). What if the Little's freq is set to e.g. < ~20% fmax, which
> > > corresponds to capacity < ~25? Let's try to simulate such scenario.
> >
> > Hmm what I'm struggling with is that PELT is invariant. So the time it takes to
> > rise and decay to threshold of 15 should be the same for all CPUs, no?
>
> Yes, but that's not an issue. The idea is to have a threshold value
> set to a level which corresponds to a long enough slip time (in
> wall-clock). Then you don't waste the energy for turning on/off the
> core too often.
>
> >
> > >
> > > In a situation we could have utilization 14 on Little CPU, than CPU capacity
> > > (effectively frequency) voting based on utilization would be
> > > 1.2 * 14 = ~17 so let's pick OPP corresponding to 17 capacity.
> > > In such condition the little CPU would run the 14-util-periodic-task for
> > > 14/17= ~82% of wall-clock time. That's a lot, and not suited for
> > > entering deeper idle state on that CPU, isn't it?
> >
> > Yes runtime is stretched. But we counter this at utilization level by making
> > PELT invariant. I thought that any CPU in the system will now take the same
> > amount of time to ramp-up and decay to the same util level. No?
>
> The stretched runtime is what we want to derived out of rq util
> information, mostly after task migration to some next cpu.
>
> >
> > But maybe what you're saying is that we don't need to take the invariance into
> > account?
>
> Invariance is OK, since is the common ground when we migrate those tasks
> across cores and we still can conclude based on util the sleeping
> cpu time.
>
> >
> > My concern (that is not backed by real problem yet) is that the threshold is
> > near 0, and since PELT is invariant, the time to gain few points is constant
> > irrespective of any CPU/capacity/freq and this means the little CPUs has to be
> > absolutely idle with no activity almost at all, IIUC.
>
> As I said, that is good for those Little CPU in term of better latency
> for the tasks they serve and also doesn't harm the energy. The
> deeper idle state for those tiny silicon area cores (and low-power
> cells) doesn't bring much in avg for real workloads.
> This is the trade-off: you don't want to sacrifice the latency factor,
> you would rather pay a bit more energy not entering deep idle
> on Little cores, but get better latency there.

I can follow this argument much better than the time stretch one for sure.

I sort of agree, but feel cautious still this is too aggressive. I guess time
will let us know how suitable this is or we'll need to evolve :-)

FWIW, these patches are now in GKI, so we should get complaints if it is
a real problem.

> For the big cores, which occupy bigger silicon area and are made from
> High-Performance (HP) cells (and low V-threshold) leaking more, that
> trade-off is set differently. That's why a similar small util task on
> big core might be facing larger latency do to facing the need to
> wake-up cpu from deeper idle state.
>
> >
> > >
> > > Apart from that, the little CPUs are tiny in terms of silicon area
> > > and are less leaky in WFI than big cores. Therefore, they don't need
> > > aggressive entries into deeper idle state. At the same time, they
> > > are often used for serving interrupts, where the latency is important
> > > factor.
> >
> > On Pixel 6 this threshold will translate to util_threshold of 2. Which looks
> > too low to me. Can't TEO do a good job before we reach such extremely low level
> > of inactivity?
>
> Kajetan has introduced a new trace event 'cpu_idle_miss' which was
> helping us to say how to set the trade-off in different thresholds.
> It appeared that we would rather prefer to miss-predict and be wrong
> about missing opportunity to enter deeper idle. The opposite was more
> harmful. You can enable that trace event on your platform and analyze
> your use cases.

Okay, thanks for the info.


Cheers

--
Qais Yousef