Re: Stopping the tick on a fully loaded system

From: Peter Zijlstra
Date: Tue Jul 25 2023 - 18:54:29 EST


On Tue, Jul 25, 2023 at 04:27:56PM +0200, Rafael J. Wysocki wrote:
> On Tue, Jul 25, 2023 at 3:07 PM Anna-Maria Behnsen

> > 100% load 50% load 25% load
> > (top: ~2% idle) (top: ~49% idle) (top: ~74% idle;
> > 33 CPUs are completely idle)
> > --------------- ---------------- ----------------------------
> > Idle Total 1658703 100% 3150522 100% 2377035 100%
> > x >= 4ms 2504 0.15% 2 0.00% 53 0.00%
> > 4ms> x >= 2ms 390 0.02% 0 0.00% 4563 0.19%
> > 2ms > x >= 1ms 62 0.00% 1 0.00% 54 0.00%
> > 1ms > x >= 500us 67 0.00% 6 0.00% 2 0.00%
> > 500us > x >= 250us 93 0.01% 39 0.00% 11 0.00%
> > 250us > x >=100us 280 0.02% 1145 0.04% 633 0.03%
> > 100us > x >= 50us 942 0.06% 30722 0.98% 13347 0.56%
> > 50us > x >= 25us 26728 1.61% 310932 9.87% 106083 4.46%
> > 25us > x >= 10us 825920 49.79% 2320683 73.66% 1722505 72.46%
> > 10us > x > 5us 795197 47.94% 442991 14.06% 506008 21.29%
> > 5us > x 6520 0.39% 43994 1.40% 23645 0.99%
> >
> >
> > 99% of the tick stops only have an idle period shorter than 50us (50us is
> > 1,25% of a tick length).
>
> Well, this just means that the governor predicts overly long idle
> durations quite often under this workload.
>
> The governor's decision on whether or not to stop the tick is based on
> its idle duration prediction. If it overshoots, that's how it goes.

This is abysmal; IIRC TEO tracks a density function in C state buckets
and if it finds it's more likely to be shorter than 'predicted' by the
timer it should pick something shallower.

Given we have this density function, picking something that's <1% likely
is insane. In fact, it seems to suggest the whole pick-alternative thing
is utterly broken.

> > This is also the reason for my opinion, that the return of
> > tick_nohz_next_event() is completely irrelevant in a (fully) loaded case:
>
> It is an upper bound and in a fully loaded case it may be way off.

But given we have our density function, we should be able to do much
better.


Oooh,... I think I see the problem. Our bins are strictly the available
C-state, but if you run this on a Zen3 that has ACPI-idle, then you end
up with something that only has 3 C states, like:

$ for i in state*/residency ; do echo -n "${i}: "; cat $i; done
state0/residency: 0
state1/residency: 2
state2/residency: 36

Which means we only have buckets: (0,0] (0,2000], (2000,36000] or somesuch. All
of them very much smaller than TICK_NSEC.

That means we don't track nearly enough data to reliably tell anything
about disabling the tick or not. We should have at least one bucket
beyond TICK_NSEC for this.

Hmm.. it is getting very late, but how about I get the cpuidle framework
to pad the drv states with a few 'disabled' C states so that we have at
least enough data to cross the TICK_NSEC boundary and say something
usable about things.

Because as things stand, it's very likely we determine @stop_tick purely
based on what tick_nohz_get_sleep_length() tells us, not on what we've
learnt from recent history.


(FWIW intel_idle seems to not have an entry for Tigerlake !?! -- my poor
laptop, it feels neglected)