Re: Stopping the tick on a fully loaded system

From: Rafael J. Wysocki
Date: Wed Jul 26 2023 - 11:12:46 EST


On Wed, Jul 26, 2023 at 12:29 AM Peter Zijlstra <peterz@xxxxxxxxxxxxx> wrote:
>
> On Tue, Jul 25, 2023 at 04:27:56PM +0200, Rafael J. Wysocki wrote:
> > On Tue, Jul 25, 2023 at 3:07 PM Anna-Maria Behnsen
>
> > > 100% load 50% load 25% load
> > > (top: ~2% idle) (top: ~49% idle) (top: ~74% idle;
> > > 33 CPUs are completely idle)
> > > --------------- ---------------- ----------------------------
> > > Idle Total 1658703 100% 3150522 100% 2377035 100%
> > > x >= 4ms 2504 0.15% 2 0.00% 53 0.00%
> > > 4ms> x >= 2ms 390 0.02% 0 0.00% 4563 0.19%
> > > 2ms > x >= 1ms 62 0.00% 1 0.00% 54 0.00%
> > > 1ms > x >= 500us 67 0.00% 6 0.00% 2 0.00%
> > > 500us > x >= 250us 93 0.01% 39 0.00% 11 0.00%
> > > 250us > x >=100us 280 0.02% 1145 0.04% 633 0.03%
> > > 100us > x >= 50us 942 0.06% 30722 0.98% 13347 0.56%
> > > 50us > x >= 25us 26728 1.61% 310932 9.87% 106083 4.46%
> > > 25us > x >= 10us 825920 49.79% 2320683 73.66% 1722505 72.46%
> > > 10us > x > 5us 795197 47.94% 442991 14.06% 506008 21.29%
> > > 5us > x 6520 0.39% 43994 1.40% 23645 0.99%
> > >
> > >
> > > 99% of the tick stops only have an idle period shorter than 50us (50us is
> > > 1,25% of a tick length).
> >
> > Well, this just means that the governor predicts overly long idle
> > durations quite often under this workload.
> >
> > The governor's decision on whether or not to stop the tick is based on
> > its idle duration prediction. If it overshoots, that's how it goes.
>
> This is abysmal; IIRC TEO tracks a density function in C state buckets
> and if it finds it's more likely to be shorter than 'predicted' by the
> timer it should pick something shallower.
>
> Given we have this density function, picking something that's <1% likely
> is insane. In fact, it seems to suggest the whole pick-alternative thing
> is utterly broken.
>
> > > This is also the reason for my opinion, that the return of
> > > tick_nohz_next_event() is completely irrelevant in a (fully) loaded case:
> >
> > It is an upper bound and in a fully loaded case it may be way off.
>
> But given we have our density function, we should be able to do much
> better.
>
>
> Oooh,... I think I see the problem. Our bins are strictly the available
> C-state, but if you run this on a Zen3 that has ACPI-idle, then you end
> up with something that only has 3 C states, like:
>
> $ for i in state*/residency ; do echo -n "${i}: "; cat $i; done
> state0/residency: 0
> state1/residency: 2
> state2/residency: 36
>
> Which means we only have buckets: (0,0] (0,2000], (2000,36000] or somesuch. All
> of them very much smaller than TICK_NSEC.
>
> That means we don't track nearly enough data to reliably tell anything
> about disabling the tick or not. We should have at least one bucket
> beyond TICK_NSEC for this.

Quite likely.

> Hmm.. it is getting very late, but how about I get the cpuidle framework
> to pad the drv states with a few 'disabled' C states so that we have at
> least enough data to cross the TICK_NSEC boundary and say something
> usable about things.
>
> Because as things stand, it's very likely we determine @stop_tick purely
> based on what tick_nohz_get_sleep_length() tells us, not on what we've
> learnt from recent history.
>
>
> (FWIW intel_idle seems to not have an entry for Tigerlake !?! -- my poor
> laptop, it feels neglected)

It should then use ACPI _CST idle states.