Re: Stopping the tick on a fully loaded system

From: Rafael J. Wysocki
Date: Wed Jul 26 2023 - 11:54:03 EST


On Wed, Jul 26, 2023 at 5:10 PM Rafael J. Wysocki <rafael@xxxxxxxxxx> wrote:
>
> On Wed, Jul 26, 2023 at 12:29 AM Peter Zijlstra <peterz@xxxxxxxxxxxxx> wrote:
> >
> > On Tue, Jul 25, 2023 at 04:27:56PM +0200, Rafael J. Wysocki wrote:
> > > On Tue, Jul 25, 2023 at 3:07 PM Anna-Maria Behnsen
> >
> > > > 100% load 50% load 25% load
> > > > (top: ~2% idle) (top: ~49% idle) (top: ~74% idle;
> > > > 33 CPUs are completely idle)
> > > > --------------- ---------------- ----------------------------
> > > > Idle Total 1658703 100% 3150522 100% 2377035 100%
> > > > x >= 4ms 2504 0.15% 2 0.00% 53 0.00%
> > > > 4ms> x >= 2ms 390 0.02% 0 0.00% 4563 0.19%
> > > > 2ms > x >= 1ms 62 0.00% 1 0.00% 54 0.00%
> > > > 1ms > x >= 500us 67 0.00% 6 0.00% 2 0.00%
> > > > 500us > x >= 250us 93 0.01% 39 0.00% 11 0.00%
> > > > 250us > x >=100us 280 0.02% 1145 0.04% 633 0.03%
> > > > 100us > x >= 50us 942 0.06% 30722 0.98% 13347 0.56%
> > > > 50us > x >= 25us 26728 1.61% 310932 9.87% 106083 4.46%
> > > > 25us > x >= 10us 825920 49.79% 2320683 73.66% 1722505 72.46%
> > > > 10us > x > 5us 795197 47.94% 442991 14.06% 506008 21.29%
> > > > 5us > x 6520 0.39% 43994 1.40% 23645 0.99%
> > > >
> > > >
> > > > 99% of the tick stops only have an idle period shorter than 50us (50us is
> > > > 1,25% of a tick length).
> > >
> > > Well, this just means that the governor predicts overly long idle
> > > durations quite often under this workload.
> > >
> > > The governor's decision on whether or not to stop the tick is based on
> > > its idle duration prediction. If it overshoots, that's how it goes.
> >
> > This is abysmal; IIRC TEO tracks a density function in C state buckets
> > and if it finds it's more likely to be shorter than 'predicted' by the
> > timer it should pick something shallower.
> >
> > Given we have this density function, picking something that's <1% likely
> > is insane. In fact, it seems to suggest the whole pick-alternative thing
> > is utterly broken.
> >
> > > > This is also the reason for my opinion, that the return of
> > > > tick_nohz_next_event() is completely irrelevant in a (fully) loaded case:
> > >
> > > It is an upper bound and in a fully loaded case it may be way off.
> >
> > But given we have our density function, we should be able to do much
> > better.
> >
> >
> > Oooh,... I think I see the problem. Our bins are strictly the available
> > C-state, but if you run this on a Zen3 that has ACPI-idle, then you end
> > up with something that only has 3 C states, like:
> >
> > $ for i in state*/residency ; do echo -n "${i}: "; cat $i; done
> > state0/residency: 0
> > state1/residency: 2
> > state2/residency: 36
> >
> > Which means we only have buckets: (0,0] (0,2000], (2000,36000] or somesuch. All
> > of them very much smaller than TICK_NSEC.
> >
> > That means we don't track nearly enough data to reliably tell anything
> > about disabling the tick or not. We should have at least one bucket
> > beyond TICK_NSEC for this.
>
> Quite likely.

So the reasoning here was that those additional bins would not be
necessary for idle state selection, but the problem of whether or not
to stop the tick is kind of separate from the idle state selection
problem if the target residency values for all of the idle states are
relatively short. And so it should be addressed separately which
currently it is not. Admittedly, this is a mistake.