Re: [RFC][PATCH 1/3] cpuidle: Inject tick boundary state

From: Rafael J. Wysocki
Date: Mon Jul 31 2023 - 13:27:45 EST


On Mon, Jul 31, 2023 at 6:55 PM Rafael J. Wysocki <rafael@xxxxxxxxxx> wrote:
>
> On Mon, Jul 31, 2023 at 1:39 PM Peter Zijlstra <peterz@xxxxxxxxxxxxx> wrote:
> >
> > On Mon, Jul 31, 2023 at 12:35:20PM +0200, Rafael J. Wysocki wrote:
> >
> > > > So I agree with 1.
> > > >
> > > > I do not agree with 2. Disabling the tick is costly, doubly so with the
> > > > timer-pull thing, but even today. Simply disabling it because we picked
> > > > the deepest idle state, irrespective of the expected duration is wrong
> > > > as it will incur this significant cost.
> > > >
> > > > With 3 there is the question of how we get the expected sleep duration;
> > > > this is especially important with timer-pull, where we have this
> > > > chicken-and-egg thing.
> > > >
> > > > Notably: tick_nohz_get_sleep_length() wants to know if the tick gets
> > > > disabled
> > >
> > > Well, it shouldn't. Or at least it didn't before.
> >
> > Correct, this is new in the timer-pull thing.
> >
> > > It is expected to produce two values, one with the tick stopped (this
> > > is the return value of the function) and the other with the tick
> > > ticking (this is the one written under the address passed as the arg).
> > > This cannot depend on whether or not the tick will be stopped. Both
> > > are good to know.
> > >
> > > Now, I understand that getting these two values may be costly, so
> > > there is an incentive to avoid calling it, but then the governor needs
> > > to figure this out from its crystal ball and so care needs to be taken
> > > to limit the possible damage in case the crystal ball is not right.
> >
> > If we can get the governor to decide the tick state up-front we can
> > avoid a lot of the expensive parts.
>
> I claim that in the vast majority of cases this is the same as
> deciding the state.
>
> The case when it is not is when the target residency of the deepest
> idle state is below the tick period length and the governor is about
> to select that state. According to the data I've seen so far this is
> a tiny fraction of all the cases.
>
> > > > and cpuilde wants to use tick_nohz_get_sleep_length() to
> > > > determine if to disable the tick. This cycle needs to be broken for
> > > > timer-pull.
> > > >
> > > > Hence my proposal to introduce the extra tick state, that allows fixing
> > > > both 2 and 3.
> > >
> > > I'm not sure about 3 TBH.
> > >
> > > Say there are 2 idle states, one shallow (say its target residency is
> > > 10 us) and one deep (say its target residency is T = 2 * TICK_NSEC).
> >
> > This is the easy case and that actually 'works' today.
>
> But this is my case 3 which you said you didn't agree with. I don't
> see why it needs to be fixed.
>
> > The interesting case is where your deepest state has a target residency that
> > is below the tick (because for HZ=100, we have a 10ms tick and pretty
> > much all idle states are below that).
>
> What about HZ=1000, though?
>
> > In that case you cannot tell the difference between I'm good to use this
> > state and I'm good to disable the tick and still use this state.
>
> No, you don't, but is it really worth the fuss?
>
> The state is high-latency anyway and tick_nohz_get_sleep_length()
> needs to be called anyway in that case in order to determine if a
> shallower state wouldn't be better. At this point the statistics have
> already told the governor otherwise and a misprediction would be a
> double loss.
>
> So yes, you can gain performance by avoiding to call
> tick_nohz_get_sleep_length(), but then you can also lose it by
> selecting a state that is too deep (and that can be determined exactly
> with respect to timers).

BTW, note that teo records timers as "hits", because it has an idea
about when the next timer should occur and that's because it calls
tick_nohz_get_sleep_length().

If it doesn't call that function, it will not be able to tell the
difference between a non-tick timer and any other wakeup, so the
non-tick timer wakeups will then be recorded as "intercepts" which
will skew it towards shallow states over time. That's one of the
reasons why I would prefer to only avoid calling
tick_nohz_get_sleep_length() when the candidate idle state is really
shallow.