Re: [RFC][PATCH 2/3] cpuidle,teo: Improve NOHZ management

From: Rafael J. Wysocki
Date: Mon Jul 31 2023 - 13:21:24 EST

Next message: Raghavendra Rao Ananta: "Re: [PATCH v7 01/12] KVM: Rename kvm_arch_flush_remote_tlb() to kvm_arch_flush_remote_tlbs()"
Previous message: Jesse Brandeburg: "Re: [net-next PATCH V2 1/2] octeontx2-af: Code restructure to handle TC outer VLAN offload"
In reply to: Peter Zijlstra: "Re: [RFC][PATCH 2/3] cpuidle,teo: Improve NOHZ management"
Next in thread: Peter Zijlstra: "[RFC][PATCH 3/3] cpuidle,teo: Improve state selection"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On Mon, Jul 31, 2023 at 2:02 PM Peter Zijlstra <peterz@xxxxxxxxxxxxx> wrote:
>
> On Mon, Jul 31, 2023 at 12:17:27PM +0200, Rafael J. Wysocki wrote:
>
> > Something really simple like:
> >
> > 1. Check sched_cpu_util() (which is done by teo anyway).
> > 2. If that is around 90% of the maximum CPU capacity, select the first
> > non-polling idle state and be done (don't stop the tick as my other
> > replay earlier today).
>
> So I really don't like using cpu_util() here, yes, 90% is a high number,
> but it doesn't say *anything* about the idle duration. Remember, this is
> a 32ms window, so 90% of that is 28.8ms.

It doesn't have to say anything about the idle duration as long as it
says something about the governor's "accuracy".

If it is observed experimentally that the governor is generally likely
to mispredict a deeper state if CPu utilization is about a certain
threshold, then it makes sense to use this information to counter
that. That's how it is used today.

And yes, you are right about the most immediate idle duration, but
overall the rule that if the CPU utilization is high, then selecting
deep idle states is not a good idea in general does seem to hold.

> (not entirely accurate, since it's an exponential average, but that
> doesn't change the overal argument, only some of the particulars)
>
> That is, 90% util, at best, says there is no idle longer than 3.2 ms.
> But that is still vastly longer than pretty much all residencies. Heck,
> that is still 3 ticks worth of HZ=1000 ticks. So 90% util should not
> preclude disabling the tick (at HZ=1000).
>
> Now, typically this won't be the case, and at 90% you'll have lots of
> small idles adding up to 3.2ms total idle. But the point is, you can't
> tell the difference. And as such util is a horrible measure to use for
> cpuidle.

No it is not IMO, because idle is all about the combined outcome of
multiple cycles.

> > > If we track the tick+ bucket -- as
> > > we must in order to say anything useful about it, then we can decide the
> > > tick state before (as I do here) calling sleep_length().
> > >
> > > The timer-pull rework from Anna-Maria unfortunately makes the
> > > tick_nohz_get_sleep_length() thing excessively expensive and it really
> > > doesn't make sense to call it when we retain the tick.
> > >
> > > It's all a bit of a chicken-egg situation, cpuidle wants to know when
> > > the next timer is, but telling when that is, wants to know if the tick
> > > stays. We need to break that somehow -- I propose by not calling it when
> > > we know we'll keep the tick.
> >
> > By selecting a state whose target residency will not be met, we lose
> > on both energy and performance, so doing this really should be
> > avoided, unless the state is really shallow in which case there may be
> > no time for making this consideration.
>
> I'm not sure how that relates to what I propose above. By adding the
> tick+ bucket we have more historical information as related to the tick
> boundary, how does that make us select states we won't match residency
> for?

As stated in my last reply, the only case in which it makes a
difference is when the deepest idle state's target residency is below
the tick and I'm not really sure if that difference is demonstrably
significant.

So I would do the following to start with:

1. Rearrange the teo code so that it considers all of the bins every
time without calling tick_nohz_get_sleep_length().

2. The sched_cpu_util() check will still be applied to the resulting
candidate state as it is now.

3. If it finds that the candidate state is shallow enough (for
instance, it is a polling state or the first non-polling one), it will
return this state without calling tick_nohz_get_sleep_length() and
stopping the tick.

4. Otherwise it will call tick_nohz_get_sleep_length() to see what
about timers and refine the selection (towards the shallower states)
if need be.

5. If the candidate state is not the deepest one and its target
residency is below the tick, it will be returned and the tick will not
be stopped.

6. Otherwise, the candidate state will be returned and the tick will be stopped.

If this still doesn't get us where we want to be, the extra bin can be
added (as long as it makes a measurable difference).

Next message: Raghavendra Rao Ananta: "Re: [PATCH v7 01/12] KVM: Rename kvm_arch_flush_remote_tlb() to kvm_arch_flush_remote_tlbs()"
Previous message: Jesse Brandeburg: "Re: [net-next PATCH V2 1/2] octeontx2-af: Code restructure to handle TC outer VLAN offload"
In reply to: Peter Zijlstra: "Re: [RFC][PATCH 2/3] cpuidle,teo: Improve NOHZ management"
Next in thread: Peter Zijlstra: "[RFC][PATCH 3/3] cpuidle,teo: Improve state selection"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]