Re: [RFC][PATCH 0/3] cpuidle,teo: Improve TEO tick decisions

From: Rafael J. Wysocki
Date: Mon Jul 31 2023 - 13:51:29 EST


On Mon, Jul 31, 2023 at 1:47 PM Anna-Maria Behnsen
<anna-maria@xxxxxxxxxxxxx> wrote:
>
> Hi,
>
> On Fri, 28 Jul 2023, Peter Zijlstra wrote:
>
> > Hi,
> >
> > Wanted to send this yesterday, but my home server died and took everything
> > down :/
> >
> > These patches are lightly tested but appear to behave as expected.
> >
> >
>
> As I was asked to see if the patches of Raphael improve the behavior, I
> rerun the tests with Raphaels v2 as well as with Peters RFC patchset. Here
> are the results (compared to upstream):

Thanks for the numbers!

> upstream raphael v2 peter RFC
>
> Idle Total 2533 100.00% 1183 100.00% 5563 100.00%
> x >= 4ms 1458 57.56% 1151 97.30% 3385 60.85%
> 4ms> x >= 2ms 91 3.59% 12 1.01% 133 2.39%
> 2ms > x >= 1ms 56 2.21% 3 0.25% 80 1.44%
> 1ms > x >= 500us 64 2.53% 1 0.08% 98 1.76%
> 500us > x >= 250us 73 2.88% 0 0.00% 113 2.03%
> 250us > x >=100us 76 3.00% 2 0.17% 106 1.91%
> 100us > x >= 50us 33 1.30% 4 0.34% 75 1.35%
> 50us > x >= 25us 39 1.54% 4 0.34% 152 2.73%
> 25us > x >= 10us 199 7.86% 4 0.34% 404 7.26%
> 10us > x > 5us 156 6.16% 0 0.00% 477 8.57%
> 5us > x 288 11.37% 2 0.17% 540 9.71%
>
> tick_nohz_next_event()
> count 8839790 6142079 36623
>
> Raphaels Approach still does the tick_nohz_get_sleep_length() execution
> unconditional. It executes ~5000 times more tick_nohz_next_event() as the
> tick is stopped. This relation improved massively in Peters approach
> (factor is ~7).

So I'm drawing slightly different conclusions from the above.

First, because there are different numbers of idle cycles in each
case, I think it only makes sense to look at the percentages.

So in both the "upstream" and "Peter RFC" cases there is a significant
percentage of times when the tick was stopped and the measure idle
duration was below 25 us (25.39% and 25.54%, respectively), whereas in
the "Rafael v2" case that is only 0.51% of times (not even 1%). This
means a huge improvement to me, because all of these cases mean that
the governor incorrectly told the idle loop to stop the tick (it must
have selected a shallow state and so it should have not stopped the
tick in those cases). To me, this clearly means fixing a real bug in
the governor.

Now, I said in the changelog of my v1 that the goal was not to reduce
the number of tick_nohz_get_sleep_length() invocations, so I'm not
sure why it is regarded as a comparison criteria.