Re: [PATCH] cpuidle: New timer events oriented governor for tickless systems

From: Rafael J. Wysocki
Date: Tue Dec 18 2018 - 10:31:29 EST


Hi,

On Tue, Dec 18, 2018 at 12:50 PM Quentin Perret <quentin.perret@xxxxxxx> wrote:
>
> Hi Rafael,
>
> I skimmed through the previous versions, but apologies if some of these
> things have been discussed already :-)
>
> On Tuesday 18 Dec 2018 at 12:09:05 (+0100), Rafael J. Wysocki wrote:
> <...>
> > + * - Find an idle state on the basis of the sleep length and state statistics
> > + * collected over time:
> > + *
> > + * o Find the deepest idle state whose target residency is less than or euqal
>
> s/euqal/equal
>
>
> <...>
> > +/**
> > + * struct teo_idle_state - Idle state data used by the TEO cpuidle governor.
> > + * @early_hits: "Early" CPU wakeups "matching" this state.
> > + * @hits: "On time" CPU wakeups "matching" this state.
> > + * @misses: CPU wakeups "missing" this state.
> > + *
> > + * A CPU wakeup is "matched" by a given idle state if the idle duration measured
> > + * after the wakeup is between the target residency of that state and the target
> > + * residnecy of the next one (or if this is the deepest available idle state, it
>
> s/residnecy/residency
>
> > + * "matches" a CPU wakeup when the measured idle duration is at least equal to
> > + * its target residency).
> > + *
> > + * Also, from the TEO governor prespective, a CPU wakeup from idle is "early" if
>
> s/prespective/perspective

Thanks for the typo fixes (ongoing spellchecker issues here).

> > + * it occurs significantly earlier than the closest expected timer event (that
> > + * is, early enough to match an idle state shallower than the one matching the
> > + * time till the closest timer event). Otherwise, the wakeup is "on time", or
> > + * it is a "hit".
> > + *
> > + * A "miss" occurs when the given state doesn't match the wakeup, but it matches
> > + * the time till the closest timer event used for idle state selection.
> > + */
>
>
> <...>
> > +static void teo_update(struct cpuidle_driver *drv, struct cpuidle_device *dev)
> > +{
> > + struct teo_cpu *cpu_data = per_cpu_ptr(&teo_cpus, dev->cpu);
> > + unsigned int sleep_length_us = ktime_to_us(cpu_data->sleep_length_ns);
> > + int i, idx_hit = -1, idx_timer = -1;
> > + unsigned int measured_us;
> > +
> > + if (cpu_data->time_span_ns >= cpu_data->sleep_length_ns) {
> > + /*
> > + * One of the safety nets has triggered or this was a timer
> > + * wakeup (or equivalent).
> > + */
> > + measured_us = sleep_length_us;
> > + } else {
> > + unsigned int lat = drv->states[cpu_data->last_state].exit_latency;
> > +
> > + measured_us = ktime_to_us(cpu_data->time_span_ns);
> > + /*
> > + * The delay between the wakeup and the first instruction
> > + * executed by the CPU is not likely to be worst-case every
> > + * time, so take 1/2 of the exit latency as a very rough
> > + * approximation of the average of it.
> > + */
> > + if (measured_us >= lat)
> > + measured_us -= lat / 2;
> > + else
> > + measured_us /= 2;
>
> I'm not sure to understand this; why not removing lat/2 in all cases and
> cap at 0 ?

Well, 0 is special on platforms where state 0 is a "polling" one, so
it's better to avoid it. So this is the reason, but that's a matter
of handling the corner case this way or another.

> > + }
>
>
> <...>
> > + /*
> > + * Save idle duration values corresponding to non-timer wakeups for
> > + * pattern detection.
> > + *
> > + * If the total time span between idle state selection and the "reflect"
> > + * callback is greater than or equal to the sleep length determined at
> > + * the idle state selection time, the wakeup is likely to be due to a
> > + * timer event.
>
> Should this paragraph go above the first 'if' of this function ?

Of course I can move it.

> It checks the same thing and it took me some time to understand what you
> were doing while the explanation I needed was down here all along :-)

Fair enough.

> > + */
> > + if (cpu_data->time_span_ns >= cpu_data->sleep_length_ns)
> > + measured_us = UINT_MAX;
> > +
> > + cpu_data->intervals[cpu_data->interval_idx++] = measured_us;
> > + if (cpu_data->interval_idx > INTERVALS)
> > + cpu_data->interval_idx = 0;
> > +}
>
>
> <...>
> > + /*
> > + * Count and sum the most recent idle duration values less than
> > + * the target residency of the state selected so far, find the
> > + * max.
> > + */
> > + for (i = 0; i < INTERVALS; i++) {
> > + unsigned int val = cpu_data->intervals[i];
> > +
> > + if (val >= drv->states[idx].target_residency)
> > + continue;
> > +
> > + count++;
> > + sum += val;
> > + }
> > +
> > + /*
> > + * Give up unless the majority of the most recent idle duration
> > + * values are in the interesting range.
> > + */
> > + if (count > INTERVALS / 2) {
>
> I understand this probably works well enough in practice, but how 'robust'
> is supposed to be this filtering here ? If you have, say, 2 tasks with
> prime periods, then the hyper-period (which is when the pattern repeats)
> can get high really quickly, and you'll never see the full extent of
> that pattern with this mechanism.

That's correct, but this is just about whether or not there is a
reason to select an idle state shallower than the one selected
already. How the pattern looks like exactly doesn't matter too much
here AFAICS.

> Yes, it's hard to do much better without introducing tons of complexity
> in here, but how did you define INTERVALS=8 and so ?

I blatantly stole this number from the menu governor. :-)

> Is this something you'd like to make configurable at some point ?

Maybe?

> I guess this could always be done later if need be.

Precisely.

> > + unsigned int avg_us = div64_u64(sum, count);
> > +
> > + /*
> > + * Avoid spending too much time in an idle state that
> > + * would be too shallow.
> > + */
> > + if (!(tick_nohz_tick_stopped() && avg_us < TICK_USEC)) {
>
> IIUC, this is saying we shouldn't trust the predicted sleep time if it
> is shorter than the tick and the tick is stopped. What is the use case
> you have in mind for this ?
>
> In other words, in which scenario do you expect to see a lot of high
> frequency non-timer-related wake-ups and have the tick disabled ?

It is mostly theoretical, but the rationale here is that the cost of
using a shallow idle state with the tick stopped is potentially very
high (the CPU may stay in that idle state for a very long time then),
so it is better to assume misprediction in that case to avoid that
cost.

> > + idx = teo_find_shallower_state(drv, dev, idx, avg_us);
> > + duration_us = avg_us;
> > + }
> > + }
> > + }

Cheers,
Rafael