Re: [RFC PATCHC 3/3] sched/fair: use the idle state info to choose the idlest cpu

From: Nicolas Pitre
Date: Fri Apr 04 2014 - 12:57:08 EST


On Fri, 4 Apr 2014, Rafael J. Wysocki wrote:

> On Tuesday, April 01, 2014 11:05:49 PM Nicolas Pitre wrote:
> > On Fri, 28 Mar 2014, Daniel Lezcano wrote:
> >
> > > As we know in which idle state the cpu is, we can investigate the following:
> > >
> > > 1. when did the cpu entered the idle state ? the longer the cpu is idle, the
> > > deeper it is idle
> > > 2. what exit latency is ? the greater the exit latency is, the deeper it is
> > >
> > > With both information, when all cpus are idle, we can choose the idlest cpu.
> > >
> > > When one cpu is not idle, the old check against weighted load applies.
> > >
> > > Signed-off-by: Daniel Lezcano <daniel.lezcano@xxxxxxxxxx>
> >
> > There seems to be some problems with the implementation.
> >
> > > @@ -4336,20 +4337,53 @@ static int
> > > find_idlest_cpu(struct sched_group *group, struct task_struct *p, int this_cpu)
> > > {
> > > unsigned long load, min_load = ULONG_MAX;
> > > - int idlest = -1;
> > > + unsigned int min_exit_latency = UINT_MAX;
> > > + u64 idle_stamp, min_idle_stamp = ULONG_MAX;
> >
> > I don't think you really meant to assign an u64 variable with ULONG_MAX.
> > You probably want ULLONG_MAX here. And probably not in fact (more
> > later).
> >
> > > +
> > > + struct rq *rq;
> > > + struct cpuidle_power *power;
> > > +
> > > + int cpu_idle = -1;
> > > + int cpu_busy = -1;
> > > int i;
> > >
> > > /* Traverse only the allowed CPUs */
> > > for_each_cpu_and(i, sched_group_cpus(group), tsk_cpus_allowed(p)) {
> > > - load = weighted_cpuload(i);
> > >
> > > - if (load < min_load || (load == min_load && i == this_cpu)) {
> > > - min_load = load;
> > > - idlest = i;
> > > + if (idle_cpu(i)) {
> > > +
> > > + rq = cpu_rq(i);
> > > + power = rq->power;
> > > + idle_stamp = rq->idle_stamp;
> > > +
> > > + /* The cpu is idle since a shorter time */
> > > + if (idle_stamp < min_idle_stamp) {
> > > + min_idle_stamp = idle_stamp;
> > > + cpu_idle = i;
> > > + continue;
> >
> > Don't you want the highest time stamp in order to select the most
> > recently idled CPU? Favoring the CPU which has been idle the longest
> > makes little sense.
>
> It may make sense if the hardware can auto-promote CPUs to deeper C-states.

If so the promotion will happen over time, no? What I'm saying here is
that those CPUs which have been idle longer should not be favored when
it is time to select a CPU for a task to run. More recently idled CPUs
are more likely to be in a shallower C-state.

> Something like that happens with package C-states that are only entered when
> all cores have entered a particular core C-state already. In that case the
> probability of the core being in a deeper state grows with time.

Exactly what I'm saying.

Also here it is worth remembering that the scheduling domains should
represent those packages that share common C-states at a higher level.
The scheduler can then be told not to balance across domains if it
doesn't need to in order to favor the conditions for those package
C-states to be used. That's what the task packing patch series is
about, independently of this one.

> That said I would just drop this heuristics for the time being. If auto-promotion
> is disregarded, it doesn't really matter how much time the given CPU has been idle
> except for one case: When the target residency of its idle state hasn't been
> reached yet, waking up the CPU may be a mistake (depending on how deep the state
> actually is, but for the majority of drivers in the tree we don't have any measure
> of that).

There is one reason for considering the time a CPU has been idle,
assuming equivalent C-state, and that is cache snooping. The longer a
CPU is idle, the more likely its cache content will have been claimed
and migrated by other CPUs. Of course that doesn't make much difference
for deeper C-states where the cache isn't preserved, but it is probably
simpler and cheaper to apply this heuristic in all cases.

> > > + }
> > > +
> > > + /* The cpu is idle but the exit_latency is shorter */
> > > + if (power && power->exit_latency < min_exit_latency) {
> > > + min_exit_latency = power->exit_latency;
> > > + cpu_idle = i;
> > > + continue;
> > > + }
> >
> > I think this is wrong. This gives priority to CPUs which have been idle
> > for a (longer... although this should have been) shorter period of time
> > over those with a shallower idle state. I think this should rather be:
> >
> > if (power && power->exit_latency < min_exit_latency) {
> > min_exit_latency = power->exit_latency;
> > latest_idle_stamp = idle_stamp;
> > cpu_idle = i;
> > } else if ((!power || power->exit_latency == min_exit_latency) &&
> > idle_stamp > latest_idle_stamp) {
> > latest_idle_stamp = idle_stamp;
> > cpu_idle = i;
> > }
> >
> > So the CPU with the shallowest idle state is selected in priority, and
> > if many CPUs are in the same state then the time stamp is used to
> > select the most recent one.
>
> Again, if auto-promotion is disregarded, it doesn't really matter which of them
> is woken up.

If it doesn't matter then it doesn't hurt. But in some cases it
matters.

> > Whenever a shallower idle state is found then the latest_idle_stamp is reset for
> > that state even if it is further in the past.
> >
> > > + } else {
> > > +
> > > + load = weighted_cpuload(i);
> > > +
> > > + if (load < min_load ||
> > > + (load == min_load && i == this_cpu)) {
> > > + min_load = load;
> > > + cpu_busy = i;
> > > + continue;
> > > + }
> > > }
> >
> > I think this is wrong to do an if-else based on idle_cpu() here. What
> > if a CPU is heavily loaded, but for some reason it happens to be idle at
> > this very moment? With your patch it could be selected as an idle CPU
> > while it would be discarded as being too busy otherwise.
>
> But see below ->
>
> > It is important to determine both cpu_busy and cpu_idle for all CPUs.
> >
> > And cpu_busy is a bad name for this. Something like least_loaded would
> > be more self explanatory. Same thing for cpu_idle which could be
> > clearer if named shalloest_idle.
>
> shallowest_idle?

Something that means the CPU with the shallowest C-state. Using
"cpu_idle" for this variable doesn't cut it.

> > > - return idlest;
> > > + /* Busy cpus are considered less idle than idle cpus ;) */
> > > + return cpu_busy != -1 ? cpu_busy : cpu_idle;
> >
> > And finally it is a policy decision whether or not we want to return
> > least_loaded over shallowest_idle e.g do we pack tasks on non idle CPUs
> > first or not. That in itself needs more investigation. To keep the
> > existing policy unchanged for now the above condition should have its
> > variables swapped.
>
> Which means that once we've find the first idle CPU, it is not useful to
> continue computing least_loaded, because we will return the idle one anyway,
> right?

Good point. Currently, that should be the case.

Eventually we'll want to put new tasks on lightly loaded CPUs instead of
waking up a fully idle CPU in order to favor deeper C-states. But that
requires a patch series of its own just to determine how loaded a CPU is
and how much work it can still accommodate before being oversubscribed,
etc.


Nicolas
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/