Re: [PATCH v2] sched: balance_cpu to consider other cpus in itsgroup as target of (pinned) task

From: Peter Zijlstra
Date: Wed Jun 13 2012 - 08:31:48 EST


On Wed, 2012-06-06 at 18:36 +0530, Prashanth Nageshappa wrote:
> From: Srivatsa Vaddagiri <vatsa@xxxxxxxxxxxxxxxxxx>
>
> Current load balance scheme lets one cpu in a sched_group (balance_cpu)
> look at other peer sched_groups for imbalance and pull tasks to
> itself from a busy cpu. Tasks thus pulled to balance_cpu will later get
> picked up by cpus that are in the same sched_group as that of balance_cpu.

Not strictly true.. but close enough I think.

> This scheme fails to pull tasks that are not allowed to run on
> balance_cpu (but are allowed to run on other cpus in its sched_group).
>
> This can affect fairness and in some worst case scenarios cause
> starvation, as illustrated below. Consider a two core (2 threads/core)
> system running tasks as below:
>
> Core Core
>
> C0 - F0 C2 - F1
> C1 - T1 C3 - idle
>
> F0 & F1 are SCHED_FIFO cpu hogs pinned to C0 & C2 respectively, while T1 is
> a SCHED_OTHER task pinned to C1. Another SCHED_OTHER task T2 (which can
> run on cpus 1,2) now wakes up and lands on its prev_cpu of C2, which is
> now running SCHED_FIFO cpu hog. To prevent starvation, T2 needs to move to C1.
> However between C0 & C1, C0 is chosen to balance its core with peer cores and
> thus fails to pull T2 towards its core (C0 not being in T2's affinity mask). T2 was
> found to starve eternally in this case.
>
> Although the problem is illustrated in presence of rt tasks, this is a
> general problem that can manifest in presence of non-rt tasks as well.

If you can come up with a relatively simple example without rt tasks
that would be preferred. Also I think the above description could be
improved.

The graph only confuses me and I'm not sure the description is maximally
concise either.

> Some solutions that were considered to solve this problem were:
>
> - Have the right sibling cpus to do load balance ignoring balance_cpu
>
> - Modify move_tasks to move a pinned tasks to a sibling cpu in the
> same sched_group as env->dst_cpu. This will involve some runqueue
> lock juggling (a third runqueue locks needs to be taken when we
> already have two locks held). Moreover we may be just fine to ignore
> that particular task and meet load balance goals by moving other
> tasks.
>
> - Hint that move_tasks should be called with a different env->dst_cpu
>
> This patch implements the 3rd of the above approach, which seemed least
> invasive. Essentially can_migrate_task() records if any task(s) were not moved
> as the destination cpu was not in the cpus_allowed mask of the target task(s)
> and the new destination cpu that task can be moved to. We reissue a call
> to move_tasks with that new destination cpu, provided we failed to meet
> load balance goal by moving other tasks from env->src_cpu.

Still no word on why doing a 3-way pull is ok.. (I think it is, I just
want you to convince me)..

> Signed-off-by: Srivatsa Vaddagiri <vatsa@xxxxxxxxxxxxxxxxxx>
> Signed-off-by: Prashanth Nageshappa <prashanth@xxxxxxxxxxxxxxxxxx>
>
> ----
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 939fd63..21a59fc 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -3098,6 +3098,7 @@ static unsigned long __read_mostly max_load_balance_interval = HZ/10;
>
> #define LBF_ALL_PINNED 0x01
> #define LBF_NEED_BREAK 0x02
> +#define LBF_NEW_DST_CPU 0x04

I still don't really like the new_dst_cpu name, I think it is because it
describes a solution detail, not a problem diagnosis.

> @@ -3198,7 +3201,26 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env)
> * 3) are cache-hot on their current CPU.
> */
> if (!cpumask_test_cpu(env->dst_cpu, tsk_cpus_allowed(p))) {
> - schedstat_inc(p, se.statistics.nr_failed_migrations_affine);
> + int new_dst_cpu;
> +
> + if (!env->dst_grpmask) {
> + schedstat_inc(p, se.statistics.nr_failed_migrations_affine);
> + return 0;
> + }
> +
> + /*
> + * remember if this task can be moved to any other cpus in our
> + * sched_group so that we can retry load balance and move
> + * that task to a new_dst_cpu if required.
> + */
> + new_dst_cpu = cpumask_first_and(env->dst_grpmask,
> + tsk_cpus_allowed(p));
> + if (new_dst_cpu >= nr_cpu_ids) {
> + schedstat_inc(p, se.statistics.nr_failed_migrations_affine);
> + } else {
> + env->flags |= LBF_NEW_DST_CPU;
> + env->new_dst_cpu = new_dst_cpu;
> + }
> return 0;
> }

This wants a comment describing why we're doing this.. this also wants a
comment describing why we do what we do about multiple new_dst.. I think
we should do the cheap thing and not compute a new dst if we already
have one.

For bonus points you'll also add a comment for the all pinned muck,
although that's somewhat more obvious.

> env->flags &= ~LBF_ALL_PINNED;
> @@ -4440,7 +4462,7 @@ static int load_balance(int this_cpu, struct rq *this_rq,
> struct sched_domain *sd, enum cpu_idle_type idle,
> int *balance)
> {
> - int ld_moved, active_balance = 0;
> + int ld_moved, cur_ld_moved, active_balance = 0;

better than the old_ld_moved, but still not really obvious.. maybe we
can cure this with a comment...

> struct sched_group *group;
> struct rq *busiest;
> unsigned long flags;
> @@ -4450,6 +4472,7 @@ static int load_balance(int this_cpu, struct rq *this_rq,
> .sd = sd,
> .dst_cpu = this_cpu,
> .dst_rq = this_rq,
> + .dst_grpmask = sched_group_cpus(sd->groups),
> .idle = idle,
> .loop_break = sched_nr_migrate_break,
> .find_busiest_queue = find_busiest_queue,
> @@ -4502,7 +4525,8 @@ more_balance:
> double_rq_lock(this_rq, busiest);
> if (!env.loop)
> update_h_load(env.src_cpu);
> - ld_moved += move_tasks(&env);
> + cur_ld_moved = move_tasks(&env);
> + ld_moved += cur_ld_moved;
> double_rq_unlock(this_rq, busiest);
> local_irq_restore(flags);
>
> @@ -4514,8 +4538,23 @@ more_balance:
> /*
> * some other cpu did the load balance for us.

so add something here describing the cur_ld_moved vs ld_moved thing..

> */

> + if (cur_ld_moved && env.dst_cpu != smp_processor_id())
> + resched_cpu(env.dst_cpu);
> +
> + if ((env.flags & LBF_NEW_DST_CPU) && (env.imbalance > 0)) {

Is this different from the last version?, ISTR only doing the loop again
when we didn't move anything.

> + /*
> + * we could not balance completely as some tasks
> + * were not allowed to move to the dst_cpu, so try
> + * again with new_dst_cpu.
> + */
> + this_rq = cpu_rq(env.new_dst_cpu);
> + env.dst_rq = this_rq;
> + env.dst_cpu = env.new_dst_cpu;
> + env.flags &= ~LBF_NEW_DST_CPU;
> + env.loop = 0;
> + env.loop_break = sched_nr_migrate_break;
> + goto more_balance;

What guarantees there's an end to this iteration? Also we explicitly use
more_balance over redo so that we're sure to iterate the same src cpu
because otherwise our new dst_cpu is meaningless? Sounds like something
that would be good to mention in .. wait for it .. a comment?

> + }
>
> /* All tasks on this runqueue were pinned by CPU affinity */
> if (unlikely(env.flags & LBF_ALL_PINNED)) {
> @@ -4716,6 +4755,7 @@ static int active_load_balance_cpu_stop(void *data)
> .sd = sd,
> .dst_cpu = target_cpu,
> .dst_rq = target_rq,
> + .dst_grpmask = NULL,

This is superfluous, c99 initialization zeros all unmentioned members.

> .src_cpu = busiest_rq->cpu,
> .src_rq = busiest_rq,
> .idle = CPU_IDLE,
>

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/