Re: [PATCH] sched/fair: fix 1 task per CPU

From: Valentin Schneider
Date: Sat Sep 08 2018 - 16:17:10 EST


Hi Vincent,

On 07/09/18 08:40, Vincent Guittot wrote:
> When CPUs have different capacity because of RT/DL tasks or
> micro-architecture or max frequency differences, there are situation where
> the imbalance is not correctly set to migrate waiting task on the idle CPU.
>

This is essentially always for down migrations (high capacity src CPU to lower
capacity dst CPU), right?

For instance, I've often seen that issue on HiKey960 (4x4 big.LITTLE) where
if you spawn 8 CPU-hogs you can end up with 5 on the bigs and 3 on the
LITTLES, but since avg_load scales with the inverse of group_capacity it's
seen as balanced. I don't recall seeing this problem for low capacity ->
big capacity migration, especially with the misfit logic that makes this a
non-problem.

Also, a nit on the commit header: although we've called it "1 task per CPU"
in the past, it might be more explicit to call it "1 long running task per
CPU", if just for the sake of clarity for people not directly involved in
our previous discussions.

> The UC uses the force_balance case :
> if (env->idle != CPU_NOT_IDLE && group_has_capacity(env, local) &&
> busiest->group_no_capacity)
> goto force_balance;
>
> But calculate_imbalance fails to set the right amount of load to migrate
> a task because of the special condition:
> busiest->avg_load <= sds->avg_load || local->avg_load >= sds->avg_load)
>
> Add in fix_small_imbalance, this special case that triggered the force
> balance in order to make sure that the amount of load to migrate will be
> enough.
>
> Signed-off-by: Vincent Guittot <vincent.guittot@xxxxxxxxxx>
> ---

So we happen to have come up with a (relatively) similar patch for Android
(see [1]) which also stemmed from the 1 always running task per CPU issue. It
only applies to SD_ASYM_CPUCAPACITY sched domains though, and targets
down-migrations (local->group_capacity < busiest->group_capacity).

I'd argue that we should make it so this kind of imbalance gets detected way
before we hit fix_small_imbalance(), but until we get there I think we do want
an extra condition like this.

> kernel/sched/fair.c | 14 ++++++++++++++
> 1 file changed, 14 insertions(+)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 309c93f..57b4d83 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -8048,6 +8048,20 @@ void fix_small_imbalance(struct lb_env *env, struct sd_lb_stats *sds)
> local = &sds->local_stat;
> busiest = &sds->busiest_stat;
>
> + /*
> + * There is available capacity in local group and busiest group is
> + * overloaded but calculate_imbalance can't compute the amount of load
> + * to migrate because they became meaningless because asymetric

What do you mean by 'they'?

Also, s/because/due to/ on the second "because".

> + * capacity between group. In such case, we only want to migrate at
> + * least one tasks of the busiest group and rely of the average load
> + * per task to ensure the migration.
> + */
> + if (env->idle != CPU_NOT_IDLE && group_has_capacity(env, local) &&
> + busiest->group_no_capacity) {
> + env->imbalance = busiest->load_per_task;
> + return;
> + }
> +

The condition is an exact copy of the one in that goto force_balance
path you mention (in find_busiest_group()), is that just a coincidence? If
not you might want to mention it in the comment.

> if (!local->sum_nr_running)
> local->load_per_task = cpu_avg_load_per_task(env->dst_cpu);
> else if (busiest->load_per_task > local->load_per_task)
>

[1]: https://android.googlesource.com/kernel/common/+/9fae72e924961f3b32816193fcb56d19c1f644c2%5E%21/#F0