Re: [PATCH v2 4/8] sched/fair: rework load_balance

From: Valentin Schneider
Date: Mon Aug 05 2019 - 13:07:08 EST


Hi Vincent,

Here's another batch of comments, still need to go through some more of it.

On 01/08/2019 15:40, Vincent Guittot wrote:
> The load_balance algorithm contains some heuristics which have becomes

s/becomes/become/

> meaningless since the rework of metrics and the introduction of PELT.
^^^^^^^^^^
Which metrics? I suppose you mean the s*_lb_stats structs?

>
> Furthermore, it's sometimes difficult to fix wrong scheduling decisions
> because everything is based on load whereas some imbalances are not
> related to the load.

Hmm, well, they don't start as wrong decisions, right? It's just that
certain imbalance scenarios can't be solved by looking only at load.

What about something along those lines?

"""
Furthermore, load is an ill-suited metric for solving certain task
placement imbalance scenarios. For instance, in the presence of idle CPUs,
we should simply try to get at least one task per CPU, whereas the current
load-based algorithm can actually leave idle CPUs alone simply because the
load is somewhat balanced.
"""

> The current algorithm ends up to create virtual and

s/to create/creating/

> meaningless value like the avg_load_per_task or tweaks the state of a
> group to make it overloaded whereas it's not, in order to try to migrate
> tasks.
>
> load_balance should better qualify the imbalance of the group and define
> cleary what has to be moved to fix this imbalance.

s/define cleary/clearly define/

>
> The type of sched_group has been extended to better reflect the type of
> imbalance. We now have :
> group_has_spare
> group_fully_busy
> group_misfit_task
> group_asym_capacity
> group_imbalanced
> group_overloaded
>
> Based on the type of sched_group, load_balance now sets what it wants to
> move in order to fix the imnbalance. It can be some load as before but

s/imnbalance/imbalance/

> also some utilization, a number of task or a type of task:
> migrate_task
> migrate_util
> migrate_load
> migrate_misfit
>
> This new load_balance algorithm fixes several pending wrong tasks
> placement:
> - the 1 task per CPU case with asymetrics system

s/asymetrics/asymmetric/

> - the case of cfs task preempted by other class
> - the case of tasks not evenly spread on groups with spare capacity
>
> The load balance decisions have been gathered in 3 functions:
> - update_sd_pick_busiest() select the busiest sched_group.
> - find_busiest_group() checks if there is an imabalance between local and

s/imabalance/imbalance/

> busiest group.
> - calculate_imbalance() decides what have to be moved.

That's nothing new, isn't it? I think what you mean there is that the
decisions have been consolidated in those areas, rather than being spread
all over the place.

>
> Signed-off-by: Vincent Guittot <vincent.guittot@xxxxxxxxxx>
> ---
> kernel/sched/fair.c | 581 ++++++++++++++++++++++++++++++++++------------------
> 1 file changed, 379 insertions(+), 202 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index d7f4a7e..a8681c3 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -7136,13 +7136,28 @@ static unsigned long __read_mostly max_load_balance_interval = HZ/10;
>
> enum fbq_type { regular, remote, all };
>
> +/*
> + * group_type describes the group of CPUs at the moment of the load balance.
> + * The enum is ordered by pulling priority, with the group with lowest priority
> + * first so the groupe_type can be simply compared when selecting the busiest
> + * group. see update_sd_pick_busiest().
> + */
> enum group_type {
> - group_other = 0,
> + group_has_spare = 0,
> + group_fully_busy,
> group_misfit_task,
> + group_asym_capacity,

That one got me confused - it's about asym packing, not asym capacity, and
the name should reflect that. I'd vote for group_asym_packing to stay in
line with what Quentin did for the sd shortcut pointers in

011b27bb5d31 ("sched/topology: Add lowest CPU asymmetry sched_domain level pointer")

> group_imbalanced,
> group_overloaded,
> };
>
> +enum migration_type {
> + migrate_task = 0,
> + migrate_util,
> + migrate_load,
> + migrate_misfit,

nitpicking here: AFAICT the ordering of this doesn't really matter,
could we place migrate_misfit next to migrate_task? As you call out in the
header, we can migrate a number of tasks or a type of task, so these should
be somewhat together.

If we're afraid that we'll add other types of tasks later on and that this
won't result in a neat append-to-the-end, we could reverse the order like
so:

migrate_load
migrate_util
migrate_task
migrate_misfit

[...]
> @@ -7745,10 +7793,10 @@ struct sg_lb_stats {
> struct sd_lb_stats {
> struct sched_group *busiest; /* Busiest group in this sd */
> struct sched_group *local; /* Local group in this sd */
> - unsigned long total_running;

Could be worth calling out in the log that this gets snipped out. Or it
could go into its own small cleanup patch, since it's just an unused field.

[...]> @@ -8147,11 +8223,67 @@ static bool update_sd_pick_busiest(struct lb_env *env,
> if (sgs->group_type < busiest->group_type)
> return false;
>
> - if (sgs->avg_load <= busiest->avg_load)
> + /*
> + * The candidate and the current busiest group are the same type of
> + * group. Let check which one is the busiest according to the type.
> + */
> +
> + switch (sgs->group_type) {
> + case group_overloaded:
> + /* Select the overloaded group with highest avg_load. */
> + if (sgs->avg_load <= busiest->avg_load)
> + return false;
> + break;
> +
> + case group_imbalanced:
> + /* Select the 1st imbalanced group as we don't have
> + * any way to choose one more than another
> + */
> return false;
> + break;

You already have an unconditional return above.

>
> - if (!(env->sd->flags & SD_ASYM_CPUCAPACITY))
> - goto asym_packing;
> + case group_asym_capacity:
> + /* Prefer to move from lowest priority CPU's work */
> + if (sched_asym_prefer(sg->asym_prefer_cpu, sds->busiest->asym_prefer_cpu))
> + return false;
^
Stray whitespace

[...]
> @@ -8438,17 +8581,17 @@ static struct sched_group *find_busiest_group(struct lb_env *env)
> local = &sds.local_stat;
> busiest = &sds.busiest_stat;
>
> - /* ASYM feature bypasses nice load balance check */
> - if (busiest->group_asym_capacity)
> - goto force_balance;
> -
> /* There is no busy sibling group to pull tasks from */
> if (!sds.busiest || busiest->sum_h_nr_running == 0)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
That can go, since you now filter this in update_sd_pick_busiest()

[...]
> @@ -8459,59 +8602,71 @@ static struct sched_group *find_busiest_group(struct lb_env *env)
> goto force_balance;
>
> /*
> - * When dst_cpu is idle, prevent SMP nice and/or asymmetric group
> - * capacities from resulting in underutilization due to avg_load.
> - */
> - if (env->idle != CPU_NOT_IDLE && group_has_capacity(env, local) &&
> - busiest->group_no_capacity)
> - goto force_balance;
> -
> - /* Misfit tasks should be dealt with regardless of the avg load */
> - if (busiest->group_type == group_misfit_task)
> - goto force_balance;
> -
> - /*
> * If the local group is busier than the selected busiest group
> * don't try and pull any tasks.
> */
> - if (local->avg_load >= busiest->avg_load)
> + if (local->group_type > busiest->group_type)
> goto out_balanced;
>
> /*
> - * Don't pull any tasks if this group is already above the domain
> - * average load.
> + * When groups are overloaded, use the avg_load to ensure fairness
> + * between tasks.
> */
> - if (local->avg_load >= sds.avg_load)
> - goto out_balanced;
> + if (local->group_type == group_overloaded) {
> + /*
> + * If the local group is more loaded than the selected
> + * busiest group don't try and pull any tasks.
> + */
> + if (local->avg_load >= busiest->avg_load)
> + goto out_balanced;
> +
> + /* XXX broken for overlapping NUMA groups */
> + sds.avg_load = (sds.total_load * SCHED_CAPACITY_SCALE) /
> + sds.total_capacity;
>
> - if (env->idle == CPU_IDLE) {
> /*
> - * This CPU is idle. If the busiest group is not overloaded
> - * and there is no imbalance between this and busiest group
> - * wrt idle CPUs, it is balanced. The imbalance becomes
> - * significant if the diff is greater than 1 otherwise we
> - * might end up to just move the imbalance on another group
> + * Don't pull any tasks if this group is already above the
> + * domain average load.
> */
> - if ((busiest->group_type != group_overloaded) &&
> - (local->idle_cpus <= (busiest->idle_cpus + 1)))
> + if (local->avg_load >= sds.avg_load)
> goto out_balanced;
> - } else {
> +
> /*
> - * In the CPU_NEWLY_IDLE, CPU_NOT_IDLE cases, use
> - * imbalance_pct to be conservative.
> + * If the busiest group is more loaded, use imbalance_pct to be
> + * conservative.
> */
> if (100 * busiest->avg_load <=
> env->sd->imbalance_pct * local->avg_load)
> goto out_balanced;
> +
> }
>
> + /* Try to move all excess tasks to child's sibling domain */
> + if (sds.prefer_sibling && local->group_type == group_has_spare &&
> + busiest->sum_h_nr_running > local->sum_h_nr_running + 1)
> + goto force_balance;
> +
> + if (busiest->group_type != group_overloaded &&
> + (env->idle == CPU_NOT_IDLE ||
> + local->idle_cpus <= (busiest->idle_cpus + 1)))
> + /*
> + * If the busiest group is not overloaded
> + * and there is no imbalance between this and busiest group
> + * wrt idle CPUs, it is balanced. The imbalance
> + * becomes significant if the diff is greater than 1 otherwise
> + * we might end up to just move the imbalance on another
> + * group.
> + */
> + goto out_balanced;
> +
> force_balance:
> /* Looks like there is an imbalance. Compute it */
> - env->src_grp_type = busiest->group_type;
> calculate_imbalance(env, &sds);
> +

Stray newline?

> return env->imbalance ? sds.busiest : NULL;
>
> out_balanced:
> +

Ditto?

> env->imbalance = 0;
> return NULL;
> }
[...]