Re: [PATCH v3 04/10] sched/fair: rework load_balance

From: Vincent Guittot
Date: Tue Oct 01 2019 - 04:14:54 EST


On Mon, 30 Sep 2019 at 18:24, Dietmar Eggemann <dietmar.eggemann@xxxxxxx> wrote:
>
> Hi Vincent,
>
> On 19/09/2019 09:33, Vincent Guittot wrote:
>
> these are just some comments & questions based on a code study. Haven't
> run any tests with it yet.
>
> [...]
>
> > The type of sched_group has been extended to better reflect the type of
> > imbalance. We now have :
> > group_has_spare
> > group_fully_busy
> > group_misfit_task
> > group_asym_capacity
>
> s/group_asym_capacity/group_asym_packing

Yes, I forgot to change the commit message and the comments

>
> > group_imbalanced
> > group_overloaded
> >
> > Based on the type fo sched_group, load_balance now sets what it wants to
>
> s/fo/for

s/fo/of/

>
> > move in order to fix the imbalance. It can be some load as before but also
> > some utilization, a number of task or a type of task:
> > migrate_task
> > migrate_util
> > migrate_load
> > migrate_misfit
> >
> > This new load_balance algorithm fixes several pending wrong tasks
> > placement:
> > - the 1 task per CPU case with asymmetrics system
>
> s/asymmetrics/asymmetric

yes

>
> This stands for ASYM_CPUCAPACITY and ASYM_PACKING, right?

yes

>
> [...]
>
> > #define LBF_ALL_PINNED 0x01
> > @@ -7115,7 +7130,7 @@ struct lb_env {
> > unsigned int loop_max;
> >
> > enum fbq_type fbq_type;
> > - enum group_type src_grp_type;
> > + enum migration_type balance_type;
>
> Minor thing:
>
> Why not
> enum migration_type migration_type;
> or
> enum balance_type balance_type;
>
> We do the same for other enums like fbq_type or group_type.

yes, I can align

>
> > struct list_head tasks;
> > };
> >
>
> The detach_tasks() comment still mentions only runnable load.

ok

>
> > @@ -7347,7 +7362,7 @@ static int detach_tasks(struct lb_env *env)
> > {
> > struct list_head *tasks = &env->src_rq->cfs_tasks;
> > struct task_struct *p;
> > - unsigned long load;
> > + unsigned long util, load;
>
> Minor: Order by length or reduce scope to while loop ?

I don't get your point here

>
> > int detached = 0;
> >
> > lockdep_assert_held(&env->src_rq->lock);
> > @@ -7380,19 +7395,53 @@ static int detach_tasks(struct lb_env *env)
> > if (!can_migrate_task(p, env))
> > goto next;
> >
> > - load = task_h_load(p);
> > + switch (env->balance_type) {
> > + case migrate_load:
> > + load = task_h_load(p);
> >
> > - if (sched_feat(LB_MIN) && load < 16 && !env->sd->nr_balance_failed)
> > - goto next;
> > + if (sched_feat(LB_MIN) &&
> > + load < 16 && !env->sd->nr_balance_failed)
> > + goto next;
> >
> > - if ((load / 2) > env->imbalance)
> > - goto next;
> > + if ((load / 2) > env->imbalance)
> > + goto next;
> > +
> > + env->imbalance -= load;
> > + break;
> > +
> > + case migrate_util:
> > + util = task_util_est(p);
> > +
> > + if (util > env->imbalance)
> > + goto next;
> > +
> > + env->imbalance -= util;
> > + break;
> > +
> > + case migrate_task:
> > + /* Migrate task */
>
> Minor: IMHO, this comment is not necessary.

yes

>
> > + env->imbalance--;
> > + break;
> > +
> > + case migrate_misfit:
> > + load = task_h_load(p);
> > +
> > + /*
> > + * utilization of misfit task might decrease a bit
>
> This patch still uses load. IMHO this comments becomes true only with 08/10.

even with 8/10 it's not correct and it has been removed.
I'm going to remove it.

>
> [...]
>
> > @@ -7707,13 +7755,11 @@ static inline void init_sd_lb_stats(struct sd_lb_stats *sds)
> > *sds = (struct sd_lb_stats){
> > .busiest = NULL,
> > .local = NULL,
> > - .total_running = 0UL,
> > .total_load = 0UL,
> > .total_capacity = 0UL,
> > .busiest_stat = {
> > - .avg_load = 0UL,
>
> There is a sentence in the comment above explaining why avg_load has to
> be cleared. And IMHO local group isn't cleared anymore but set/initialized.

Yes, I have to update it

>
> > - .sum_h_nr_running = 0,
> > - .group_type = group_other,
> > + .idle_cpus = UINT_MAX,
> > + .group_type = group_has_spare,
> > },
> > };
> > }
>
> [...]
>
> > @@ -8042,14 +8104,24 @@ static inline void update_sg_lb_stats(struct lb_env *env,
> > }
> > }
> >
> > - /* Adjust by relative CPU capacity of the group */
> > + /* Check if dst cpu is idle and preferred to this group */
>
> s/preferred to/preferred by ? or the preferred CPU of this group ?

dst cpu doesn't belong to this group. We compare asym_prefer_cpu of
this group vs dst_cpu which belongs to another group

>
> [...]
>
> > @@ -8283,69 +8363,133 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
> > */
> > static inline void calculate_imbalance(struct lb_env *env, struct
> sd_lb_stats *sds)
> > {
> > - unsigned long max_pull, load_above_capacity = ~0UL;
> > struct sg_lb_stats *local, *busiest;
> >
> > local = &sds->local_stat;
> > busiest = &sds->busiest_stat;
> >
> > - if (busiest->group_asym_packing) {
> > + if (busiest->group_type == group_misfit_task) {
> > + /* Set imbalance to allow misfit task to be balanced. */
> > + env->balance_type = migrate_misfit;
> > + env->imbalance = busiest->group_misfit_task_load;
> > + return;
> > + }
> > +
> > + if (busiest->group_type == group_asym_packing) {
> > + /*
> > + * In case of asym capacity, we will try to migrate all load to
>
> Does asym capacity stands for asym packing or asym cpu capacity?

busiest->group_type == group_asym_packing

will fix it

>
> > + * the preferred CPU.
> > + */
> > + env->balance_type = migrate_load;
> > env->imbalance = busiest->group_load;
> > return;
> > }
> >
> > + if (busiest->group_type == group_imbalanced) {
> > + /*
> > + * In the group_imb case we cannot rely on group-wide averages
> > + * to ensure CPU-load equilibrium, try to move any task to fix
> > + * the imbalance. The next load balance will take care of
> > + * balancing back the system.
>
> balancing back ?

In case of imbalance, we don't try to balance the system but only try
to get rid of the pinned tasks problem. The system will still be
unbalanced after the migration and the next load balance will take
care of balancing the system

>
> > + */
> > + env->balance_type = migrate_task;
> > + env->imbalance = 1;
> > + return;
> > + }
> > +
> > /*
> > - * Avg load of busiest sg can be less and avg load of local sg can
> > - * be greater than avg load across all sgs of sd because avg load
> > - * factors in sg capacity and sgs with smaller group_type are
> > - * skipped when updating the busiest sg:
> > + * Try to use spare capacity of local group without overloading it or
> > + * emptying busiest
> > */
> > - if (busiest->group_type != group_misfit_task &&
> > - (busiest->avg_load <= sds->avg_load ||
> > - local->avg_load >= sds->avg_load)) {
> > - env->imbalance = 0;
> > + if (local->group_type == group_has_spare) {
> > + if (busiest->group_type > group_fully_busy) {
>
> So this could be 'busiest->group_type == group_overloaded' here to match
> the comment below? Since you handle group_misfit_task,
> group_asym_packing, group_imbalanced above and return.

This is just to be more robust in case some new states are added later

>
> > + /*
> > + * If busiest is overloaded, try to fill spare
> > + * capacity. This might end up creating spare capacity
> > + * in busiest or busiest still being overloaded but
> > + * there is no simple way to directly compute the
> > + * amount of load to migrate in order to balance the
> > + * system.
> > + */
> > + env->balance_type = migrate_util;
> > + env->imbalance = max(local->group_capacity, local->group_util) -
> > + local->group_util;
> > + return;
> > + }
> > +
> > + if (busiest->group_weight == 1 || sds->prefer_sibling) {
> > + /*
> > + * When prefer sibling, evenly spread running tasks on
> > + * groups.
> > + */
> > + env->balance_type = migrate_task;
> > + env->imbalance = (busiest->sum_h_nr_running - local->sum_h_nr_running) >> 1;
> > + return;
> > + }
> > +
> > + /*
> > + * If there is no overload, we just want to even the number of
> > + * idle cpus.
> > + */
> > + env->balance_type = migrate_task;
> > + env->imbalance = max_t(long, 0, (local->idle_cpus - busiest->idle_cpus) >> 1);
>
> Why do we need a max_t(long, 0, ...) here and not for the 'if
> (busiest->group_weight == 1 || sds->prefer_sibling)' case?

For env->imbalance = (busiest->sum_h_nr_running - local->sum_h_nr_running) >> 1;

either we have sds->prefer_sibling && busiest->sum_nr_running >
local->sum_nr_running + 1

or busiest->group_weight == 1 and env->idle != CPU_NOT_IDLE so local
cpu is idle or newly idle

>
> > return;
> > }
> >
> > /*
> > - * If there aren't any idle CPUs, avoid creating some.
> > + * Local is fully busy but have to take more load to relieve the
>
> s/have/has
>
> > + * busiest group
> > */
>
> I thought that 'local->group_type == group_imbalanced' is allowed as
> well? So 'if (local->group_type < group_overloaded)' (further down)
> could include that?

yes.
Imbalance state is not very useful for local group but only reflects
that the group is not overloaded so either fully busy or has spare
capacity.
In this case we assume the worst : fully_busy

>
> > - if (busiest->group_type == group_overloaded &&
> > - local->group_type == group_overloaded) {
> > - load_above_capacity = busiest->sum_h_nr_running * SCHED_CAPACITY_SCALE;
> > - if (load_above_capacity > busiest->group_capacity) {
> > - load_above_capacity -= busiest->group_capacity;
> > - load_above_capacity *= scale_load_down(NICE_0_LOAD);
> > - load_above_capacity /= busiest->group_capacity;
> > - } else
> > - load_above_capacity = ~0UL;
> > + if (local->group_type < group_overloaded) {
> > + /*
> > + * Local will become overloaded so the avg_load metrics are
> > + * finally needed.
> > + */
>
> How does this relate to the decision_matrix[local, busiest] (dm[])? E.g.
> dm[overload, overload] == avg_load or dm[fully_busy, overload] == force.
> It would be nice to be able to match all allowed fields of dm to code sections.

decision_matrix describes how it decides between balanced or unbalanced.
In case of dm[overload, overload], we use the avg_load to decide if it
is balanced or not
In case of dm[fully_busy, overload], the groups are unbalanced because
fully_busy < overload and we force the balance. Then
calculate_imbalance() uses the avg_load to decide how much will be
moved

dm[overload, overload]=force means that we force the balance and we
will compute later the imbalance. avg_load may be used to calculate
the imbalance
dm[overload, overload]=avg_load means that we compare the avg_load to
decide whether we need to balance load between groups
dm[overload, overload]=nr_idle means that we compare the number of
idle cpus to decide whether we need to balance. In fact this is no
more true with patch 7 because we also take into account the number of
nr_h_running when weight =1

>
> [...]
>
> > /******* find_busiest_group() helpers end here *********************/
> >
> > +/*
> > + * Decision matrix according to the local and busiest group state
>
> Minor s/state/type ?

ok

>
> > + *
> > + * busiest \ local has_spare fully_busy misfit asym imbalanced overloaded
> > + * has_spare nr_idle balanced N/A N/A balanced balanced
> > + * fully_busy nr_idle nr_idle N/A N/A balanced balanced
> > + * misfit_task force N/A N/A N/A force force
> > + * asym_capacity force force N/A N/A force force
>
> s/asym_capacity/asym_packing

yes

>
> > + * imbalanced force force N/A N/A force force
> > + * overloaded force force N/A N/A force avg_load
> > + *
> > + * N/A : Not Applicable because already filtered while updating
> > + * statistics.
> > + * balanced : The system is balanced for these 2 groups.
> > + * force : Calculate the imbalance as load migration is probably needed.
> > + * avg_load : Only if imbalance is significant enough.
> > + * nr_idle : dst_cpu is not busy and the number of idle cpus is quite
> > + * different in groups.
> > + */
> > +
> > /**
> > * find_busiest_group - Returns the busiest group within the sched_domain
> > * if there is an imbalance.
> > @@ -8380,17 +8524,17 @@ static struct sched_group *find_busiest_group(struct lb_env *env)
> > local = &sds.local_stat;
> > busiest = &sds.busiest_stat;
> >
> > - /* ASYM feature bypasses nice load balance check */
> > - if (busiest->group_asym_packing)
> > - goto force_balance;
> > -
> > /* There is no busy sibling group to pull tasks from */
> > - if (!sds.busiest || busiest->sum_h_nr_running == 0)
> > + if (!sds.busiest)
> > goto out_balanced;
> >
> > - /* XXX broken for overlapping NUMA groups */
> > - sds.avg_load = (SCHED_CAPACITY_SCALE * sds.total_load)
> > - / sds.total_capacity;
> > + /* Misfit tasks should be dealt with regardless of the avg load */
> > + if (busiest->group_type == group_misfit_task)
> > + goto force_balance;
> > +
> > + /* ASYM feature bypasses nice load balance check */
>
> Minor: s/ASYM feature/ASYM_PACKING ... to distinguish clearly from
> ASYM_CPUCAPACITY.

yes

>
> > + if (busiest->group_type == group_asym_packing)
> > + goto force_balance;
>
> [...]
>