Re: [PATCH v3 04/10] sched/fair: rework load_balance

From: Vincent Guittot
Date: Wed Oct 02 2019 - 04:23:32 EST


On Tue, 1 Oct 2019 at 18:53, Dietmar Eggemann <dietmar.eggemann@xxxxxxx> wrote:
>
> On 01/10/2019 10:14, Vincent Guittot wrote:
> > On Mon, 30 Sep 2019 at 18:24, Dietmar Eggemann <dietmar.eggemann@xxxxxxx> wrote:
> >>
> >> Hi Vincent,
> >>
> >> On 19/09/2019 09:33, Vincent Guittot wrote:
>
[...]

>
> >>> + if (busiest->group_weight == 1 || sds->prefer_sibling) {
> >>> + /*
> >>> + * When prefer sibling, evenly spread running tasks on
> >>> + * groups.
> >>> + */
> >>> + env->balance_type = migrate_task;
> >>> + env->imbalance = (busiest->sum_h_nr_running - local->sum_h_nr_running) >> 1;
> >>> + return;
> >>> + }
> >>> +
> >>> + /*
> >>> + * If there is no overload, we just want to even the number of
> >>> + * idle cpus.
> >>> + */
> >>> + env->balance_type = migrate_task;
> >>> + env->imbalance = max_t(long, 0, (local->idle_cpus - busiest->idle_cpus) >> 1);
> >>
> >> Why do we need a max_t(long, 0, ...) here and not for the 'if
> >> (busiest->group_weight == 1 || sds->prefer_sibling)' case?
> >
> > For env->imbalance = (busiest->sum_h_nr_running - local->sum_h_nr_running) >> 1;
> >
> > either we have sds->prefer_sibling && busiest->sum_nr_running >
> > local->sum_nr_running + 1
>
> I see, this corresponds to
>
> /* Try to move all excess tasks to child's sibling domain */
> if (sds.prefer_sibling && local->group_type == group_has_spare &&
> busiest->sum_h_nr_running > local->sum_h_nr_running + 1)
> goto force_balance;
>
> in find_busiest_group, I assume.

yes. But it seems that I missed a case:

prefer_sibling is set
busiest->sum_h_nr_running <= local->sum_h_nr_running + 1 so we skip
goto force_balance above
But env->idle != CPU_NOT_IDLE and local->idle_cpus >
(busiest->idle_cpus + 1) so we also skip goto out_balance and finally
call calculate_imbalance()

in calculate_imbalance with prefer_sibling set, imbalance =
(busiest->sum_h_nr_running - local->sum_h_nr_running) >> 1;

so we probably want something similar to max_t(long, 0,
(busiest->sum_h_nr_running - local->sum_h_nr_running) >> 1)

>
> Haven't been able to recreate this yet on my arm64 platform since there
> is no prefer_sibling and in case local and busiest have
> group_type=group_has_spare they bailout in
>
> if (busiest->group_type != group_overloaded &&
> (env->idle == CPU_NOT_IDLE ||
> local->idle_cpus <= (busiest->idle_cpus + 1)))
> goto out_balanced;
>
>
> [...]
>
> >>> - if (busiest->group_type == group_overloaded &&
> >>> - local->group_type == group_overloaded) {
> >>> - load_above_capacity = busiest->sum_h_nr_running * SCHED_CAPACITY_SCALE;
> >>> - if (load_above_capacity > busiest->group_capacity) {
> >>> - load_above_capacity -= busiest->group_capacity;
> >>> - load_above_capacity *= scale_load_down(NICE_0_LOAD);
> >>> - load_above_capacity /= busiest->group_capacity;
> >>> - } else
> >>> - load_above_capacity = ~0UL;
> >>> + if (local->group_type < group_overloaded) {
> >>> + /*
> >>> + * Local will become overloaded so the avg_load metrics are
> >>> + * finally needed.
> >>> + */
> >>
> >> How does this relate to the decision_matrix[local, busiest] (dm[])? E.g.
> >> dm[overload, overload] == avg_load or dm[fully_busy, overload] == force.
> >> It would be nice to be able to match all allowed fields of dm to code sections.
> >
> > decision_matrix describes how it decides between balanced or unbalanced.
> > In case of dm[overload, overload], we use the avg_load to decide if it
> > is balanced or not
>
> OK, that's why you calculate sgs->avg_load in update_sg_lb_stats() only
> for 'sgs->group_type == group_overloaded'.
>
> > In case of dm[fully_busy, overload], the groups are unbalanced because
> > fully_busy < overload and we force the balance. Then
> > calculate_imbalance() uses the avg_load to decide how much will be
> > moved
>
> And in this case 'local->group_type < group_overloaded' in
> calculate_imbalance(), 'local->avg_load' and 'sds->avg_load' have to be
> calculated before using them in env->imbalance = min(...).
>
> OK, got it now.
>
> > dm[overload, overload]=force means that we force the balance and we
> > will compute later the imbalance. avg_load may be used to calculate
> > the imbalance
> > dm[overload, overload]=avg_load means that we compare the avg_load to
> > decide whether we need to balance load between groups
> > dm[overload, overload]=nr_idle means that we compare the number of
> > idle cpus to decide whether we need to balance. In fact this is no
> > more true with patch 7 because we also take into account the number of
> > nr_h_running when weight =1
>
> This becomes clearer now ... slowly.
>
> [...]