Re: [PATCH] sched/fair: Enable group_asym_packing in find_idlest_group

From: Tim Chen
Date: Tue Jan 09 2024 - 19:58:49 EST


On Thu, 2024-01-04 at 21:20 +0530, Shrikanth Hegde wrote:
> On 10/18/23 9:20 PM, Srikar Dronamraju wrote:
>
> Hi Srikar,
>
> > Current scheduler code doesn't handle SD_ASYM_PACKING in the
> > find_idlest_cpu path. On few architectures, like Powerpc, cache is at a
> > core. Moving threads across cores may end up in cache misses.
> >
> > While asym_packing can be enabled above SMT level, enabling Asym packing
> > across cores could result in poorer performance due to cache misses.
> > However if the initial task placement via find_idlest_cpu does take
> > Asym_packing into consideration, then scheduler can avoid asym_packing
> > migrations. This will result in lesser migrations and better packing and
> > better overall performance.
> >
>
> This would handle asym packing case when finding the idle CPU for newly woken
> up task and thereby reducing the number of migrations if it is placed correctly in
> the first place. I think thats helpful.
>
> Currently intel cluster and powerVM shared LPAR's are the two where ASYM PACKING
> is enabled at higher domain than SMT. Is that correct or is there any other topology?
>
> +tim
>
> > Signed-off-by: Srikar Dronamraju <srikar@xxxxxxxxxxxxxxxxxx>
> > ---
> > kernel/sched/fair.c | 33 ++++++++++++++++++++++++++++++---
> > 1 file changed, 30 insertions(+), 3 deletions(-)
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index cb225921bbca..7164f79a3d13 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -9931,11 +9931,13 @@ static int idle_cpu_without(int cpu, struct task_struct *p)
> > * @group: sched_group whose statistics are to be updated.
> > * @sgs: variable to hold the statistics for this group.
> > * @p: The task for which we look for the idlest group/CPU.
> > + * @this_cpu: current cpu
> > */
> > static inline void update_sg_wakeup_stats(struct sched_domain *sd,
> > struct sched_group *group,
> > struct sg_lb_stats *sgs,
> > - struct task_struct *p)
> > + struct task_struct *p,
> > + int this_cpu)
> > {
> > int i, nr_running;
> >
> > @@ -9972,6 +9974,11 @@ static inline void update_sg_wakeup_stats(struct sched_domain *sd,
> >
> > }
> >
> > + if (sd->flags & SD_ASYM_PACKING && sgs->sum_h_nr_running &&
> > + sched_asym_prefer(group->asym_prefer_cpu, this_cpu)) {
> > + sgs->group_asym_packing = 1;

I disagree with the above criteria for doing asym_packing.

I think asym packing only makes sense if you have an idle CPU availabe
in the group that is preferred over this_cpu, and you have fewer
tasks than CPU. Using group->asym_prefer_cpu
is inappropriate as that most preferred CPU may be busy.
You should be migrating task from this_cpu to that highest
priority idle_cpu identified

If the group is fully busy or overloaded, we should stick with the original
logic of picking the most lightly loaded group and not use asym_packing. 

You may want to note down the idle CPU in the group with highest priority, 
or most preferred if there are more than 1 cpu in the group to compare 
between two idle groups that have idle CPUs.

Tim

> > + }
> > +
>
>
> I think there is a corner case here which could be taken care. please correct me if i
> am wrong.
>
> Assume there are four sched groups, sg1, sg2, sg3 and sg4. asym packing is enabled at sd.
> sg1, and sg3 have one task each and a new task is being created. So find_idlest_cpu is
> called for this new task.
>
> Because of sgs->sum_h_nr_running check, sg1 and sg3 will have group_asym_packing, while
> sg2 and sg4 will have group_has_spare. update_pick_idlest will choose the lowest type.
> so group_has_spare. TIE would be between sg2 and sg4. Because of asym packing (atleast true
> for powerpc shared LPAR case) sg4 will have lower utilization compared to sg2, and hence sg4
> will be given as the idlest_cpu. On the next load balance sg2 will pull task from sg4 due to
> asym packing.
>
> Additional migration may be avoided if we omit the sum_h_nr_running check?
>
>
> > sgs->group_capacity = group->sgc->capacity;
> >
> > sgs->group_weight = group->group_weight;
> > @@ -10012,8 +10019,17 @@ static bool update_pick_idlest(struct sched_group *idlest,
> > return false;
> > break;
> >
> > - case group_imbalanced:
> > case group_asym_packing:
> > + if (sched_asym_prefer(group->asym_prefer_cpu, idlest->asym_prefer_cpu)) {
> > + int busy_cpus = idlest_sgs->group_weight - idlest_sgs->idle_cpus;
> > +
> > + busy_cpus -= (sgs->group_weight - sgs->idle_cpus);
> > + if (busy_cpus >= 0)
> > + return true;
>
>
> wouldn't using idle_cpus would be simpler? something like,
>
> if (sgs->idle_cpus - idlest->idle_cpus > 0)
> return true
>
> > + }
> > + return false;
> > +
> > + case group_imbalanced:
> > case group_smt_balance:
> > /* Those types are not used in the slow wakeup path */
> > return false;
> > @@ -10080,7 +10096,7 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, int this_cpu)
> > sgs = &tmp_sgs;
> > }
> >
> > - update_sg_wakeup_stats(sd, group, sgs, p);
> > + update_sg_wakeup_stats(sd, group, sgs, p, this_cpu);
> >
> > if (!local_group && update_pick_idlest(idlest, &idlest_sgs, group, sgs)) {
> > idlest = group;
> > @@ -10112,6 +10128,17 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, int this_cpu)
> > if (local_sgs.group_type > idlest_sgs.group_type)
> > return idlest;
> >
> > + if (idlest_sgs.group_type == group_asym_packing) {
> > + if (sched_asym_prefer(idlest->asym_prefer_cpu, local->asym_prefer_cpu)) {
> > + int busy_cpus = local_sgs.group_weight - local_sgs.idle_cpus;
> > +
> > + busy_cpus -= (idlest_sgs.group_weight - idlest_sgs.idle_cpus);
> > + if (busy_cpus >= 0)
> > + return idlest;
> > + }
> > + return NULL;
> > + }
>
> same comment of using idle_cpus
>
> > +
> > switch (local_sgs.group_type) {
> > case group_overloaded:
> > case group_fully_busy:
>