Re: [PATCH 2/2] sched/fair: Adjust the allowed NUMA imbalance when SD_NUMA spans multiple LLCs

From: Gautham R. Shenoy
Date: Fri Dec 17 2021 - 14:55:48 EST

Next message: Yevhen Orlov: "[PATCH net-next 2/6] net: marvell: prestera: Add router interface ABI"
Previous message: Sebastian Andrzej Siewior: "Re: [PATCH-next v3] mm/memcg: Properly handle memcg_stock access for PREEMPT_RT"
In reply to: Jirka Hladky: "Re: [PATCH 2/2] sched/fair: Adjust the allowed NUMA imbalance when SD_NUMA spans multiple LLCs"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On Fri, Dec 10, 2021 at 09:33:07AM +0000, Mel Gorman wrote:
[..snip..]

> @@ -9186,12 +9191,13 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, int this_cpu)
> return idlest;
> #endif
> /*
> - * Otherwise, keep the task on this node to stay close
> - * its wakeup source and improve locality. If there is
> - * a real need of migration, periodic load balance will
> - * take care of it.
> + * Otherwise, keep the task on this node to stay local
> + * to its wakeup source if the number of running tasks
> + * are below the allowed imbalance. If there is a real
> + * need of migration, periodic load balance will take
> + * care of it.
> */
> - if (allow_numa_imbalance(local_sgs.sum_nr_running, sd->span_weight))
> + if (local_sgs.sum_nr_running <= sd->imb_numa_nr)
> return NULL;

Considering the fact that we want to check whether or not the
imb_numa_nr threshold is going to be crossed if we let the new task
stay local, this should be

if (local_sgs.sum_nr_running + 1 <= sd->imb_numa_nr)
return NULL;

Without this change, on a Zen3 configured with Nodes Per Socket
(NPS)=4, the lower NUMA domain with sd->imb_numa_nr = 2, has 4 groups
(each group corresponds to a NODE sched-domain), when we run stream
with 8 threads, we see 3 of them being initially placed in the local
group and the remaining 5 distributed across the other 4 groups. None
of these 3 tasks will never get migrated to any of the other 3 groups,
because those others have at least one task.

Eg:

PID 157811 : timestamp 108921.267293 : first placed in NODE 1
PID 157812 : timestamp 108921.269877 : first placed in NODE 1
PID 157813 : timestamp 108921.269921 : first placed in NODE 1
PID 157814 : timestamp 108921.270007 : first placed in NODE 2
PID 157815 : timestamp 108921.270065 : first placed in NODE 3
PID 157816 : timestamp 108921.270118 : first placed in NODE 0
PID 157817 : timestamp 108921.270168 : first placed in NODE 2
PID 157818 : timestamp 108921.270216 : first placed in NODE 3

With the fix mentioned above, we see the 8 threads uniformly
distributed across the 4 groups.

PID 7500 : timestamp 436.156429 : first placed in NODE 1
PID 7501 : timestamp 436.159058 : first placed in NODE 1
PID 7502 : timestamp 436.159106 : first placed in NODE 2
PID 7503 : timestamp 436.159173 : first placed in NODE 3
PID 7504 : timestamp 436.159219 : first placed in NODE 0
PID 7505 : timestamp 436.159263 : first placed in NODE 2
PID 7506 : timestamp 436.159305 : first placed in NODE 3
PID 7507 : timestamp 436.159348 : first placed in NODE 0

--
Thanks and Regards
gautham.

Next message: Yevhen Orlov: "[PATCH net-next 2/6] net: marvell: prestera: Add router interface ABI"
Previous message: Sebastian Andrzej Siewior: "Re: [PATCH-next v3] mm/memcg: Properly handle memcg_stock access for PREEMPT_RT"
In reply to: Jirka Hladky: "Re: [PATCH 2/2] sched/fair: Adjust the allowed NUMA imbalance when SD_NUMA spans multiple LLCs"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]