Re: [PATCH] sched: Improve rq selection for a blocked task when its affinity changes

From: Pierre Gondois
Date: Wed Mar 13 2024 - 11:16:43 EST


Hello Ze,

I am running stress-ng with the following command:
stress-ng -c 1 -l 10 &
and migrating the process with:
taskset -pc [cpus] [pid]

The thread seems to be migrated via:
sched_setaffinity
\-__sched_setaffinity()
\-__set_cpus_allowed_ptr()
\-__set_cpus_allowed_ptr_locked()
\- [1]

[1]
/*
* Picking a ~random cpu helps in cases where we are changing affinity
* for groups of tasks (ie. cpuset), so that load balancing is not
* immediately required to distribute the tasks within their new mask.
*/
dest_cpu = cpumask_any_and_distribute(cpu_valid_mask, ctx->new_mask);

So it seems the destination CPU chosen among the new CPU affinity mask is done
here, by picking a random CPU in the mask.

Checking the cpus_ptr in select_idle_sibling() might be useful in other cases,
but I think the experiment doesn't show that. Maybe a another small tweak could
be done at [1] instead ?

Regards,
Pierre

On 3/13/24 09:58, Ze Gao wrote:
We observered select_idle_sibling() is likely to return the *target* cpu
early which is likely to be the previous cpu this task is running on even
when it's actually not within the affinity list newly set, from where after
we can only rely on select_fallback_rq() to choose one for us at its will
(the first valid mostly for now).

However, the one chosen by select_fallback_rq() is highly likely not a
good enough candidate, sometimes it has to rely on load balancer to kick
in to place itself to a better cpu, which adds one or more unnecessary
migrations in no doubt. For example, this is what I get when I move task
3964 to cpu 23-24 where cpu 23 has a cpu bound work pinned already:

swapper 0 [013] 959.791829: sched:sched_migrate_task: comm=stress-ng-cpu pid=3964 prio=120 orig_cpu=13 dest_cpu=23
kworker/24:2-mm 1014 [024] 959.806148: sched:sched_migrate_task: comm=stress-ng-cpu pid=3964 prio=120 orig_cpu=23 dest_cpu=24

The thing is we can actually do better if we do checks early and take more
advantages of the *target* in select_idle_sibling(). That is, we continue
the idle cpu selection if *target* fails the test of cpumask_test_cpu(
*target*, p->cpus_ptr). By doing so, we are likely to pick a good candidate,
especially when the newly allowed cpu set shares some cpu resources with
*target*.

And with this change, we clearly see the improvement when I move task 3964
to cpu 25-26 where cpu 25 has a cpu bound work pinned already.

swapper 0 [027] 4249.204658: sched:sched_migrate_task: comm=stress-ng-cpu pid=3964 prio=120 orig_cpu=27 dest_cpu=26

Note we do the same check for *prev* in select_idle_sibling() as well.

Signed-off-by: Ze Gao <zegao@xxxxxxxxxxx>
---
kernel/sched/fair.c | 13 ++++++++-----
1 file changed, 8 insertions(+), 5 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 533547e3c90a..9ef6e74c6b2a 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7511,16 +7511,19 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
*/
lockdep_assert_irqs_disabled();
- if ((available_idle_cpu(target) || sched_idle_cpu(target)) &&
- asym_fits_cpu(task_util, util_min, util_max, target))
+ if (cpumask_test_cpu(target, p->cpus_ptr) &&
+ (available_idle_cpu(target) || sched_idle_cpu(target)) &&
+ asym_fits_cpu(task_util, util_min, util_max, target))
return target;
/*
* If the previous CPU is cache affine and idle, don't be stupid:
*/
- if (prev != target && cpus_share_cache(prev, target) &&
- (available_idle_cpu(prev) || sched_idle_cpu(prev)) &&
- asym_fits_cpu(task_util, util_min, util_max, prev)) {
+ if (cpumask_test_cpu(prev, p->cpus_ptr) &&
+ prev != target &&
+ cpus_share_cache(prev, target) &&
+ (available_idle_cpu(prev) || sched_idle_cpu(prev)) &&
+ asym_fits_cpu(task_util, util_min, util_max, prev)) {
if (!static_branch_unlikely(&sched_cluster_active) ||
cpus_share_resources(prev, target))