Re: [PATCH] sched: Improve rq selection for a blocked task when its affinity changes

From: Pierre Gondois
Date: Thu Mar 14 2024 - 06:32:54 EST




On 3/14/24 04:49, Ze Gao wrote:
On Thu, Mar 14, 2024 at 10:59 AM Ze Gao <zegao2021@xxxxxxxxx> wrote:

On Wed, Mar 13, 2024 at 11:16 PM Pierre Gondois <pierre.gondois@xxxxxxx> wrote:

Hello Ze,

I am running stress-ng with the following command:
stress-ng -c 1 -l 10 &
and migrating the process with:
taskset -pc [cpus] [pid]

The thread seems to be migrated via:
sched_setaffinity
\-__sched_setaffinity()
\-__set_cpus_allowed_ptr()
\-__set_cpus_allowed_ptr_locked()
\- [1]



[1]
/*
* Picking a ~random cpu helps in cases where we are changing affinity
* for groups of tasks (ie. cpuset), so that load balancing is not
* immediately required to distribute the tasks within their new mask.
*/
dest_cpu = cpumask_any_and_distribute(cpu_valid_mask, ctx->new_mask);

So it seems the destination CPU chosen among the new CPU affinity mask is done
here, by picking a random CPU in the mask.

IIUC, this is for running/queued/waking tasks instead of blocked tasks.

Am I missing something obvious here?

Not at all, I just didn't tested the patch on a blocked task as you said.


Checking the cpus_ptr in select_idle_sibling() might be useful in other cases,
but I think the experiment doesn't show that. Maybe a another small tweak could

The experiment is used to illustrate that the status quo does not do well
but has to rely on select_fallback_rq() to choose a cpu for a woken task
which turns out to be a bad choice since it's already monopolized by a
cpu bound task, that is why a second migration happens with the help
of the load balancer.

Btw, perf alone does not show obvious results here, you need some
other observability tools to make sure the migration is not initiated by
__set_cpus_allowed_ptr_locked (i.e., for running tasks). I achieve this
by directly adding some tracepoints to both select_fallback_rq() and
select_idle_sibling().

FWIW, I could effectively reproduce the issue described in the commit message, so:
Tested-by: Pierre Gondois <pierre.gondois@xxxxxxx>

Regards,
Pierre


Regards,
-- Ze

Actually, we can reuse the same reasons for doing so as in

commit 46a87b3851f0("sched/core: Distribute tasks within affinity masks")

be done at [1] instead ?

As for blocked tasks, check out what is commented on set_task_cpu() and
select_task_rq(), since we never call set_task_cpu() on blocked tasks which
in turn, we have no way to change p->wake_cpu to dest_cpu being randomly
chosen here, so when it's woken up, it still needs to go through the
select_task_rq() process using the outdated p->wake_cpu.


Thanks,
-- Ze

Regards,
Pierre

On 3/13/24 09:58, Ze Gao wrote:
We observered select_idle_sibling() is likely to return the *target* cpu
early which is likely to be the previous cpu this task is running on even
when it's actually not within the affinity list newly set, from where after
we can only rely on select_fallback_rq() to choose one for us at its will
(the first valid mostly for now).

However, the one chosen by select_fallback_rq() is highly likely not a
good enough candidate, sometimes it has to rely on load balancer to kick
in to place itself to a better cpu, which adds one or more unnecessary
migrations in no doubt. For example, this is what I get when I move task
3964 to cpu 23-24 where cpu 23 has a cpu bound work pinned already:

swapper 0 [013] 959.791829: sched:sched_migrate_task: comm=stress-ng-cpu pid=3964 prio=120 orig_cpu=13 dest_cpu=23
kworker/24:2-mm 1014 [024] 959.806148: sched:sched_migrate_task: comm=stress-ng-cpu pid=3964 prio=120 orig_cpu=23 dest_cpu=24

The thing is we can actually do better if we do checks early and take more
advantages of the *target* in select_idle_sibling(). That is, we continue
the idle cpu selection if *target* fails the test of cpumask_test_cpu(
*target*, p->cpus_ptr). By doing so, we are likely to pick a good candidate,
especially when the newly allowed cpu set shares some cpu resources with
*target*.

And with this change, we clearly see the improvement when I move task 3964
to cpu 25-26 where cpu 25 has a cpu bound work pinned already.

swapper 0 [027] 4249.204658: sched:sched_migrate_task: comm=stress-ng-cpu pid=3964 prio=120 orig_cpu=27 dest_cpu=26

Note we do the same check for *prev* in select_idle_sibling() as well.

Signed-off-by: Ze Gao <zegao@xxxxxxxxxxx>
---
kernel/sched/fair.c | 13 ++++++++-----
1 file changed, 8 insertions(+), 5 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 533547e3c90a..9ef6e74c6b2a 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7511,16 +7511,19 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
*/
lockdep_assert_irqs_disabled();

- if ((available_idle_cpu(target) || sched_idle_cpu(target)) &&
- asym_fits_cpu(task_util, util_min, util_max, target))
+ if (cpumask_test_cpu(target, p->cpus_ptr) &&
+ (available_idle_cpu(target) || sched_idle_cpu(target)) &&
+ asym_fits_cpu(task_util, util_min, util_max, target))
return target;

/*
* If the previous CPU is cache affine and idle, don't be stupid:
*/
- if (prev != target && cpus_share_cache(prev, target) &&
- (available_idle_cpu(prev) || sched_idle_cpu(prev)) &&
- asym_fits_cpu(task_util, util_min, util_max, prev)) {
+ if (cpumask_test_cpu(prev, p->cpus_ptr) &&
+ prev != target &&
+ cpus_share_cache(prev, target) &&
+ (available_idle_cpu(prev) || sched_idle_cpu(prev)) &&
+ asym_fits_cpu(task_util, util_min, util_max, prev)) {

if (!static_branch_unlikely(&sched_cluster_active) ||
cpus_share_resources(prev, target))