Re: [PATCH] sched/fair: Do not scan non-movable tasks several times

From: Konstantin Khorenko
Date: Mon Dec 25 2023 - 10:18:05 EST


On 18.12.2023 17:48, Vincent Guittot wrote:
On Thu, 14 Dec 2023 at 15:42, Konstantin Khorenko
<khorenko@xxxxxxxxxxxxx> wrote:

If busiest rq is small, nr_running < SCHED_NR_MIGRATE_BREAK and all
tasks are not movable, detach_tasks() should not iterate more than tasks
available in the busiest rq.

Previously the (env->loop > env->loop_max) condition prevented us from

It's usually better to give the commit directly when we know it :
Before commit : b0defa7ae03e ("sched/fair: Make sure to try to detach
at least one movable task"),
the (env->loop > env->loop_max) condition prevented us from ...

You are definitely right, added.

scanning non-movable tasks more than rq size times, but after we start
checking the LBF_ALL_PINNED flag, the "all tasks are not movable" case
is under threat.


Fixes: b0defa7ae03e ("sched/fair: Make sure to try to detach at least
one movable task")

Added this too, thank you.

Signed-off-by: Konstantin Khorenko <khorenko@xxxxxxxxxxxxx>
---
kernel/sched/fair.c | 7 ++++++-
1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index d7a3c63a2171..faa2a765e899 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -11219,7 +11219,6 @@ static int load_balance(int this_cpu, struct rq *this_rq,
.dst_rq = this_rq,
.dst_grpmask = group_balance_mask(sd->groups),
.idle = idle,
- .loop_break = SCHED_NR_MIGRATE_BREAK,
.cpus = cpus,
.fbq_type = all,
.tasks = LIST_HEAD_INIT(env.tasks),
@@ -11265,6 +11264,12 @@ static int load_balance(int this_cpu, struct rq *this_rq,
* correctly treated as an imbalance.
*/
env.loop_max = min(sysctl_sched_nr_migrate, busiest->nr_running);
+ /*
+ * If busiest rq is small, nr_running < SCHED_NR_MIGRATE_BREAK
+ * and all tasks are not movable, detach_tasks() should not
+ * iterate more than tasks available in rq.
+ */
+ env.loop_break = min(SCHED_NR_MIGRATE_BREAK, busiest->nr_running);

Should it be after more_balance: ?
In case we do "more_balance:" on a new_dst_cpu and it ends up that
finally there is no more movable task as we released the lock of
busiest rq in the meantime ?

Well, both yes and no here.
If we simply move the env.loop_break assignment after "more_balance" label,
we will break the handling LBF_NEED_BREAK case.

But you are right, the new_dst_cpu case should also not just reset loop_break to SCHED_NR_MIGRATE_BREAK, but do the same min() calculation.

So i've added one more label before the env.loop_break = min(...) and used it in
the new_dst_cpu case.

Also you can remove one more superfluous init of loop_break:

Yep, dropped it, thank you.

--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -11361,7 +11361,6 @@ static int load_balance(int this_cpu, struct
rq *this_rq,
*/
if (!cpumask_subset(cpus, env.dst_grpmask)) {
env.loop = 0;
- env.loop_break = SCHED_NR_MIGRATE_BREAK;
goto redo;
}
goto out_all_pinned;


more_balance:
rq_lock_irqsave(busiest, &rf);
--
2.39.3