Re: [PATCH] sched/rt: Use cpu_active_mask to prevent rto_push_irq_work's dead loop

From: Qais Yousef
Date: Sun Nov 20 2022 - 18:35:57 EST


On 11/17/22 17:00, Steven Rostedt wrote:
> On Mon, 14 Nov 2022 20:04:53 +0800
> Xuewen Yan <xuewen.yan@xxxxxxxxxx> wrote:
>
> > +++ b/kernel/sched/rt.c
> > @@ -2219,6 +2219,7 @@ static int rto_next_cpu(struct root_domain *rd)
> > {
> > int next;
> > int cpu;
> > + struct cpumask tmp_cpumask;
>
> If you have a machine with thousands of CPUs, this will likely kill the
> stack.
>
> >
> > /*
> > * When starting the IPI RT pushing, the rto_cpu is set to -1,
> > @@ -2238,6 +2239,11 @@ static int rto_next_cpu(struct root_domain *rd)
> > /* When rto_cpu is -1 this acts like cpumask_first() */
> > cpu = cpumask_next(rd->rto_cpu, rd->rto_mask);
> >
> > + cpumask_and(&tmp_cpumask, rd->rto_mask, cpu_active_mask);
> > + if (rd->rto_cpu == -1 && cpumask_weight(&tmp_cpumask) == 1 &&
> > + cpumask_test_cpu(smp_processor_id(), &tmp_cpumask))
> > + break;
> > +
>
> Kill the above.
>
> > rd->rto_cpu = cpu;
> >
> > if (cpu < nr_cpu_ids) {
>
> Why not just add here:
>
> if (!cpumask_test_cpu(cpu, cpu_active_mask))
> continue;
> return cpu;
> }
>
> ?

Or this?


diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index ed2a47e4ddae..5073e06e34c3 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -2236,7 +2236,7 @@ static int rto_next_cpu(struct root_domain *rd)
for (;;) {

/* When rto_cpu is -1 this acts like cpumask_first() */
- cpu = cpumask_next(rd->rto_cpu, rd->rto_mask);
+ cpu = cpumask_next_and(rd->rto_cpu, rd->rto_mask, cpu_active_mask);

rd->rto_cpu = cpu;


I think we might be circumventing the root cause though. It looks like that
there are only 2 cpus online in the system and one of them going offline (cpu1)
while the other is being overloaded (cpu0), ending in ping-ponging the tasks?

If that's the case, it seems to me the rto state needs to be cleared for cpu0
and stop attempting to push tasks. Which I'd assume it usually happens but
there's a race that prevents it from realizing this.

Maybe something like this would be better? Assuming I understood the problem
correctly of course.


diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index ed2a47e4ddae..d80072cc2596 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -334,6 +334,10 @@ static inline void rt_set_overload(struct rq *rq)
if (!rq->online)
return;

+ /* Overload is meaningless if we shrank to become a UP system */
+ if (cpumask_weight(cpu_online_mask) == 1)
+ return;
+
cpumask_set_cpu(rq->cpu, rq->rd->rto_mask);
/*
* Make sure the mask is visible before we set

This could be working around the problem too, not sure. But the bug IMHO is
that if we shrink to a UP system, the whole push-pull mechanism should retire
temporarily. Which I assume this usually happens, but there's a race somewhere.


Thanks

--
Qais Yousef