Re: [PATCH] sched/rt: Make rt_rq->pushable_tasks updates drive rto_mask

From: Sebastian Andrzej Siewior
Date: Wed Sep 20 2023 - 09:38:16 EST


On 2023-09-11 12:54:50 [+0200], Valentin Schneider wrote:
> Ok, back to this :)
>
> On 15/08/23 16:21, Sebastian Andrzej Siewior wrote:
> > What I still observe is:
> > - CPU0 is idle. CPU0 gets a task assigned from CPU1. That task receives
> > a wakeup. CPU0 returns from idle and schedules the task.
> > pull_rt_task() on CPU1 and sometimes on other CPU observe this, too.
> > CPU1 sends irq_work to CPU0 while at the time rto_next_cpu() sees that
> > has_pushable_tasks() return 0. That bit was cleared earlier (as per
> > tracing).
> >
> > - CPU0 is idle. CPU0 gets a task assigned from CPU1. The task on CPU0 is
> > woken up without an IPI (yay). But then pull_rt_task() decides that
> > send irq_work and has_pushable_tasks() said that is has tasks left
> > so….
> > Now: rto_push_irq_work_func() run once once on CPU0, does nothing,
> > rto_next_cpu() return CPU0 again and enqueues itself again on CPU0.
> > Usually after the second or third round the scheduler on CPU0 makes
> > enough progress to remove the task/ clear the CPU from mask.
> >
>
> If CPU0 is selected for the push IPI, then we should have
>
> rd->rto_cpu == CPU0
>
> So per the
>
> cpumask_next(rd->rto_cpu, rd->rto_mask);
>
> in rto_next_cpu(), it shouldn't be able to re-select itself.
>
> Do you have a simple enough reproducer I could use to poke at this?

Not really a reproducer. What I had earlier was a high priority RT task
(ntpsec at prio 99) and cyclictest below it (prio 90). And PREEMPT_RT
which adds a few tasks (due to threaded interrupts).
Then I added trace-printks to observe. Initially I had latency spikes
due to ntpsec but also a bunch IRQ-work-IPIs which I decided to look at.

> > I understand that there is a race and the CPU is cleared from rto_mask
> > shortly after checking. Therefore I would suggest to look at
> > has_pushable_tasks() before returning a CPU in rto_next_cpu() as I did
> > just to avoid the interruption which does nothing.
> >
> > For the second case the irq_work seems to make no progress. I don't see
> > any trace_events in hardirq, the mask is cleared outside hardirq (idle
> > code). The NEED_RESCHED bit is set for current therefore it doesn't make
> > sense to send irq_work to reschedule if the current already has this on
> > its agenda.
> >
> > So what about something like:
> >
> > diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
> > index 00e0e50741153..d963408855e25 100644
> > --- a/kernel/sched/rt.c
> > +++ b/kernel/sched/rt.c
> > @@ -2247,8 +2247,23 @@ static int rto_next_cpu(struct root_domain *rd)
> >
> > rd->rto_cpu = cpu;
> >
> > - if (cpu < nr_cpu_ids)
> > + if (cpu < nr_cpu_ids) {
> > + struct task_struct *t;
> > +
> > + if (!has_pushable_tasks(cpu_rq(cpu)))
> > + continue;
> > +
>
> IIUC that's just to plug the race between the CPU emptying its
> pushable_tasks list and it removing itself from the rto_mask - that looks
> fine to me.
>
> > + rcu_read_lock();
> > + t = rcu_dereference(rq->curr);
> > + /* if (test_preempt_need_resched_cpu(cpu_rq(cpu))) */
> > + if (test_tsk_need_resched(t)) {
>
> We need to make sure this doesn't cause us to loose IPIs we actually need.
>
> We do have a call to put_prev_task_balance() through entering __schedule()
> if the previous task is RT/DL, and balance_rt() can issue a push
> IPI, but AFAICT only if the previous task was the last DL task. So I don't
> think we can do this.

I observed that the CPU/ task on that CPU already had the need-resched
bit set so a task-switch is in progress. Therefore it looks like any
further IPIs are needless because the IRQ-work IPI just "leave early"
via resched_curr() and don't do anything useful. So they don't
contribute anything but stall the CPU from making progress and
performing the actual context switch.

> > + rcu_read_unlock();
> > + continue;
> > + }
> > + rcu_read_unlock();
> > +
> > return cpu;
> > + }
> >
> > rd->rto_cpu = -1;

Sebastian