Re: EEVDF and NUMA balancing

From: Julia Lawall
Date: Fri Dec 29 2023 - 10:18:40 EST




On Thu, 28 Dec 2023, Julia Lawall wrote:

> > > > > > > I'm surprised that you have mainly CPU_NEWLY_IDLE. Do you know the reason ?
> > > > > >
> > > > > > No. They come from do_idle calling the scheduler. I will look into why
> > > > > > this happens so often.
> > > > >
> > > > > Hmm, the CPU was idle and received a need resched which triggered the
> > > > > scheduler but there was nothing to schedule so it goes back to idle
> > > > > after running a newly_idle _load_balance.
> > > >
> > > > I spent quite some time thinking the same until I saw the following code
> > > > in do_idle:
> > > >
> > > > preempt_set_need_resched();
> > > >
> > > > So I have the impression that do_idle sets need resched itself.
> > >
> > > But of course that code is only executed if need_resched is true. But I
> >
> > Yes, that is your root cause. something, most probably in interrupt
> > context, wakes up your CPU and expect to wake up a thread
> >
> > > don't know who would be setting need resched on each clock tick.
> >
> > that can be a timer, interrupt, ipi, rcu ...
> > a trace should give you some hints
>
> I have the impression that it is the goal of calling nohz_csd_func on each
> clock tick that causes the calls to need_resched. If the idle process is
> polling, call_function_single_prep_ipi just sets need_resched to get the
> idle process to stop polling. But there is no actual task that the idle
> process should schedule. The need_resched then prevents the idle process
> from stealing, due to the CPU_NEWLY_IDLE flag, contradicting the whole
> purpose of calling nohz_csd_func in the first place.

Looking in more detail, do_idle contains the following after existing the
polling loop:

flush_smp_call_function_queue();
schedule_idle();

flush_smp_call_function_queue() does end up calling nohz_csd_func, but
this has no impact, because it first checks that need_resched() is false,
whereas it is currently true to cause existing the polling loop. Removing
that test causes:

raise_softirq_irqoff(SCHED_SOFTIRQ);

but that causes the load balancing code to be executed from a ksoftirqd
task, which means that there is now no load imbalance.

So the only chance to detect an imbalance does seem to be to have the load
balance call be executed by the idle task, via schedule_idle(), as is
done currently. But that leads to the core being considered to be newly
idle.

julia