Re: EEVDF and NUMA balancing

From: Julia Lawall
Date: Fri Jan 19 2024 - 06:34:08 EST




On Fri, 19 Jan 2024, Vincent Guittot wrote:

> On Thu, 18 Jan 2024 at 23:13, Julia Lawall <julia.lawall@xxxxxxxx> wrote:
> >
> >
> >
> > On Thu, 18 Jan 2024, Vincent Guittot wrote:
> >
> > > Hi Julia,
> > >
> > > Sorry for the delay. I have been involved on other perf regression
> > >
> > > On Fri, 5 Jan 2024 at 18:27, Julia Lawall <julia.lawall@xxxxxxxx> wrote:
> > > >
> > > >
> > > >
> > > > On Fri, 5 Jan 2024, Julia Lawall wrote:
> > > >
> > > > >
> > > > >
> > > > > On Fri, 5 Jan 2024, Vincent Guittot wrote:
> > > > >
> > > > > > On Fri, 5 Jan 2024 at 15:51, Julia Lawall <julia.lawall@xxxxxxxx> wrote:
> > > > > > >
> > > > > > > > Your system is calling the polling mode and not the default
> > > > > > > > cpuidle_idle_call() ? This could explain why I don't see such problem
> > > > > > > > on my system which doesn't have polling
> > > > > > > >
> > > > > > > > Are you forcing the use of polling mode ?
> > > > > > > > If yes, could you check that this problem disappears without forcing
> > > > > > > > polling mode ?
> > > > > > >
> > > > > > > I expanded the code in do_idle to:
> > > > > > >
> > > > > > > if (cpu_idle_force_poll) { c1++;
> > > > > > > tick_nohz_idle_restart_tick();
> > > > > > > cpu_idle_poll();
> > > > > > > } else if (tick_check_broadcast_expired()) { c2++;
> > > > > > > tick_nohz_idle_restart_tick();
> > > > > > > cpu_idle_poll();
> > > > > > > } else { c3++;
> > > > > > > cpuidle_idle_call();
> > > > > > > }
> > > > > > >
> > > > > > > Later, I have:
> > > > > > >
> > > > > > > trace_printk("force poll: %d: c1: %d, c2: %d, c3: %d\n",cpu_idle_force_poll, c1, c2, c3);
> > > > > > > flush_smp_call_function_queue();
> > > > > > > schedule_idle();
> > > > > > >
> > > > > > > force poll, c1 and c2 are always 0, and c3 is always some non-zero value.
> > > > > > > Sometimes small (often 1), and sometimes large (304 or 305).
> > > > > > >
> > > > > > > So I don't think it's calling cpu_idle_poll().
> > > > > >
> > > > > > I agree that something else
> > > > > >
> > > > > > >
> > > > > > > x86 has TIF_POLLING_NRFLAG defined to be a non zero value, which I think
> > > > > > > is sufficient to cause the issue.
> > > > > >
> > > > > > Could you trace trace_sched_wake_idle_without_ipi() ans csd traces as well ?
> > > > > > I don't understand what set need_resched() in your case; having in
> > > > > > mind that I don't see the problem on my Arm systems and IIRC Peter
> > > > > > said that he didn't face the problem on his x86 system.
> > > > >
> > > > > TIF_POLLING_NRFLAG doesn't seem to be defined on Arm.
> > > > >
> > > > > Peter said that he didn't see the problem, but perhaps that was just
> > > > > random. It requires a NUMA move to occur. I make 20 runs to be sure to
> > > > > see the problem at least once. But another machine might behave
> > > > > differently.
> > > > >
> > > > > I believe the call chain is:
> > > > >
> > > > > scheduler_tick
> > > > > trigger_load_balance
> > > > > nohz_balancer_kick
> > > > > kick_ilb
> > > > > smp_call_function_single_async
> > > > > generic_exec_single
> > > > > __smp_call_single_queue
> > > > > send_call_function_single_ipi
> > > > > call_function_single_prep_ipi
> > > > > set_nr_if_polling <====== sets need_resched
> > > > >
> > > > > I'll make a trace to reverify that.
> > > >
> > > > This is what I see at a tick, which corresponds to the call chain shown
> > > > above:
> > > >
> > > > bt.B.x-4184 [046] 466.410605: bputs: scheduler_tick: calling trigger_load_balance
> > > > bt.B.x-4184 [046] 466.410605: bputs: trigger_load_balance: calling nohz_balancer_kick
> > > > bt.B.x-4184 [046] 466.410605: bputs: trigger_load_balance: calling kick_ilb
> > > > bt.B.x-4184 [046] 466.410607: bprint: trigger_load_balance: calling smp_call_function_single_async 22
> > > > bt.B.x-4184 [046] 466.410607: bputs: smp_call_function_single_async: calling generic_exec_single
> > > > bt.B.x-4184 [046] 466.410607: bputs: generic_exec_single: calling __smp_call_single_queue
> > > > bt.B.x-4184 [046] 466.410608: bputs: __smp_call_single_queue: calling send_call_function_single_ipi
> > > > bt.B.x-4184 [046] 466.410608: bputs: __smp_call_single_queue: calling call_function_single_prep_ipi
> > > > bt.B.x-4184 [046] 466.410608: bputs: call_function_single_prep_ipi: calling set_nr_if_polling
> > > > bt.B.x-4184 [046] 466.410609: sched_wake_idle_without_ipi: cpu=22
> > >
> > > I don't know if you have made progress on this in the meantime.
> > >
> > > Regarding the trace above, do you know if anything happens on CPU22
> > > just before the scheduler tried to kick the ILB on it ?
> > >
> > > Have you found why TIF_POLLING_NRFLAG seems to be always set when the
> > > kick_ilb happens ? It should be cleared once entering the idle state.
> >
> > I haven't figured out everything, but the attached graph shows
> > that TIF_POLLING_NRFLAG is not always set. Sometimes it is and sometimes
> > it isn't.
> >
> > In the graph, on core 57, the blue box and the green x are before and
> > after the call to cpuidle_idle_call(), resplectively. One can't see it in
> > this graph, but the green x comes before the blue box. So almost all of
> > the time, it is in cpuidle_idle_call(), only in the tiny gap between the x
> > and the box is it back in do_idle with TIF_POLLING_NRFLAG set.
> >
> > Afterwards, there is a diamond for the polling case and a triangle for the
> > non polling case. These also occur on clock ticks, and may be
> > microscopically closer to (polling) or further from (not polling) the
> > green x and blue box.
>
> Your problem really looks like weird timing.
>
> It would be good to know which idle states are selected ? or even
> better if it's possible, disable all but one idle state and see if one
> idle state in particular trigger your problem
>
> idle state can be disable here :
> echo 1 > /sys/devices/system/cpu/cpu*/cpuidle/state*/disable
>
> One possible sequence:
> tick is not stopped on the idle cpu
> tick fires on busy and idle cpus
> idle cpu wakes up and the wake up time varies depending of wakeup
> latency of the entered c-state
> busy cpu executes call_function_single_prep_ipi() and idle cpu could
> be already woken or not depending of the time to wake up
>
> >
> > I haven't yet studied what happens afterwards in the non polling case.
>
> Side point, according to your trace above, you can 2 consecutives real
> idle load balance so the patch that I proposed, should be able to
> trigger active migration because the nr_balance_failed will be != 0
> the 2nd idle load balance. Are I missing something ?

Thanks for the suggestions. I will check both issues.

julia