Re: [PATCH] sched/fair: Skip cpus with no sched domain attached during NOHZ idle balance

From: Zhang, Rui
Date: Mon Aug 14 2023 - 04:31:36 EST


On Mon, 2023-08-14 at 11:14 +0800, Aaron Lu wrote:
> Hi Rui,
>
> On Fri, Aug 04, 2023 at 05:08:58PM +0800, Zhang Rui wrote:
> > Problem statement
> > -----------------
> > When using cgroup isolated partition to isolate cpus including
> > cpu0, it
> > is observed that cpu0 is woken up frequenctly but doing nothing.
> > This is
> > not good for power efficiency.
> >
> > <idle>-0     [000]   616.491602: hrtimer_cancel:      
> > hrtimer=0xffff8e8fdf623c10
> > <idle>-0     [000]   616.491608: hrtimer_start:       
> > hrtimer=0xffff8e8fdf623c10 function=tick_sched_timer/0x0
> > expires=615996000000 softexpires=615996000000
> > <idle>-0     [000]   616.491616: rcu_utilization:      Start
> > context switch
> > <idle>-0     [000]   616.491618: rcu_utilization:      End context
> > switch
> > <idle>-0     [000]   616.491637: tick_stop:            success=1
> > dependency=NONE
> > <idle>-0     [000]   616.491637: hrtimer_cancel:      
> > hrtimer=0xffff8e8fdf623c10
> > <idle>-0     [000]   616.491638: hrtimer_start:       
> > hrtimer=0xffff8e8fdf623c10 function=tick_sched_timer/0x0
> > expires=616420000000 softexpires=616420000000
> >
> > The above pattern repeats every one or multiple ticks, results in
> > total
> > 2000+ wakeups on cpu0 in 60 seconds, when running workload on the
> > cpus that are not in the isolated partition.
> >
> > Rootcause
> > ---------
> > In NOHZ mode, an active cpu either sends an IPI or touches the idle
> > cpu's polling flag to wake it up, so that the idle cpu can pull
> > tasks
> > from the busy cpu. The logic for selecting the target cpu is to use
> > the
> > first idle cpu that presents in both nohz.idle_cpus_mask and
> > housekeeping_cpumask.
> >
> > In the above scenario, when cpu0 is in the cgroup isolated
> > partition,
> > its sched domain is deteched, but it is still available in both of
> > the
> > above cpumasks. As a result, cpu0
>
> I saw in nohz_balance_enter_idle(), if a cpu is isolated, it will not
> set itself in nohz.idle_cpus_mask and thus should not be chosen as
> ilb_cpu. I wonder what's stopping this from working?

One thing I forgot to mention is that the problem is gone if we offline
and re-online those cpus. In that case, the isolated cpus are removed
from the nohz.idle_cpus_mask in sched_cpu_deactivate() and are never
added back.

At runtime, the cpus can be removed from the nohz.idle_cpus_mask in
below case only
trigger_load_balance()
if (unlikely(on_null_domain(rq) || !cpu_active(cpu_of(rq))))
return;
nohz_balancer_kick(rq);
nohz_balance_exit_idle()

My understanding is that if a cpu is in nohz.idle_cpus_mask when it is
isolated, there is no chance to remove it from that mask later, so the
check in nohz_balance_enter_idle() does not help.

thanks,
rui


>
> > 1. is always selected when kicking idle load balance
> > 2. is woken up from the idle loop
> > 3. calls __schedule() but cannot find any task to pull because it
> > is not
> >    in any sched_domain, thus it does nothing and reenters idle.
> >
> > Solution
> > --------
> > Fix the problem by skipping cpus with no sched domain attached
> > during
> > NOHZ idle balance.
> >
> > Signed-off-by: Zhang Rui <rui.zhang@xxxxxxxxx>
> > ---
> >  kernel/sched/fair.c | 3 +++
> >  1 file changed, 3 insertions(+)
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index b3e25be58e2b..ea3185a46962 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -11340,6 +11340,9 @@ static inline int find_new_ilb(void)
> >                 if (ilb == smp_processor_id())
> >                         continue;
> >  
> > +               if (unlikely(on_null_domain(cpu_rq(ilb))))
> > +                       continue;
> > +
> >                 if (idle_cpu(ilb))
> >                         return ilb;
> >         }
> > --
> > 2.34.1
> >