Re: [PATCH] sched/fair: Skip cpus with no sched domain attached during NOHZ idle balance

From: Zhang, Rui
Date: Fri Aug 11 2023 - 04:49:34 EST


Hi, Yu,

On Wed, 2023-08-09 at 15:00 +0800, Chen Yu wrote:
> On 2023-08-04 at 17:08:58 +0800, Zhang Rui wrote:
> > Problem statement
> > -----------------
> > When using cgroup isolated partition to isolate cpus including
> > cpu0, it
> > is observed that cpu0 is woken up frequenctly but doing nothing.
> > This is
> > not good for power efficiency.
> >
> > <idle>-0     [000]   616.491602: hrtimer_cancel:      
> > hrtimer=0xffff8e8fdf623c10
> > <idle>-0     [000]   616.491608: hrtimer_start:       
> > hrtimer=0xffff8e8fdf623c10 function=tick_sched_timer/0x0
> > expires=615996000000 softexpires=615996000000
> > <idle>-0     [000]   616.491616: rcu_utilization:      Start
> > context switch
> > <idle>-0     [000]   616.491618: rcu_utilization:      End context
> > switch
> > <idle>-0     [000]   616.491637: tick_stop:            success=1
> > dependency=NONE
> > <idle>-0     [000]   616.491637: hrtimer_cancel:      
> > hrtimer=0xffff8e8fdf623c10
> > <idle>-0     [000]   616.491638: hrtimer_start:       
> > hrtimer=0xffff8e8fdf623c10 function=tick_sched_timer/0x0
> > expires=616420000000 softexpires=616420000000
> >
> > The above pattern repeats every one or multiple ticks, results in
> > total
> > 2000+ wakeups on cpu0 in 60 seconds, when running workload on the
> > cpus that are not in the isolated partition.
> >
> > Rootcause
> > ---------
> > In NOHZ mode, an active cpu either sends an IPI or touches the idle
> > cpu's polling flag to wake it up, so that the idle cpu can pull
> > tasks
> > from the busy cpu. The logic for selecting the target cpu is to use
> > the
> > first idle cpu that presents in both nohz.idle_cpus_mask and
> > housekeeping_cpumask.
> >
> > In the above scenario, when cpu0 is in the cgroup isolated
> > partition,
> > its sched domain is deteched, but it is still available in both of
> > the
> > above cpumasks. As a result, cpu0
> > 1. is always selected when kicking idle load balance
> > 2. is woken up from the idle loop
> > 3. calls __schedule() but cannot find any task to pull because it
> > is not
> >    in any sched_domain, thus it does nothing and reenters idle.
> >
> > Solution
> > --------
> > Fix the problem by skipping cpus with no sched domain attached
> > during
> > NOHZ idle balance.
> >
> > Signed-off-by: Zhang Rui <rui.zhang@xxxxxxxxx>
> > ---
> >  kernel/sched/fair.c | 3 +++
> >  1 file changed, 3 insertions(+)
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index b3e25be58e2b..ea3185a46962 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -11340,6 +11340,9 @@ static inline int find_new_ilb(void)
> >                 if (ilb == smp_processor_id())
> >                         continue;
> >  
> > +               if (unlikely(on_null_domain(cpu_rq(ilb))))
> > +                       continue;
> > +
> >                 if (idle_cpu(ilb))
> >                         return ilb;
> >         }
>
> Is it possible to pass a valid cpumask to kick_ilb() via
> nohz_balancer_kick()
> and let find_new_ilb() scan in that mask? So we could shrink the scan
> range
> and also reduce the null domain check in each loop. CPUs in different
> cpuset are in different root domains, the busy CPU(in cpuset0) will
> not ask
> nohz idle CPU0(in isolated cpuset1) to launch idle load balance.
>
> struct root_domain *rd = rq->rd;
> ...
> kick_ilb(flags, rd->span)
>         

Yeah. This also sounds like a reasonable approach. I can make a patch
to confirm it works as expected.

thanks,
rui