Re: [PATCH] sched/fair: Skip cpus with no sched domain attached during NOHZ idle balance

From: Zhang, Rui
Date: Wed Sep 20 2023 - 03:24:47 EST


Hi, Pierre,

Sorry for the late response. I'm still ramping up on the related code.

On Thu, 2023-09-14 at 16:53 +0200, Pierre Gondois wrote:
>
>
> On 9/14/23 11:23, Zhang, Rui wrote:
> > Hi, Pierre,
> >
> > >
> > > Yes right indeed,
> > > This happens when putting a CPU offline (as you mentioned
> > > earlier,
> > > putting a CPU offline clears the CPU in the idle_cpus_mask).
> > >
> > > The load balancing related variables
> >
> > including?
>
> I meant the nohz idle variables in the load balancing, so I was
> referring to:
> (struct sched_domain_shared).nr_busy_cpus
> (struct sched_domain).nohz_idle
> nohz.idle_cpus_mask
> nohz.nr_cpus
> (struct rq).nohz_tick_stopped

IMO, the problem is that, for an isolated CPU,
1. it is not an idle cpu (nohz.idle_cpus_mask should be cleared)
2. it is not a busy cpu (sds->nr_busy_cpus should be decreased)

But current code does not have a third state to describe this, so we
need to either
1. add extra logic, like on_null_domain() checks
or
2. rely on current logic, but update all related variables correctly,
like you proposed.

But in any case, we should stick with one direction.

If we follow the first one, the original patch should be used, which
IMO is simple and straight forward.
If we follow the later one, we'd better audit and remove the current
on_null_domain() usage at the same time. TBH, I'm not confident enough
to make such a change. But if you want to propose something, I'd glad
to test it.

thanks,
rui

>
> >
> > >   are unused if a CPU has a NULL
> > > rq as it cannot pull any task. Ideally we should clear them once,
> > > when attaching a NULL sd to the CPU.
> >
> > This sounds good to me. But TBH, I don't have enough confidence to
> > do
> > so because I'm not crystal clear about how these variables are
> > used.
> >
> > Some questions about the code below.
> > >
> > > The following snipped should do that and solve the issue you
> > > mentioned:
> > > --- snip ---
> > > --- a/include/linux/sched/nohz.h
> > > +++ b/include/linux/sched/nohz.h
> > > @@ -9,8 +9,10 @@
> > >    #if defined(CONFIG_SMP) && defined(CONFIG_NO_HZ_COMMON)
> > >    extern void nohz_balance_enter_idle(int cpu);
> > >    extern int get_nohz_timer_target(void);
> > > +extern void nohz_clean_sd_state(int cpu);
> > >    #else
> > >    static inline void nohz_balance_enter_idle(int cpu) { }
> > > +static inline void nohz_clean_sd_state(int cpu) { }
> > >    #endif
> > >   
> > >    #ifdef CONFIG_NO_HZ_COMMON
> > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > > index b3e25be58e2b..6fcabe5d08f5 100644
> > > --- a/kernel/sched/fair.c
> > > +++ b/kernel/sched/fair.c
> > > @@ -11525,6 +11525,9 @@ void nohz_balance_exit_idle(struct rq
> > > *rq)
> > >    {
> > >           SCHED_WARN_ON(rq != this_rq());
> > >   
> > > +       if (on_null_domain(rq))
> > > +               return;
> > > +
> > >           if (likely(!rq->nohz_tick_stopped))
> > >                   return;
> > >
> > if we force clearing rq->nohz_tick_stopped when detaching domain,
> > why
> > bother adding the first check?
>
> Yes you're right. I added this check for safety, but this is not
> mandatory.
>
> >
> > >   
> > > @@ -11551,6 +11554,17 @@ static void set_cpu_sd_state_idle(int
> > > cpu)
> > >           rcu_read_unlock();
> > >    }
> > >   
> > > +void nohz_clean_sd_state(int cpu) {
> > > +       struct rq *rq = cpu_rq(cpu);
> > > +
> > > +       rq->nohz_tick_stopped = 0;
> > > +       if (cpumask_test_cpu(cpu, nohz.idle_cpus_mask)) {
> > > +               cpumask_clear_cpu(cpu, nohz.idle_cpus_mask);
> > > +               atomic_dec(&nohz.nr_cpus);
> > > +       }
> > > +       set_cpu_sd_state_idle(cpu);
> > > +}
> > > +
> >
> > detach_destroy_domains
> >         cpu_attach_domain
> >                 update_top_cache_domain
> >
> > as we clears per_cpu(sd_llc, cpu) for the isolated cpu in
> > cpu_attach_domain(), set_cpu_sd_state_idle() seems to be a no-op
> > here,
> > no?
>
> Yes you're right, cpu_attach_domain() and nohz_clean_sd_state() calls
> have to be inverted to avoid what you just described.
>
> It also seems that the current kernel doesn't decrease nr_busy_cpus
> when putting CPUs in an isolated partition. Indeed if a CPU is
> counted
> in nr_busy_cpus, putting the CPU in an isolated partition doesn't
> trigger
> any call to set_cpu_sd_state_idle().
> So it might an additional argument.
>
> Thanks for reading the patch,
> Regards,
> Pierre
>
> >
> > thanks,
> > rui
> > >    /*
> > >     * This routine will record that the CPU is going idle with
> > > tick
> > > stopped.
> > >     * This info will be used in performing idle load balancing in
> > > the
> > > future.
> > > diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> > > index d3a3b2646ec4..d31137b5f0ce 100644
> > > --- a/kernel/sched/topology.c
> > > +++ b/kernel/sched/topology.c
> > > @@ -2584,8 +2584,10 @@ static void detach_destroy_domains(const
> > > struct cpumask *cpu_map)
> > >                 
> > > static_branch_dec_cpuslocked(&sched_asym_cpucapacity);
> > >   
> > >           rcu_read_lock();
> > > -       for_each_cpu(i, cpu_map)
> > > +       for_each_cpu(i, cpu_map) {
> > >                   cpu_attach_domain(NULL, &def_root_domain, i);
> > > +               nohz_clean_sd_state(i);
> > > +       }
> > >           rcu_read_unlock();
> > >    }
> > >
> > > --- snip ---
> > >
> > > Regards,
> > > Pierre
> > >
> > > >
> > > > >
> > > > > > +       }
> > > > > > +
> > > > > >            /*
> > > > > >             * The tick is still stopped but load could have
> > > > > > been
> > > > > > added in the
> > > > > >             * meantime. We set the nohz.has_blocked flag to
> > > > > > trig
> > > > > > a
> > > > > > check of the
> > > > > > @@ -11585,10 +11609,6 @@ void nohz_balance_enter_idle(int
> > > > > > cpu)
> > > > > >            if (rq->nohz_tick_stopped)
> > > > > >                    goto out;
> > > > > > -       /* If we're a completely isolated CPU, we don't
> > > > > > play:
> > > > > > */
> > > > > > -       if (on_null_domain(rq))
> > > > > > -               return;
> > > > > > -
> > > > > >            rq->nohz_tick_stopped = 1;
> > > > > >            cpumask_set_cpu(cpu, nohz.idle_cpus_mask);
> > > > > >
> > > > > > Otherwise I could reproduce the issue and the patch was
> > > > > > solving
> > > > > > it,
> > > > > > so:
> > > > > > Tested-by: Pierre Gondois <pierre.gondois@xxxxxxx>
> > > >
> > > > Thanks for testing, really appreciated!
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > Also, your patch doesn't aim to solve that, but I think
> > > > > > there
> > > > > > is an
> > > > > > issue
> > > > > > when updating cpuset.cpus when an isolated partition was
> > > > > > already
> > > > > > created:
> > > > > >
> > > > > > // Create an isolated partition containing CPU0
> > > > > > # mkdir cgroup
> > > > > > # mount -t cgroup2 none cgroup/
> > > > > > # mkdir cgroup/Testing
> > > > > > # echo "+cpuset" > cgroup/cgroup.subtree_control
> > > > > > # echo "+cpuset" > cgroup/Testing/cgroup.subtree_control
> > > > > > # echo 0 > cgroup/Testing/cpuset.cpus
> > > > > > # echo isolated > cgroup/Testing/cpuset.cpus.partition
> > > > > >
> > > > > > // CPU0's sched domain is detached:
> > > > > > # ls /sys/kernel/debug/sched/domains/cpu0/
> > > > > > # ls /sys/kernel/debug/sched/domains/cpu1/
> > > > > > domain0  domain1
> > > > > >
> > > > > > // Change the isolated partition to be CPU1
> > > > > > # echo 1 > cgroup/Testing/cpuset.cpus
> > > > > >
> > > > > > // CPU[0-1] sched domains are not updated:
> > > > > > # ls /sys/kernel/debug/sched/domains/cpu0/
> > > > > > # ls /sys/kernel/debug/sched/domains/cpu1/
> > > > > > domain0  domain1
> > > > > >
> > > > Interesting. Let me check and get back to you later on this. :)
> > > >
> > > > thanks,
> > > > rui
> >