Re: [External] Re: [PATCH v4 6/7] sched/fair: skip busy cores in SIS search

From: Chen Yu
Date: Thu Jun 23 2022 - 23:30:49 EST


On Wed, Jun 22, 2022 at 11:52:19AM +0800, Abel Wu wrote:
> Hi Chen, thanks for your comments!
>
> On 6/22/22 2:14 AM, Chen Yu Wrote:
> > > ...
> > Reuse the data from load balance to select the unoccupied candidate
> > is applicable IMO, which is also aligned with SIS_UTIL path. And I have
> > a question regarding the update frequency. In v3 patch, the update is
> > based on periodic tick, which is triggered at most every 1ms(CONFIG_HZ_1000).
> > Then the periodic SMT load balance is launched every smt_weight ms, usually 2ms.
> > I expect the 2ms is of the same scale unit as 1ms, and since task tick based
> > update is acceptable, would excluding the CPU_NEWLY_IDLE balance from this patch
> > reduce the overhead meanwhile not bring too much inaccuracy?
>
> I doubt periodic balancing entry is enough. The ms-scale could still
> be too large for wakeup path. Say one cpu becomes newly idle just after
> an update, then it keeps untouchable until next update which is nearly
> 2ms (even worse in SMT4/8 case) wasting time-slices to do nothing. So
> newly-idle update is important to keep the filter fresh. And the false
> positive correction is there to avoid excessive updates due to newly
> idle, by allowing the false positive cpus to stay in the filter for a
> little longer.

>
> > > @@ -8757,7 +8794,16 @@ static inline void update_sg_lb_stats(struct lb_env *env,
> > > * No need to call idle_cpu() if nr_running is not 0
> > > */
> > > if (!nr_running && idle_cpu(i)) {
> > > + /*
> > > + * Prefer the last idle cpu by overwriting
> > > + * preious one. The first idle cpu in this
> > > + * domain (if any) can trigger balancing
> > > + * and fed with tasks, so we'd better choose
> > > + * a candidate in an opposite way.
> > > + */
> > > + sds->idle_cpu = i;
> > Does it mean, only 1 idle CPU in the smt domain would be set to the
> > idle cpu mask at one time? For SMT4/8 we might lose track of the
> > idle siblings.
>
> Yes. The intention of one-at-a-time propagation is
> 1) help spreading out load to different cores
> 2) reduce some update overhead
> In this way, if the filter contains several cpus of a core, ideally
> we can sure about that at least one of them is actually unoccupied.
> For SMT4/8 we still have newly idle balance to make things right.
>
> > > sgs->idle_cpus++;
> > > +
> > > /* Idle cpu can't have misfit task */
> > > continue;
> > > }
> > > @@ -9273,8 +9319,40 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, int this_cpu)
> > > static void sd_update_state(struct lb_env *env, struct sd_lb_stats *sds)
> > > {
> > > - if (sds->sd_state == sd_has_icpus && !test_idle_cpus(env->dst_cpu))
> > > - set_idle_cpus(env->dst_cpu, true);
> > > + struct sched_domain_shared *sd_smt_shared = env->sd->shared;
> > > + enum sd_state new = sds->sd_state;
> > > + int this = env->dst_cpu;
> > > +
> > > + /*
> > > + * Parallel updating can hardly contribute accuracy to
> > > + * the filter, besides it can be one of the burdens on
> > > + * cache traffic.
> > > + */
> > > + if (cmpxchg(&sd_smt_shared->updating, 0, 1))
> > > + return;
> > > +
> > > + /*
> > > + * There is at least one unoccupied cpu available, so
> > > + * propagate it to the filter to avoid false negative
> > > + * issue which could result in lost tracking of some
> > > + * idle cpus thus throughupt downgraded.
> > > + */
> > > + if (new != sd_is_busy) {
> > > + if (!test_idle_cpus(this))
> > > + set_idle_cpus(this, true);
> > > + } else {
> > > + /*
> > > + * Nothing changes so nothing to update or
> > > + * propagate.
> > > + */
> > > + if (sd_smt_shared->state == sd_is_busy)
> > > + goto out;
> > > + }
> > > +
> > > + sd_update_icpus(this, sds->idle_cpu);
> > I wonder if we could further enhance it to facilitate idle CPU scan.
> > For example, can we propagate the idle CPUs in smt domain, to its parent
> > domain in a hierarchic sequence, and finally to the LLC domain. If there is
>
> In fact, it was my first try to cache the unoccupied cpus in SMT
> shared domain, but the overhead of cpumask ops seems like a major
> stumbling block.
>
> > a cluster domain between SMT and LLC domain, the cluster domain idle CPU filter
> > could benefit from this mechanism.
> > https://lore.kernel.org/lkml/20220609120622.47724-3-yangyicong@xxxxxxxxxxxxx/
>
> Putting SIS into a hierarchical pattern is good for cache locality.
> But I don't think multi-level filter is appropriate since it could
> bring too much cache traffic in SIS,
Could you please elaborate a little more about the cache traffic? I thought we
don't save the unoccupied cpus in SMT shared domain, but to store it in middle
layer shared domain, say, cluster->idle_cpus, this would reduce cache write
contention compared to writing to llc->idle_cpus directly, because a smaller
set of CPUs share the idle_cpus filter. Similarly, SIS can only scan the cluster->idle_cpus
first, without having to query the llc->idle_cpus. This looks like splitting
a big lock into fine grain small lock.
> and it could be expected to be
> a disaster for netperf/tbench or the workloads suffering frequent
> context switches.
>
So this overhead comes from the NEWLY_IDLE case?

thanks,
Chenyu
> >
> > Furthermore, even if there is no cluster domain, would a 'virtual' middle
> > sched domain would help reduce the contention?
> > Core0(CPU0,CPU1),Core1(CPU2,CPU3) Core2(CPU4,CPU5) Core3(CPU6,CPU7)
> > We can create cpumask1, which is composed of Core0 and Core1, and cpumask2
> > which is composed of Core2 and Core3. The SIS would first scan in cpumask1,
> > if idle cpu is not found, scan cpumask2. In this way, the CPUs in Core0 and
> > Core1 only updates cpumask1, without competing with Core2 and Core3 on cpumask2.
> >
> Yes, this is the best case, but the worst case is something that
> we probably can't afford.
>
> Thanks & BR,
> Abel