Re: [PATCH] sched/fair: Fix wake_affine() for !NUMA_BALANCING

From: Chris Wilson
Date: Thu Aug 24 2017 - 18:30:29 EST


We've stumbled across the same regression on Broxton/Apollolake (J3455).

Quoting Peter Zijlstra (2017-08-01 22:43:07)
> On Tue, Aug 01, 2017 at 03:26:51PM -0400, Josef Bacik wrote:
> > > @@ -7574,6 +7607,18 @@ static inline void update_sd_lb_stats(st
> > > env->dst_rq->rd->overload = overload;
> > > }
> > >
> > > + /*
> > > + * Since these are sums over groups they can contain some CPUs
> > > + * multiple times for the NUMA domains.
> > > + *
> > > + * Currently only wake_affine_llc() and find_busiest_group()
> > > + * uses these numbers, only the last is affected by this problem.
> > > + *
> > > + * XXX fix that.
> > > + */
> > > + WRITE_ONCE(shared->nr_running, sds->total_running);
> > > + WRITE_ONCE(shared->load, sds->total_load);
> > > + WRITE_ONCE(shared->capacity, sds->total_capacity);
> >
> > This panic's on boot for me because shared is NULL. Same happens in
> > select_task_rq_fair when it tries to do the READ_ONCE. Here is my .config in
> > case it's something strange with my config. Thanks,
>
> Nah, its just me being an idiot and booting the wrong kernel. Unless I
> messed up again, this one boots.
>
> There is state during boot and hotplug where there are no domains, and
> thus also no shared. Simply ignoring things when that happens should be
> good enough I think.

This is still not as effective as the previous code in spreading across
siblings.

> +/*
> + * Can a task be moved from prev_cpu to this_cpu without causing a load
> + * imbalance that would trigger the load balancer?
> + *
> + * Since we're running on 'stale' values, we might in fact create an imbalance
> + * but recomputing these values is expensive, as that'd mean iteration 2 cache
> + * domains worth of CPUs.
> + */
> +static bool
> +wake_affine_llc(struct sched_domain *sd, struct task_struct *p,
> + int this_cpu, int prev_cpu, int sync)
> +{
> + struct llc_stats prev_stats, this_stats;
> + s64 this_eff_load, prev_eff_load;
> + unsigned long task_load;
> +
> + get_llc_stats(&prev_stats, prev_cpu);
> + get_llc_stats(&this_stats, this_cpu);
> +
> + /*
> + * If sync wakeup then subtract the (maximum possible)
> + * effect of the currently running task from the load
> + * of the current LLC.
> + */
> + if (sync) {
> + unsigned long current_load = task_h_load(current);
> +
> + /* in this case load hits 0 and this LLC is considered 'idle' */
> + if (current_load > this_stats.load)
> + return true;
> +
> + this_stats.load -= current_load;
> + }
> +
> + /*
> + * The has_capacity stuff is not SMT aware, but by trying to balance
> + * the nr_running on both ends we try and fill the domain at equal
> + * rates, thereby first consuming cores before siblings.
> + */
> +
> + /* if the old cache has capacity, stay there */
> + if (prev_stats.has_capacity && prev_stats.nr_running < this_stats.nr_running+1)
> + return false;
> +
> + /* if this cache has capacity, come here */
> + if (this_stats.has_capacity && this_stats.nr_running < prev_stats.nr_running+1)

I think you mean,

if (this_stats.has_capacity && this_stats.nr_running + 1 < prev_stats.nr_running)

and with that our particular graphics benchmark behaves similarly to as
before (the regression appears fixed). But I'll let Eero double check.

> + return true;
> +
> +
> + /*
> + * Check to see if we can move the load without causing too much
> + * imbalance.
> + */
> + task_load = task_h_load(p);
> +
> + this_eff_load = 100;
> + this_eff_load *= prev_stats.capacity;
> +
> + prev_eff_load = 100 + (sd->imbalance_pct - 100) / 2;
> + prev_eff_load *= this_stats.capacity;
> +
> + this_eff_load *= this_stats.load + task_load;
> + prev_eff_load *= prev_stats.load - task_load;
> +
> + return this_eff_load <= prev_eff_load;
> +}