Re: [PATCH v3 03/10] sched/topology: Provide cfs_overload_cpus bitmap

From: Steven Sistare
Date: Mon Nov 19 2018 - 12:34:22 EST


On 11/12/2018 11:42 AM, Valentin Schneider wrote:
> Hi Steve,
>
> On 09/11/2018 12:50, Steve Sistare wrote:
>> From: Steve Sistare <steve.sistare@xxxxxxxxxx>
>>
>> Define and initialize a sparse bitmap of overloaded CPUs, per
>> last-level-cache scheduling domain, for use by the CFS scheduling class.
>> Save a pointer to cfs_overload_cpus in the rq for efficient access.
>>
>> Signed-off-by: Steve Sistare <steven.sistare@xxxxxxxxxx>
>> ---
>> include/linux/sched/topology.h | 1 +
>> kernel/sched/sched.h | 2 ++
>> kernel/sched/topology.c | 21 +++++++++++++++++++--
>> 3 files changed, 22 insertions(+), 2 deletions(-)
>>
>> diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
>> index 6b99761..b173a77 100644
>> --- a/include/linux/sched/topology.h
>> +++ b/include/linux/sched/topology.h
>> @@ -72,6 +72,7 @@ struct sched_domain_shared {
>> atomic_t ref;
>> atomic_t nr_busy_cpus;
>> int has_idle_cores;
>> + struct sparsemask *cfs_overload_cpus;
>
> Thinking about misfit stealing, we can't use the sd_llc_shared's because
> on big.LITTLE misfit migrations happen across LLC domains.
>
> I was thinking of adding a misfit sparsemask to the root_domain, but
> then I thought we could do the same thing for cfs_overload_cpus.
>
> By doing so we'd have a single source of information for overloaded CPUs,
> and we could filter that down during idle balance - you mentioned earlier
> wanting to try stealing at each SD level. This would also let you get
> rid of [PATCH 02].
>
> The main part of try_steal() could then be written down as something like
> this:
>
> ----->8-----
>
> for_each_domain(this_cpu, sd) {
> span = sched_domain_span(sd)
>
> for_each_sparse_wrap(src_cpu, overload_cpus) {
> if (cpumask_test_cpu(src_cpu, span) &&
> steal_from(dts_rq, dst_rf, &locked, src_cpu)) {
> stolen = 1;
> goto out;
> }
> }
> }
>
> ------8<-----
>
> We could limit the stealing to stop at the highest SD_SHARE_PKG_RESOURCES
> domain for now so there would be no behavioural change - but we'd
> factorize the #ifdef SCHED_SMT bit. Furthermore, the door would be open
> to further stealing.
>
> What do you think?

That is not efficient for a multi-level search because at each domain level we
would (re) iterate over overloaded candidates that do not belong in that level.
To extend stealing across LLC, I would like to keep the per-LLC sparsemask,
but add to each SD a list of sparsemask pointers. The list nodes would be
private, but the sparsemask structs would be shared. Each list would include
the masks that overlap the SD's members. The list would be a singleton at the
core and LLC levels (same as the socket level for most processors), and would
have multiple elements at the NUMA level.

- Steve