Re: [PATCH 2/2] sched/fair: skip the cache hot CPU in select_idle_cpu()

From: Chen Yu
Date: Thu Sep 28 2023 - 03:52:46 EST


Hi Mathieu,

On 2023-09-27 at 12:11:33 -0400, Mathieu Desnoyers wrote:
> On 9/26/23 06:11, Chen Yu wrote:
> > Problem statement:
> > When task p is woken up, the scheduler leverages select_idle_sibling()
> > to find an idle CPU for it. p's previous CPU is usually a preference
> > because it can improve cache locality. However in many cases, the
> > previous CPU has already been taken by other wakees, thus p has to
> > find another idle CPU.
> >
> > Proposal:
> > Inspired by Mathieu's idea[1], introduce the SIS_CACHE. It considers
> > the sleep time of the task for better task placement. Based on the
> > task's short sleeping history, keep p's previous CPU idle for a short
> > while. Later when p is woken up, it can choose its previous CPU in
> > select_idle_sibling(). When p's previous CPU is reserved, another wakee
> > is not allowed to choose this CPU in select_idle_idle(). The reservation
> > period is set to the task's average short sleep time, AKA, se->sis_rsv_avg.
> >
> > This does not break the work conservation of the scheduler, because
> > wakee will still try its best to find an idle CPU. The difference is that
> > different idle CPUs might have different priorities.
> >
> > Prateek pointed out that, with SIS_CACHE enabled, if all idle CPUs are
> > cache-hot, select_idle_cpu() might have to choose a non-idle target CPU,
> > which brings task stacking. Mitigate this by returning the first cache-hot
> > idle CPU if no cache-cold CPU is found.
>
> I've tried your patches on my reference hackbench workload:
>
> ./hackbench -g 32 -f 20 --threads --pipe -l 480000 -s 100
>
> Unfortunately they don't appear to help for that specific load.
>

I just ran the same test on a 224 CPU system and it seems that
there is no much difference with/without SIS_CACHE. To figure out
the reason, I used the bpftrace to track how often hackbench is woken
up on its previous CPU:

kretfunc:select_task_rq_fair
{
$p = (struct task_struct *)args->p;
if ($p->comm == "sender") {
if ($p->thread_info.cpu != retval) {
@wakeup_migrate_sender = count();
} else {
@wakeup_prev_sender = count();
}
}

if ($p->comm == "receiver") {
if ($p->thread_info.cpu != retval) {
@wakeup_migrate_receiver = count();
} else {
@wakeup_prev_receiver = count();
}
}
}

and print the data every 10 seconds:

NO_SIS_CACHE:
23:50:24 Wakeup statistics:
@wakeup_migrate_sender: 9043961
@wakeup_prev_sender: 20073128
@wakeup_migrate_receiver: 12071462
@wakeup_prev_receiver: 19587895

sender: migration/previous = 45.06%
receiver: migration/previous = 61.612%


SIS_CACHE:
23:49:21 Wakeup statistics:
@wakeup_migrate_sender: 6716902
@wakeup_prev_sender: 22727513
@wakeup_migrate_receiver: 11547623
@wakeup_prev_receiver: 24615810

sender: migration/previous = 29.55%
receiver: migration/previous = 46.91%

Both the sender and receiver in hackbench has raised the chance
to be woken up on its previous CPU, but not as much as netperf.
Why there is no much score difference? I checked the bottleneck
via perf topdown.

perf stat -M TopdownL1 -- sleep 10
perf stat -M tma_frontend_bound_group -- sleep 10
perf stat -M tma_fetch_latency_group -- sleep 10

NO_SIS_CACHE:
15.2 % tma_backend_bound
14.9 % tma_bad_speculation
43.9 % tma_frontend_bound
30.3 % tma_fetch_latency
9.7 % tma_ms_switches
14.0 % tma_fetch_bandwidth
26.1 % tma_retiring


SIS_CACHE:
14.5 % tma_backend_bound
15.3 % tma_bad_speculation
44.5 % tma_frontend_bound
31.5 % tma_fetch_latency
10.6 % tma_ms_switches
13.0 % tma_fetch_bandwidth
25.8 % tma_retiring

There is no much ratio changed with/without SIS_CACHE enabled.
This is because SIS_CACHE might bring benefit if tasks have a large
cache foot print(backend like netperf), but it seems that hackbench
pipe mode is frontend bound and its bottleneck is the complexy of the
instruction being executes(tma_ms_switch: MS is to decode the complex
instructions, the increase of MS switch counter usually means that the
workload is running some complex instruction), that is to say, the
pipe_read/write's code path could be the bottleneck. Your original
rate limitation on task migration might be more aggressive to reduce
the ratio of tma_backend_bound, and that might bring the score benefit,
like netperf case. Let me apply your original patch to confirm
if this is the case.

thanks,
Chenyu