RE: [RFC PATCH v8] sched/fair: select idle cpu from idle cpumask for task wakeup

From: Song Bao Hua (Barry Song)
Date: Sun Dec 13 2020 - 18:30:21 EST




> -----Original Message-----
> From: Li, Aubrey [mailto:aubrey.li@xxxxxxxxxxxxxxx]
> Sent: Saturday, December 12, 2020 4:25 AM
> To: Vincent Guittot <vincent.guittot@xxxxxxxxxx>
> Cc: Ingo Molnar <mingo@xxxxxxxxxx>; Peter Zijlstra <peterz@xxxxxxxxxxxxx>;
> Juri Lelli <juri.lelli@xxxxxxxxxx>; Mel Gorman <mgorman@xxxxxxxxxxxxxxxxxxx>;
> Valentin Schneider <valentin.schneider@xxxxxxx>; Qais Yousef
> <qais.yousef@xxxxxxx>; Dietmar Eggemann <dietmar.eggemann@xxxxxxx>; Steven
> Rostedt <rostedt@xxxxxxxxxxx>; Ben Segall <bsegall@xxxxxxxxxx>; Tim Chen
> <tim.c.chen@xxxxxxxxxxxxxxx>; linux-kernel <linux-kernel@xxxxxxxxxxxxxxx>;
> Mel Gorman <mgorman@xxxxxxx>; Jiang Biao <benbjiang@xxxxxxxxx>
> Subject: Re: [RFC PATCH v8] sched/fair: select idle cpu from idle cpumask for
> task wakeup
>
> On 2020/12/11 23:22, Vincent Guittot wrote:
> > On Fri, 11 Dec 2020 at 16:19, Li, Aubrey <aubrey.li@xxxxxxxxxxxxxxx> wrote:
> >>
> >> On 2020/12/11 23:07, Vincent Guittot wrote:
> >>> On Thu, 10 Dec 2020 at 02:44, Aubrey Li <aubrey.li@xxxxxxxxxxxxxxx> wrote:
> >>>>
> >>>> Add idle cpumask to track idle cpus in sched domain. Every time
> >>>> a CPU enters idle, the CPU is set in idle cpumask to be a wakeup
> >>>> target. And if the CPU is not in idle, the CPU is cleared in idle
> >>>> cpumask during scheduler tick to ratelimit idle cpumask update.
> >>>>
> >>>> When a task wakes up to select an idle cpu, scanning idle cpumask
> >>>> has lower cost than scanning all the cpus in last level cache domain,
> >>>> especially when the system is heavily loaded.
> >>>>
> >>>> Benchmarks including hackbench, schbench, uperf, sysbench mysql and
> >>>> kbuild have been tested on a x86 4 socket system with 24 cores per
> >>>> socket and 2 hyperthreads per core, total 192 CPUs, no regression
> >>>> found.
> >>>>
> >>>> v7->v8:
> >>>> - refine update_idle_cpumask, no functionality change
> >>>> - fix a suspicious RCU usage warning with CONFIG_PROVE_RCU=y
> >>>>
> >>>> v6->v7:
> >>>> - place the whole idle cpumask mechanism under CONFIG_SMP
> >>>>
> >>>> v5->v6:
> >>>> - decouple idle cpumask update from stop_tick signal, set idle CPU
> >>>> in idle cpumask every time the CPU enters idle
> >>>>
> >>>> v4->v5:
> >>>> - add update_idle_cpumask for s2idle case
> >>>> - keep the same ordering of tick_nohz_idle_stop_tick() and update_
> >>>> idle_cpumask() everywhere
> >>>>
> >>>> v3->v4:
> >>>> - change setting idle cpumask from every idle entry to tickless idle
> >>>> if cpu driver is available
> >>>> - move clearing idle cpumask to scheduler_tick to decouple nohz mode
> >>>>
> >>>> v2->v3:
> >>>> - change setting idle cpumask to every idle entry, otherwise schbench
> >>>> has a regression of 99th percentile latency
> >>>> - change clearing idle cpumask to nohz_balancer_kick(), so updating
> >>>> idle cpumask is ratelimited in the idle exiting path
> >>>> - set SCHED_IDLE cpu in idle cpumask to allow it as a wakeup target
> >>>>
> >>>> v1->v2:
> >>>> - idle cpumask is updated in the nohz routines, by initializing idle
> >>>> cpumask with sched_domain_span(sd), nohz=off case remains the original
> >>>> behavior
> >>>>
> >>>> Cc: Peter Zijlstra <peterz@xxxxxxxxxxxxx>
> >>>> Cc: Mel Gorman <mgorman@xxxxxxx>
> >>>> Cc: Vincent Guittot <vincent.guittot@xxxxxxxxxx>
> >>>> Cc: Qais Yousef <qais.yousef@xxxxxxx>
> >>>> Cc: Valentin Schneider <valentin.schneider@xxxxxxx>
> >>>> Cc: Jiang Biao <benbjiang@xxxxxxxxx>
> >>>> Cc: Tim Chen <tim.c.chen@xxxxxxxxxxxxxxx>
> >>>> Signed-off-by: Aubrey Li <aubrey.li@xxxxxxxxxxxxxxx>
> >>>
> >>> This version looks good to me. I don't see regressions of v5 anymore
> >>> and see some improvements on heavy cases
> >>
> >> v5 or v8?
> >
> > the v8 looks good to me and I don't see the regressions that I have
> > seen with the v5 anymore
> >
> Sounds great, thanks, :)


Hi Aubrey,

The patch looks great. But I didn't find any hackbench improvement
on kunpeng 920 which has 24 cores for each llc span. Llc span is also
one numa node. The topology is like:
# numactl --hardware
available: 4 nodes (0-3)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
node 0 size: 128669 MB
node 0 free: 126995 MB
node 1 cpus: 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42
43 44 45 46 47
node 1 size: 128997 MB
node 1 free: 127539 MB
node 2 cpus: 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66
67 68 69 70 71
node 2 size: 129021 MB
node 2 free: 127106 MB
node 3 cpus: 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90
91 92 93 94 95
node 3 size: 127993 MB
node 3 free: 126739 MB
node distances:
node 0 1 2 3
0: 10 12 20 22
1: 12 10 22 24
2: 20 22 10 12
3: 22 24 12 10

Benchmark command:
numactl -N 0-1 hackbench -p -T -l 20000 -g $1

for each g, I ran 10 times to get the average time. And I tested
g from 1 to 10.

g 1 2 3 4 5 6 7 8 9 10
w/o 1.4733 1.5992 1.9353 2.1563 2.8448 3.3305 3.9616 4.4870 5.0786 5.6983
w/ 1.4709 1.6152 1.9474 2.1512 2.8298 3.2998 3.9472 4.4803 5.0462 5.6505

Is it because the core number is small in llc span in my test?

Thanks
Barry