Re: [RFC PATCH v4] sched: Fix performance regression introduced by mm_cid

From: Aaron Lu
Date: Wed Apr 12 2023 - 07:43:16 EST


On Wed, Apr 12, 2023 at 11:10:43AM +0200, Peter Zijlstra wrote:
> On Tue, Apr 11, 2023 at 09:12:21PM +0800, Aaron Lu wrote:
>
> > Forget about this "v4 is better than v2 and v3" part, my later test
> > showed the contention can also rise to around 18% for v4.
>
> So while I can reproduce the initial regression on a HSW-EX system
> (4*18*2) and get lovely things like:
>
> 34.47%--schedule_hrtimeout_range_clock
> schedule
> |
> --34.42%--__schedule
> |
> |--31.86%--_raw_spin_lock
> | |
> | --31.65%--native_queued_spin_lock_slowpath
> |
> --0.72%--dequeue_task_fair
> |
> --0.60%--dequeue_entity
>
> On a --threads=144 run; it is completely gone when I use v4:
>
> 6.92%--__schedule
> |
> |--2.16%--dequeue_task_fair
> | |
> | --1.69%--dequeue_entity
> | |
> | |--0.61%--update_load_avg
> | |
> | --0.54%--update_curr
> |
> |--1.30%--pick_next_task_fair
> | |
> | --0.54%--set_next_entity
> |
> |--0.77%--psi_task_switch
> |
> --0.69%--switch_mm_irqs_off
>
>
> :-(

Hmm... I also tested on a 2sockets/64cores/128cpus Icelake, the
contention number is about 20%-48% with vanilla v6.3-rc6 and after
applying v4, the contention is gone.

But it's still there on 2sockets/112cores/224cpus Sapphire Rapids(SPR)
with v4(and v2, v3)...:

18.38% 1.24% [kernel.vmlinux] [k] __schedule
|
|--17.14%--__schedule
| |
| |--10.63%--mm_cid_get
| | |
| | --10.22%--_raw_spin_lock
| | |
| | --10.07%--native_queued_spin_lock_slowpath
| |
| |--3.43%--dequeue_task
| | |
| | --3.39%--dequeue_task_fair
| | |
| | |--2.60%--dequeue_entity
| | | |
| | | |--1.22%--update_cfs_group
| | | |
| | | --1.05%--update_load_avg
| | |
| | --0.63%--update_cfs_group
| |
| |--0.68%--switch_mm_irqs_off
| |
| |--0.60%--finish_task_switch.isra.0
| |
| --0.50%--psi_task_switch
|
--0.53%--0x55a8385c088b

It's much better than the initial 70% contention on SPR of course.

BTW, I found hackbench can also show this problem on both Icelake and SPR.

With v4, on SPR:
~/src/rt-tests-2.4/hackbench --pipe --threads -l 500000
Profile was captured 20s after starting hackbench.

40.89% 7.71% [kernel.vmlinux] [k] __schedule
|
|--33.19%--__schedule
| |
| |--22.25%--mm_cid_get
| | |
| | --18.78%--_raw_spin_lock
| | |
| | --18.46%--native_queued_spin_lock_slowpath
| |
| |--7.46%--finish_task_switch.isra.0
| | |
| | --0.52%--asm_sysvec_call_function_single
| | sysvec_call_function_single
| |
| |--0.95%--dequeue_task
| | |
| | --0.93%--dequeue_task_fair
| | |
| | --0.76%--dequeue_entity
| |
| --0.75%--debug_smp_processor_id
|


With v4, on Icelake:
~/src/rt-tests-2.4/hackbench --pipe --threads -l 500000
Profile was captured 20s after starting hackbench.

25.83% 4.11% [kernel.kallsyms] [k] __schedule
|
|--21.72%--__schedule
| |
| |--11.68%--mm_cid_get
| | |
| | --9.36%--_raw_spin_lock
| | |
| | --9.09%--native_queued_spin_lock_slowpath
| |
| |--3.80%--finish_task_switch.isra.0
| | |
| | --0.70%--asm_sysvec_call_function_single
| | |
| | --0.69%--sysvec_call_function_single
| |
| |--2.58%--dequeue_task
| | |
| | --2.53%--dequeue_task_fair

I *guess* you might be able to see some contention with hackbench on
that HSW-EX system with v4.