Re: [PATCH v11 3/3] sched/fair: Use candidate prev/recent_used CPU if scanning failed for cluster wakeup

From: Chen Yu
Date: Fri Oct 20 2023 - 03:37:05 EST

Next message: Jonathan Cameron: "Re: [PATCH v2 3/3] i2c: muxes: i2c-mux-gpmux: Use of_get_i2c_adapter_by_node()"
Previous message: Qi Zheng: "Re: [PATCH v3 2/2] mm: memory_hotplug: drop memoryless node from fallback lists"
In reply to: Vincent Guittot: "Re: [PATCH v11 3/3] sched/fair: Use candidate prev/recent_used CPU if scanning failed for cluster wakeup"
Next in thread: Yicong Yang: "[PATCH v11 2/3] sched/fair: Scan cluster before scanning LLC in wake-up path"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On 2023-10-19 at 11:33:23 +0800, Yicong Yang wrote:
> From: Yicong Yang <yangyicong@xxxxxxxxxxxxx>
>
> Chen Yu reports a hackbench regression of cluster wakeup when
> hackbench threads equal to the CPU number [1]. Analysis shows
> it's because we wake up more on the target CPU even if the
> prev_cpu is a good wakeup candidate and leads to the decrease
> of the CPU utilization.
>
> Generally if the task's prev_cpu is idle we'll wake up the task
> on it without scanning. On cluster machines we'll try to wake up
> the task in the same cluster of the target for better cache
> affinity, so if the prev_cpu is idle but not sharing the same
> cluster with the target we'll still try to find an idle CPU within
> the cluster. This will improve the performance at low loads on
> cluster machines. But in the issue above, if the prev_cpu is idle
> but not in the cluster with the target CPU, we'll try to scan an
> idle one in the cluster. But since the system is busy, we're
> likely to fail the scanning and use target instead, even if
> the prev_cpu is idle. Then leads to the regression.
>
> This patch solves this in 2 steps:
> o record the prev_cpu/recent_used_cpu if they're good wakeup
> candidates but not sharing the cluster with the target.
> o on scanning failure use the prev_cpu/recent_used_cpu if
> they're recorded as idle
>
> [1] https://lore.kernel.org/all/ZGzDLuVaHR1PAYDt@chenyu5-mobl1/
>
> Reported-by: Chen Yu <yu.c.chen@xxxxxxxxx>
> Closes: https://lore.kernel.org/all/ZGsLy83wPIpamy6x@chenyu5-mobl1/
> Signed-off-by: Yicong Yang <yangyicong@xxxxxxxxxxxxx>
>

Tested on 24 CPUs Jacobsville machine, 4 CPUs in one cluster sharing L2 Cache.
The baseline is sched/core on top of
commit a36e5741bdc5 ("sched: Fix stop_one_cpu_nowait() vs hotplug"),
and compared with the whole patch set applied. I did not see any regression but
improvement on hackbench, please feel free to add:

Tested-and-reviewed-by: Chen Yu <yu.c.chen@xxxxxxxxx>

hackbench
=========
case load baseline(std%) compare%( std%)
process-pipe 1-groups 1.00 ( 0.26) +6.02 ( 1.53)
process-pipe 2-groups 1.00 ( 1.03) +1.97 ( 0.70)
process-pipe 4-groups 1.00 ( 0.26) +1.80 ( 3.27)
process-sockets 1-groups 1.00 ( 0.29) +1.63 ( 0.86)
process-sockets 2-groups 1.00 ( 1.17) +2.59 ( 0.35)
process-sockets 4-groups 1.00 ( 1.07) +3.86 ( 0.51)
threads-pipe 1-groups 1.00 ( 0.79) +8.17 ( 0.48)
threads-pipe 2-groups 1.00 ( 0.65) +6.34 ( 0.23)
threads-pipe 4-groups 1.00 ( 0.38) +4.61 ( 1.04)
threads-sockets 1-groups 1.00 ( 0.73) +0.80 ( 0.35)
threads-sockets 2-groups 1.00 ( 1.09) +2.81 ( 1.18)
threads-sockets 4-groups 1.00 ( 0.67) +2.30 ( 0.20)

netperf
=======
case load baseline(std%) compare%( std%)
TCP_RR 6-threads 1.00 ( 0.48) +3.97 ( 0.50)
TCP_RR 12-threads 1.00 ( 0.11) +3.83 ( 0.15)
TCP_RR 18-threads 1.00 ( 0.18) +7.53 ( 0.18)
TCP_RR 24-threads 1.00 ( 0.34) +2.40 ( 0.77)
TCP_RR 30-threads 1.00 ( 10.39) +2.22 ( 11.51)
TCP_RR 36-threads 1.00 ( 10.87) +2.06 ( 16.71)
TCP_RR 42-threads 1.00 ( 14.04) +2.10 ( 12.86)
TCP_RR 48-threads 1.00 ( 5.89) +2.15 ( 5.54)
UDP_RR 6-threads 1.00 ( 0.20) +2.99 ( 0.55)
UDP_RR 12-threads 1.00 ( 0.18) +3.65 ( 0.27)
UDP_RR 18-threads 1.00 ( 0.34) +6.62 ( 0.23)
UDP_RR 24-threads 1.00 ( 0.60) -1.73 ( 12.54)
UDP_RR 30-threads 1.00 ( 9.70) -0.62 ( 14.34)
UDP_RR 36-threads 1.00 ( 11.80) -0.05 ( 12.27)
UDP_RR 42-threads 1.00 ( 15.35) -0.05 ( 12.26)
UDP_RR 48-threads 1.00 ( 5.58) -0.12 ( 5.73)

tbench
======
case load baseline(std%) compare%( std%)
loopback 6-threads 1.00 ( 0.29) +2.51 ( 0.24)
loopback 12-threads 1.00 ( 0.08) +2.90 ( 0.47)
loopback 18-threads 1.00 ( 0.06) +6.85 ( 0.07)
loopback 24-threads 1.00 ( 0.20) +1.85 ( 0.14)
loopback 30-threads 1.00 ( 0.15) +1.37 ( 0.07)
loopback 36-threads 1.00 ( 0.12) +1.34 ( 0.07)
loopback 42-threads 1.00 ( 0.09) +0.91 ( 0.04)
loopback 48-threads 1.00 ( 0.11) +0.88 ( 0.05)

schbench
========
case load baseline(std%) compare%( std%)
normal 1-mthreads 1.00 ( 2.67) -1.89 ( 0.00)
normal 2-mthreads 1.00 ( 0.00) +0.00 ( 0.00)
normal 4-mthreads 1.00 ( 8.08) +12.86 ( 2.32)

thanks,
Chenyu

Next message: Jonathan Cameron: "Re: [PATCH v2 3/3] i2c: muxes: i2c-mux-gpmux: Use of_get_i2c_adapter_by_node()"
Previous message: Qi Zheng: "Re: [PATCH v3 2/2] mm: memory_hotplug: drop memoryless node from fallback lists"
In reply to: Vincent Guittot: "Re: [PATCH v11 3/3] sched/fair: Use candidate prev/recent_used CPU if scanning failed for cluster wakeup"
Next in thread: Yicong Yang: "[PATCH v11 2/3] sched/fair: Scan cluster before scanning LLC in wake-up path"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]