Re: [PATCH 0/2] Introduce SIS_CACHE to choose previous CPU during task wakeup

From: Madadi Vineeth Reddy
Date: Tue Oct 17 2023 - 05:52:57 EST


Hi Chen Yu,

On 26/09/23 10:40, Chen Yu wrote:
> RFC -> v1:
> - drop RFC
> - Only record the short sleeping time for each task, to better honor the
> burst sleeping tasks. (Mathieu Desnoyers)
> - Keep the forward movement monotonic for runqueue's cache-hot timeout value.
> (Mathieu Desnoyers, Aaron Lu)
> - Introduce a new helper function cache_hot_cpu() that considers
> rq->cache_hot_timeout. (Aaron Lu)
> - Add analysis of why inhibiting task migration could bring better throughput
> for some benchmarks. (Gautham R. Shenoy)
> - Choose the first cache-hot CPU, if all idle CPUs are cache-hot in
> select_idle_cpu(). To avoid possible task stacking on the waker's CPU.
> (K Prateek Nayak)
>
> Thanks for your comments and review!
>
> ----------------------------------------------------------------------

Regarding making the scan for finding an idle cpu longer vs cache benefits,
I ran some benchmarks.

Tested the patch on power system with 12 cores. Total of 96 CPU's.
System has two NUMA nodes.

Below are some of the benchmark results

schbench 99.0th latency (lower is better)
========
case load baseline[pct imp](std%) SIS_CACHE[pct imp]( std%)
normal 1-mthreads 1.00 [ 0.00]( 3.66) 1.00 [ 0.00]( 1.71)
normal 2-mthreads 1.00 [ 0.00]( 4.55) 1.02 [ -2.00]( 3.00)
normal 4-mthreads 1.00 [ 0.00]( 4.77) 0.96 [ +4.00]( 4.27)
normal 6-mthreads 1.00 [ 0.00]( 60.37) 2.66 [ -166.00]( 23.67)


schbench results are showing that there is not much impact in wakeup latencies due to more iterations
in search for an idle cpu in the select_idle_cpu code path and interestingly numbers are slightly better
for SIS_CACHE in case of 4-mthreads. I think we can ignore the last case due to huge run to run variations.

producer_consumer avg time/access (lower is better)
========
loads per consumer iteration baseline[pct imp](std%) SIS_CACHE[pct imp]( std%)
5 1.00 [ 0.00]( 0.00) 0.87 [ +13.0]( 1.92)
20 1.00 [ 0.00]( 0.00) 0.92 [ +8.00]( 0.00)
50 1.00 [ 0.00]( 0.00) 1.00 [ 0.00]( 0.00)
100 1.00 [ 0.00]( 0.00) 1.00 [ 0.00]( 0.00)

The main goal of the patch of improving cache locality is reflected as SIS_CACHE only improves in this workload,
mainly when loads per consumer iteration is lower.

hackbench normalized time in seconds (lower is better)
========
case load baseline[pct imp](std%) SIS_CACHE[pct imp]( std%)
process-pipe 1-groups 1.00 [ 0.00]( 1.50) 1.02 [ -2.00]( 3.36)
process-pipe 2-groups 1.00 [ 0.00]( 4.76) 0.99 [ +1.00]( 5.68)
process-sockets 1-groups 1.00 [ 0.00]( 2.56) 1.00 [ 0.00]( 0.86)
process-sockets 2-groups 1.00 [ 0.00]( 0.50) 0.99 [ +1.00]( 0.96)
threads-pipe 1-groups 1.00 [ 0.00]( 3.87) 0.71 [ +29.0]( 3.56)
threads-pipe 2-groups 1.00 [ 0.00]( 1.60) 0.97 [ +3.00]( 3.44)
threads-sockets 1-groups 1.00 [ 0.00]( 7.65) 0.99 [ +1.00]( 1.05)
threads-sockets 2-groups 1.00 [ 0.00]( 3.12) 1.03 [ -3.00]( 1.70)

hackbench results are similar in both kernels except the case where there is an improvement of
29% in case of threads-pipe case with 1 groups.

Daytrader throughput (higher is better)
========

As per Ingo suggestion, ran a real life workload daytrader

baseline:
===================================================================================
Instance 1
Throughputs Ave. Resp. Time Min. Resp. Time Max. Resp. Time
================ =============== =============== ===============
10124.5 2 0 3970

SIS_CACHE:
===================================================================================
Instance 1
Throughputs Ave. Resp. Time Min. Resp. Time Max. Resp. Time
================ =============== =============== ===============
10319.5 2 0 5771

In the above run, daytrader perfomance was 2% better in case of SIS_CACHE.

Thanks and Regards
Madadi Vineeth Reddy