Re: [PATCH 0/2] Introduce SIS_CACHE to choose previous CPU during task wakeup

From: Chen Yu
Date: Tue Oct 17 2023 - 07:10:23 EST


Hi Madadi,

On 2023-10-17 at 15:19:24 +0530, Madadi Vineeth Reddy wrote:
> Hi Chen Yu,
>
> On 26/09/23 10:40, Chen Yu wrote:
> > RFC -> v1:
> > - drop RFC
> > - Only record the short sleeping time for each task, to better honor the
> > burst sleeping tasks. (Mathieu Desnoyers)
> > - Keep the forward movement monotonic for runqueue's cache-hot timeout value.
> > (Mathieu Desnoyers, Aaron Lu)
> > - Introduce a new helper function cache_hot_cpu() that considers
> > rq->cache_hot_timeout. (Aaron Lu)
> > - Add analysis of why inhibiting task migration could bring better throughput
> > for some benchmarks. (Gautham R. Shenoy)
> > - Choose the first cache-hot CPU, if all idle CPUs are cache-hot in
> > select_idle_cpu(). To avoid possible task stacking on the waker's CPU.
> > (K Prateek Nayak)
> >
> > Thanks for your comments and review!
> >
> > ----------------------------------------------------------------------
>
> Regarding making the scan for finding an idle cpu longer vs cache benefits,
> I ran some benchmarks.
>

Thanks very much for your interest and your time on the patch.

> Tested the patch on power system with 12 cores. Total of 96 CPU's.
> System has two NUMA nodes.
>
> Below are some of the benchmark results
>
> schbench 99.0th latency (lower is better)
> ========
> case load baseline[pct imp](std%) SIS_CACHE[pct imp]( std%)
> normal 1-mthreads 1.00 [ 0.00]( 3.66) 1.00 [ 0.00]( 1.71)
> normal 2-mthreads 1.00 [ 0.00]( 4.55) 1.02 [ -2.00]( 3.00)
> normal 4-mthreads 1.00 [ 0.00]( 4.77) 0.96 [ +4.00]( 4.27)
> normal 6-mthreads 1.00 [ 0.00]( 60.37) 2.66 [ -166.00]( 23.67)
>
>
> schbench results are showing that there is not much impact in wakeup latencies due to more iterations
> in search for an idle cpu in the select_idle_cpu code path and interestingly numbers are slightly better
> for SIS_CACHE in case of 4-mthreads.

The 4% improvement is within std%, so I suppose we did not see much difference in 4 mthreads case.

> I think we can ignore the last case due to huge run to run variations.

Although the run-to-run variation is large, it seems that the decrease is within that range.
Prateek has also reported that when the system is overloaded there could be some regression
from schbench:
https://lore.kernel.org/lkml/27651e14-f441-c1e2-9b5b-b958d6aadc79@xxxxxxx/
Could you also post the raw data printed by schbench? And maybe using the latest schbench could get the
latency in detail.

> producer_consumer avg time/access (lower is better)
> ========
> loads per consumer iteration baseline[pct imp](std%) SIS_CACHE[pct imp]( std%)
> 5 1.00 [ 0.00]( 0.00) 0.87 [ +13.0]( 1.92)
> 20 1.00 [ 0.00]( 0.00) 0.92 [ +8.00]( 0.00)
> 50 1.00 [ 0.00]( 0.00) 1.00 [ 0.00]( 0.00)
> 100 1.00 [ 0.00]( 0.00) 1.00 [ 0.00]( 0.00)
>
> The main goal of the patch of improving cache locality is reflected as SIS_CACHE only improves in this workload,
> mainly when loads per consumer iteration is lower.
>
> hackbench normalized time in seconds (lower is better)
> ========
> case load baseline[pct imp](std%) SIS_CACHE[pct imp]( std%)
> process-pipe 1-groups 1.00 [ 0.00]( 1.50) 1.02 [ -2.00]( 3.36)
> process-pipe 2-groups 1.00 [ 0.00]( 4.76) 0.99 [ +1.00]( 5.68)
> process-sockets 1-groups 1.00 [ 0.00]( 2.56) 1.00 [ 0.00]( 0.86)
> process-sockets 2-groups 1.00 [ 0.00]( 0.50) 0.99 [ +1.00]( 0.96)
> threads-pipe 1-groups 1.00 [ 0.00]( 3.87) 0.71 [ +29.0]( 3.56)
> threads-pipe 2-groups 1.00 [ 0.00]( 1.60) 0.97 [ +3.00]( 3.44)
> threads-sockets 1-groups 1.00 [ 0.00]( 7.65) 0.99 [ +1.00]( 1.05)
> threads-sockets 2-groups 1.00 [ 0.00]( 3.12) 1.03 [ -3.00]( 1.70)
>
> hackbench results are similar in both kernels except the case where there is an improvement of
> 29% in case of threads-pipe case with 1 groups.
>
> Daytrader throughput (higher is better)
> ========
>
> As per Ingo suggestion, ran a real life workload daytrader
>
> baseline:
> ===================================================================================
> Instance 1
> Throughputs Ave. Resp. Time Min. Resp. Time Max. Resp. Time
> ================ =============== =============== ===============
> 10124.5 2 0 3970
>
> SIS_CACHE:
> ===================================================================================
> Instance 1
> Throughputs Ave. Resp. Time Min. Resp. Time Max. Resp. Time
> ================ =============== =============== ===============
> 10319.5 2 0 5771
>
> In the above run, daytrader perfomance was 2% better in case of SIS_CACHE.
>

Thanks for bringing this good news, a real life workload benefits from this change.
I'll tune this patch a little bit to address the regression from schbench. Also to mention
that, I'm working with Mathieu on his proposal to make the wakee choosing its previous
CPU easier(similar to SIS_CACHE, but a little simpler), and we'll check how to make more
platform benefit from this change.
https://lore.kernel.org/lkml/20231012203626.1298944-1-mathieu.desnoyers@xxxxxxxxxxxx/

thanks,
Chenyu