Re: [PATCH 0/2] Introduce SIS_CACHE to choose previous CPU during task wakeup

From: Chen Yu
Date: Thu Oct 19 2023 - 06:57:46 EST

Next message: Russell King (Oracle): "Re: [PATCH net-next v4 1/7] dt-bindings: net: dsa: Require ports or ethernet-ports"
Previous message: Karolina Stolarek: "Re: [PATCH] staging: vc04_services: remove empty functions"
In reply to: Madadi Vineeth Reddy: "Re: [PATCH 0/2] Introduce SIS_CACHE to choose previous CPU during task wakeup"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On 2023-10-19 at 01:02:16 +0530, Madadi Vineeth Reddy wrote:
> Hi Chen Yu,
> On 17/10/23 16:39, Chen Yu wrote:
> > Hi Madadi,
> >
> > On 2023-10-17 at 15:19:24 +0530, Madadi Vineeth Reddy wrote:
> >> Hi Chen Yu,
> >>
> >> On 26/09/23 10:40, Chen Yu wrote:
> >>> RFC -> v1:
> >>> - drop RFC
> >>> - Only record the short sleeping time for each task, to better honor the
> >>> burst sleeping tasks. (Mathieu Desnoyers)
> >>> - Keep the forward movement monotonic for runqueue's cache-hot timeout value.
> >>> (Mathieu Desnoyers, Aaron Lu)
> >>> - Introduce a new helper function cache_hot_cpu() that considers
> >>> rq->cache_hot_timeout. (Aaron Lu)
> >>> - Add analysis of why inhibiting task migration could bring better throughput
> >>> for some benchmarks. (Gautham R. Shenoy)
> >>> - Choose the first cache-hot CPU, if all idle CPUs are cache-hot in
> >>> select_idle_cpu(). To avoid possible task stacking on the waker's CPU.
> >>> (K Prateek Nayak)
> >>>
> >>> Thanks for your comments and review!
> >>>
> >>> ----------------------------------------------------------------------
> >>
> >> Regarding making the scan for finding an idle cpu longer vs cache benefits,
> >> I ran some benchmarks.
> >>
> >
> > Thanks very much for your interest and your time on the patch.
> >
> >> Tested the patch on power system with 12 cores. Total of 96 CPU's.
> >> System has two NUMA nodes.
> >>
> >> Below are some of the benchmark results
> >>
> >> schbench 99.0th latency (lower is better)
> >> ========
> >> case load baseline[pct imp](std%) SIS_CACHE[pct imp]( std%)
> >> normal 1-mthreads 1.00 [ 0.00]( 3.66) 1.00 [ 0.00]( 1.71)
> >> normal 2-mthreads 1.00 [ 0.00]( 4.55) 1.02 [ -2.00]( 3.00)
> >> normal 4-mthreads 1.00 [ 0.00]( 4.77) 0.96 [ +4.00]( 4.27)
> >> normal 6-mthreads 1.00 [ 0.00]( 60.37) 2.66 [ -166.00]( 23.67)
> >>
> >>
> >> schbench results are showing that there is not much impact in wakeup latencies due to more iterations
> >> in search for an idle cpu in the select_idle_cpu code path and interestingly numbers are slightly better
> >> for SIS_CACHE in case of 4-mthreads.
> >
> > The 4% improvement is within std%, so I suppose we did not see much difference in 4 mthreads case.
> >
> >> I think we can ignore the last case due to huge run to run variations.
> >
> > Although the run-to-run variation is large, it seems that the decrease is within that range.
> > Prateek has also reported that when the system is overloaded there could be some regression
> > from schbench:
> > https://lore.kernel.org/lkml/27651e14-f441-c1e2-9b5b-b958d6aadc79@xxxxxxx/
> > Could you also post the raw data printed by schbench? And maybe using the latest schbench could get the
> > latency in detail.
> >
>
> raw data by schbench(old) with 6-mthreads
> ======================
>
> Baseline (5 runs)
> ========
> Latency percentiles (usec)
> 50.0000th: 22
> 75.0000th: 29
> 90.0000th: 34
> 95.0000th: 37
> *99.0000th: 981
> 99.5000th: 4424
> 99.9000th: 9200
> min=0, max=29497
>
> Latency percentiles (usec)
> 50.0000th: 23
> 75.0000th: 29
> 90.0000th: 35
> 95.0000th: 38
> *99.0000th: 495
> 99.5000th: 3924
> 99.9000th: 9872
> min=0, max=29997
>
> Latency percentiles (usec)
> 50.0000th: 23
> 75.0000th: 30
> 90.0000th: 36
> 95.0000th: 39
> *99.0000th: 1326
> 99.5000th: 4744
> 99.9000th: 10000
> min=0, max=23394
>
> Latency percentiles (usec)
> 50.0000th: 23
> 75.0000th: 29
> 90.0000th: 34
> 95.0000th: 37
> *99.0000th: 55
> 99.5000th: 3292
> 99.9000th: 9104
> min=0, max=25196
>
> Latency percentiles (usec)
> 50.0000th: 23
> 75.0000th: 29
> 90.0000th: 34
> 95.0000th: 37
> *99.0000th: 711
> 99.5000th: 4600
> 99.9000th: 9424
> min=0, max=19997
>
> SIS_CACHE (5 runs)
> =========
> Latency percentiles (usec)
> 50.0000th: 23
> 75.0000th: 30
> 90.0000th: 35
> 95.0000th: 38
> *99.0000th: 1894
> 99.5000th: 5464
> 99.9000th: 10000
> min=0, max=19157
>
> Latency percentiles (usec)
> 50.0000th: 22
> 75.0000th: 29
> 90.0000th: 34
> 95.0000th: 37
> *99.0000th: 2396
> 99.5000th: 6664
> 99.9000th: 10000
> min=0, max=24029
>
> Latency percentiles (usec)
> 50.0000th: 22
> 75.0000th: 29
> 90.0000th: 34
> 95.0000th: 37
> *99.0000th: 2132
> 99.5000th: 6296
> 99.9000th: 10000
> min=0, max=25313
>
> Latency percentiles (usec)
> 50.0000th: 22
> 75.0000th: 29
> 90.0000th: 34
> 95.0000th: 37
> *99.0000th: 1090
> 99.5000th: 6232
> 99.9000th: 9744
> min=0, max=27264
>
> Latency percentiles (usec)
> 50.0000th: 22
> 75.0000th: 29
> 90.0000th: 34
> 95.0000th: 38
> *99.0000th: 1786
> 99.5000th: 5240
> 99.9000th: 9968
> min=0, max=24754
>
> The above data as indicated has large run to run variation and in general, the latency is
> high in case of SIS_CACHE for the 99th %ile.
>
>
> schbench(new) with 6-mthreads
> =============
>
> Baseline
> ========
> Wakeup Latencies percentiles (usec) runtime 30 (s) (209403 total samples)
> 50.0th: 8 (43672 samples)
> 90.0th: 13 (83908 samples)
> * 99.0th: 20 (18323 samples)
> 99.9th: 775 (1785 samples)
> min=1, max=8400
> Request Latencies percentiles (usec) runtime 30 (s) (209543 total samples)
> 50.0th: 13648 (59873 samples)
> 90.0th: 14000 (82767 samples)
> * 99.0th: 14320 (16342 samples)
> 99.9th: 18720 (1670 samples)
> min=5130, max=38334
> RPS percentiles (requests) runtime 30 (s) (31 total samples)
> 20.0th: 6968 (8 samples)
> * 50.0th: 6984 (23 samples)
> 90.0th: 6984 (0 samples)
> min=6835, max=6991
> average rps: 6984.77
>
>
> SIS_CACHE
> =========
> Wakeup Latencies percentiles (usec) runtime 30 (s) (209295 total samples)
> 50.0th: 9 (49267 samples)
> 90.0th: 14 (86522 samples)
> * 99.0th: 21 (14091 samples)
> 99.9th: 1146 (1722 samples)
> min=1, max=10427
> Request Latencies percentiles (usec) runtime 30 (s) (209432 total samples)
> 50.0th: 13616 (62838 samples)
> 90.0th: 14000 (85301 samples)
> * 99.0th: 14352 (16149 samples)
> 99.9th: 21408 (1660 samples)
> min=5070, max=41866
> RPS percentiles (requests) runtime 30 (s) (31 total samples)
> 20.0th: 6968 (7 samples)
> * 50.0th: 6984 (21 samples)
> 90.0th: 6984 (0 samples)
> min=6672, max=6996
> average rps: 6981.07
>
> In new schbench, I didn't observe run to run variation and also there was no regression
> in case of SIS_CACHE for the 99th %ile.
>

Thanks for the test Madadi, in my opinion we can stick with the new schbench
in the future. I'll have a double check on my test machine.

thanks,
Chenyu

Next message: Russell King (Oracle): "Re: [PATCH net-next v4 1/7] dt-bindings: net: dsa: Require ports or ethernet-ports"
Previous message: Karolina Stolarek: "Re: [PATCH] staging: vc04_services: remove empty functions"
In reply to: Madadi Vineeth Reddy: "Re: [PATCH 0/2] Introduce SIS_CACHE to choose previous CPU during task wakeup"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]