Re: [PATCH v6 0/4] sched/fair: Improve scan efficiency of SIS

From: K Prateek Nayak
Date: Tue Nov 15 2022 - 06:29:07 EST

Next message: Xuezhi Zhang: "[PATCH] scsi: pm8001: convert sysfs snprintf to sysfs_emit"
Previous message: Pin-yen Lin: "[PATCH] drm/bridge: it6505: Add caching for EDID"
In reply to: Abel Wu: "Re: [PATCH v6 0/4] sched/fair: Improve scan efficiency of SIS"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Hello Abel,

Thank you for taking a look at the report.

On 11/15/2022 2:01 PM, Abel Wu wrote:
> Hi Prateek, thanks very much for your detailed testing!
>
> On 11/14/22 1:45 PM, K Prateek Nayak wrote:
>> Hello Abel,
>>
>> Sorry for the delay. I've tested the patch on a dual socket Zen3 system
>> (2 x 64C/128T)
>>
>> tl;dr
>>
>> o I do not notice any regressions with the standard benchmarks.
>> o schbench sees a nice improvement to the tail latency when the number
>>    of worker are equal to the number of cores in the system in NPS1 and
>>    NPS2 mode. (Marked with "^")
>> o Few data points show improvements in tbench in NPS1 and NPS2 mode.
>>    (Marked with "^")
>>
>> I'm still in the process of running larger workloads. If there is any
>> specific workload you would like me to run on the test system, please
>> do let me know. Below is the detailed report:
>
> Not particularly in my mind, and I think testing larger workloads is
> great. Thanks!
>
>>
>> Following are the results from running standard benchmarks on a
>> dual socket Zen3 (2 x 64C/128T) machine configured in different
>> NPS modes.
>>
>> NPS Modes are used to logically divide single socket into
>> multiple NUMA region.
>> Following is the NUMA configuration for each NPS mode on the system:
>>
>> NPS1: Each socket is a NUMA node.
>>      Total 2 NUMA nodes in the dual socket machine.
>>
>>      Node 0: 0-63,   128-191
>>      Node 1: 64-127, 192-255
>>
>> NPS2: Each socket is further logically divided into 2 NUMA regions.
>>      Total 4 NUMA nodes exist over 2 socket.
>>          Node 0: 0-31,   128-159
>>      Node 1: 32-63, 160-191
>>      Node 2: 64-95, 192-223
>>      Node 3: 96-127, 223-255
>>
>> NPS4: Each socket is logically divided into 4 NUMA regions.
>>      Total 8 NUMA nodes exist over 2 socket.
>>          Node 0: 0-15,    128-143
>>      Node 1: 16-31,   144-159
>>      Node 2: 32-47,   160-175
>>      Node 3: 48-63,   176-191
>>      Node 4: 64-79,   192-207
>>      Node 5: 80-95,   208-223
>>      Node 6: 96-111, 223-231
>>      Node 7: 112-127, 232-255
>>
>> Benchmark Results:
>>
>> Kernel versions:
>> - tip:          5.19.0 tip sched/core
>> - sis_core:     5.19.0 tip sched/core + this series
>>
>> When we started testing, the tip was at:
>> commit fdf756f71271 ("sched: Fix more TASK_state comparisons")
>>
>> ~~~~~~~~~~~~~
>> ~ hackbench ~
>> ~~~~~~~~~~~~~
>>
>> o NPS1
>>
>> Test:            tip            sis_core
>> 1-groups:       4.06 (0.00 pct)       4.26 (-4.92 pct)    *
>> 1-groups:       4.14 (0.00 pct)       4.09 (1.20 pct)    [Verification Run]
>> 2-groups:       4.76 (0.00 pct)       4.71 (1.05 pct)
>> 4-groups:       5.22 (0.00 pct)       5.11 (2.10 pct)
>> 8-groups:       5.35 (0.00 pct)       5.31 (0.74 pct)
>> 16-groups:       7.21 (0.00 pct)       6.80 (5.68 pct)
>>
>> o NPS2
>>
>> Test:            tip            sis_core
>> 1-groups:       4.09 (0.00 pct)       4.08 (0.24 pct)
>> 2-groups:       4.70 (0.00 pct)       4.69 (0.21 pct)
>> 4-groups:       5.05 (0.00 pct)       4.92 (2.57 pct)
>> 8-groups:       5.35 (0.00 pct)       5.26 (1.68 pct)
>> 16-groups:       6.37 (0.00 pct)       6.34 (0.47 pct)
>>
>> o NPS4
>>
>> Test:            tip            sis_core
>> 1-groups:       4.07 (0.00 pct)       3.99 (1.96 pct)
>> 2-groups:       4.65 (0.00 pct)       4.59 (1.29 pct)
>> 4-groups:       5.13 (0.00 pct)       5.00 (2.53 pct)
>> 8-groups:       5.47 (0.00 pct)       5.43 (0.73 pct)
>> 16-groups:       6.82 (0.00 pct)       6.56 (3.81 pct)
>
> Although each cpu will get 2.5 tasks when 16-groups, which can
> be considered overloaded, I tested in AMD EPYC 7Y83 machine and
> the total cpu usage was ~82% (with some older kernel version),
> so there is still lots of idle time.
>
> I guess cutting off at 16-groups is because it's enough loaded
> compared to the real workloads, so testing more groups might just
> be a waste of time?

The machine has 16 LLCs so I capped the results at 16-groups.
Previously I had seen some run-to-run variance with larger group counts
so I limited the reports to 16-groups. I'll run hackbench with more
number of groups (32, 64, 128, 256) and get back to you with the
results along with results for a couple of long running workloads.

>
> Thanks & Best Regards,
>     Abel
>
> [..snip..]
>

--
Thanks and Regards,
Prateek

Next message: Xuezhi Zhang: "[PATCH] scsi: pm8001: convert sysfs snprintf to sysfs_emit"
Previous message: Pin-yen Lin: "[PATCH] drm/bridge: it6505: Add caching for EDID"
In reply to: Abel Wu: "Re: [PATCH v6 0/4] sched/fair: Improve scan efficiency of SIS"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]