Re: [PATCH v6 0/4] sched/fair: Improve scan efficiency of SIS
From: K Prateek Nayak
Date: Tue Nov 15 2022 - 06:29:07 EST
Hello Abel,
Thank you for taking a look at the report.
On 11/15/2022 2:01 PM, Abel Wu wrote:
> Hi Prateek, thanks very much for your detailed testing!
>
> On 11/14/22 1:45 PM, K Prateek Nayak wrote:
>> Hello Abel,
>>
>> Sorry for the delay. I've tested the patch on a dual socket Zen3 system
>> (2 x 64C/128T)
>>
>> tl;dr
>>
>> o I do not notice any regressions with the standard benchmarks.
>> o schbench sees a nice improvement to the tail latency when the number
>> of worker are equal to the number of cores in the system in NPS1 and
>> NPS2 mode. (Marked with "^")
>> o Few data points show improvements in tbench in NPS1 and NPS2 mode.
>> (Marked with "^")
>>
>> I'm still in the process of running larger workloads. If there is any
>> specific workload you would like me to run on the test system, please
>> do let me know. Below is the detailed report:
>
> Not particularly in my mind, and I think testing larger workloads is
> great. Thanks!
>
>>
>> Following are the results from running standard benchmarks on a
>> dual socket Zen3 (2 x 64C/128T) machine configured in different
>> NPS modes.
>>
>> NPS Modes are used to logically divide single socket into
>> multiple NUMA region.
>> Following is the NUMA configuration for each NPS mode on the system:
>>
>> NPS1: Each socket is a NUMA node.
>> Total 2 NUMA nodes in the dual socket machine.
>>
>> Node 0: 0-63, 128-191
>> Node 1: 64-127, 192-255
>>
>> NPS2: Each socket is further logically divided into 2 NUMA regions.
>> Total 4 NUMA nodes exist over 2 socket.
>> Node 0: 0-31, 128-159
>> Node 1: 32-63, 160-191
>> Node 2: 64-95, 192-223
>> Node 3: 96-127, 223-255
>>
>> NPS4: Each socket is logically divided into 4 NUMA regions.
>> Total 8 NUMA nodes exist over 2 socket.
>> Node 0: 0-15, 128-143
>> Node 1: 16-31, 144-159
>> Node 2: 32-47, 160-175
>> Node 3: 48-63, 176-191
>> Node 4: 64-79, 192-207
>> Node 5: 80-95, 208-223
>> Node 6: 96-111, 223-231
>> Node 7: 112-127, 232-255
>>
>> Benchmark Results:
>>
>> Kernel versions:
>> - tip: 5.19.0 tip sched/core
>> - sis_core: 5.19.0 tip sched/core + this series
>>
>> When we started testing, the tip was at:
>> commit fdf756f71271 ("sched: Fix more TASK_state comparisons")
>>
>> ~~~~~~~~~~~~~
>> ~ hackbench ~
>> ~~~~~~~~~~~~~
>>
>> o NPS1
>>
>> Test: tip sis_core
>> 1-groups: 4.06 (0.00 pct) 4.26 (-4.92 pct) *
>> 1-groups: 4.14 (0.00 pct) 4.09 (1.20 pct) [Verification Run]
>> 2-groups: 4.76 (0.00 pct) 4.71 (1.05 pct)
>> 4-groups: 5.22 (0.00 pct) 5.11 (2.10 pct)
>> 8-groups: 5.35 (0.00 pct) 5.31 (0.74 pct)
>> 16-groups: 7.21 (0.00 pct) 6.80 (5.68 pct)
>>
>> o NPS2
>>
>> Test: tip sis_core
>> 1-groups: 4.09 (0.00 pct) 4.08 (0.24 pct)
>> 2-groups: 4.70 (0.00 pct) 4.69 (0.21 pct)
>> 4-groups: 5.05 (0.00 pct) 4.92 (2.57 pct)
>> 8-groups: 5.35 (0.00 pct) 5.26 (1.68 pct)
>> 16-groups: 6.37 (0.00 pct) 6.34 (0.47 pct)
>>
>> o NPS4
>>
>> Test: tip sis_core
>> 1-groups: 4.07 (0.00 pct) 3.99 (1.96 pct)
>> 2-groups: 4.65 (0.00 pct) 4.59 (1.29 pct)
>> 4-groups: 5.13 (0.00 pct) 5.00 (2.53 pct)
>> 8-groups: 5.47 (0.00 pct) 5.43 (0.73 pct)
>> 16-groups: 6.82 (0.00 pct) 6.56 (3.81 pct)
>
> Although each cpu will get 2.5 tasks when 16-groups, which can
> be considered overloaded, I tested in AMD EPYC 7Y83 machine and
> the total cpu usage was ~82% (with some older kernel version),
> so there is still lots of idle time.
>
> I guess cutting off at 16-groups is because it's enough loaded
> compared to the real workloads, so testing more groups might just
> be a waste of time?
The machine has 16 LLCs so I capped the results at 16-groups.
Previously I had seen some run-to-run variance with larger group counts
so I limited the reports to 16-groups. I'll run hackbench with more
number of groups (32, 64, 128, 256) and get back to you with the
results along with results for a couple of long running workloads.
>
> Thanks & Best Regards,
> Abel
>
> [..snip..]
>
--
Thanks and Regards,
Prateek