Re: [RFC PATCH 1/1] sched: Extend cpu idle state for 1ms

From: Shrikanth Hegde
Date: Wed Jul 26 2023 - 13:41:00 EST




On 7/26/23 7:37 PM, Mathieu Desnoyers wrote:
> On 7/26/23 04:04, Shrikanth Hegde wrote:
>>
>>
>> On 7/26/23 1:00 AM, Mathieu Desnoyers wrote:
>>> Allow select_task_rq to consider a cpu as idle for 1ms after that cpu
>>> has exited the idle loop.
>>>
>>> This speeds up the following hackbench workload on a 192 cores AMD EPYC
>>> 9654 96-Core Processor (over 2 sockets):
>>>
>>> hackbench -g 32 -f 20 --threads --pipe -l 480000 -s 100
>>>
>>> from 49s to 34s. (30% speedup)
>>>
>>> My working hypothesis for why this helps is: queuing more than a single
>>> task on the runqueue of a cpu which just exited idle rather than
>>> spreading work over other idle cpus helps power efficiency on systems
>>> with large number of cores.
>>>
>>> This was developed as part of the investigation into a weird regression
>>> reported by AMD where adding a raw spinlock in the scheduler context
>>> switch accelerated hackbench.

Do you have SMT here? What is the system utilization when you are running
this workload?

>>>
>>> It turned out that changing this raw spinlock for a loop of 10000x
>>> cpu_relax within do_idle() had similar benefits.
>>>
>>> This patch achieve a similar effect without the busy-waiting by
>>> introducing a runqueue state sampling the sched_clock() when exiting
>>> idle, which allows select_task_rq to consider "as idle" a cpu which has
>>> recently exited idle.
>>>
>>> This patch should be considered "food for thoughts", and I would be glad
>>> to hear feedback on whether it causes regressions on _other_ workloads,
>>> and whether it helps with the hackbench workload on large Intel system
>>> as well.
>>>
>>> Link:
>>> https://lore.kernel.org/r/09e0f469-a3f7-62ef-75a1-e64cec2dcfc5@xxxxxxx
>>> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@xxxxxxxxxxxx>
>>> Cc: Ingo Molnar <mingo@xxxxxxxxxx>
>>> Cc: Peter Zijlstra <peterz@xxxxxxxxxxxxx>
>>> Cc: Valentin Schneider <vschneid@xxxxxxxxxx>
>>> Cc: Steven Rostedt <rostedt@xxxxxxxxxxx>
>>> Cc: Ben Segall <bsegall@xxxxxxxxxx>
>>> Cc: Mel Gorman <mgorman@xxxxxxx>
>>> Cc: Daniel Bristot de Oliveira <bristot@xxxxxxxxxx>
>>> Cc: Vincent Guittot <vincent.guittot@xxxxxxxxxx>
>>> Cc: Juri Lelli <juri.lelli@xxxxxxxxxx>
>>> Cc: Swapnil Sapkal <Swapnil.Sapkal@xxxxxxx>
>>> Cc: Aaron Lu <aaron.lu@xxxxxxxxx>
>>> Cc: x86@xxxxxxxxxx
>>> ---
>>>   kernel/sched/core.c  | 4 ++++
>>>   kernel/sched/sched.h | 3 +++
>>>   2 files changed, 7 insertions(+)
>>>
>>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>>> index a68d1276bab0..d40e3a0a5ced 100644
>>> --- a/kernel/sched/core.c
>>> +++ b/kernel/sched/core.c
>>> @@ -6769,6 +6769,7 @@ void __sched schedule_idle(void)
>>>        * TASK_RUNNING state.
>>>        */
>>>       WARN_ON_ONCE(current->__state);
>>> +    WRITE_ONCE(this_rq()->idle_end_time, sched_clock());
>>>       do {
>>>           __schedule(SM_NONE);
>>>       } while (need_resched());
>>> @@ -7300,6 +7301,9 @@ int idle_cpu(int cpu)
>>>   {
>>>       struct rq *rq = cpu_rq(cpu);
>>>   +    if (sched_clock() < READ_ONCE(rq->idle_end_time) +
>>> IDLE_CPU_DELAY_NS)
>>
>>
>> Wouldn't this hurt the latency badly? Specially on a loaded system with
>> a workload that does a lot of wakeup.
>
> Good point !
>
> Can you try your benchmark replacing the if () statement above by:
>
> +       if (sched_clock() < READ_ONCE(rq->idle_end_time) +
> IDLE_CPU_DELAY_NS &&
> +           READ_ONCE(rq->nr_running) <= 4)
> +               return 1;


Tried with this change. I think it does help in reducing latency compared to
earlier specially till 95th percentile.


6.5-rc3 6.5-rc3+RFC_Patch 6.5-rc3_RFC_Patch
+ nr<4
4 Groups
50.0th: 18.00 18.50 18.50
75.0th: 21.50 26.00 23.50
90.0th: 56.00 940.50 501.00
95.0th: 678.00 1896.00 1392.00
99.0th: 2484.00 3756.00 3708.00
99.5th: 3224.00 4616.00 5088.00
99.9th: 4960.00 6824.00 8068.00
8 Groups
50.0th: 23.50 25.50 23.00
75.0th: 30.50 421.50 30.50
90.0th: 443.50 1722.00 741.00
95.0th: 1410.00 2736.00 1670.00
99.0th: 3942.00 5496.00 4032.00
99.5th: 5232.00 7016.00 5064.00
99.9th: 7996.00 8896.00 8012.00
16 Groups
50.0th: 33.50 41.50 32.50
75.0th: 49.00 752.00 47.00
90.0th: 1067.50 2332.00 994.50
95.0th: 2093.00 3468.00 2117.00
99.0th: 5048.00 6728.00 5568.00
99.5th: 6760.00 7624.00 6960.00
99.9th: 8592.00 9504.00 11104.00
32 Groups
50.0th: 60.00 79.00 53.00
75.0th: 456.50 1712.00 209.50
90.0th: 2788.00 3996.00 2752.00
95.0th: 4544.00 5768.00 5024.00
99.0th: 8444.00 9104.00 10352.00
99.5th: 9168.00 9808.00 12720.00
99.9th: 11984.00 12448.00 17624.00

>
> It speeds up the hackbench test-case even more here. It's now 30s, and
> it should
> improve tail latency.
>
> Thanks,
>
> Mathieu
>
>
>>
>> ran schbench on a 50% loaded system with stress-ng. (there could be a
>> better benchmark to measure latency)
>> I see that latency takes a hit. specially tail latencies.full log
>> below with different schbench groups.
>>
>>              6.5-rc3        6.5-rc3+this patch
>>
>> Groups: 1
>> 50.0th:                 14.0              13.0
>> 75.0th:                 16.0              16.0
>> 90.0th:                 19.5              20.0
>> 95.0th:                 53.0              226.0
>> 99.0th:                 1969.0            2165.0
>> 99.5th:                 2912.0            2648.0
>> 99.9th:                 4680.0            4142.0
>>
>> Groups: 2
>> 50.0th:                 15.5              15.5
>> 75.0th:                 18.0              19.5
>> 90.0th:                 25.5              497.0
>> 95.0th:                 323.0             1384.0
>> 99.0th:                 2055.0            3144.0
>> 99.5th:                 2972.0            4014.0
>> 99.9th:                 6026.0            6560.0
>>
>> Groups: 4
>> 50.0th:                 18.0              18.5
>> 75.0th:                 21.5              26.0
>> 90.0th:                 56.0              940.5
>> 95.0th:                 678.0             1896.0
>> 99.0th:                 2484.0            3756.0
>> 99.5th:                 3224.0            4616.0
>> 99.9th:                 4960.0            6824.0
>>
>> Groups: 8
>> 50.0th:                 23.5              25.5
>> 75.0th:                 30.5              421.5
>> 90.0th:                 443.5             1722.0
>> 95.0th:                 1410.0            2736.0
>> 99.0th:                 3942.0            5496.0
>> 99.5th:                 5232.0            7016.0
>> 99.9th:                 7996.0            8896.0
>>
>> Groups: 16
>> 50.0th:                 33.5              41.5
>> 75.0th:                 49.0              752.0
>> 90.0th:                 1067.5            2332.0
>> 95.0th:                 2093.0            3468.0
>> 99.0th:                 5048.0            6728.0
>> 99.5th:                 6760.0            7624.0
>> 99.9th:                 8592.0            9504.0
>>
>> Groups: 32
>> 50.0th:                 60.0              79.0
>> 75.0th:                 456.5             1712.0
>> 90.0th:                 2788.0            3996.0
>> 95.0th:                 4544.0            5768.0
>> 99.0th:                 8444.0            9104.0
>> 99.5th:                 9168.0            9808.0
>> 99.9th:                 11984.0           12448.0
>>
>>
>>> +        return 1;
>>> +
>>>       if (rq->curr != rq->idle)
>>>           return 0;
>>>   diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
>>> index 81ac605b9cd5..8932e198a33a 100644
>>> --- a/kernel/sched/sched.h
>>> +++ b/kernel/sched/sched.h
>>> @@ -97,6 +97,8 @@
>>>   # define SCHED_WARN_ON(x)      ({ (void)(x), 0; })
>>>   #endif
>>>   +#define IDLE_CPU_DELAY_NS    1000000        /* 1ms */
>>> +
>>>   struct rq;
>>>   struct cpuidle_state;
>>>   @@ -1010,6 +1012,7 @@ struct rq {
>>>         struct task_struct __rcu    *curr;
>>>       struct task_struct    *idle;
>>> +    u64            idle_end_time;

There is clock_idle already in the rq. Can that be used for the same?

>>>       struct task_struct    *stop;
>>>       unsigned long        next_balance;
>>>       struct mm_struct    *prev_mm;
>