Re: [PATCH] sched/fair: Rate limit calls to update_blocked_averages() for NOHZ

From: Tim Chen
Date: Fri Jun 11 2021 - 16:00:40 EST




On 5/12/21 6:59 AM, Qais Yousef wrote:
> On 05/11/21 10:25, Tim Chen wrote:
>>> update_next_balance() is only used in newidle_balance() so we could
>>> modify it to have:
>>>
>>> next = max(jiffies+1, next = sd->last_balance + interval)
>>
>> Is the extra assignment "next = sd->last_balance + interval" needed?
>> This seems more straight forward:
>>
>> next = max(jiffies+1, sd->last_balance + interval)
>
> I haven't been following the whole conversation closely, but it's always
> interesting when manipulating time in non time_*() functions.
>
> Is this max() safe against wrapping?
>

Vincent,

Sorry I haven't got back sooner. I finally was able to get some test
time on the test system. The fix works to correct the next balance time
going backwards but the frequency of balancing still remains the same,
so we don't see performance improvement.

I incorporated Qais' suggestion to take care of the wrap around time
(see patch #1) in patches below. This patch by itself prevented
the next_balance from going backward. However, most of the time the
next_balance occurs immediately in the next jiffie after newidle_balance
occured and we still have the same amount of load balancing as the vanilla
kernel on the OLTP workload I was looking at. I didn't see performance
improvement with just patch#1 and patch#2.

The current logic is when a CPU becomes idle, next_balance occur very
shortly (usually in the next jiffie) as get_sd_balance_interval returns
the next_balance in the next jiffie if the CPU is idle. However, in
reality, I saw most CPUs are 95% busy on average for my workload and
a task will wake up on an idle CPU shortly. So having frequent idle
balancing towards shortly idle CPUs is counter productive and simply
increase overhead and does not improve performance.

I tried a patch (patch 3) in addition to the other patches. It improved
performance by 5%, which is quite significant for my OLTP workload.
The patch considers a CPU busy when average its utilization is more
than 80% when determining the next_balance interval. This tweak may
not be ideal for the case when CPU becomes idle after a CPU intensive
task dominates a CPU for a long time and will block for a while.

Hopefully we can find a way to make good judgement on whether we have
a mostly busy CPU that becomes idle, and a task likely to wake up on
it soon. For such case, we should push out the next balance time. Such
logic is absent today in the idle load balance path. And such frequent
load balancing hurt performance when cgroup is turned on. Computing
update_blocked_averages before load balance becomes expensive. For my
OLTP workload, we lose 9% of performance when cgroup is turned on.

Tim


----