[PATCH 0/2] sched/fair: Limit access to overutilized

From: Shrikanth Hegde
Date: Fri Feb 23 2024 - 10:08:24 EST


When running a ISV workload on a large system (240 Cores, SMT8), it was
observed from perf profile that newidle_balance and enqueue_task_fair
were consuming more cycles. Perf annotate showed that most of the time
was spent on accessing overutilized field of root domain.

Aboorva was able to simulate similar perf profile by making some
changes to stress-ng --wait. Both newidle_balance and enqueue_task_fair
consume close to 5-7%. Perf annotate shows that most of the cycles are spent
in accessing rd,rd->overutilized field.

perf profile:
7.18% swapper [kernel.vmlinux] [k] enqueue_task_fair
6.78% s [kernel.vmlinux] [k] newidle_balance

perf annotate of enqueue_task_fair:
1.66 : c000000000223ba4: beq c000000000223c50 <enqueue_task_fair+0x238>
: 6789 update_overutilized_status():
: 6675 if (!READ_ONCE(rq->rd->overutilized) && cpu_overutilized(rq->cpu)) {
95.42 : c000000000223ba8: ld r8,2752(r28)
0.08 : c000000000223bac: lwz r9,540(r8)
Debugging it further, in enqueue_task_fair:
ld r8,2752(r28) <-- loads rd
lwz r9,540(r8) <-- loads rd->overutilized.
Frequent write to rd in other CPUs causes load/store tearing and hence
loading rd could take more time.

Perf annotate of newidle_balance:
: 12333 sd = rcu_dereference_check_sched_domain(this_rq->sd);
41.54 : c000000000228070: ld r30,2760(r31)
: 12335 if (!READ_ONCE(this_rq->rd->overload) ||
0.07 : c000000000228074: lwz r9,536(r9)
Similarly, in newidle_balance,
ld r9,2752(r31) <-- loads rd
lwz r9,536(r9) <-- loads rd->overload
Though overutilized is not used in this function. The writes to overutilized
could cause the load of overload to take more time. Both overload and
overutilized are part of the same cacheline.

overutilized was added for EAS(Energy aware scheduler) to choose either
EAS aware load balancing or regular load balance. Hence these fields
should only be updated if EAS is active.

As checked, on x86 and powerpc both overload and overutilized share the
same cacheline in rd. Updating overutilized is not required in non-EAS
platforms. Hence this patch can help reduce cache issues in such archs.

Patch 1/2 is the main patch. It helps in reducing the above said issue.
Both the functions don't show up in the profile. With patch comparison is in
changelog. With the patch stated problem in the ISV workload also got
solved and throughput has improved. Fixes tag 2802bf3cd936 maybe removed
if it causes issues with clean backport all the way. I didn't know what
would be right thing to do here.
Patch 2/2 is only code refactoring to use the helper function instead of
direct access of the field, so one would come to know that it is accessed
only in EAS. This depends on 1/2 to be applied first

Thanks to Aboorva Devarajan and Nysal Jan K A for helping in
recreating,debugging this issue and verifying the patch.

Shrikanth Hegde (2):
sched/fair: Add EAS checks before updating overutilized
sched/fair: Use helper function to access rd->overutilized

kernel/sched/fair.c | 50 +++++++++++++++++++++++++++++++++------------
1 file changed, 37 insertions(+), 13 deletions(-)

--
2.39.3