[RFC PATCHSET] sched/fair: fix load balancer behavior when cgroup is in use

From: Tejun Heo
Date: Mon Apr 24 2017 - 16:14:01 EST


Hello,

We've noticed scheduling latency spike when cgroup is in use even when
the machine is idle enough with moderate scheduling frequency and
single level of cgroup nesting. More details are in the patch
descriptions but here's a schbench run from the root cgroup.

# ~/schbench -m 2 -t 16 -s 10000 -c 15000 -r 30
Latency percentiles (usec)
50.0000th: 26
75.0000th: 62
90.0000th: 74
95.0000th: 86
*99.0000th: 887
99.5000th: 3692
99.9000th: 10832
min=0, max=13374

And here's one from inside a first level cgroup.

# ~/schbench -m 2 -t 16 -s 10000 -c 15000 -r 30
Latency percentiles (usec)
50.0000th: 31
75.0000th: 65
90.0000th: 71
95.0000th: 91
*99.0000th: 7288
99.5000th: 10352
99.9000th: 12496
min=0, max=13023

The p99 latency spike got tracked down to runnable_load_avg not being
propagated through nested cfs_rqs and thus load_balance() operating on
out-of-sync load information. It ended up picking the wrong CPU as
load balance target often enough to significantly impact p99 latency.

This patchset fixes the issue by always propagating runnable_load_avg
so that, regardless of nesting, every cfs_rq's runnable_load_avg is
the sum of the scaled loads of all tasks queued below it.

As a side effect, this changes the load_avg behavior of sched_entities
associated cfs_rq's. It doesn't seem wrong to me and I can't think of
a better / cleaner way, but if there is, please let me know.

This patchset is on top of v4.11-rc8 and contains the following two
patches.

sched/fair: Fix how load gets propagated from cfs_rq to its sched_entity
sched/fair: Always propagate runnable_load_avg

diffstat follows.

kernel/sched/fair.c | 46 +++++++++++++++++++---------------------------
1 file changed, 19 insertions(+), 27 deletions(-)

Thanks.

--
tejun