Re: [RFC PATCH v7 11/23] sched/fair: core wide cfs task priority comparison

From: chris hyser
Date: Tue Sep 15 2020 - 17:52:34 EST

Next message: Andrey Konovalov: "[PATCH v2 15/37] kasan: rename print_shadow_for_address to print_memory_metadata"
Previous message: Vadym Kochan: "[PATCH] nvmem: core: fix missing of_node_put() in of_nvmem_device_get()"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On 8/28/20 3:51 PM, Julien Desfossez wrote:

From: Aaron Lu <aaron.lwe@xxxxxxxxx>

This patch provides a vruntime based way to compare two cfs task's
priority, be it on the same cpu or different threads of the same core.

When the two tasks are on the same CPU, we just need to find a common
cfs_rq both sched_entities are on and then do the comparison.

When the two tasks are on differen threads of the same core, each thread
will choose the next task to run the usual way and then the root level
sched entities which the two tasks belong to will be used to decide
which task runs next core wide.

An illustration for the cross CPU case:

cpu0 cpu1
/ | \ / | \
se1 se2 se3 se4 se5 se6
/ \ / \
se21 se22 se61 se62
(A) /
se621
(B)

Assume CPU0 and CPU1 are smt siblings and cpu0 has decided task A to
run next and cpu1 has decided task B to run next. To compare priority
of task A and B, we compare priority of se2 and se6. Whose vruntime is
smaller, who wins.

To make this work, the root level sched entities' vruntime of the two
threads must be directly comparable. So one of the hyperthread's root
cfs_rq's min_vruntime is chosen as the core wide one and all root level
sched entities' vruntime is normalized against it.

All sub cfs_rqs and sched entities are not interesting in cross cpu
priority comparison as they will only participate in the usual cpu local
schedule decision so no need to normalize their vruntimes.

Signed-off-by: Aaron Lu <ziqian.lzq@xxxxxxxxxx>
---
kernel/sched/core.c | 23 +++----
kernel/sched/fair.c | 142 ++++++++++++++++++++++++++++++++++++++++++-
kernel/sched/sched.h | 3 +
3 files changed, 150 insertions(+), 18 deletions(-)

While investigating reported 'uperf' performance regressions between core sched v5 and core sched v6/v7, this patch seems to be the first indicator of about a 40% perf loss in moving between v5 and v6 (and the accounting here is carried forward into this patch). Unfortunately, it is not the easiest thing to trace back as the patchsets are not directly comparable in this case and moving into v7, the base kernel revision has changed from 5.6 to 5.9.

The regressions were duplicated with the following setup: on a 24 core VM, create a cgroup and in it, fire off the uperf server and the client running 2 mins worth of 100 threads doing short TCP reads and writes. Do this for both the cgroup core sched tagged and not tagged (technically tearing everything down and rebuilding it in between). Short and easy to do dozens of runs for statistical averaging.

What ever the above version of this test might map to in real life, it presumably exacerbates the competition between softirq threads and the core sched tagged threads which was observed in the reports.

Here are the uperf results of the various patchsets. Note, that disabling smt is better for these tests and that that presumably reflects the overall overhead of core scheduling which went from bad to really bad. The primary focus in this email is to start to understand what happened within core sched itself.

patchset smt=on/cs=off smt=off smt=on/cs=on
--------------------------------------------------------
v5-v5.6.y : 1.78Gb/s 1.57Gb/s 1.07Gb/s
pre-v6-v5.6.y : 1.75Gb/s 1.55Gb/s 822.16Mb/s
v6-5.7 : 1.87Gs/s 1.56Gb/s 561.6Mb/s
v6-5.7-hotplug : 1.75Gb/s 1.58Gb/s 438.21Mb/s
v7 : 1.80Gb/s 1.61Gb/s 440.44Mb/s

If you start by contrasting v5 and v6 on the same base 5.6 kernel to try to rule out kernel to kernel version differences, bisecting v6 pointed to the v6 version of (ie this patch):

"[RFC PATCH v7 11/23] sched/fair: core wide cfs task priority comparison"

although all that really seems to be saying is that the new means of vruntime accounting (still present in v7) has caused performance in the patchset to drop which is plausible; different numbers, different scheduler behavior. A rough attempt to verify this by backporting parts of the new accounting onto the v5 patchset show where that initial switching from old to new accounting dropped perf to about 791Mb/s and the rest of the changes (as shown in the v6 numbers though not backported), only bring the v6 patchset back to 822.16Mb/s. That is not 100% proof, but seems very suspicious.

This change in vruntime accounting seems to account for about 40% of the total v5-to-v7 perf loss though clearly lots of other changes have occurred in between. Certainly not saying there is a bug here, just time to bring in the original authors and start a general discussion.

-chrish

Next message: Andrey Konovalov: "[PATCH v2 15/37] kasan: rename print_shadow_for_address to print_memory_metadata"
Previous message: Vadym Kochan: "[PATCH] nvmem: core: fix missing of_node_put() in of_nvmem_device_get()"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]