EEVDF/vhost regression (bisected to 86bfbb7ce4f6 sched/fair: Add lag based placement)

From: Tobias Huschle
Date: Thu Nov 16 2023 - 14:03:02 EST


Hi,

when testing the EEVDF scheduler we stumbled upon a performance regression in a uperf scenario and would like to
kindly ask for feedback on whether we are going into the right direction with our analysis so far.

The base scenario are two KVM guests running on an s390 LPAR. One guest hosts the uperf server, one the uperf client.
With EEVDF we observe a regression of ~50% for a strburst test.
For a more detailed description of the setup see the section TEST SUMMARY at the bottom.

Bisecting led us to the following commit which appears to introduce the regression:
86bfbb7ce4f6 sched/fair: Add lag based placement

We then compared the last good commit we identified with a recent level of the devel branch.
The issue still persists on 6.7 rc1 although there is some improvement (down from 62% regression to 49%)

All analysis described further are based on a 6.6 rc7 kernel.

We sampled perf data to get an idea on what is going wrong and ended up seeing an dramatic increase in the maximum
wait times from 3ms up to 366ms. See section WAIT DELAYS below for more details.

We then collected tracing data to get a better insight into what is going on.
The trace excerpt in section TRACE EXCERPT shows one example (of multiple per test run) of the problematic scenario where
a kworker(pid=6525) has to wait for 39,718 ms.

Short summary:
The mentioned kworker has been scheduled to CPU 14 before the tracing was enabled.
A vhost process is migrated onto CPU 14.
The vruntimes of kworker and vhost differ significantly (86642125805 vs 4242563284 -> factor 20)
The vhost process wants to wake up the kworker, therefore the kworker is placed onto the runqueue again and set to runnable.
The vhost process continues to execute, waking up other vhost processes on other CPUs.

So far this behavior is not different to what we see on pre-EEVDF kernels.

On timestamp 576.162767, the vhost process triggers the last wake up of another vhost on another CPU.
Until timestamp 576.171155, we see no other activity. Now, the vhost process ends its time slice.
Then, vhost gets re-assigned new time slices 4 times and gets then migrated off to CPU 15.
This does not occur with older kernels.
The kworker has to wait for the migration to happen in order to be able to execute again.
This is due to the fact, that the vruntime of the kworker is significantly larger than the one of vhost.


We observed the large difference in vruntime between kworker and vhost in the same magnitude on
a kernel built based on the parent of the commit mentioned above.
With EEVDF, the kworker is doomed to wait until the vhost either catches up on vruntime (which would take 86 seconds)
or the vhost is migrated off of the CPU.

We found some options which sound plausible but we are not sure if they are valid or not:

1. The wake up path has a dependency on the vruntime metrics that now delays the execution of the kworker.
2. The previous commit af4cf40470c2 (sched/fair: Add cfs_rq::avg_vruntime) which updates the way cfs_rq->min_vruntime and
cfs_rq->avg_runtime are set might have introduced an issue which is uncovered with the commit mentioned above.
3. An assumption in the vhost code which causes vhost to rely on being scheduled off in time to allow the kworker to proceed.

We also stumbled upon the following mailing thread: https://lore.kernel.org/lkml/ZORaUsd+So+tnyMV@chenyu5-mobl2/
That conversation, and the patches derived from it lead to the assumption that the wake up path might be adjustable in a way
that this case in particular can be addressed.
At the same time, the vast difference in vruntimes is concerning since, at least for some time frame, both processes are on the runqueue.

We would be glad to hear some feedback on which paths to pursue and which might just be a dead end in the first place.


#################### TRACE EXCERPT ####################
The sched_place trace event was added to the end of the place_entity function and outputs:
sev -> sched_entity vruntime
sed -> sched_entity deadline
sel -> sched_entity vlag
avg -> cfs_rq avg_vruntime
min -> cfs_rq min_vruntime
cpu -> cpu of cfs_rq
nr -> cfs_rq nr_running
---
CPU 3/KVM-2950 [014] d.... 576.161432: sched_migrate_task: comm=vhost-2920 pid=2941 prio=120 orig_cpu=15 dest_cpu=14
--> migrates task from cpu 15 to 14
CPU 3/KVM-2950 [014] d.... 576.161433: sched_place: comm=vhost-2920 pid=2941 sev=4242563284 sed=4245563284 sel=0 avg=4242563284 min=4242563284 cpu=14 nr=0
--> places vhost 2920 on CPU 14 with vruntime 4242563284
CPU 3/KVM-2950 [014] d.... 576.161433: sched_place: comm= pid=0 sev=16329848593 sed=16334604010 sel=0 avg=16329848593 min=16329848593 cpu=14 nr=0
CPU 3/KVM-2950 [014] d.... 576.161433: sched_place: comm= pid=0 sev=42560661157 sed=42627443765 sel=0 avg=42560661157 min=42560661157 cpu=14 nr=0
CPU 3/KVM-2950 [014] d.... 576.161434: sched_place: comm= pid=0 sev=53846627372 sed=54125900099 sel=0 avg=53846627372 min=53846627372 cpu=14 nr=0
CPU 3/KVM-2950 [014] d.... 576.161434: sched_place: comm= pid=0 sev=86640641980 sed=87255041979 sel=0 avg=86640641980 min=86640641980 cpu=14 nr=0
CPU 3/KVM-2950 [014] dN... 576.161434: sched_stat_wait: comm=vhost-2920 pid=2941 delay=9958 [ns]
CPU 3/KVM-2950 [014] d.... 576.161435: sched_switch: prev_comm=CPU 3/KVM prev_pid=2950 prev_prio=120 prev_state=S ==> next_comm=vhost-2920 next_pid=2941 next_prio=120
vhost-2920-2941 [014] D.... 576.161439: sched_waking: comm=vhost-2286 pid=2309 prio=120 target_cpu=008
vhost-2920-2941 [014] d.... 576.161446: sched_waking: comm=kworker/14:0 pid=6525 prio=120 target_cpu=014
vhost-2920-2941 [014] d.... 576.161447: sched_place: comm=kworker/14:0 pid=6525 sev=86642125805 sed=86645125805 sel=0 avg=86642125805 min=86642125805 cpu=14 nr=1
--> places kworker 6525 on cpu 14 with vruntime 86642125805
--> which is far larger than vhost vruntime of 4242563284
vhost-2920-2941 [014] d.... 576.161447: sched_stat_blocked: comm=kworker/14:0 pid=6525 delay=10143757 [ns]
vhost-2920-2941 [014] dN... 576.161447: sched_wakeup: comm=kworker/14:0 pid=6525 prio=120 target_cpu=014
vhost-2920-2941 [014] dN... 576.161448: sched_stat_runtime: comm=vhost-2920 pid=2941 runtime=13884 [ns] vruntime=4242577168 [ns]
--> vhost 2920 finishes after 13884 ns of runtime
vhost-2920-2941 [014] dN... 576.161448: sched_stat_wait: comm=kworker/14:0 pid=6525 delay=0 [ns]
vhost-2920-2941 [014] d.... 576.161448: sched_switch: prev_comm=vhost-2920 prev_pid=2941 prev_prio=120 prev_state=R+ ==> next_comm=kworker/14:0 next_pid=6525 next_prio=120
--> switch to kworker
kworker/14:0-6525 [014] d.... 576.161449: sched_waking: comm=CPU 2/KVM pid=2949 prio=120 target_cpu=007
kworker/14:0-6525 [014] d.... 576.161450: sched_stat_runtime: comm=kworker/14:0 pid=6525 runtime=3714 [ns] vruntime=86642129519 [ns]
--> kworker finshes after 3714 ns of runtime
kworker/14:0-6525 [014] d.... 576.161450: sched_stat_wait: comm=vhost-2920 pid=2941 delay=3714 [ns]
kworker/14:0-6525 [014] d.... 576.161451: sched_switch: prev_comm=kworker/14:0 prev_pid=6525 prev_prio=120 prev_state=I ==> next_comm=vhost-2920 next_pid=2941 next_prio=120
--> switch back to vhost
vhost-2920-2941 [014] d.... 576.161478: sched_waking: comm=kworker/14:0 pid=6525 prio=120 target_cpu=014
vhost-2920-2941 [014] d.... 576.161478: sched_place: comm=kworker/14:0 pid=6525 sev=86642191859 sed=86645191859 sel=-1150 avg=86642188144 min=86642188144 cpu=14 nr=1
--> kworker placed again on cpu 14 with vruntime 86642191859, the problem occurs only if lag <= 0, having lag=0 does not always hit the problem though
vhost-2920-2941 [014] d.... 576.161478: sched_stat_blocked: comm=kworker/14:0 pid=6525 delay=27943 [ns]
vhost-2920-2941 [014] d.... 576.161479: sched_wakeup: comm=kworker/14:0 pid=6525 prio=120 target_cpu=014
vhost-2920-2941 [014] D.... 576.161511: sched_waking: comm=vhost-2286 pid=2308 prio=120 target_cpu=006
vhost-2920-2941 [014] D.... 576.161512: sched_waking: comm=vhost-2286 pid=2309 prio=120 target_cpu=008
vhost-2920-2941 [014] D.... 576.161516: sched_waking: comm=vhost-2286 pid=2308 prio=120 target_cpu=006
vhost-2920-2941 [014] D.... 576.161773: sched_waking: comm=vhost-2286 pid=2308 prio=120 target_cpu=006
vhost-2920-2941 [014] D.... 576.161775: sched_waking: comm=vhost-2286 pid=2309 prio=120 target_cpu=008
vhost-2920-2941 [014] D.... 576.162103: sched_waking: comm=vhost-2286 pid=2308 prio=120 target_cpu=006
vhost-2920-2941 [014] D.... 576.162105: sched_waking: comm=vhost-2286 pid=2307 prio=120 target_cpu=021
vhost-2920-2941 [014] D.... 576.162326: sched_waking: comm=vhost-2286 pid=2305 prio=120 target_cpu=004
vhost-2920-2941 [014] D.... 576.162437: sched_waking: comm=vhost-2286 pid=2308 prio=120 target_cpu=006
vhost-2920-2941 [014] D.... 576.162767: sched_waking: comm=vhost-2286 pid=2305 prio=120 target_cpu=004
vhost-2920-2941 [014] d.h.. 576.171155: sched_stat_runtime: comm=vhost-2920 pid=2941 runtime=9704465 [ns] vruntime=4252281633 [ns]
vhost-2920-2941 [014] d.h.. 576.181155: sched_stat_runtime: comm=vhost-2920 pid=2941 runtime=10000377 [ns] vruntime=4262282010 [ns]
vhost-2920-2941 [014] d.h.. 576.191154: sched_stat_runtime: comm=vhost-2920 pid=2941 runtime=9999514 [ns] vruntime=4272281524 [ns]
vhost-2920-2941 [014] d.h.. 576.201155: sched_stat_runtime: comm=vhost-2920 pid=2941 runtime=10000246 [ns] vruntime=4282281770 [ns]
--> vhost gets rescheduled multiple times because its vruntime is significantly smaller than the vruntime of the kworker
vhost-2920-2941 [014] dNh.. 576.201176: sched_wakeup: comm=migration/14 pid=85 prio=0 target_cpu=014
vhost-2920-2941 [014] dN... 576.201191: sched_stat_runtime: comm=vhost-2920 pid=2941 runtime=25190 [ns] vruntime=4282306960 [ns]
vhost-2920-2941 [014] d.... 576.201192: sched_switch: prev_comm=vhost-2920 prev_pid=2941 prev_prio=120 prev_state=R+ ==> next_comm=migration/14 next_pid=85 next_prio=0
migration/14-85 [014] d..1. 576.201194: sched_migrate_task: comm=vhost-2920 pid=2941 prio=120 orig_cpu=14 dest_cpu=15
--> vhost gets migrated off of cpu 14
migration/14-85 [014] d..1. 576.201194: sched_place: comm=vhost-2920 pid=2941 sev=3198666923 sed=3201666923 sel=0 avg=3198666923 min=3198666923 cpu=15 nr=0
migration/14-85 [014] d..1. 576.201195: sched_place: comm= pid=0 sev=12775683594 sed=12779398224 sel=0 avg=12775683594 min=12775683594 cpu=15 nr=0
migration/14-85 [014] d..1. 576.201195: sched_place: comm= pid=0 sev=33655559178 sed=33661025369 sel=0 avg=33655559178 min=33655559178 cpu=15 nr=0
migration/14-85 [014] d..1. 576.201195: sched_place: comm= pid=0 sev=42240572785 sed=42244083642 sel=0 avg=42240572785 min=42240572785 cpu=15 nr=0
migration/14-85 [014] d..1. 576.201196: sched_place: comm= pid=0 sev=70190876523 sed=70194789898 sel=-13068763 avg=70190876523 min=70190876523 cpu=15 nr=0
migration/14-85 [014] d.... 576.201198: sched_stat_wait: comm=kworker/14:0 pid=6525 delay=39718472 [ns]
migration/14-85 [014] d.... 576.201198: sched_switch: prev_comm=migration/14 prev_pid=85 prev_prio=0 prev_state=S ==> next_comm=kworker/14:0 next_pid=6525 next_prio=120
--> only now, kworker is eligible to run again, after a delay of 39718472 ns
kworker/14:0-6525 [014] d.... 576.201200: sched_waking: comm=CPU 0/KVM pid=2947 prio=120 target_cpu=012
kworker/14:0-6525 [014] d.... 576.201290: sched_stat_runtime: comm=kworker/14:0 pid=6525 runtime=92941 [ns] vruntime=86642284800 [ns]

#################### WAIT DELAYS - PERF LATENCY ####################
last good commit --> perf sched latency -s max
-------------------------------------------------------------------------------------------------------------------------------------------
Task | Runtime ms | Switches | Avg delay ms | Max delay ms | Max delay start | Max delay end |
-------------------------------------------------------------------------------------------------------------------------------------------
CPU 2/KVM:(2) | 5399.650 ms | 108698 | avg: 0.003 ms | max: 3.077 ms | max start: 544.090322 s | max end: 544.093399 s
CPU 7/KVM:(2) | 5111.132 ms | 69632 | avg: 0.003 ms | max: 2.980 ms | max start: 544.690994 s | max end: 544.693974 s
kworker/22:3-ev:723 | 342.944 ms | 63417 | avg: 0.005 ms | max: 1.880 ms | max start: 545.235430 s | max end: 545.237310 s
CPU 0/KVM:(2) | 8171.431 ms | 433099 | avg: 0.003 ms | max: 1.004 ms | max start: 547.970344 s | max end: 547.971348 s
CPU 1/KVM:(2) | 5486.260 ms | 258702 | avg: 0.003 ms | max: 1.002 ms | max start: 548.782514 s | max end: 548.783516 s
CPU 5/KVM:(2) | 4766.143 ms | 65727 | avg: 0.003 ms | max: 0.997 ms | max start: 545.313610 s | max end: 545.314607 s
vhost-2268:(6) | 13206.503 ms | 315030 | avg: 0.003 ms | max: 0.989 ms | max start: 550.887761 s | max end: 550.888749 s
vhost-2892:(6) | 14467.268 ms | 214005 | avg: 0.003 ms | max: 0.981 ms | max start: 545.213819 s | max end: 545.214800 s
CPU 3/KVM:(2) | 5538.908 ms | 85105 | avg: 0.003 ms | max: 0.883 ms | max start: 547.138139 s | max end: 547.139023 s
CPU 6/KVM:(2) | 5289.827 ms | 72301 | avg: 0.003 ms | max: 0.836 ms | max start: 551.094590 s | max end: 551.095425 s

6.6 rc7 --> perf sched latency -s max
-------------------------------------------------------------------------------------------------------------------------------------------
Task | Runtime ms | Switches | Avg delay ms | Max delay ms | Max delay start | Max delay end |
-------------------------------------------------------------------------------------------------------------------------------------------
kworker/19:2-ev:1071 | 69.482 ms | 12700 | avg: 0.050 ms | max: 366.314 ms | max start: 54705.674294 s | max end: 54706.040607 s
kworker/13:1-ev:184 | 78.048 ms | 14645 | avg: 0.067 ms | max: 287.738 ms | max start: 54710.312863 s | max end: 54710.600602 s
kworker/12:1-ev:46148 | 138.488 ms | 26660 | avg: 0.021 ms | max: 147.414 ms | max start: 54706.133161 s | max end: 54706.280576 s
kworker/16:2-ev:33076 | 149.175 ms | 29491 | avg: 0.026 ms | max: 139.752 ms | max start: 54708.410845 s | max end: 54708.550597 s
CPU 3/KVM:(2) | 1934.714 ms | 41896 | avg: 0.007 ms | max: 92.126 ms | max start: 54713.158498 s | max end: 54713.250624 s
kworker/7:2-eve:17001 | 68.164 ms | 11820 | avg: 0.045 ms | max: 69.717 ms | max start: 54707.100903 s | max end: 54707.170619 s
kworker/17:1-ev:46510 | 68.804 ms | 13328 | avg: 0.037 ms | max: 67.894 ms | max start: 54711.022711 s | max end: 54711.090605 s
kworker/21:1-ev:45782 | 68.906 ms | 13215 | avg: 0.021 ms | max: 59.473 ms | max start: 54709.351135 s | max end: 54709.410608 s
ksoftirqd/17:101 | 0.041 ms | 2 | avg: 25.028 ms | max: 50.047 ms | max start: 54711.040578 s | max end: 54711.090625 s

#################### TEST SUMMARY ####################
Setup description:
- single KVM host with 2 identical guests
- guests are connected virtually via Open vSwitch
- guests run uperf streaming read workload with 50 parallel connections
- one guests acts as uperf client, the other one as uperf server

Regression:
kernel-6.5.0-rc2: 78 Gb/s (before 86bfbb7ce4f6 sched/fair: Add lag based placement)
kernel-6.5.0-rc2: 29 Gb/s (with 86bfbb7ce4f6 sched/fair: Add lag based placement)
kernel-6.7.0-rc1: 41 Gb/s

KVM host:
- 12 dedicated IFLs, SMT-2 (24 Linux CPUs)
- 64 GiB memory
- FEDORA 38
- kernel commandline: transparent_hugepage=never audit_enable=0 audit=0 audit_debug=0 selinux=0

KVM guests:
- 8 vCPUs
- 8 GiB memory
- RHEL 9.2
- kernel: 5.14.0-162.6.1.el9_1.s390x
- kernel commandline: transparent_hugepage=never audit_enable=0 audit=0 audit_debug=0 selinux=0

Open vSwitch:
- Open vSwitch with 2 ports, each with mtu=32768 and qlen=15000
- Open vSwitch ports attached to guests via virtio-net
- each guest has 4 vhost-queues

Domain xml snippet for Open vSwitch port:
<interface type="bridge" dev="OVS">
<source bridge="vswitch0"/>
<mac address="02:bb:97:28:02:02"/>
<virtualport type="openvswitch"/>
<model type="virtio"/>
<target dev="vport1"/>
<driver name="vhost" queues="4"/>
<address type="ccw" cssid="0xfe" ssid="0x0" devno="0x0002"/>
</interface>

Benchmark: uperf
- workload: str-readx30k, 50 active parallel connections
- uperf server permanently sends data in 30720-byte chunks
- uperf client receives and acknowledges this data
- Server: uperf -s
- Client: uperf -a -i 30 -m uperf.xml

uperf.xml:
<?xml version="1.0"?>
<profile name="strburst">
<group nprocs="50">
<transaction iterations="1">
<flowop type="connect" options="remotehost=10.161.28.3 protocol=tcp "/>
</transaction>
<transaction duration="300">
<flowop type="read" options="count=640 size=30k"/>
</transaction>
<transaction iterations="1">
<flowop type="disconnect" />
</transaction>
</group>
</profile>