Re: EEVDF/vhost regression (bisected to 86bfbb7ce4f6 sched/fair: Add lag based placement)

From: Bagas Sanjaya
Date: Sun Nov 19 2023 - 08:29:50 EST


On Thu, Nov 16, 2023 at 07:58:18PM +0100, Tobias Huschle wrote:
> Hi,
>
> when testing the EEVDF scheduler we stumbled upon a performance regression
> in a uperf scenario and would like to
> kindly ask for feedback on whether we are going into the right direction
> with our analysis so far.
>
> The base scenario are two KVM guests running on an s390 LPAR. One guest
> hosts the uperf server, one the uperf client.
> With EEVDF we observe a regression of ~50% for a strburst test.
> For a more detailed description of the setup see the section TEST SUMMARY at
> the bottom.
>
> Bisecting led us to the following commit which appears to introduce the
> regression:
> 86bfbb7ce4f6 sched/fair: Add lag based placement
>
> We then compared the last good commit we identified with a recent level of
> the devel branch.
> The issue still persists on 6.7 rc1 although there is some improvement (down
> from 62% regression to 49%)
>
> All analysis described further are based on a 6.6 rc7 kernel.
>
> We sampled perf data to get an idea on what is going wrong and ended up
> seeing an dramatic increase in the maximum
> wait times from 3ms up to 366ms. See section WAIT DELAYS below for more
> details.
>
> We then collected tracing data to get a better insight into what is going
> on.
> The trace excerpt in section TRACE EXCERPT shows one example (of multiple
> per test run) of the problematic scenario where
> a kworker(pid=6525) has to wait for 39,718 ms.
>
> Short summary:
> The mentioned kworker has been scheduled to CPU 14 before the tracing was
> enabled.
> A vhost process is migrated onto CPU 14.
> The vruntimes of kworker and vhost differ significantly (86642125805 vs
> 4242563284 -> factor 20)
> The vhost process wants to wake up the kworker, therefore the kworker is
> placed onto the runqueue again and set to runnable.
> The vhost process continues to execute, waking up other vhost processes on
> other CPUs.
>
> So far this behavior is not different to what we see on pre-EEVDF kernels.
>
> On timestamp 576.162767, the vhost process triggers the last wake up of
> another vhost on another CPU.
> Until timestamp 576.171155, we see no other activity. Now, the vhost process
> ends its time slice.
> Then, vhost gets re-assigned new time slices 4 times and gets then migrated
> off to CPU 15.
> This does not occur with older kernels.
> The kworker has to wait for the migration to happen in order to be able to
> execute again.
> This is due to the fact, that the vruntime of the kworker is significantly
> larger than the one of vhost.
>
>
> We observed the large difference in vruntime between kworker and vhost in
> the same magnitude on
> a kernel built based on the parent of the commit mentioned above.
> With EEVDF, the kworker is doomed to wait until the vhost either catches up
> on vruntime (which would take 86 seconds)
> or the vhost is migrated off of the CPU.
>
> We found some options which sound plausible but we are not sure if they are
> valid or not:
>
> 1. The wake up path has a dependency on the vruntime metrics that now delays
> the execution of the kworker.
> 2. The previous commit af4cf40470c2 (sched/fair: Add cfs_rq::avg_vruntime)
> which updates the way cfs_rq->min_vruntime and
> cfs_rq->avg_runtime are set might have introduced an issue which is
> uncovered with the commit mentioned above.
> 3. An assumption in the vhost code which causes vhost to rely on being
> scheduled off in time to allow the kworker to proceed.
>
> We also stumbled upon the following mailing thread:
> https://lore.kernel.org/lkml/ZORaUsd+So+tnyMV@chenyu5-mobl2/
> That conversation, and the patches derived from it lead to the assumption
> that the wake up path might be adjustable in a way
> that this case in particular can be addressed.
> At the same time, the vast difference in vruntimes is concerning since, at
> least for some time frame, both processes are on the runqueue.
>
> We would be glad to hear some feedback on which paths to pursue and which
> might just be a dead end in the first place.
>
>
> #################### TRACE EXCERPT ####################
> The sched_place trace event was added to the end of the place_entity
> function and outputs:
> sev -> sched_entity vruntime
> sed -> sched_entity deadline
> sel -> sched_entity vlag
> avg -> cfs_rq avg_vruntime
> min -> cfs_rq min_vruntime
> cpu -> cpu of cfs_rq
> nr -> cfs_rq nr_running
> ---
> CPU 3/KVM-2950 [014] d.... 576.161432: sched_migrate_task:
> comm=vhost-2920 pid=2941 prio=120 orig_cpu=15 dest_cpu=14
> --> migrates task from cpu 15 to 14
> CPU 3/KVM-2950 [014] d.... 576.161433: sched_place: comm=vhost-2920
> pid=2941 sev=4242563284 sed=4245563284 sel=0 avg=4242563284 min=4242563284
> cpu=14 nr=0
> --> places vhost 2920 on CPU 14 with vruntime 4242563284
> CPU 3/KVM-2950 [014] d.... 576.161433: sched_place: comm= pid=0
> sev=16329848593 sed=16334604010 sel=0 avg=16329848593 min=16329848593 cpu=14
> nr=0
> CPU 3/KVM-2950 [014] d.... 576.161433: sched_place: comm= pid=0
> sev=42560661157 sed=42627443765 sel=0 avg=42560661157 min=42560661157 cpu=14
> nr=0
> CPU 3/KVM-2950 [014] d.... 576.161434: sched_place: comm= pid=0
> sev=53846627372 sed=54125900099 sel=0 avg=53846627372 min=53846627372 cpu=14
> nr=0
> CPU 3/KVM-2950 [014] d.... 576.161434: sched_place: comm= pid=0
> sev=86640641980 sed=87255041979 sel=0 avg=86640641980 min=86640641980 cpu=14
> nr=0
> CPU 3/KVM-2950 [014] dN... 576.161434: sched_stat_wait:
> comm=vhost-2920 pid=2941 delay=9958 [ns]
> CPU 3/KVM-2950 [014] d.... 576.161435: sched_switch: prev_comm=CPU
> 3/KVM prev_pid=2950 prev_prio=120 prev_state=S ==> next_comm=vhost-2920
> next_pid=2941 next_prio=120
> vhost-2920-2941 [014] D.... 576.161439: sched_waking:
> comm=vhost-2286 pid=2309 prio=120 target_cpu=008
> vhost-2920-2941 [014] d.... 576.161446: sched_waking:
> comm=kworker/14:0 pid=6525 prio=120 target_cpu=014
> vhost-2920-2941 [014] d.... 576.161447: sched_place:
> comm=kworker/14:0 pid=6525 sev=86642125805 sed=86645125805 sel=0
> avg=86642125805 min=86642125805 cpu=14 nr=1
> --> places kworker 6525 on cpu 14 with vruntime 86642125805
> --> which is far larger than vhost vruntime of 4242563284
> vhost-2920-2941 [014] d.... 576.161447: sched_stat_blocked:
> comm=kworker/14:0 pid=6525 delay=10143757 [ns]
> vhost-2920-2941 [014] dN... 576.161447: sched_wakeup:
> comm=kworker/14:0 pid=6525 prio=120 target_cpu=014
> vhost-2920-2941 [014] dN... 576.161448: sched_stat_runtime:
> comm=vhost-2920 pid=2941 runtime=13884 [ns] vruntime=4242577168 [ns]
> --> vhost 2920 finishes after 13884 ns of runtime
> vhost-2920-2941 [014] dN... 576.161448: sched_stat_wait:
> comm=kworker/14:0 pid=6525 delay=0 [ns]
> vhost-2920-2941 [014] d.... 576.161448: sched_switch:
> prev_comm=vhost-2920 prev_pid=2941 prev_prio=120 prev_state=R+ ==>
> next_comm=kworker/14:0 next_pid=6525 next_prio=120
> --> switch to kworker
> kworker/14:0-6525 [014] d.... 576.161449: sched_waking: comm=CPU 2/KVM
> pid=2949 prio=120 target_cpu=007
> kworker/14:0-6525 [014] d.... 576.161450: sched_stat_runtime:
> comm=kworker/14:0 pid=6525 runtime=3714 [ns] vruntime=86642129519 [ns]
> --> kworker finshes after 3714 ns of runtime
> kworker/14:0-6525 [014] d.... 576.161450: sched_stat_wait:
> comm=vhost-2920 pid=2941 delay=3714 [ns]
> kworker/14:0-6525 [014] d.... 576.161451: sched_switch:
> prev_comm=kworker/14:0 prev_pid=6525 prev_prio=120 prev_state=I ==>
> next_comm=vhost-2920 next_pid=2941 next_prio=120
> --> switch back to vhost
> vhost-2920-2941 [014] d.... 576.161478: sched_waking:
> comm=kworker/14:0 pid=6525 prio=120 target_cpu=014
> vhost-2920-2941 [014] d.... 576.161478: sched_place:
> comm=kworker/14:0 pid=6525 sev=86642191859 sed=86645191859 sel=-1150
> avg=86642188144 min=86642188144 cpu=14 nr=1
> --> kworker placed again on cpu 14 with vruntime 86642191859, the problem
> occurs only if lag <= 0, having lag=0 does not always hit the problem though
> vhost-2920-2941 [014] d.... 576.161478: sched_stat_blocked:
> comm=kworker/14:0 pid=6525 delay=27943 [ns]
> vhost-2920-2941 [014] d.... 576.161479: sched_wakeup:
> comm=kworker/14:0 pid=6525 prio=120 target_cpu=014
> vhost-2920-2941 [014] D.... 576.161511: sched_waking:
> comm=vhost-2286 pid=2308 prio=120 target_cpu=006
> vhost-2920-2941 [014] D.... 576.161512: sched_waking:
> comm=vhost-2286 pid=2309 prio=120 target_cpu=008
> vhost-2920-2941 [014] D.... 576.161516: sched_waking:
> comm=vhost-2286 pid=2308 prio=120 target_cpu=006
> vhost-2920-2941 [014] D.... 576.161773: sched_waking:
> comm=vhost-2286 pid=2308 prio=120 target_cpu=006
> vhost-2920-2941 [014] D.... 576.161775: sched_waking:
> comm=vhost-2286 pid=2309 prio=120 target_cpu=008
> vhost-2920-2941 [014] D.... 576.162103: sched_waking:
> comm=vhost-2286 pid=2308 prio=120 target_cpu=006
> vhost-2920-2941 [014] D.... 576.162105: sched_waking:
> comm=vhost-2286 pid=2307 prio=120 target_cpu=021
> vhost-2920-2941 [014] D.... 576.162326: sched_waking:
> comm=vhost-2286 pid=2305 prio=120 target_cpu=004
> vhost-2920-2941 [014] D.... 576.162437: sched_waking:
> comm=vhost-2286 pid=2308 prio=120 target_cpu=006
> vhost-2920-2941 [014] D.... 576.162767: sched_waking:
> comm=vhost-2286 pid=2305 prio=120 target_cpu=004
> vhost-2920-2941 [014] d.h.. 576.171155: sched_stat_runtime:
> comm=vhost-2920 pid=2941 runtime=9704465 [ns] vruntime=4252281633 [ns]
> vhost-2920-2941 [014] d.h.. 576.181155: sched_stat_runtime:
> comm=vhost-2920 pid=2941 runtime=10000377 [ns] vruntime=4262282010 [ns]
> vhost-2920-2941 [014] d.h.. 576.191154: sched_stat_runtime:
> comm=vhost-2920 pid=2941 runtime=9999514 [ns] vruntime=4272281524 [ns]
> vhost-2920-2941 [014] d.h.. 576.201155: sched_stat_runtime:
> comm=vhost-2920 pid=2941 runtime=10000246 [ns] vruntime=4282281770 [ns]
> --> vhost gets rescheduled multiple times because its vruntime is
> significantly smaller than the vruntime of the kworker
> vhost-2920-2941 [014] dNh.. 576.201176: sched_wakeup:
> comm=migration/14 pid=85 prio=0 target_cpu=014
> vhost-2920-2941 [014] dN... 576.201191: sched_stat_runtime:
> comm=vhost-2920 pid=2941 runtime=25190 [ns] vruntime=4282306960 [ns]
> vhost-2920-2941 [014] d.... 576.201192: sched_switch:
> prev_comm=vhost-2920 prev_pid=2941 prev_prio=120 prev_state=R+ ==>
> next_comm=migration/14 next_pid=85 next_prio=0
> migration/14-85 [014] d..1. 576.201194: sched_migrate_task:
> comm=vhost-2920 pid=2941 prio=120 orig_cpu=14 dest_cpu=15
> --> vhost gets migrated off of cpu 14
> migration/14-85 [014] d..1. 576.201194: sched_place: comm=vhost-2920
> pid=2941 sev=3198666923 sed=3201666923 sel=0 avg=3198666923 min=3198666923
> cpu=15 nr=0
> migration/14-85 [014] d..1. 576.201195: sched_place: comm= pid=0
> sev=12775683594 sed=12779398224 sel=0 avg=12775683594 min=12775683594 cpu=15
> nr=0
> migration/14-85 [014] d..1. 576.201195: sched_place: comm= pid=0
> sev=33655559178 sed=33661025369 sel=0 avg=33655559178 min=33655559178 cpu=15
> nr=0
> migration/14-85 [014] d..1. 576.201195: sched_place: comm= pid=0
> sev=42240572785 sed=42244083642 sel=0 avg=42240572785 min=42240572785 cpu=15
> nr=0
> migration/14-85 [014] d..1. 576.201196: sched_place: comm= pid=0
> sev=70190876523 sed=70194789898 sel=-13068763 avg=70190876523
> min=70190876523 cpu=15 nr=0
> migration/14-85 [014] d.... 576.201198: sched_stat_wait:
> comm=kworker/14:0 pid=6525 delay=39718472 [ns]
> migration/14-85 [014] d.... 576.201198: sched_switch:
> prev_comm=migration/14 prev_pid=85 prev_prio=0 prev_state=S ==>
> next_comm=kworker/14:0 next_pid=6525 next_prio=120
> --> only now, kworker is eligible to run again, after a delay of 39718472
> ns
> kworker/14:0-6525 [014] d.... 576.201200: sched_waking: comm=CPU 0/KVM
> pid=2947 prio=120 target_cpu=012
> kworker/14:0-6525 [014] d.... 576.201290: sched_stat_runtime:
> comm=kworker/14:0 pid=6525 runtime=92941 [ns] vruntime=86642284800 [ns]
>
> #################### WAIT DELAYS - PERF LATENCY ####################
> last good commit --> perf sched latency -s max
> -------------------------------------------------------------------------------------------------------------------------------------------
> Task | Runtime ms | Switches | Avg delay ms | Max
> delay ms | Max delay start | Max delay end |
> -------------------------------------------------------------------------------------------------------------------------------------------
> CPU 2/KVM:(2) | 5399.650 ms | 108698 | avg: 0.003 ms | max:
> 3.077 ms | max start: 544.090322 s | max end: 544.093399 s
> CPU 7/KVM:(2) | 5111.132 ms | 69632 | avg: 0.003 ms | max:
> 2.980 ms | max start: 544.690994 s | max end: 544.693974 s
> kworker/22:3-ev:723 | 342.944 ms | 63417 | avg: 0.005 ms | max:
> 1.880 ms | max start: 545.235430 s | max end: 545.237310 s
> CPU 0/KVM:(2) | 8171.431 ms | 433099 | avg: 0.003 ms | max:
> 1.004 ms | max start: 547.970344 s | max end: 547.971348 s
> CPU 1/KVM:(2) | 5486.260 ms | 258702 | avg: 0.003 ms | max:
> 1.002 ms | max start: 548.782514 s | max end: 548.783516 s
> CPU 5/KVM:(2) | 4766.143 ms | 65727 | avg: 0.003 ms | max:
> 0.997 ms | max start: 545.313610 s | max end: 545.314607 s
> vhost-2268:(6) | 13206.503 ms | 315030 | avg: 0.003 ms | max:
> 0.989 ms | max start: 550.887761 s | max end: 550.888749 s
> vhost-2892:(6) | 14467.268 ms | 214005 | avg: 0.003 ms | max:
> 0.981 ms | max start: 545.213819 s | max end: 545.214800 s
> CPU 3/KVM:(2) | 5538.908 ms | 85105 | avg: 0.003 ms | max:
> 0.883 ms | max start: 547.138139 s | max end: 547.139023 s
> CPU 6/KVM:(2) | 5289.827 ms | 72301 | avg: 0.003 ms | max:
> 0.836 ms | max start: 551.094590 s | max end: 551.095425 s
>
> 6.6 rc7 --> perf sched latency -s max
> -------------------------------------------------------------------------------------------------------------------------------------------
> Task | Runtime ms | Switches | Avg delay ms | Max
> delay ms | Max delay start | Max delay end |
> -------------------------------------------------------------------------------------------------------------------------------------------
> kworker/19:2-ev:1071 | 69.482 ms | 12700 | avg: 0.050 ms | max:
> 366.314 ms | max start: 54705.674294 s | max end: 54706.040607 s
> kworker/13:1-ev:184 | 78.048 ms | 14645 | avg: 0.067 ms | max:
> 287.738 ms | max start: 54710.312863 s | max end: 54710.600602 s
> kworker/12:1-ev:46148 | 138.488 ms | 26660 | avg: 0.021 ms | max:
> 147.414 ms | max start: 54706.133161 s | max end: 54706.280576 s
> kworker/16:2-ev:33076 | 149.175 ms | 29491 | avg: 0.026 ms | max:
> 139.752 ms | max start: 54708.410845 s | max end: 54708.550597 s
> CPU 3/KVM:(2) | 1934.714 ms | 41896 | avg: 0.007 ms | max:
> 92.126 ms | max start: 54713.158498 s | max end: 54713.250624 s
> kworker/7:2-eve:17001 | 68.164 ms | 11820 | avg: 0.045 ms | max:
> 69.717 ms | max start: 54707.100903 s | max end: 54707.170619 s
> kworker/17:1-ev:46510 | 68.804 ms | 13328 | avg: 0.037 ms | max:
> 67.894 ms | max start: 54711.022711 s | max end: 54711.090605 s
> kworker/21:1-ev:45782 | 68.906 ms | 13215 | avg: 0.021 ms | max:
> 59.473 ms | max start: 54709.351135 s | max end: 54709.410608 s
> ksoftirqd/17:101 | 0.041 ms | 2 | avg: 25.028 ms | max:
> 50.047 ms | max start: 54711.040578 s | max end: 54711.090625 s
>
> #################### TEST SUMMARY ####################
> Setup description:
> - single KVM host with 2 identical guests
> - guests are connected virtually via Open vSwitch
> - guests run uperf streaming read workload with 50 parallel connections
> - one guests acts as uperf client, the other one as uperf server
>
> Regression:
> kernel-6.5.0-rc2: 78 Gb/s (before 86bfbb7ce4f6 sched/fair: Add lag based
> placement)
> kernel-6.5.0-rc2: 29 Gb/s (with 86bfbb7ce4f6 sched/fair: Add lag based
> placement)
> kernel-6.7.0-rc1: 41 Gb/s
>
> KVM host:
> - 12 dedicated IFLs, SMT-2 (24 Linux CPUs)
> - 64 GiB memory
> - FEDORA 38
> - kernel commandline: transparent_hugepage=never audit_enable=0 audit=0
> audit_debug=0 selinux=0
>
> KVM guests:
> - 8 vCPUs
> - 8 GiB memory
> - RHEL 9.2
> - kernel: 5.14.0-162.6.1.el9_1.s390x
> - kernel commandline: transparent_hugepage=never audit_enable=0 audit=0
> audit_debug=0 selinux=0
>
> Open vSwitch:
> - Open vSwitch with 2 ports, each with mtu=32768 and qlen=15000
> - Open vSwitch ports attached to guests via virtio-net
> - each guest has 4 vhost-queues
>
> Domain xml snippet for Open vSwitch port:
> <interface type="bridge" dev="OVS">
> <source bridge="vswitch0"/>
> <mac address="02:bb:97:28:02:02"/>
> <virtualport type="openvswitch"/>
> <model type="virtio"/>
> <target dev="vport1"/>
> <driver name="vhost" queues="4"/>
> <address type="ccw" cssid="0xfe" ssid="0x0" devno="0x0002"/>
> </interface>
>
> Benchmark: uperf
> - workload: str-readx30k, 50 active parallel connections
> - uperf server permanently sends data in 30720-byte chunks
> - uperf client receives and acknowledges this data
> - Server: uperf -s
> - Client: uperf -a -i 30 -m uperf.xml
>
> uperf.xml:
> <?xml version="1.0"?>
> <profile name="strburst">
> <group nprocs="50">
> <transaction iterations="1">
> <flowop type="connect" options="remotehost=10.161.28.3 protocol=tcp
> "/>
> </transaction>
> <transaction duration="300">
> <flowop type="read" options="count=640 size=30k"/>
> </transaction>
> <transaction iterations="1">
> <flowop type="disconnect" />
> </transaction>
> </group>
> </profile>

Thanks for the regression report. I'm adding it to regzbot:

#regzbot ^introduced: 86bfbb7ce4f67a

--
An old man doll... just what I always wanted! - Clara

Attachment: signature.asc
Description: PGP signature