Re: Re: EEVDF/vhost regression (bisected to 86bfbb7ce4f6 sched/fair: Add lag based placement)

From: Abel Wu
Date: Sat Nov 18 2023 - 00:14:44 EST


On 11/17/23 5:23 PM, Peter Zijlstra Wrote:

Your email is pretty badly mangled by wrapping, please try and
reconfigure your MUA, esp. the trace and debug output is unreadable.

On Thu, Nov 16, 2023 at 07:58:18PM +0100, Tobias Huschle wrote:

The base scenario are two KVM guests running on an s390 LPAR. One guest
hosts the uperf server, one the uperf client.
With EEVDF we observe a regression of ~50% for a strburst test.
For a more detailed description of the setup see the section TEST SUMMARY at
the bottom.

Well, that's not good :/

Short summary:
The mentioned kworker has been scheduled to CPU 14 before the tracing was
enabled.
A vhost process is migrated onto CPU 14.
The vruntimes of kworker and vhost differ significantly (86642125805 vs
4242563284 -> factor 20)

So bear with me, I know absolutely nothing about virt stuff. I suspect
there's cgroups involved because shiny or something.

kworkers are typically not in cgroups and are part of the root cgroup,
but what's a vhost and where does it live?

Also, what are their weights / nice values?

The vhost process wants to wake up the kworker, therefore the kworker is
placed onto the runqueue again and set to runnable.
The vhost process continues to execute, waking up other vhost processes on
other CPUs.

So far this behavior is not different to what we see on pre-EEVDF kernels.

On timestamp 576.162767, the vhost process triggers the last wake up of
another vhost on another CPU.
Until timestamp 576.171155, we see no other activity. Now, the vhost process
ends its time slice.
Then, vhost gets re-assigned new time slices 4 times and gets then migrated
off to CPU 15.

So why does this vhost stay on the CPU if it doesn't have anything to
do? (I've not tried to make sense of the trace, that's just too
painful).

This does not occur with older kernels.
The kworker has to wait for the migration to happen in order to be able to
execute again.
This is due to the fact, that the vruntime of the kworker is significantly
larger than the one of vhost.

That's, weird. Can you add a trace_printk() to update_entity_lag() and
have it print out the lag, limit and vlag (post clamping) values? And
also in place_entity() for the reverse process, lag pre and post scaling
or something.

After confirming both tasks are indeed in the same cgroup ofcourse,
because if they're not, vruntime will be meaningless to compare and we
should look elsewhere.

Also, what HZ and what preemption mode are you running? If kworker is
somehow vastly over-shooting it's slice -- keeps running way past the
avg_vruntime, then it will build up a giant lag and you get what you
describe, next time it wakes up it gets placed far to the right (exactly
where it was when it 'finally' went to sleep, relatively speaking).

We found some options which sound plausible but we are not sure if they are
valid or not:

1. The wake up path has a dependency on the vruntime metrics that now delays
the execution of the kworker.
2. The previous commit af4cf40470c2 (sched/fair: Add cfs_rq::avg_vruntime)
which updates the way cfs_rq->min_vruntime and
cfs_rq->avg_runtime are set might have introduced an issue which is
uncovered with the commit mentioned above.

Suppose you have a few tasks (of equal weight) on you virtual timeline
like so:

---------+---+---+---+---+------
^ ^
| `avg_vruntime
`-min_vruntime

Then the above would be more or less the relative placements of these
values. avg_vruntime is the weighted average of the various vruntimes
and is therefore always in the 'middle' of the tasks, and not somewhere
out-there.

min_vruntime is a monotonically increasing 'minimum' that's left-ish on
the tree (there's a few cases where a new task can be placed left of
min_vruntime and its no longer actuall the minimum, but whatever).

These values should be relatively close to one another, depending
ofcourse on the spread of the tasks. So I don't think this is causing
trouble.

Anyway, the big difference with lag based placement is that where
previously tasks (that do not migrate) retain their old vruntime and on
placing they get pulled forward to at least min_vruntime, so a task that
wildly overshoots, but then doesn't run for significant time can still
be overtaken and then when placed again be 'okay'.

Now OTOH, with lag-based placement, we strictly preserve their relative
offset vs avg_vruntime. So if they were *far* too the right when they go
to sleep, they will again be there on placement.

Hi Peter, I'm a little confused here. As we adopt placement strategy #1
when PLACE_LAG is enabled, the lag of that entity needs to be preserved.
Given that the weight doesn't change, we have:

vl' = vl

But in fact it is scaled on placement:

vl' = vl * W/(W + w)

Does this intended? And to illustrate my understanding of strategy #1:

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 07f555857698..a24ef8b297ed 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5131,7 +5131,7 @@ place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
*
* EEVDF: placement strategy #1 / #2
*/
- if (sched_feat(PLACE_LAG) && cfs_rq->nr_running) {
+ if (sched_feat(PLACE_LAG) && cfs_rq->nr_running && se->vlag) {
struct sched_entity *curr = cfs_rq->curr;
unsigned long load;
@@ -5150,7 +5150,10 @@ place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
* To avoid the 'w_i' term all over the place, we only track
* the virtual lag:
*
- * vl_i = V - v_i <=> v_i = V - vl_i
+ * vl_i = V' - v_i <=> v_i = V' - vl_i
+ *
+ * Where V' is the new weighted average after placing this
+ * entity, and v_i is its newly assigned vruntime.
*
* And we take V to be the weighted average of all v:
*
@@ -5162,41 +5165,17 @@ place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
* vl_i is given by:
*
* V' = (\Sum w_j*v_j + w_i*v_i) / (W + w_i)
- * = (W*V + w_i*(V - vl_i)) / (W + w_i)
- * = (W*V + w_i*V - w_i*vl_i) / (W + w_i)
- * = (V*(W + w_i) - w_i*l) / (W + w_i)
- * = V - w_i*vl_i / (W + w_i)
- *
- * And the actual lag after adding an entity with vl_i is:
- *
- * vl'_i = V' - v_i
- * = V - w_i*vl_i / (W + w_i) - (V - vl_i)
- * = vl_i - w_i*vl_i / (W + w_i)
- *
- * Which is strictly less than vl_i. So in order to preserve lag
- * we should inflate the lag before placement such that the
- * effective lag after placement comes out right.
- *
- * As such, invert the above relation for vl'_i to get the vl_i
- * we need to use such that the lag after placement is the lag
- * we computed before dequeue.
+ * = (W*V + w_i*(V' - vl_i)) / (W + w_i)
+ * = V - w_i*vl_i / W
*
- * vl'_i = vl_i - w_i*vl_i / (W + w_i)
- * = ((W + w_i)*vl_i - w_i*vl_i) / (W + w_i)
- *
- * (W + w_i)*vl'_i = (W + w_i)*vl_i - w_i*vl_i
- * = W*vl_i
- *
- * vl_i = (W + w_i)*vl'_i / W
*/
load = cfs_rq->avg_load;
if (curr && curr->on_rq)
load += scale_load_down(curr->load.weight);
-
- lag *= load + scale_load_down(se->load.weight);
if (WARN_ON_ONCE(!load))
load = 1;
- lag = div_s64(lag, load);
+
+ vruntime -= div_s64(lag * scale_load_down(se->load.weight), load);
}
se->vruntime = vruntime - lag;