[PATCH 0/4] sched/core: fix cfs_prio_less

From: Cruz Zhao
Date: Wed Nov 15 2023 - 06:33:58 EST


The update of vruntime snapshot will cause unfair sched, especially when
tasks enqueue/dequeue frequently.

Consider the following case:
- Task A1 and A2 share a cookie, and task B has another cookie.
- A1 is a short task, waking up frequently but running short everytime.
- A2 and B are long tasks.
- A1 and B runs on ht0 and A2 runs on ht1.

ht0 ht1 fi_before fi update
switch to A1 switch to A2 0 0 1
A1 sleeps
switch to B A2 force idle 0 1 1
A1 wakes up
switch to A1 switch to A1 1 0 1
A1 sleeps
switch to B A2 force idle 0 1 1

In this case, cfs_rq->min_vruntime_fi will update every schedule, and
prio of B and A2 will be pulled to the same level, no matter how long A2
and B have run before, which is not fair enough. Extramely, we observed
that the latency of a task became several minutes due to this reason,
which should be 100ms.

To fix this problem, a possible approach is to maintain another vruntime
relative to the core, called core_vruntime, and we compare the priority
of ses using core_vruntime directly, instead of vruntime snapshot. To
achieve this goal, we need to introduce cfs_rq->core, similarity to
rq->core, and record core_min_vruntime in cfs_rq->core.

Cruz Zhao (4):
sched/core: Introduce core_id
sched: Introduce cfs_rq->core
sched: introduce core_vruntime and core_min_vruntime
fix vruntime snapshot

include/linux/sched.h | 3 ++
kernel/sched/core.c | 37 +++++++---------
kernel/sched/fair.c | 98 ++++++++++++++++++++++++++-----------------
kernel/sched/sched.h | 5 ++-
4 files changed, 81 insertions(+), 62 deletions(-)

--
2.39.3