Re: Re: Re: EEVDF/vhost regression (bisected to 86bfbb7ce4f6 sched/fair: Add lag based placement)

From: Tobias Huschle
Date: Mon Jan 08 2024 - 08:13:57 EST


On Thu, Dec 14, 2023 at 02:14:59AM -0500, Michael S. Tsirkin wrote:
>
> Peter, would appreciate feedback on this. When is cond_resched()
> insufficient to give up the CPU? Should Documentation/kernel-hacking/hacking.rst
> be updated to require schedule() instead?
>

Happy new year everybody!

I'd like to bring this thread back to life. To reiterate:

- The introduction of the EEVDF scheduler revealed a performance
regression in a uperf testcase of ~50%.
- Tracing the scheduler showed that it takes decisions which are
in line with its design.
- The traces showed as well, that a vhost instance might run
excessively long on its CPU in some circumstance. Those cause
the performance regression as they cause delay times of 100+ms
for a kworker which drives the actual network processing.
- Before EEVDF, the vhost would always be scheduled off its CPU
in favor of the kworker, as the kworker was being woken up and
the former scheduler was giving more priority to the woken up
task. With EEVDF, the kworker, as a long running process, is
able to accumulate negative lag, which causes EEVDF to not
prefer it on its wake up, leaving the vhost running.
- If the kworker is not scheduled when being woken up, the vhost
continues looping until it is migrated off the CPU.
- The vhost offers to be scheduled off the CPU by calling
cond_resched(), but, the the need_resched flag is not set,
therefore cond_resched() does nothing.

To solve this, I see the following options
(might not be a complete nor a correct list)
- Along with the wakeup of the kworker, need_resched needs to
be set, such that cond_resched() triggers a reschedule.
- The vhost calls schedule() instead of cond_resched() to give up
the CPU. This would of course be a significantly stricter
approach and might limit the performance of vhost in other cases.
- Preventing the kworker from accumulating negative lag as it is
mostly not runnable and if it runs, it only runs for a very short
time frame. This might clash with the overall concept of EEVDF.
- On cond_resched(), verify if the consumed runtime of the caller
is outweighing the negative lag of another process (e.g. the
kworker) and schedule the other process. Introduces overhead
to cond_resched.

I would be curious on feedback on those ideas and interested in
alternative approaches.