Re: EEVDF/vhost regression (bisected to 86bfbb7ce4f6 sched/fair: Add lag based placement)

From: Michael S. Tsirkin
Date: Fri Mar 15 2024 - 06:33:16 EST


On Fri, Mar 15, 2024 at 09:33:49AM +0100, Tobias Huschle wrote:
> On Thu, Mar 14, 2024 at 11:09:25AM -0400, Michael S. Tsirkin wrote:
> >
> > Thanks a lot! To clarify it is not that I am opposed to changing vhost.
> > I would like however for some documentation to exist saying that if you
> > do abc then call API xyz. Then I hope we can feel a bit safer that
> > future scheduler changes will not break vhost (though as usual, nothing
> > is for sure). Right now we are going by the documentation and that says
> > cond_resched so we do that.
> >
> > --
> > MST
> >
>
> Here I'd like to add that we have two different problems:
>
> 1. cond_resched not working as expected
> This appears to me to be a bug in the scheduler where it lets the cgroup,
> which the vhost is running in, loop endlessly. In EEVDF terms, the cgroup
> is allowed to surpass its own deadline without consequences. One of my RFCs
> mentioned above adresses this issue (not happy yet with the implementation).
> This issue only appears in that specific scenario, so it's not a general
> issue, rather a corner case.
> But, this fix will still allow the vhost to reach its deadline, which is
> one full time slice. This brings down the max delays from 300+ms to whatever
> the timeslice is. This is not enough to fix the regression.
>
> 2. vhost relying on kworker being scheduled on wake up
> This is the bigger issue for the regression. There are rare cases, where
> the vhost runs only for a very short amount of time before it wakes up
> the kworker. Simultaneously, the kworker takes longer than usual to
> complete its work and takes longer than the vhost did before. We
> are talking 4digit to low 5digit nanosecond values.
> With those two being the only tasks on the CPU, the scheduler now assumes
> that the kworker wants to unfairly consume more than the vhost and denies
> it being scheduled on wakeup.
> In the regular cases, the kworker is faster than the vhost, so the
> scheduler assumes that the kworker needs help, which benefits the
> scenario we are looking at.
> In the bad case, this means unfortunately, that cond_resched cannot work
> as good as before, for this particular case!
> So, let's assume that problem 1 from above is fixed. It will take one
> full time slice to get the need_resched flag set by the scheduler
> because vhost surpasses its deadline. Before, the scheduler cannot know
> that the kworker should actually run. The kworker itself is unable
> to communicate that by itself since it's not getting scheduled and there
> is no external entity that could intervene.
> Hence my argumentation that cond_resched still works as expected. The
> crucial part is that the wake up behavior has changed which is why I'm
> a bit reluctant to propose a documentation change on cond_resched.
> I could see proposing a doc change, that cond_resched should not be
> used if a task heavily relies on a woken up task being scheduled.

Could you remind me pls, what is the kworker doing specifically that
vhost is relying on?

--
MST