Re: [RFC PATCH V3 6/6] sched/fair: Implement starvation monitor

From: Joel Fernandes
Date: Mon Jun 12 2023 - 16:36:26 EST


Hello Daniel!

On Mon, Jun 12, 2023 at 1:21 PM Daniel Bristot de Oliveira
<bristot@xxxxxxxxxx> wrote:
[...]
> > On Thu, Jun 8, 2023 at 11:58 AM Daniel Bristot de Oliveira
> > <bristot@xxxxxxxxxx> wrote:
> >>
> >> From: Juri Lelli <juri.lelli@xxxxxxxxxx>
> >>
> >> Starting deadline server for lower priority classes right away when
> >> first task is enqueued might break guarantees, as tasks belonging to
> >> intermediate priority classes could be uselessly preempted. E.g., a well
> >> behaving (non hog) FIFO task can be preempted by NORMAL tasks even if
> >> there are still CPU cycles available for NORMAL tasks to run, as they'll
> >> be running inside the fair deadline server for some period of time.
> >>
> >> To prevent this issue, implement a starvation monitor mechanism that
> >> starts the deadline server only if a (fair in this case) task hasn't
> >> been scheduled for some interval of time after it has been enqueued.
> >> Use pick/put functions to manage starvation monitor status.
> >
> > Me and Vineeth were discussing that another way of resolving this
> > issue is to use a DL-server for RT as well, and then using a smaller
> > deadline for RT. That way the RT is more likely to be selected due to
> > its earlier deadline/period.
>
> It would not be that different from what we have now.
>
> One of the problems of throttling nowadays is that it accounts for a large window
> of time, and any "imprecision" can cause the mechanism not to work as expected.
>
> For example, we work on a fully-isolated CPU scenario, where some very sporadic
> workload can be placed on the isolated CPU because of per-cpu kernel activities,
> e.g., kworkers... We need to let them run, but for a minimal amount of time, for
> instance, 20 us, to bound the interference.
>
> The current mechanism does not give this precision because the IRQ accounting
> does not account for runtime for the rt throttling (which makes sense).

I lost you here, "Runtime for the rt throttling" does not make much
sense to me as a statement.

> So the
> RT throttling has the 20 us stolen from IRQs and keeps running. The same will
> happen if we swap the current mechanism with a DL server for the RT.

I read this about 10 times to learn that *maybe* you mean that IRQs
stole time from the "Well behaved running time" of the RT class. I am
not seeing how that is related to creation of a DL-server for the RT
class. Maybe describe your point a bit more clearly?

>
> Also, thinking about short deadlines to fake a fixed priority is... not starting
> well. A fixed-priority higher instance is not a property of a deadline-based
> scheduler, and Linux has a fixed-priority hierarchy: STOP -> DL -> RT -> CFS...
> It is simple, and that is good.
>
> That is why it is better to boost CFS instead of throttling RT. By boosting
> CFS, you do not need a server for RT, and we account for anything on top of CFS
> for free (IRQ/DL/FIFO...).

I did mention in my last email that it is not ideal. I just brought it
up as an option. It might reduce the problem being seen and is better
than not having it.

> > Another approach could be to implement the 0-laxity scheduling as a
> > general SCHED_DEADLINE feature, perhaps through a flag. And allow DL
> > tasks to opt-in to 0-laxity scheduling unless there are idle cycles.
> > And then opt-in the feature for the CFS deadline server task.
>
> A 0-laxity scheduler is not as simple as it sounds, as the priority also depends
> on the "C" (runtime, generally WCET), which is hard to find and embeds
> pessimism. Also, having such a feature would make other mechanisms harder, as
> well as debugging things. For example, proxy-execution or a more precise
> schedulability test...

I think you did not read my email properly, I was saying make the
0-laxity default-off and the opt-in for certain DL tasks. That may
work perfectly well for a system like ChromeOS where likely we will
use the DL server as the sole deadline task and opt-in for the
0-laxity. Then we don't need watchdog hacks at all and it all cleanly
works within the DL class itself. There are the drawbacks of the
pessimism/locking etc (I already knew that by the way as the obvious
drawbacks of 0-laxity) but I am not immediately seeing how this
CFS-watchdog with 0-laxity is any different from the DL-server itself
having such a property. If you really have a concrete point on why
that won't work, and if you could clarify that more clearly why a
watchdog is better than it, that would be great.

> In a paper, the scheduler alone is the solution. In real life, the solution
> for problems like locking is as fundamental as the scheduler. We need to keep
> things simple to expand on these other topics as well.
>
> So, I do not think we need all the drawbacks of a mixed solution to just fix
> the throttling problem, and EDF is more capable and explored for the
> general case.

Again, I was saying making it opt-in seems like a reasonable approach
and just enabling such property for the DL server.

> With this patch's idea (and expansions), we can fix the throttling problem
> without breaking other behaviors like scheduling order...

I don't mind the watchdog patch as such, of course. I presented its
mechanics at OSSNA and I know how it works, but I feel the DL server
opting-in for 0-laxity would be cleaner while keeping such behavior as
default-off for regular DL uses, that's my opinion -- but what else am
I missing? Either way, no harm in discussing alternate approaches as
well even if we are settling for the watchdog.

> > Lastly, if the goal is to remove RT throttling code eventually, are
> > you also planning to remove RT group scheduling as well? Are there
> > users of RT group scheduling that might be impacted? On the other
> > hand, RT throttling / group scheduling code can be left as it is
> > (perhaps documenting it as deprecated) and the server stuff can be
> > implemented via a CONFIG option.
>
> I think that the idea is to have the DL servers eventually replace the group
> schedule. But I also believe that it is better to start by solving the
> throttling and then moving to other constructions on top of the mechanism.

Hmm. For throttling at the root level yes, but I am not seeing how
you can replace the group scheduling code for existing users of RT
Cgroups with this. The throttling in the RT group scheduling code is
not exactly only about "not starving CFS", it is more related to
letting RT groups run with certain bandwidth. So you cannot really
delete it if there are real users of that code -- you'll have to
migrate those users away first (to an alternate implementation like
DL). If there are no users of RT group scheduling, that's lovely
though. We don't use it in ChromeOS fwiw.

- Joel