Re: [RFC PATCH V3 6/6] sched/fair: Implement starvation monitor

From: Joel Fernandes
Date: Wed Jun 14 2023 - 14:25:20 EST


Before I reply, I just want to mention: I am OK with this patch 6/6 as
such and the ideas I mentioned earlier are just alternatives just for
patch 6/6 -- just for discussion sake. The devil is in the details as
Daniel and Juri pointed out.

With that said...

On Wed, Jun 14, 2023 at 9:45 AM Daniel Bristot de Oliveira
<bristot@xxxxxxxxxx> wrote:
>
> On 6/13/23 17:32, Joel Fernandes wrote:
> > On Tue, Jun 13, 2023 at 9:43 AM Daniel Bristot de Oliveira
> > <bristot@xxxxxxxxxx> wrote:
> > [...]
> >>> On Mon, Jun 12, 2023 at 1:21 PM Daniel Bristot de Oliveira
> >>> <bristot@xxxxxxxxxx> wrote:
> >>> [...]
> >>>>> On Thu, Jun 8, 2023 at 11:58 AM Daniel Bristot de Oliveira
> >>>>> <bristot@xxxxxxxxxx> wrote:
> >>>>>>
> >>>>>> From: Juri Lelli <juri.lelli@xxxxxxxxxx>
> >>>>>>
> >>>>>> Starting deadline server for lower priority classes right away when
> >>>>>> first task is enqueued might break guarantees, as tasks belonging to
> >>>>>> intermediate priority classes could be uselessly preempted. E.g., a well
> >>>>>> behaving (non hog) FIFO task can be preempted by NORMAL tasks even if
> >>>>>> there are still CPU cycles available for NORMAL tasks to run, as they'll
> >>>>>> be running inside the fair deadline server for some period of time.
> >>>>>>
> >>>>>> To prevent this issue, implement a starvation monitor mechanism that
> >>>>>> starts the deadline server only if a (fair in this case) task hasn't
> >>>>>> been scheduled for some interval of time after it has been enqueued.
> >>>>>> Use pick/put functions to manage starvation monitor status.
> >>>>>
> >>>>> Me and Vineeth were discussing that another way of resolving this
> >>>>> issue is to use a DL-server for RT as well, and then using a smaller
> >>>>> deadline for RT. That way the RT is more likely to be selected due to
> >>>>> its earlier deadline/period.
> >>>>
> >>>> It would not be that different from what we have now.
> >>>>
> >>>> One of the problems of throttling nowadays is that it accounts for a large window
> >>>> of time, and any "imprecision" can cause the mechanism not to work as expected.
> >>>>
> >>>> For example, we work on a fully-isolated CPU scenario, where some very sporadic
> >>>> workload can be placed on the isolated CPU because of per-cpu kernel activities,
> >>>> e.g., kworkers... We need to let them run, but for a minimal amount of time, for
> >>>> instance, 20 us, to bound the interference.
> >>>>
> >>>> The current mechanism does not give this precision because the IRQ accounting
> >>>> does not account for runtime for the rt throttling (which makes sense).
> >>>
> >>> I lost you here, "Runtime for the rt throttling" does not make much
> >>> sense to me as a statement.
> >>
> >> Both update_curr_rt() and update_curr_dl() use rq_clock_task() as clock. update_rq_clock_task()
> >> reduces the irq_delta from task's clock (inside #ifdef CONFIG_IRQ_TIME_ACCOUNTING), and this
> >> clock is used to throttling.
> >
> > That was a much better description. You're basically saying that since
> > the running time of the RT class is not accounted for in the clock, it
> > affects the throttling and unthrottling times. I actually ran into a
> > similar issue on Android I recall, where the RT time was showing up as
> > CFS load if I recall.
> >
> > For RT throttling though, in our testing the time scales are large
> > enough (for our usecase) that such time stealing wasn't an issue. I am
> > going for something that is practical and that works, does not have to
> > be perfect since it has been several years now with these problems and
> > leaving RT throttling broken is probably not a good thing.
>
> By postponing the enqueue/replanishment of the DL server here, we are fixing the
> problem in a practical way, that works without breaking existing useful properties &
> use-cases.

Ok, that sounds good to me.

> [...]
> >>> not seeing how that is related to creation of a DL-server for the RT
> >>> class. Maybe describe your point a bit more clearly?
> >>
> >> This patch is targeting a better way to avoid SCHED_OTHER starvation.
> >> Having a DL server for RT class does not help on that. We need to boost
> >> SCHED_OTHER.
> >
> > Oh, actually the problem of boosting SCHED_OTHER is a bit orthogonal
> > to what I said. I was not saying not to boost SCHED_OTHER, I was
> > talking more about this particular patch and using an DL-based RT
> > server to mitigate that issue. The boosting is already handled in
> > previous patches with the DL-server.
>
> The boosting of the previous patch is breaking FIFO priority. This patch fixes that
> single point without touching and or breaking SCHED_DEADLINE/EDF properties. With
> these things in place we do not mitigate, we fix the problem.

Sure, that's fine with me.

[..]
> > could you just push it out to run in the next period if it has the
> > flag.
>
> The improvements on top of this patch is to postpone the enqueue/replenish to the 0-laxity
> time. By doing so, the task receives a new period (and so deadline) a period ahead.

Yes, I understand. I was hoping for us to do that from within the DL
class itself as a DL task property, but perhaps that's my wishful
thinking...

> >
> > Here is the definition of 0-laxity as I understand it. Please correct
> > me as I have not done a phD in these things like you ;-)
> >
> > Laxity, also known as slack time,
>
> laxity = absolute deadline - activation time - runtime.

Correct.

> is the amount of time that a task
> > can be delayed without causing a missed deadline.
>
> If you look at the task alone! e.g, if it does not face preemption!
> A 0-laxity task is
> > one that has no more time to spare and must be executed immediately
>
> and not be preempted!
>
> to
> > avoid missing its deadline.
>
> Again, you are looking at the task alone.

Sure, that's why I mentioned running it before 0-laxity itself to
account for that and not just preemption but also other issues as well
like I/O, locking, sleeping etc. I don't have a formula right now and
I need to think more about it, but that was the idea at a high-level.
Again it is all rainbows and ponies and the devil is in the details so
just consider it as an idea/suggestion and not something we must
urgently do. I may find time later to go over the papers such as those
related to the laxity-based scheduling.

> > And here's where I need your input: If we take all admitted DL tasks
> > and run it at their respective 0-laxity times or slightly earlier,
> > then in-theory, they should all meet their deadlines correctly?
> For all tasksets, no!
>
> There might be a taskset here o there that creates such timeline under EDF,
> but it is not always true that a task under EDF will wait until the 0-laxity
> time for them to run. EDF is working conserving.
>
> For a working conserving scheduler to build such a timeline, it needs to
> have no idle time. Then, lets get the classical single core assumptions
> (these servers are per-cpu).
>
> - Assuming single-core/partitioned scheduler.
> - Assuming periodic tasks with deadline = period
> - Assuming a task set with the sum of each task utilization = 1
> - Assuming all tasks are dispatched at the same time (critical instant)
> - Assuming all tasks will run for their entire runtime, without blocking
>
> (so... the thing that EDF is optimum... fully loaded...)
>
> Even so you will not have EDF always creating such timeline because the
> task with the earliest deadline will run first, still deadlines will be met.
>
> For example:
>
> t1 = 5/10
> t2 = 5/10
>
> Each task you pick first will run 5 unities of time before the "0-laxity time".
>
> If there is a scheduler that always build a timeline like you want, it will not
> schedule the taskset I mentioned... thus.. it will schedule less than EDF.

Yes, I think you are kind of saying the same thing that I mentioned,
which is to run it before the 0-laxity time (perhaps depending on the
runtime of the existing admitted tasks).

> > Perhaps you mean the algorithm needs to push the new period/deadline
> > to a later time at the 0-laxity.
>
> This is the idea behind this patch ^^^^ This is the different between running
> and replenishing I mention on previous emails.
>
> That's also fine with me. But some
> > variation of the idea is possible AFAICS (again could be missing
> > something that mathematically makes this impossible).
>
> you are looking for a fragment of the information... "0-laxity time," with
> a single task in mind - not in the context of a scheduler.

I should have probably not used the word 0-laxity, because I did
mention running *before* 0-laxity arrives to account for delays,
that's what I was saying. So like run it before you get to 0 laxity to
account for delays, like that. But hey it is just an idea and sorry if
it sounded like noise. It goes back to the premise I mentioned which
is, DL task do not need to run immediately and preempt RT, it can run
later and still meet the deadline. When to run it is a different
question but if there was a crystal ball, DL task can still meet its
deadline by running at a later time.

> >> In the cover, I mentioned improving this patch, so maybe watchdog is not
> >> the appropriate name. 0-laxity server is not a good name either because
> >> it might induce people to think that the server will RUN at 0-laxity
> >> while it will actually be replenished at 0-laxity. Maybe a deferred server
> >> might be a better name.
> >
> > Yeah maybe a deferred server could be a better name.
> >
> >>>> In a paper, the scheduler alone is the solution. In real life, the solution
> >>>> for problems like locking is as fundamental as the scheduler. We need to keep
> >>>> things simple to expand on these other topics as well.
> >>>>
> >>>> So, I do not think we need all the drawbacks of a mixed solution to just fix
> >>>> the throttling problem, and EDF is more capable and explored for the
> >>>> general case.
> >>>
> >>> Again, I was saying making it opt-in seems like a reasonable approach
> >>> and just enabling such property for the DL server.
> >>
> >> Can we have a "deferred DL server?" is that your question?
> >>
> >> If so, I think so. But we have other points to look first. DL servers are per-cpu,
> >> so they break global. We need to think about an interface... and there are
> >> other points that we need to understand before trying some other more
> >> optimized cases.
> >
> > You mean an interface for the DL server params? Those can be similar
> > to other sched knobs. And then boot with a CONFIG option and boost CFS
> > things by default if RT is active. Would that work or did you mean
> > something else by interface?

About the interface, perhaps you are referring to using this mechanism
to replace the stalld daemon? That's what I remember from our
conversations in OSPM. In general, I think it is a great idea to
automatically detect "important" starving CFS tasks and boost them
(whether they are starving because of RT, or some other reason).

> >> Generalizing it requires time, but it will happen... and you know that Juri and I
> >> care about Chromeos' use case, as I have been discussing this with you all and
> >> even participating in Google/chrome focused meetings about sched...
> >> at 6 pm our time ;-).
> >
> > I totally appreciate that, please don't get offended, we go a long way
> > back as friends ;-)
>
> I did not get offended, and nothing changes on our friendship :-). I am just
> clarifying you things we know - even before this rebase... We are aware of
> Chrome needs, as well as general RT Linux needs.
>
> The basic idea behind this patch works for all cases and is unlocking this
> situation. The code will be improved in the next version.

Thanks for the discussions! I am looking forward to helping in any way
I can on the series, I am going to be testing it for ChromeOS.

- Joel