Re: [PATCH v5 6/7] sched/deadline: Deferrable dl server

From: Joel Fernandes
Date: Mon Nov 06 2023 - 16:37:49 EST


On Mon, Nov 6, 2023 at 4:32 PM Joel Fernandes <joel@xxxxxxxxxxxxxxxxx> wrote:
>
> On Mon, Nov 6, 2023 at 2:32 PM Joel Fernandes <joel@xxxxxxxxxxxxxxxxx> wrote:
> >
> > Hi Daniel,
> >
> > On Sat, Nov 4, 2023 at 6:59 AM Daniel Bristot de Oliveira
> > <bristot@xxxxxxxxxx> wrote:
> > >
> > > Among the motivations for the DL servers is the real-time throttling
> > > mechanism. This mechanism works by throttling the rt_rq after
> > > running for a long period without leaving space for fair tasks.
> > >
> > > The base dl server avoids this problem by boosting fair tasks instead
> > > of throttling the rt_rq. The point is that it boosts without waiting
> > > for potential starvation, causing some non-intuitive cases.
> > >
> > > For example, an IRQ dispatches two tasks on an idle system, a fair
> > > and an RT. The DL server will be activated, running the fair task
> > > before the RT one. This problem can be avoided by deferring the
> > > dl server activation.
> > >
> > > By setting the zerolax option, the dl_server will dispatch an
> > > SCHED_DEADLINE reservation with replenished runtime, but throttled.
> > >
> > > The dl_timer will be set for (period - runtime) ns from start time.
> > > Thus boosting the fair rq on its 0-laxity time with respect to
> > > rt_rq.
> > >
> > > If the fair scheduler has the opportunity to run while waiting
> > > for zerolax time, the dl server runtime will be consumed. If
> > > the runtime is completely consumed before the zerolax time, the
> > > server will be replenished while still in a throttled state. Then,
> > > the dl_timer will be reset to the new zerolax time
> > >
> > > If the fair server reaches the zerolax time without consuming
> > > its runtime, the server will be boosted, following CBS rules
> > > (thus without breaking SCHED_DEADLINE).
> > >
> > > Signed-off-by: Daniel Bristot de Oliveira <bristot@xxxxxxxxxx>
> > > ---
> > > include/linux/sched.h | 2 +
> > > kernel/sched/deadline.c | 100 +++++++++++++++++++++++++++++++++++++++-
> > > kernel/sched/fair.c | 3 ++
> > > 3 files changed, 103 insertions(+), 2 deletions(-)
> > >
> > > diff --git a/include/linux/sched.h b/include/linux/sched.h
> > > index 5ac1f252e136..56e53e6fd5a0 100644
> > > --- a/include/linux/sched.h
> > > +++ b/include/linux/sched.h
> > > @@ -660,6 +660,8 @@ struct sched_dl_entity {
> > > unsigned int dl_non_contending : 1;
> > > unsigned int dl_overrun : 1;
> > > unsigned int dl_server : 1;
> > > + unsigned int dl_zerolax : 1;
> > > + unsigned int dl_zerolax_armed : 1;
> > >
> > > /*
> > > * Bandwidth enforcement timer. Each -deadline task has its
> > > diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
> > > index 1d7b96ca9011..69ee1fbd60e4 100644
> > > --- a/kernel/sched/deadline.c
> > > +++ b/kernel/sched/deadline.c
> > > @@ -772,6 +772,14 @@ static inline void replenish_dl_new_period(struct sched_dl_entity *dl_se,
> > > /* for non-boosted task, pi_of(dl_se) == dl_se */
> > > dl_se->deadline = rq_clock(rq) + pi_of(dl_se)->dl_deadline;
> > > dl_se->runtime = pi_of(dl_se)->dl_runtime;
> > > +
> > > + /*
> > > + * If it is a zerolax reservation, throttle it.
> > > + */
> > > + if (dl_se->dl_zerolax) {
> > > + dl_se->dl_throttled = 1;
> > > + dl_se->dl_zerolax_armed = 1;
> > > + }
> > > }
> > >
> > > /*
> > > @@ -828,6 +836,7 @@ static inline void setup_new_dl_entity(struct sched_dl_entity *dl_se)
> > > * could happen are, typically, a entity voluntarily trying to overcome its
> > > * runtime, or it just underestimated it during sched_setattr().
> > > */
> > > +static int start_dl_timer(struct sched_dl_entity *dl_se);
> > > static void replenish_dl_entity(struct sched_dl_entity *dl_se)
> > > {
> > > struct dl_rq *dl_rq = dl_rq_of_se(dl_se);
> > > @@ -874,6 +883,28 @@ static void replenish_dl_entity(struct sched_dl_entity *dl_se)
> > > dl_se->dl_yielded = 0;
> > > if (dl_se->dl_throttled)
> > > dl_se->dl_throttled = 0;
> > > +
> > > + /*
> > > + * If this is the replenishment of a zerolax reservation,
> > > + * clear the flag and return.
> > > + */
> > > + if (dl_se->dl_zerolax_armed) {
> > > + dl_se->dl_zerolax_armed = 0;
> > > + return;
> > > + }
> > > +
> > > + /*
> > > + * A this point, if the zerolax server is not armed, and the deadline
> > > + * is in the future, throttle the server and arm the zerolax timer.
> > > + */
> > > + if (dl_se->dl_zerolax &&
> > > + dl_time_before(dl_se->deadline - dl_se->runtime, rq_clock(rq))) {
> > > + if (!is_dl_boosted(dl_se)) {
> > > + dl_se->dl_zerolax_armed = 1;
> > > + dl_se->dl_throttled = 1;
> > > + start_dl_timer(dl_se);
> > > + }
> > > + }
> > > }
> > >
> > > /*
> > > @@ -1024,6 +1055,13 @@ static void update_dl_entity(struct sched_dl_entity *dl_se)
> > > }
> > >
> > > replenish_dl_new_period(dl_se, rq);
> > > + } else if (dl_server(dl_se) && dl_se->dl_zerolax) {
> > > + /*
> > > + * The server can still use its previous deadline, so throttle
> > > + * and arm the zero-laxity timer.
> > > + */
> > > + dl_se->dl_zerolax_armed = 1;
> > > + dl_se->dl_throttled = 1;
> > > }
> > > }
> > >
> > > @@ -1056,8 +1094,20 @@ static int start_dl_timer(struct sched_dl_entity *dl_se)
> > > * We want the timer to fire at the deadline, but considering
> > > * that it is actually coming from rq->clock and not from
> > > * hrtimer's time base reading.
> > > + *
> > > + * The zerolax reservation will have its timer set to the
> > > + * deadline - runtime. At that point, the CBS rule will decide
> > > + * if the current deadline can be used, or if a replenishment
> > > + * is required to avoid add too much pressure on the system
> > > + * (current u > U).
> > > */
> > > - act = ns_to_ktime(dl_next_period(dl_se));
> > > + if (dl_se->dl_zerolax_armed) {
> > > + WARN_ON_ONCE(!dl_se->dl_throttled);
> > > + act = ns_to_ktime(dl_se->deadline - dl_se->runtime);
> >
> > Just a question, here if dl_se->deadline - dl_se->runtime is large,
> > then does that mean that server activation will be much more into the
> > future? So say I want to give CFS 30%, then it will take 70% of the
> > period before CFS preempts RT thus "starving" CFS for this duration. I
> > think that's Ok for smaller periods and runtimes, though.
> >
> > I think it does reserve the amount of required CFS bandwidth so it is
> > probably OK, though it is perhaps letting RT run more initially (say
> > if CFS tasks are not CPU bound and occasionally wake up, they will
> > always be hit by the 70% latency AFAICS which may be large for large
> > periods and small runtimes).
> >
>
> One more consideration I guess is, because the server is throttled
> till 0-laxity time, it is possible that if CFS sleeps even a bit
> (after the DL-server is unthrottled), then it will be pushed out to a
> full current deadline + period due to CBS. In such a situation, if
> CFS-server is the only DL task running, it might starve RT for a bit
> more time.
>
> Example, say CFS runtime is 0.3s and period is 1s. At 0.7s, 0-laxity
> timer fires. CFS runs for 0.29s, then sleeps for 0.005s and wakes up
> at 0.295s (its remaining runtime is 0.01s at this point which is < the
> "time till deadline" of 0.005s). Now the runtime of the CFS-server
> will be replenished to the full 3s (due to CBS) and the deadline
> pushed out. The end result is the total runtime that the CFS-server
> actually gets is 0.0595s (though yes it did sleep for 5ms in between,
> still that's tiny -- say if it briefly blocked on a kernel mutex).

Blah, I got lost in decimal points. Here's the example again:

Say CFS-server runtime is 0.3s and period is 1s.

At 0.7s, 0-laxity timer fires. CFS runs for 0.29s, then sleeps for
0.005s and wakes up at 0.295s (its remaining runtime is 0.01s at this
point which is < the "time till deadline" of 0.005s)

Now the runtime of the CFS-server will be replenished to the full 0.3s
(due to CBS) and the deadline
pushed out.

The end result is, the total runtime that the CFS-server actually gets
is 0.595s (though yes it did sleep for 5ms in between, still that's
tiny -- say if it briefly blocked on a kernel mutex). That's almost
double the allocated runtime.

This is just theoretical and I have yet to see if it is actually an
issue in practice.

Thanks.