Re: [PATCH RESEND] sched/nohz: Add HRTICK_BW for using cfs bandwidth with nohz_full

From: Phil Auld
Date: Thu May 18 2023 - 18:02:16 EST


On Thu, May 18, 2023 at 02:29:04PM -0700 Benjamin Segall wrote:
> Phil Auld <pauld@xxxxxxxxxx> writes:
>
> > CFS bandwidth limits and NOHZ full don't play well together. Tasks
> > can easily run well past their quotas before a remote tick does
> > accounting. This leads to long, multi-period stalls before such
> > tasks can run again. Use the hrtick mechanism to set a sched
> > tick to fire at remaining_runtime in the future if we are on
> > a nohz full cpu, if the task has quota and if we are likely to
> > disable the tick (nr_running == 1). This allows for bandwidth
> > accounting before tasks go too far over quota.
> >
> > A number of container workloads use a dynamic number of real
> > nohz tasks but also have other work that is limited which ends
> > up running on the "spare" nohz cpus. This is an artifact of
> > having to specify nohz_full cpus at boot. Adding this hrtick
> > resolves the issue of long stalls on these tasks.
> >
> > Add the sched_feat HRTICK_BW off by default to allow users to
> > enable this only when needed.
> >
> > Signed-off-by: Phil Auld <pauld@xxxxxxxxxx>
> > Suggested-by: Juri Lelli <jlelli@xxxxxxxxxx>
> > Cc: Ingo Molnar <mingo@xxxxxxxxxx>
> > Cc: Peter Zijlstra <peterz@xxxxxxxxxxxxx>
> > Cc: Vincent Guittot <vincent.guittot@xxxxxxxxxx>
> > Cc: Juri Lelli <juri.lelli@xxxxxxxxxx>
> > Cc: Dietmar Eggemann <dietmar.eggemann@xxxxxxx>
> > Cc: Valentin Schneider <vschneid@xxxxxxxxxx>
> > Cc: Ben Segall <bsegall@xxxxxxxxxx>
> > ---
> >
> > Resend with LKML address instead of rh list...
> >
> > kernel/sched/core.c | 2 +-
> > kernel/sched/fair.c | 20 ++++++++++++++++++++
> > kernel/sched/features.h | 1 +
> > kernel/sched/sched.h | 12 ++++++++++++
> > 4 files changed, 34 insertions(+), 1 deletion(-)
> >
> > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > index a68d1276bab0..76425c377245 100644
> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > @@ -6562,7 +6562,7 @@ static void __sched notrace __schedule(unsigned int sched_mode)
> >
> > schedule_debug(prev, !!sched_mode);
> >
> > - if (sched_feat(HRTICK) || sched_feat(HRTICK_DL))
> > + if (sched_feat(HRTICK) || sched_feat(HRTICK_DL) || sched_feat(HRTICK_BW))
> > hrtick_clear(rq);
> >
> > local_irq_disable();
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 373ff5f55884..0dd1f6a874bc 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -5309,6 +5309,22 @@ static int assign_cfs_rq_runtime(struct cfs_rq *cfs_rq)
> > return ret;
> > }
> >
> > +#if defined(CONFIG_SCHED_HRTICK) && defined(CONFIG_NO_HZ_FULL)
> > +static void start_hrtick_cfs_bw(struct rq *rq, struct cfs_rq *cfs_rq)
> > +{
> > + if (!tick_nohz_full_cpu(rq->cpu) || !cfs_bandwidth_used() || !cfs_rq->runtime_enabled)
> > + return;
> > +
> > + /* runtime_remaining should never be negative here but just in case */
> > + if (rq->nr_running == 1 && cfs_rq->runtime_remaining > 0)
> > + hrtick_start(rq, cfs_rq->runtime_remaining);
> > +}
> > +#else
> > +static void start_hrtick_cfs_bw(struct rq *rq, struct cfs_rq *cfs_rq)
> > +{
> > +}
> > +#endif
> > +
> > static void __account_cfs_rq_runtime(struct cfs_rq *cfs_rq, u64 delta_exec)
> > {
> > /* dock delta_exec before expiring quota (as it could span periods) */
> > @@ -5481,6 +5497,7 @@ static bool throttle_cfs_rq(struct cfs_rq *cfs_rq)
> > */
> > cfs_rq->throttled = 1;
> > cfs_rq->throttled_clock = rq_clock(rq);
> > +
> > return true;
> > }
> >
> > @@ -8096,6 +8113,9 @@ done: __maybe_unused;
> > if (hrtick_enabled_fair(rq))
> > hrtick_start_fair(rq, p);
> >
> > + if (hrtick_enabled_bw(rq))
> > + start_hrtick_cfs_bw(rq, task_cfs_rq(p));
> > +
> > update_misfit_status(p, rq);
>
> Implementation-wise this winds up with a tick of
> sysctl_sched_cfs_bandwidth_slice, which I suppose the admin could _also_
> configure depending on how much NOHZ benefit vs cfsb issues they want.
>

If the task is using all its bw then yes. Really though, by specifying
cfsb they are saying the nohz part is not that important it seems to me.

Short of doing the remote tick way more often you can't really
have both. The default setup favors nohz and lets the tasks run until
a remote tick gets them and then they are some ~5-10 periods in the hole.

I'm open to better solutions.

> It's not great that this implementation winds up going all the way
> through schedule() for each 5ms-default tick, though.
>

Yeah. We initially had the start_hrtick in __account_cfs_runtime but that
was too late. It doesn't catch when we put the task on the cpu initially.
Now that you mention this I do think we need that as well so that we
get a new timer if the task gets more bandwidth and can stay on cpu.

Not stopping the tick looks better and better...

Thanks for taking a look.


Cheers,
Phil

--