Re: [PATCH v5 7/7] sched/fair: Fair server interface

From: Peter Zijlstra
Date: Tue Nov 07 2023 - 03:16:25 EST


On Mon, Nov 06, 2023 at 05:29:49PM +0100, Daniel Bristot de Oliveira wrote:
> On 11/6/23 16:40, Peter Zijlstra wrote:
> > On Sat, Nov 04, 2023 at 11:59:24AM +0100, Daniel Bristot de Oliveira wrote:
> >> Add an interface for fair server setup on debugfs.
> >>
> >> Each rq have three files under /sys/kernel/debug/sched/rq/CPU{ID}:
> >>
> >> - fair_server_runtime: set runtime in ns
> >> - fair_server_period: set period in ns
> >> - fair_server_defer: on/off for the defer mechanism
> >>
> >
> > This then leaves /proc/sys/kernel/sched_rt_{period,runtime}_us to be the
> > total available bandwidth control, right?
>
> right, but thinking aloud... given that the per-cpu files are already allocating the
> bandwidth on the dl_rq, the spare time for fair scheduler is granted.
>
> Still, we can have them there as a safeguard to not overloading the deadline
> scheduler... (thinking aloud 2) as long as global is a thing... as we get away
> from it, that global limitation will make less sense - still better to have a form
> of limitation so people are aware of bandwidth until there.

Yeah, so having a limit on the deadline thing seems prudent as a way to
model system overhead. I mean 100% sounds nice, but then all the models
also assume no interrupts, no scheduler or migration overhead etc.. So
setting a slightly lower max seems far more realistic to me.

That said, the period/bandwidth thing is now slightly odd, as we really
only care about the utilization. But whatever. One thing at a time.

> > But then shouldn've we also rip out the throttle thingy right quick?
> >
>
> I was thinking about moving the entire throttling machinery inside CONFIG_RT_GROUP_SCHED
> for now, because GROUP_SCHED depends on it, no?

Yes. Until we can delete all that code we'll have to keep some of that.

> With the next step on moving the dl server as the base for the
> hierarchical scheduling... That will rip out the
> CONFIG_RT_GROUP_SCHED... with a thing with a per-cpu interface.
>
> Does it make sense?

I'm still not sure how to deal with affinities and deadline servers for
RT.

There's a bunch of issues and I thing we've only got some of them solved.

The semi-partitioned thing (someone was working on that, I think you
know the guy), solves DL 'entities' having affinities.

But the problem of FIFO is that they don't have inherent bandwidth. This
in turn means that any server for FIFO needs to be minimally concurrent,
otherwise you hand out bandwidth to lower priority tasks that the higher
priority task might want etc.. (Andersson's group has papers here).

Specifically, imagine a server with U=1.5 and 3 tasks, a high prio task
that requires .8 a medium prio task that requires .6 and a low prio task
that soaks up whatever it can get its little grubby paws on.

Then with minimal concurreny this works out nicely, high gets .8, mid
gets .6 and low gets the remaining .1.

If OTOH you don't limit concurrency and let them all run concurrently,
you can end up with the situation where they each get .5. Which is
obviously fail.

Add affinities here though and you're up a creek, how do you distribute
utilization between the slices, what slices, etc.. You say given them a
per-cpu cgroup interface, and have them configure it themselves, but
that's a god-aweful thing to ask userspace to do.

Ideally, I'd delete all of FIFO, it's such a horrid trainwreck, a total
and abysmal failure of a model -- thank you POSIX :-(