Re: [PATCH v5 7/7] sched/fair: Fair server interface

From: Daniel Bristot de Oliveira
Date: Tue Nov 07 2023 - 09:07:05 EST


On 11/7/23 09:16, Peter Zijlstra wrote:
> On Mon, Nov 06, 2023 at 05:29:49PM +0100, Daniel Bristot de Oliveira wrote:
>> On 11/6/23 16:40, Peter Zijlstra wrote:
>>> On Sat, Nov 04, 2023 at 11:59:24AM +0100, Daniel Bristot de Oliveira wrote:
>>>> Add an interface for fair server setup on debugfs.
>>>>
>>>> Each rq have three files under /sys/kernel/debug/sched/rq/CPU{ID}:
>>>>
>>>> - fair_server_runtime: set runtime in ns
>>>> - fair_server_period: set period in ns
>>>> - fair_server_defer: on/off for the defer mechanism
>>>>
>>>
>>> This then leaves /proc/sys/kernel/sched_rt_{period,runtime}_us to be the
>>> total available bandwidth control, right?
>>
>> right, but thinking aloud... given that the per-cpu files are already allocating the
>> bandwidth on the dl_rq, the spare time for fair scheduler is granted.
>>
>> Still, we can have them there as a safeguard to not overloading the deadline
>> scheduler... (thinking aloud 2) as long as global is a thing... as we get away
>> from it, that global limitation will make less sense - still better to have a form
>> of limitation so people are aware of bandwidth until there.
>
> Yeah, so having a limit on the deadline thing seems prudent as a way to
> model system overhead. I mean 100% sounds nice, but then all the models
> also assume no interrupts, no scheduler or migration overhead etc.. So
> setting a slightly lower max seems far more realistic to me.
>
> That said, the period/bandwidth thing is now slightly odd, as we really
> only care about the utilization. But whatever. One thing at a time.

Yep, that is why I am mentioning the generalization as a second phase, it is
a harder problem... But having the rt throttling out of the default way is
already a good step.

>
>>> But then shouldn've we also rip out the throttle thingy right quick?
>>>
>>
>> I was thinking about moving the entire throttling machinery inside CONFIG_RT_GROUP_SCHED
>> for now, because GROUP_SCHED depends on it, no?
>
> Yes. Until we can delete all that code we'll have to keep some of that.
>
>> With the next step on moving the dl server as the base for the
>> hierarchical scheduling... That will rip out the
>> CONFIG_RT_GROUP_SCHED... with a thing with a per-cpu interface.
>>
>> Does it make sense?
>
> I'm still not sure how to deal with affinities and deadline servers for
> RT.
>
> There's a bunch of issues and I thing we've only got some of them solved.
>
> The semi-partitioned thing (someone was working on that, I think you
> know the guy), solves DL 'entities' having affinities.

Yep, then having arbitrari affinities is another step towards mode flexible models...

> But the problem of FIFO is that they don't have inherent bandwidth. This
> in turn means that any server for FIFO needs to be minimally concurrent,
> otherwise you hand out bandwidth to lower priority tasks that the higher
> priority task might want etc.. (Andersson's group has papers here).
>
> Specifically, imagine a server with U=1.5 and 3 tasks, a high prio task
> that requires .8 a medium prio task that requires .6 and a low prio task
> that soaks up whatever it can get its little grubby paws on.
>
> Then with minimal concurreny this works out nicely, high gets .8, mid
> gets .6 and low gets the remaining .1.
>
> If OTOH you don't limit concurrency and let them all run concurrently,
> you can end up with the situation where they each get .5. Which is
> obviously fail.
>
> Add affinities here though and you're up a creek, how do you distribute
> utilization between the slices, what slices, etc.. You say given them a
> per-cpu cgroup interface, and have them configure it themselves, but
> that's a god-aweful thing to ask userspace to do.

and yep again... It is definitely a harder topic... but it gets simpler as we do
those other moves...

> Ideally, I'd delete all of FIFO, it's such a horrid trainwreck, a total
> and abysmal failure of a model -- thank you POSIX :-(

-- Daniel