Re: [PATCH 1/2] sched/deadline: accurate reclaim bandwidth for GRUB

From: luca abeni
Date: Wed May 10 2023 - 03:08:03 EST


Hi,

after thinking more about this (and re-reading the code, and your patch
:), I try to comment again:

On Tue, 9 May 2023 23:53:28 -0400
Vineeth Remanan Pillai <vineeth@xxxxxxxxxxxxxxx> wrote:

> On Tue, May 9, 2023 at 4:54 PM luca abeni
> <luca.abeni@xxxxxxxxxxxxxxx> wrote:
>
> > > Yes, this is the approximation I was mentioning... Instead of
> > > using a division, I approximated it with a different equation
> > > using a sum.
> >
> > Sorry, ignore this comment (and the following); I misinterpreted the
> > code (and my old notes).
> >
> > I do not understand why the "max{}" doe not work well, I need to
> > double think about it.
> >
> I was thinking more about this and was doing some more digging into
> this. I was also wrong about min{}. Giving it some more thought, I
> think (U/Umax) is indeed the only equation we need and it will take
> care of caping the reclaiming at Umax.

Uhm... I am not sure about this: for 1 single task on 1 CPU, yes,
u/Umax does the right thing. But when there are multiple tasks on the
CPU, I think it can cause issues (because every task ends up trying to
execute for Umax).

The "max{}" comes from the original multi-processor GRUB algorithm:
https://arxiv.org/pdf/1512.01984.pdf (see Equation (13) - in that
equation, the part we call u_extra is included in Uinact[p])

the "1 - u_inact - u_extra" part is needed to make sure that the
real-time guarantees are not broken by the reclaiming mechanism... But
it can end up with a task trying to consume too much time on a single
CPU, hence the "u/Umax" term in the "max{}" is needed to make sure that
a task will not consume more than Umax of a CPU.

Now, if we have one single task on a CPU u/Umax will always be larger
than the other term... But when we have multiple tasks the other term
is needed too.


> The reason why it was not
> working is because of the loss of precision when we did the inverse.

I agree


> I tried replacing (delta * running_bw * bw_ratio) by
> div64_u64(delta * running_bw, Umax) and it worked as expected and
> reclaimed only till Umax with only SCHED_FLAG_RECLAIM tasks. As an
> example a task with reservation (1, 100) and RT capacity 95%, and
> delta = 4ms, we get scaled_delta as
> delta * running_bw * bw_ratio ~= .040000 (roughly)
> div64_u64(delta * running_bw, Umax) ~= .04210526 (roughly)
>
> This caused the inverse logic to consume ~99% bw, while the other
> one consumed ~95% as expected.
>
> I still could not figure out why min{} works. As you mentioned in
> the previous thread, its the loss of precision thats the culprit and
> I think we only need U/Umax if we have enough precision. This along
> with accounting for both type of tasks will be the solution.

I agree with this.


(BTW, when considering multiple tasks on multiple CPUs, another
potential problem is given by u_extra... Now that I remember all the
details, u_extra is not "Umax - this_bw" - this is true when we consider
only one CPU, but is is "Umax - sum(u_i)/m" (where "sum(u_i)" is the
sum of the bandwidths of all the SCHED_DEADLINE tasks in the root
domain, and "m" is the number of CPUs in the root domain)... So, the
reclaimable CPU time is distributed uniformly on all the CPUs and this
could create some issues. But let's see what happens after the div64
fix and the SCHED_FLAG_RECLAIM fix)


Thanks,
Luca

>
> I will look deeper into any performance issues with using div64_u64
> over multiplication and shall let you know soon.
>
> Thanks,
> Vineeth