Re: [PATCH 1/2] sched/deadline: accurate reclaim bandwidth for GRUB

From: Vineeth Remanan Pillai
Date: Wed May 10 2023 - 11:50:23 EST


Hi Luca,

On Wed, May 10, 2023 at 3:07 AM luca abeni <luca.abeni@xxxxxxxxxxxxxxx> wrote:
>
> > I was thinking more about this and was doing some more digging into
> > this. I was also wrong about min{}. Giving it some more thought, I
> > think (U/Umax) is indeed the only equation we need and it will take
> > care of caping the reclaiming at Umax.
>
> Uhm... I am not sure about this: for 1 single task on 1 CPU, yes,
> u/Umax does the right thing. But when there are multiple tasks on the
> CPU, I think it can cause issues (because every task ends up trying to
> execute for Umax).
>
Ah ok, I understand now. I was going with the single CPU equation of
U/Umax where this would not have been an issue as U(running_bw) takes
care of the adjustment and does not allow a single task to use whole
of Umax. I read the SMP GRUB paper you shared and see that it uses
u_i/Umax and that might cause issues as you mentioned.

> The "max{}" comes from the original multi-processor GRUB algorithm:
> https://arxiv.org/pdf/1512.01984.pdf (see Equation (13) - in that
> equation, the part we call u_extra is included in Uinact[p])
>
Thanks for sharing. I read the paper and got an overall idea an
background of existing implementation.

> the "1 - u_inact - u_extra" part is needed to make sure that the
> real-time guarantees are not broken by the reclaiming mechanism... But
> it can end up with a task trying to consume too much time on a single
> CPU, hence the "u/Umax" term in the "max{}" is needed to make sure that
> a task will not consume more than Umax of a CPU.
>
> Now, if we have one single task on a CPU u/Umax will always be larger
> than the other term... But when we have multiple tasks the other term
> is needed too.
>
Understood, thanks for explaining.

> (BTW, when considering multiple tasks on multiple CPUs, another
> potential problem is given by u_extra... Now that I remember all the
> details, u_extra is not "Umax - this_bw" - this is true when we consider
> only one CPU, but is is "Umax - sum(u_i)/m" (where "sum(u_i)" is the
> sum of the bandwidths of all the SCHED_DEADLINE tasks in the root
> domain, and "m" is the number of CPUs in the root domain)... So, the
> reclaimable CPU time is distributed uniformly on all the CPUs and this
> could create some issues. But let's see what happens after the div64
> fix and the SCHED_FLAG_RECLAIM fix)
>
This makes sense. This also means that we wouldn't be able to replace
"Uextra + Uinact" with "Umax - running_bw" and I was seeing problems
with SMP testing. So I shall revert to "Uextra + Uinact" in v2. And I
think the potential issue with Uextra would be avoided by the check
for Uextra + Uinact > Umax to make sure that we don't reclaim more
than Umax for a single cpu.

I have tested the patch with SMP using the stressor mentioned in the
commit message and running cyclicdeadline in parallel. The results
are similar to upstream and GRUB able to reclaim upto Umax now.

I shall send the v2 soon after a bit more testing.. Thanks a lot for
all the valuable inputs and detailed explanation :-)

Thanks,
Vineeth