Re: [PATCH 1/7] sched/uclamp: Fix relationship between uclamp and migration margin

From: Xuewen Yan
Date: Sun Jul 31 2022 - 22:46:29 EST


Hi Qais

On Thu, Jul 28, 2022 at 12:25 AM Qais Yousef <qais.yousef@xxxxxxx> wrote:
>
> Hi Xuewen
>
> On 07/25/22 19:59, Xuewen Yan wrote:
> > Hi Qais
> >
> > On Thu, Jul 21, 2022 at 6:24 PM Qais Yousef <qais.yousef@xxxxxxx> wrote:
> > >
> > > Hi Xuewen
> > >
> > > On 07/20/22 15:17, Xuewen Yan wrote:
> > > > Hi Qais,
> > > >
> > > > On Thu, Jun 30, 2022 at 3:47 AM Qais Yousef <qais.yousef@xxxxxxx> wrote:
> > > > >
> > > > > fits_capacity() verifies that a util is within 20% margin of the
> > > > > capacity of a CPU, which is an attempt to speed up upmigration.
> > > > >
> > > > > But when uclamp is used, this 20% margin is problematic because for
> > > > > example if a task is boosted to 1024, then it will not fit on any CPU
> > > > > according to fits_capacity() logic.
> > > > >
> > > > > Or if a task is boosted to capacity_orig_of(medium_cpu). The task will
> > > > > end up on big instead on the desired medium CPU.
> > > >
> > > > I think it is reasonable. Since the user sets uclamp_min to be greater
> > > > than 0, the user prefers that the process has better performance cpu.
> > > > If ignore the margin here, the uclamp_min is meaningless.
> > >
> > > Why is it meaningless?
> > >
> > > uclamp is a performance hint, not a bandwidth hint.
> > >
> > > That is, if the task's util_avg, which represents its bandwidth, is being
> > > impacted then it should move up.
> > >
> > > But if the task is getting the bandwidth it needs, which is again represented
> > > by its util_avg, then uclamp_min just ensure it is running at the right
> > > performance level. Performance level is orthogonal to bandwidth.
> > >
> > > As long as the medium CPU will run at max performance point, it is fine.
> >
> > This involves the meaning of uclamp, if it guarantees performance
> > rather than bandwidth, then it is fine:-)
>
> +1
>
> I do have a patch to add kernel doc to better explain what uclamp is. Hopefully
> I'll send this out soon. I've been sleeping on it for a long while but too many
> things to do, too little time :-)
Ah, Could this patch loop me in the future? I want to learn more from
you, Thanks!

>
> >
> > >
> > > >
> > > > >
> > > > > Similar corner cases exist for uclamp and usage of capacity_of().
> > > > > Slightest irq pressure on biggest CPU for example will make a 1024
> > > > > boosted task look like it can't fit.
> > > >
> > > > I think it can't fit is reasonable. The uclamp_min is limit the
> > > > util_avg, if the task can fit the cpu with capacity is 1024, which
> > > > uclamp_min is 1024, How to deal with the task which util is 1024?
> > > > Maybe your idea is that the biggest cpu can fit any task even if it's
> > > > util is 1024?
> > >
> > > util_fits_cpu() compares util_avg with capacity_of(). So if
> > >
> > > util_avg >= 0.8 * 1024
> > >
> > > then it will not fit the cpu. Regardless of what is the uclamp_min value. Only
> > > exception is if you use uclamp_max, then by design this should force it to fit
> > > even if util_avg is bigger.
> >
> > Okay, This also involves the meaning of uclamp. It represents performance. :-)
>
> +1
>
> >
> > >
> > > >
> > > > >
> > > > > What we really want is for uclamp comparisons to ignore the migration
> > > > > margin and capacity pressure, yet retain them for when checking the
> > > > > _actual_ util signal.
> > > > >
> > > > > For example, task p:
> > > > >
> > > > > p->util_avg = 300
> > > > > p->uclamp[UCLAMP_MIN] = 1024
> > > > >
> > > > > Will fit a big CPU. But
> > > > >
> > > > > p->util_avg = 900
> > > > > p->uclamp[UCLAMP_MIN] = 1024
> > > > >
> > > > > will not, this should trigger overutilized state because the big CPU is
> > > > > now *actually* being saturated.
> > > >
> > > > Now the code would catch the uclamp before judging the fits_capacity.
> > > > The two task both can not fit the cpu, why the task(300) can fit the
> > > > cpu?
> > >
> > > Because
> > >
> > > p->util_avg < 0.8 * capacity_of(big_cpu)
> > > AND
> > > p->uclamp_min <= capacity_orig_of(big_cpu)
> > >
> > > Why it shouldn't fit?
> > >
> > > Please keep in mind that uclamp is a performance hint and not a bandwidth hint.
> > > It requests for the task to run at a performance level, if we can satisfy that
> > > request, but it doesn't say that the task is actually occupies that bandwidth.
> > >
> > > By design, we want to allow multiple small tasks to be packed on a big core.
> > > For example if we have
> > >
> > > p0->util_avg = 300
> > > p0->uclamp_min = 1024
> > >
> > > p1->util_avg = 300
> > > p1->uclamp_min = 1024
> > >
> > > Then by design we would like to enable both of these tasks to run on big cores.
> > >
> > > Their combined bandwidth is 600, which is well below the available bandwidth.
> > > And uclamp_min = 1024 just means these task must run at highest frequency on
> > > the biggest cpu.
> > >
> > > feec() will actually take care of deciding whether to pack or spread within
> > > the big cpu 'cluster'. util_fits_cpu() role is merely to indicate whether this
> > > cpu is a viable option or not.
> > >
> > > Taking any pressure into account will mean any hint to 1024 will almost always
> > > fail because in the common case there's always some form of pressure on a CPU.
> > > So even if capacity_of() is 1023, this will make p0 and p1 to trigger
> > > overutilized state. Which is plain wrong. The tasks are actually small, and the
> > > fact that uclamp_min is 1024 is a simple request to *attempt* to run it at max
> > > performance point, which is the biggest core and highest frequency. None of
> > > these has any correlation to rt/irq pressures.
> >
> > Okay, Thanks for the the very detailed explanation, I will re-review
> > this patch from a different angle:-)
> > Cheers~
>
> Glad that was readable! :-)
>
> >
> > >
> > > >
> > > > >
> > > > > Similar reasoning applies to capping tasks with UCLAMP_MAX. For example:
> > > > >
> > > > > p->util_avg = 1024
> > > > > p->uclamp[UCLAMP_MAX] = capacity_orig_of(medium_cpu)
> > > > >
> > > > > Should fit the task on medium cpus without triggering overutilized
> > > > > state.
> > > >
> > > > I fully agree with this! But there is a problem, How to do when there
> > > > is RT pressure or irq pressure?
> > > > Maybe it is better to compare the uclamp_max with the capacity_of(cpu)
> > > > instead of the capacity_origin?
> > >
> > > No. This IS the problem I am trying to fix with this series. UCLAMP_MAX limits
> > > the performance level the task can obtain.
> > >
> > > The fact that there's RT or irq pressure doesn't prevent this task from being
> > > capped to that performance level.
> > >
> > > Beside this will break the ability to use uclamp as a weak affinity.
> > >
> > > Setting uclamp_max to capacity_orig_of(little_cpu), as one would do for
> > > background tasks for instance, will enable EAS to consider the little cores as
> > > a viable candidate and select it if it is the most energy efficient CPU.
> > > Which is an intended design use case.
> > >
> > > If we start failing to do this randomly because of spurious RT and irq
> > > pressure, the benefit of the hint will be significantly reduced.
> > > And then it *will* become meaningless.
> >
> > I agree with you, but I'm still a bit concerned that such a setup will
> > cause performance issues.
> > As you say, may one want the background tasks running on the little
> > cpus, he can use cpuset to control them completely.
>
> We are actually hoping that we can enable using uclamp_max as weak affinity
> instead of the aggressive cpusets. But there's still a bit more work to do
> before we can get there.
>
> > When there are many processes in the system, if such processes always
> > fit small cores, do we need to consider more when load balancing?
>
> Oh, you're worried about packing these tasks on small cores?
>
> We've looked at that, and this should be hard to happen.
>
> EAS will always distribute tasks on max_spare_capacity cpu in the performance
> domain. Only exception I'm aware of is if a lot of tasks wake up at the same
> time. Then there's a chance (race) they all see the same max_spare capacity
> before any of these tasks gets enqueue to adjust the rq->util_avg.
>
> Packing can't happen outside of EAS AFAICT. The default behavior of the
> scheduler is to distribute tasks on idle cpus or based on load.
>
> If we're in overutilized, then select_idle_capacity() should consider the idle
> cpus only. And in load balance in general should distribute tasks based on
> idle/load.

Yes, you're right, I'm thinking a little bit less...Thanks!

>
> Keep in mind from EAS PoV, util_fits_cpu() just says this is a viable
> candidate. The actual selection has to satisfy other conditions in feec(). One
> of them is that this candidate is max_spare_capacity - which effectively
> distributes within a performance domain.
>
> I'd expect us to start spilling to medium cores because they'd become more
> energy efficient than the little cores at some point when they're all
> overloaded.
>
> Maybe you had a different scenario in mind. If yes, can you explain it more
> details please?
>
> >
> > >
> > > >
> > > > >
> > > > > Inlined comments expand more on desired behavior in more scenarios.
> > > > >
> > > > > Introduce new util_fits_cpu() function which encapsulates the new logic.
> > > > > The new function is not used anywhere yet, but will be used to update
> > > > > various users of fits_capacity() in later patches.
> > > > >
> > > > > Fixes: af24bde8df202 ("sched/uclamp: Add uclamp support to energy_compute()")
> > > > > Signed-off-by: Qais Yousef <qais.yousef@xxxxxxx>
> > > > > ---
> > > > > kernel/sched/fair.c | 114 ++++++++++++++++++++++++++++++++++++++++++++
> > > > > 1 file changed, 114 insertions(+)
> > > > >
> > > > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > > > > index f80ae86bb404..5eecae32a0f6 100644
> > > > > --- a/kernel/sched/fair.c
> > > > > +++ b/kernel/sched/fair.c
> > > > > @@ -4203,6 +4203,120 @@ static inline void util_est_update(struct cfs_rq *cfs_rq,
> > > > > trace_sched_util_est_se_tp(&p->se);
> > > > > }
> > > > >
> > > > > +static inline int util_fits_cpu(unsigned long util,
> > > > > + unsigned long uclamp_min,
> > > > > + unsigned long uclamp_max,
> > > > > + int cpu)
> > > > > +{
> > > >
> > > > May the function name is not proper when the uclamp is unused.
> > >
> > > Are you suggesting to rename it? What name do you have in mind?
> > > I think this is a suitable name, but open for suggestions :-)
> >
> > Okay:-)
> >
> > >
> > > >
> > > > > + unsigned long capacity = capacity_of(cpu);
> > > > > + unsigned long capacity_orig;
> > > > > + bool fits, max_capacity;
> > > > > + bool uclamp_max_fits;
> > > > > +
> > > > > + /*
> > > > > + * Check if the real util fits without any uclamp boost/cap applied.
> > > > > + */
> > > > > + fits = fits_capacity(util, capacity);
> > > > > +
> > > > > + if (!uclamp_is_used())
> > > > > + return fits;
> > > > > +
> > > > > + /*
> > > > > + * We must use capacity_orig_of() for comparing against uclamp_min and
> > > > > + * uclamp_max. We only care about capacity pressure (by using
> > > > > + * capacity_of()) for comparing against the real util.
> > > > > + *
> > > > > + * If a task is boosted to 1024 for example, we don't want a tiny
> > > > > + * pressure to skew the check whether it fits a CPU or not.
> > > > > + *
> > > > > + * Similarly if a task is capped to capacity_orig_of(little_cpu), it
> > > > > + * should fit a little cpu even if there's some pressure.
> > > > > + *
> > > > > + * Known limitation is when thermal pressure is severe to the point
> > > > > + * where we have capacity inversion. We don't cater for that as the
> > > > > + * system performance will already be impacted severely.
> > > > > + */
> > > > > + capacity_orig = capacity_orig_of(cpu);
> > > > > +
> > > > > + /*
> > > > > + * We want to force a task to fit a cpu as implied by uclamp_max.
> > > > > + * But we do have some corner cases to cater for..
> > > > > + *
> > > > > + *
> > > > > + * C=z
> > > > > + * | ___
> > > > > + * | C=y | |
> > > > > + * |_ _ _ _ _ _ _ _ _ ___ _ _ _ | _ | _ _ _ _ _ uclamp_max
> > > > > + * | C=x | | | |
> > > > > + * | ___ | | | |
> > > > > + * | | | | | | | (util somewhere in this region)
> > > > > + * | | | | | | |
> > > > > + * | | | | | | |
> > > > > + * +----------------------------------------
> > > > > + * cpu0 cpu1 cpu2
> > > > > + *
> > > > > + * In the above example if a task is capped to a specific performance
> > > > > + * point, y, then when:
> > > > > + *
> > > > > + * * util = 80% of x then it does not fit on cpu0 and should migrate
> > > > > + * to cpu1
> > > > > + * * util = 80% of y then it is forced to fit on cpu1 to honour
> > > > > + * uclamp_max request.
> > > > > + *
> > > > > + * which is what we're enforcing here. A task always fits if
> > > > > + * uclamp_max <= capacity_orig. But when uclamp_max > capacity_orig,
> > > > > + * the normal upmigration rules should withhold still.
> > > > > + *
> > > > > + * Only exception is when we are on max capacity, then we need to be
> > > > > + * careful not to block overutilized state. This is so because:
> > > > > + *
> > > > > + * 1. There's no concept of capping at max_capacity! We can't go
> > > > > + * beyond this performance level anyway.
> > > > > + * 2. The system is being saturated when we're operating near
> > > > > + * max_capacity, it doesn't make sense to block overutilized.
> > > > > + */
> > > > > + max_capacity = (capacity_orig == SCHED_CAPACITY_SCALE) && (uclamp_max == SCHED_CAPACITY_SCALE);
> > > > > + uclamp_max_fits = !max_capacity && (uclamp_max <= capacity_orig);
> > > > > + fits = fits || uclamp_max_fits;
> > > >
> > > > As I said above, Using the capacity_orig may ignore the rt/irq pressure.
> > > > If we have two or more middle cpus, we can select the cpu whose rt/irq
> > > > pressure is smaller.
> > > > If using the capacity_orig, the first MID cpu is always the candidate.
> > >
> > > I hope my explanation above addressed that too. rt/irq has no impact on the
> > > task's ability to achieve the required performance level from uclamp hint PoV.
> > > We still use util_avg to compare with rt/irq pressure as usual. so if rt/irq
> > > pose any issue to the task's ability to obtain the required bandwidth that will
> > > be taken into account. But if util_avg is happy with that level of rt/irq
> > > pressure, then uclamp only cares about being able to achieve the performance
> > > level on that cpu, which doesn't care about rt/irq pressure.
> > >
> > > >
> > > > > +
> > > > > + /*
> > > > > + *
> > > > > + * C=z
> > > > > + * | ___ (region a, capped, util >= uclamp_max)
> > > > > + * | C=y | |
> > > > > + * |_ _ _ _ _ _ _ _ _ ___ _ _ _ | _ | _ _ _ _ _ uclamp_max
> > > > > + * | C=x | | | |
> > > > > + * | ___ | | | | (region b, uclamp_min <= util <= uclamp_max)
> > > > > + * |_ _ _|_ _|_ _ _ _| _ | _ _ _| _ | _ _ _ _ _ uclamp_min
> > > > > + * | | | | | | |
> > > > > + * | | | | | | | (region c, boosted, util < uclamp_min)
> > > > > + * +----------------------------------------
> > > > > + * cpu0 cpu1 cpu2
> > > > > + *
> > > > > + * a) If util > uclamp_max, then we're capped, we don't care about
> > > > > + * actual fitness value here. We only care if uclamp_max fits
> > > > > + * capacity without taking margin/pressure into account.
> > > > > + * See comment above.
> > > > > + *
> > > > > + * b) If uclamp_min <= util <= uclamp_max, then the normal
> > > > > + * fits_capacity() rules apply. Except we need to ensure that we
> > > > > + * enforce we remain within uclamp_max, see comment above.
> > > > > + *
> > > > > + * c) If util < uclamp_min, then we are boosted. Same as (b) but we
> > > > > + * need to take into account the boosted value fits the CPU without
> > > > > + * taking margin/pressure into account.
> > > > > + *
> > > > > + * Cases (a) and (b) are handled in the 'fits' variable already. We
> > > > > + * just need to consider an extra check for case (c) after ensuring we
> > > > > + * handle the case uclamp_min > uclamp_max.
> > > > > + */
> > > > > + uclamp_min = min(uclamp_min, uclamp_max);
> > > > > + if (util < uclamp_min)
> > > > > + fits = fits && (uclamp_min <= capacity_orig);
> > > >
> > > > As said above, I think the uclamp_min should consider the margin.
> > >
> > > Addressed above ;-)
> >
> > Okay, I would revisit the patch:-)
>
> Thanks!!
>
>
> Cheers
>
> --
> Qais Yousef