Re: [PATCH v2 1/3] sched/uclamp: Set max_spare_cap_cpu even if max_spare_cap is 0

From: Qais Yousef
Date: Fri Jun 30 2023 - 07:51:49 EST


On 06/07/23 15:52, Hongyan Xia wrote:
> Hi Qais,
>
> On 2023-02-11 17:50, Qais Yousef wrote:
> > [...]
> > >
> > > So EAS keeps packing on the cheaper PD/clamped OPP.
> >
> > Which is the desired behavior for uclamp_max?
> >
> > The only issue I see is that we want to distribute within a pd. Which is
> > something I was going to work on and send after later - but can lump it in this
> > series if it helps.
>
> I more or less share the same concern with Dietmar, which is packing things
> on the same small CPU when everyone has spare cpu_cap of 0.
>
> I wonder if this could be useful: On the side of cfs_rq->avg.util_avg, we
> have a cfs_rq->avg.util_avg_uclamp_max. It is keeping track of util_avg, but
> each task on the rq is capped at its uclamp_max value, so even if there's
> two always-running tasks with uclamp_max values of 100 with no idle time,
> the cfs_rq only sees cpu_util() of 200 and still has remaining capacity of
> 1024 - 200, not 0. This also helps balancing the load when rqs have no idle
> time. Even if two CPUs both have no idle time, but one is running a single
> task clamped at 100, the other running 2 such tasks, the first sees a
> remaining capacity of 1024 - 100, while the 2nd is 1024 - 200, so we still
> prefer the first one.

If I understood correctly you're suggesting do accounting of the sum of
uclamp_max for all the enqueued tasks?

I think we discussed this in the past. Can't remember the details now, but
adding additional accounting seemed undeseriable.

And I had issue with treating uclamp_max as a bandwidth hint rather than
a performance requirements hint. Limiting a task to 200 means it can't run
faster than this, but it doesn't mean it is not allowed to consume more
bandwidth than 200. Nice value and cfs bandwidth controllers should be used for
that.

> And I wonder if this could also help calculating energy when there's no idle
> time under uclamp_max. Instead of seeing a util_avg at 1024, we actually see
> a lower value. This is also what cpu_util_next() does in Android's sum
> aggregation, but I'm thinking of maintaining it right beside util_avg so
> that we don't have to sum up everything every time.

I haven't thought about how to improve the EM calculations to be honest, I see
this as a secondary problem compared to the other issue we need to fix first.

It seems load_avg can grow unboundedly, can you look at using this signal to
distribute on a cluster and as a hint we might be better off spilling to other
if they're already running at a perf level <= uclamp_max?


Thanks

--
Qais Yousef