Re: [PATCH v2 1/2] sched/fair: Fix task utilization accountability in compute_energy()

From: Quentin Perret
Date: Thu Feb 25 2021 - 03:53:45 EST


On Thursday 25 Feb 2021 at 08:36:11 (+0000), vincent.donnefort@xxxxxxx wrote:
> From: Vincent Donnefort <vincent.donnefort@xxxxxxx>
>
> find_energy_efficient_cpu() (feec()) computes for each perf_domain (pd) an
> energy delta as follows:
>
> feec(task)
> for_each_pd
> base_energy = compute_energy(task, -1, pd)
> -> for_each_cpu(pd)
> -> cpu_util_next(cpu, task, -1)
>
> energy_delta = compute_energy(task, dst_cpu, pd)
> -> for_each_cpu(pd)
> -> cpu_util_next(cpu, task, dst_cpu)
> energy_delta -= base_energy
>
> Then it picks the best CPU as being the one that minimizes energy_delta.
>
> cpu_util_next() estimates the CPU utilization that would happen if the
> task was placed on dst_cpu as follows:
>
> max(cpu_util + task_util, cpu_util_est + _task_util_est)
>
> The task contribution to the energy delta can then be either:
>
> (1) _task_util_est, on a mostly idle CPU, where cpu_util is close to 0
> and _task_util_est > cpu_util.
> (2) task_util, on a mostly busy CPU, where cpu_util > _task_util_est.
>
> (cpu_util_est doesn't appear here. It is 0 when a CPU is idle and
> otherwise must be small enough so that feec() takes the CPU as a
> potential target for the task placement)
>
> This is problematic for feec(), as cpu_util_next() might give an unfair
> advantage to a CPU which is mostly busy (2) compared to one which is
> mostly idle (1). _task_util_est being always bigger than task_util in
> feec() (as the task is waking up), the task contribution to the energy
> might look smaller on certain CPUs (2) and this breaks the energy
> comparison.
>
> This issue is, moreover, not sporadic. By starving idle CPUs, it keeps
> their cpu_util < _task_util_est (1) while others will maintain cpu_util >
> _task_util_est (2).
>
> Fix this problem by always using max(task_util, _task_util_est) as a task
> contribution to the energy (ENERGY_UTIL). The new estimated CPU
> utilization for the energy would then be:
>
> max(cpu_util, cpu_util_est) + max(task_util, _task_util_est)
>
> compute_energy() still needs to know which OPP would be selected if the
> task would be migrated in the perf_domain (FREQUENCY_UTIL). Hence,
> cpu_util_next() is still used to estimate the maximum util within the pd.
>
> Signed-off-by: Vincent Donnefort <vincent.donnefort@xxxxxxx>

Reviewed-by: Quentin Perret <qperret@xxxxxxxxxx>

Thanks,
Quentin