Re: [RFC PATCH] sched: fair: reset task_group.load_avg when there are no running tasks.

From: Imran Khan
Date: Fri Dec 15 2023 - 05:00:45 EST


Hello Vincent,
Thanks a lot for having a look and getting back.

On 15/12/2023 7:11 pm, Vincent Guittot wrote:
> On Fri, 15 Dec 2023 at 06:27, Imran Khan <imran.f.khan@xxxxxxxxxx> wrote:
>>
>> It has been found that sometimes a task_group has some residual
>> load_avg even though the load average at each of its owned queues
>> i.e task_group.cfs_rq[cpu].avg.load_avg and task_group.cfs_rq[cpu].
>> tg_load_avg_contrib have become 0 for a long time.
>> Under this scenario if another task starts running in this task_group,
>> it does not get proper time share on CPU since pre-existing
>> load average of task group inversely impacts the new task's CPU share
>> on each CPU.
>>
>> This change looks for the condition when a task_group has no running
>> tasks and sets the task_group's load average to 0 in such cases, so
>> that tasks that run in future under this task_group get the CPU time
>> in accordance with the current load.
>>
>> Signed-off-by: Imran Khan <imran.f.khan@xxxxxxxxxx>
>> ---
>>
>
> [...]
>
>>
>> 4. Now move systemd-udevd to one of these test groups, say test_group_1, and
>> perform scale up to 124 CPUs followed by scale down back to 4 CPUs from the
>> host side.
>
> Could it be the root cause of your problem ?
>
> The cfs_rq->tg_load_avg_contrib of the 120 CPUs that have been plugged
> then unplugged, have not been correctly removed from tg->load_avg. If
> the cfs_rq->tg_load_avg_contrib of the 4 remaining CPUs is 0 then
> tg->load_avg should be 0 too.
>
Agree and this was my understanding as well. The issue only happens
with large number of CPUs. For example if I go from 4 to 8 and back to
4 , the issue does not happen and even if it happens the residual load
avg is very little.

> Could you track that the cfs_rq->tg_load_avg_contrib is correctly
> removed from tg->load_avg when you unplug the CPUs ? I can easily
> imagine that the rate limit can skip some update of tg- >load_avg
> while offlining the cpu
>

I will try to trace it but just so you know this issue is happening on other
kernel versions (which don't have rate limit feature) as well. I started
with v4.14.x but have tested and found it on v5.4.x and v5.15.x as well.

Thanks,
Imran