Re: [PATCH 2/2] sched/fair: Always propagate runnable_load_avg

From: Vincent Guittot
Date: Wed May 03 2017 - 03:26:18 EST


On 2 May 2017 at 22:56, Tejun Heo <tj@xxxxxxxxxx> wrote:
> Hello, Vincent.
>
> On Tue, May 02, 2017 at 08:56:52AM +0200, Vincent Guittot wrote:
>> On 28 April 2017 at 18:14, Tejun Heo <tj@xxxxxxxxxx> wrote:
>> > I'll follow up in the other subthread but there really is fundamental
>> > difference in how we calculate runnable_avg w/ and w/o cgroups.
>> > Indepndent of whether we can improve the load balancer further, it is
>> > an existing bug.
>>
>> I'd like to weight that a bit.
>> The runnable_load_avg works correctly as it is because it reflects
>> correctly the load of runnable entity at root domain
>> If you start to propagate the runnable_load_avg on the load_avg of the
>> group entity, the load will become unstable.
>> runnable_load_avg has been added to fix load_balance being unable to
>> select the right busiest rq. So the goal is to use more and more
>> load_avg not the opposite
>
> I have a hard time understanding what you're trying to say here.
>
> Without cgroup, the load balancer uses the sum of load_avgs of the
> running tasks on the queue. As shown by the debug trace, the load
> balancer repeatedly ends up picking the wrong CPU when cgroup is
> involved because it ends up including the load_avgs of nested blocked
> tasks into runnable_load_avg of root - we lose the distinction between
> running and blocked load_avgs when we pass through a nested cfs_rq.
>
> We can further improve the load balancer all we want, for example,
> right now, we would end up picking a CPU with one task which has a
> really high weight over another CPU with two normal weight tasks even,
> which isn't ideal; however, there is something obviously broken in the
> existing mechanism and we want to fix that first independent of
> further improvements, and it won't be a good idea to paper over an
> existing problem with a different mechanism either.

That's probably where I disagree.
IMO, runnable_load_avg is already a fix to compensate the fact when
PELT has been rewritten, the load balance has not been update
accordingly and was unable to make the right choice
between 2 groups: one group having only 1 task but a higher load and
another group with several tasks but a lower load (either because of
the blocked load or because a runnable task with far higher load than
others). runnable_load_avg has been put back to quickly fix the 1st
case but it doesn't fix the other. There were also some issues of
propagating load update with task migration but this has been fixed
now.
So we have 2 use case with the exact same behavior a group with a
higher load (runnable_load_avg or load_avg) but only 1 task and
another one with several tasks but a lower load. At now
runnable_load_avg can only fix one with the impact of breaking
load_avg behavior whereas one fix in the load_balance can fix both
withour breaking load_avg behavior.
We should better work on removing runnable_load_avg completely than
adding more stuff on it

>
>> >> I always let time between 2 consecutive run and the 10 consecutive
>> >> runs take around 2min to execute
>> >>
>> >> Then I have run several time these 10 consecutive test and results stay the same
>> >
>> > Can you please try the patch at the end of this message? I'm
>> > wondering whether you're seeing the errors accumulating from the wrong
>> > min().
>>
>> I still have the regression with the patch below.
>> The regression comes from the use runnable_load_avg to propagate. As
>> load_avg becomes null at some point, it break the computation of share
>> and the load_avg stay very low
>
> That's surprising given that what the patch does is bringing the
> cgroup behavior closer to !cgroup behavior. It'd be great to be able

But it's also break load_avg which is used to calculate per cfs_rq
share and this change generates the regression on my board

> to reproduce the problem and trace it. It looks like the board is
> pretty standardized. Would the following be equivalent to the one you
> have?
>
> http://a.co/f3dD1lm

Yes it's this board i have the previous version but it should not
change the behavior
>
> If so, I can just buy it, get your test image and repro it here and
> trace why the regression is happening with the setup. We might be
> hitting something else.

I have pushed my branch which includes v4.11-rc8 + your patches + some
debug trace here:
https://git.linaro.org/people/vincent.guittot/kernel.git/log/?h=test-tejun-patches

and uploaded a trace_cmd's trace that shows the regression
http://people.linaro.org/~vincent.guittot/trace-tejun-patches

As an example, we can easily see arounf time stamp 94.826469 sec that
cpu5 is idle while CPU2 and CPU3 share their time between several
task.
This doesn't happens without your patch

>
> Thanks.
>
> --
> tejun