Re: [v4.8-rc1 Regression] sched/fair: Apply more PELT fixes

From: Vincent Guittot
Date: Wed Oct 19 2016 - 13:43:38 EST


On 19 October 2016 at 15:30, Morten Rasmussen <morten.rasmussen@xxxxxxx> wrote:
> On Tue, Oct 18, 2016 at 01:56:51PM +0200, Vincent Guittot wrote:
>> Le Tuesday 18 Oct 2016 Ã 12:34:12 (+0200), Peter Zijlstra a Ãcrit :
>> > On Tue, Oct 18, 2016 at 11:45:48AM +0200, Vincent Guittot wrote:
>> > > On 18 October 2016 at 11:07, Peter Zijlstra <peterz@xxxxxxxxxxxxx> wrote:
>> > > > So aside from funny BIOSes, this should also show up when creating
>> > > > cgroups when you have offlined a few CPUs, which is far more common I'd
>> > > > think.
>> > >
>> > > The problem is also that the load of the tg->se[cpu] that represents
>> > > the tg->cfs_rq[cpu] is initialized to 1024 in:
>> > > alloc_fair_sched_group
>> > > for_each_possible_cpu(i) {
>> > > init_entity_runnable_average(se);
>> > > sa->load_avg = scale_load_down(se->load.weight);
>> > >
>> > > Initializing sa->load_avg to 1024 for a newly created task makes
>> > > sense as we don't know yet what will be its real load but i'm not sure
>> > > that we have to do the same for se that represents a task group. This
>> > > load should be initialized to 0 and it will increase when task will be
>> > > moved/attached into task group
>> >
>> > Yes, I think that makes sense, not sure how horrible that is with the
>>
>> That should not be that bad because this initial value is only useful for
>> the few dozens of ms that follow the creation of the task group
>
> IMHO, it doesn't make much sense to initialize empty containers, which
> group sched_entities really are, to 1024. It is meant to represent what
> is in it, and a creation it is empty, so in my opinion initializing it
> to zero make sense.
>
>> > current state of things, but after your propagate patch, that
>> > reinstates the interactivity hack that should work for sure.
>
> It actually works on mainline/tip as well.
>
> As I see it, the fundamental problem is keeping group entities up to
> date. Because the load_weight and hence se->avg.load_avg each per-cpu
> group sched_entity depends on the group cfs_rq->tg_load_avg_contrib for
> all cpus (tg->load_avg), including those that might be empty and
> therefore not enqueued, we must ensure that they are updated some other
> way. Most naturally as part of update_blocked_averages().
>
> To guarantee that, it basically boils down to making sure:
> Any cfs_rq with a non-zero tg_load_avg_contrib must be on the
> leaf_cfs_rq_list.
>
> We can do that in different ways: 1) Add all cfs_rqs to the
> leaf_cfs_rq_list at task group creation, or 2) initialize group
> sched_entity contributions to zero and make sure that they are added to
> leaf_cfs_rq_list as soon as a sched_entity (task or group) is enqueued
> on it.
>
> Vincent patch below gives us the second option.
>
>> kernel/sched/fair.c | 9 ++++++++-
>> 1 file changed, 8 insertions(+), 1 deletion(-)
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 8b03fb5..89776ac 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -690,7 +690,14 @@ void init_entity_runnable_average(struct sched_entity *se)
>> * will definitely be update (after enqueue).
>> */
>> sa->period_contrib = 1023;
>> - sa->load_avg = scale_load_down(se->load.weight);
>> + /*
>> + * Tasks are intialized with full load to be seen as heavy task until
>> + * they get a chance to stabilize to their real load level.
>> + * group entity are intialized with null load to reflect the fact that
>> + * nothing has been attached yet to the task group.
>> + */
>> + if (entity_is_task(se))
>> + sa->load_avg = scale_load_down(se->load.weight);
>> sa->load_sum = sa->load_avg * LOAD_AVG_MAX;
>> /*
>> * At this point, util_avg won't be used in select_task_rq_fair anyway
>
> I would suggest adding a comment somewhere stating that we need to keep
> group cfs_rqs up to date:
>
> -----
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index abb3763dff69..2b820d489be0 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -6641,6 +6641,11 @@ static void update_blocked_averages(int cpu)
> if (throttled_hierarchy(cfs_rq))
> continue;
>
> + /*
> + * Note that _any_ leaf cfs_rq with a non-zero tg_load_avg_contrib
> + * _must_ be on the leaf_cfs_rq_list to ensure that group shares
> + * are updated correctly.
> + */

As discussed on IRC, the point is that even if the leaf cfs_rq is
added to the leaf_cfs_rq_list, it doesn't ensure that it will be
updated correctly for unplugged CPUs

> if (update_cfs_rq_load_avg(cfs_rq_clock_task(cfs_rq), cfs_rq, true))
> update_tg_load_avg(cfs_rq, 0);
> }
> -----
>
> I did a couple of simple tests on tip/sched/core to test whether
> Vincent's fix works even without reflecting group load/util in the group
> hierarchy:
>
> Juno (2xA57+4xA53)
>
> tip:
> grouped hog(1) alone: 2841
> non-grouped hogs(6) alone: 40830
> grouped hog(1): 218
> non-grouped hogs(6): 40580
>
> tip+vg:
> grouped hog alone: 2849
> non-grouped hogs(6) alone: 40831
> grouped hog: 2363
> non-grouped hogs: 38418
>
> See script below for details, but we basically see that the grouped task
> is not getting its 'fair' share on tip, while it does with Vincent's
> patch.
>
> To summarize, I think Vincent's patch makes sense and works :-) More
> testing is needed of cause to see if there are other problems.
>
> -----
>
> # Create 100 task groups:
> for i in `seq 1 100`;
> do
> cgcreate -g cpu:/root/test$i
> done
>
> NCPUS=$(grep -c ^processor /proc/cpuinfo)
>
> # Run single cpu hog inside task group on first cpu _alone_:
> cgexec -g cpu:/root/test100 taskset 0x01 sysbench --test=cpu \
> --num-threads=1 --max-time=5 --max-requests=1000000 run | \
> awk '{if ($4=="events:") {print "grouped hog(1) alone: " $5}}'
>
> # Run cpu hogs outside task group _alone_:
> sysbench --test=cpu --num-threads=$NCPUS --max-time=10 \
> --max-requests=1000000 run | awk '{if ($4=="events:") \
> {print "non-grouped hogs('$NCPUS') alone: " $5}}'
>
> # Run cpu hogs outside task group:
> sysbench --test=cpu --num-threads=$NCPUS --max-time=10 \
> --max-requests=1000000 run | awk '{if ($4=="events:") \
> {print "non-grouped hogs('$NCPUS'): " $5}}' &
>
> # Run single cpu hog inside task group on first cpu:
> cgexec -g cpu:/root/test100 taskset 0x01 sysbench \
> --test=cpu --num-threads=1 --max-time=5 \
> --max-requests=1000000 run | awk '{if ($4=="events:") \
> {print "grouped hog(1): " $5}}'
>
> wait
>
> # Delete task groups:
> for i in `seq 1 100`;
> do
> cgdelete -g cpu:/root/test$i
> done