Re: [PATCH] sched/fair: Do not decay new task load on first enqueue

From: Vincent Guittot
Date: Fri Sep 23 2016 - 10:30:53 EST


Hi Matt,

On 23 September 2016 at 13:58, Matt Fleming <matt@xxxxxxxxxxxxxxxxxxx> wrote:
> Since commit 7dc603c9028e ("sched/fair: Fix PELT integrity for new
> tasks") ::last_update_time will be set to a non-zero value in
> post_init_entity_util_avg(), which leads to p->se.avg.load_avg being
> decayed on enqueue before the task has even had a chance to run.
>
> For a NICE_0 task the sequence of events leading up to this with
> example load average changes might be,
>
> sched_fork()
> init_entity_runnable_average()
> p->se.avg.load_avg = scale_load_down(se->load.weight); // 1024
>
> wake_up_new_task()
> post_init_entity_util_avg()
> attach_entity_load_avg()
> p->se.last_update_time = cfs_rq->avg.last_update_time;
>
> activate_task()
> enqueue_task()
> ...
> enqueue_entity_load_avg()
> migrated = !sa->last_update_time // false
> if (!migrated)
> __update_load_avg()
> p->se.avg.load_avg = 1002

Does it mean that you can see the perf drop that you mention below
because load is decayed to 1002 instead of staying to 1024 ?

1002 mainly comes from period_contrib being set to 1023 during
init_entity_runnable_average so any delay longer than 1us between
attach_entity_load_avg and enqueue_entity_load_avg will trig the decay
of the load from 1024 to 1002

>
> This causes a performance regression for fork intensive workloads like
> hackbench. When balancing on fork we can end up picking the same CPU
> to enqueue on over and over. This leads to huge congestion when trying
> to simultaneously wake up tasks that are all on the same runqueue, and
> causes lots of migrations on wake up.
>
> The behaviour since commit 7dc603c9028e essentially defeats the
> scheduler's attempt to balance on fork(). Before, ::runnable_load_avg
> likely had a non-zero value when the hackbench tasks were dequeued
> (the fork()'d tasks immediately block reading on pipe/socket) but now
> the load balancer sees the CPU as having no runnable load.

But this patch doesn't change the behavior of runnable_load_avg, isn't
it ? it has only an impact on the initial value of p->se.avg.load_avg
when the task is enqueued.

>
> Arguably the real problem is that balancing on fork doesn't look at
> the blocked contribution of tasks, only the runnable load and it's
> possible for the two metrics to be wildly different on a relatively
> idle system.

fair enough

>
> But it still doesn't seem quite right to update a task's load_avg
> before it runs for the first time.
>
> Here are the results of running hackbench before 7dc603c9028e (old
> behaviour), with 7dc603c9028e applied (exiting behaviour), and after
> 7dc603c9028e with this patch on top (new behaviour),
>
> hackbench-process-sockets
>
> 4.7.0-rc5 4.7.0-rc5 4.7.0-rc5
> before 7dc603c9028e after
> Amean 1 0.0611 ( 0.00%) 0.0693 (-13.32%) 0.0600 ( 1.87%)
> Amean 4 0.1777 ( 0.00%) 0.1730 ( 2.65%) 0.1790 ( -0.72%)
> Amean 7 0.2771 ( 0.00%) 0.2816 ( -1.60%) 0.2741 ( 1.08%)
> Amean 12 0.3851 ( 0.00%) 0.4167 ( -8.20%) 0.3751 ( 2.60%)
>
> Cc: Peter Zijlstra <peterz@xxxxxxxxxxxxx>
> Cc: Ingo Molnar <mingo@xxxxxxxxxx>
> Cc: Mike Galbraith <umgwanakikbuti@xxxxxxxxx>
> Cc: Yuyang Du <yuyang.du@xxxxxxxxx>
> Cc: Vincent Guittot <vincent.guittot@xxxxxxxxxx>
> Cc: Dietmar Eggemann <dietmar.eggemann@xxxxxxx>
> Signed-off-by: Matt Fleming <matt@xxxxxxxxxxxxxxxxxxx>
> ---
> kernel/sched/fair.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 8fb4d1942c14..4a2d3ff772f8 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -3142,7 +3142,7 @@ enqueue_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se)
> int migrated, decayed;
>
> migrated = !sa->last_update_time;
> - if (!migrated) {
> + if (!migrated && se->sum_exec_runtime) {
> __update_load_avg(now, cpu_of(rq_of(cfs_rq)), sa,
> se->on_rq * scale_load_down(se->load.weight),
> cfs_rq->curr == se, NULL);
> --
> 2.10.0
>