Re: [PATCH 2/2] sched/fair: Always propagate runnable_load_avg

From: Vincent Guittot
Date: Tue May 02 2017 - 09:26:20 EST


Hi Tejun,

Le Tuesday 02 May 2017 à 09:18:53 (+0200), Vincent Guittot a écrit :
> On 28 April 2017 at 22:33, Tejun Heo <tj@xxxxxxxxxx> wrote:
> > Hello, Vincent.
> >
> > On Thu, Apr 27, 2017 at 10:29:10AM +0200, Vincent Guittot wrote:
> >> On 27 April 2017 at 00:52, Tejun Heo <tj@xxxxxxxxxx> wrote:
> >> > Hello,
> >> >
> >> > On Wed, Apr 26, 2017 at 08:12:09PM +0200, Vincent Guittot wrote:
> >> >> On 24 April 2017 at 22:14, Tejun Heo <tj@xxxxxxxxxx> wrote:
> >> >> Can the problem be on the load balance side instead ? and more
> >> >> precisely in the wakeup path ?
> >> >> After looking at the trace, it seems that task placement happens at
> >> >> wake up path and if it fails to select the right idle cpu at wake up,
> >> >> you will have to wait for a load balance which is alreayd too late
> >> >
> >> > Oh, I was tracing most of scheduler activities and the ratios of
> >> > wakeups picking idle CPUs were about the same regardless of cgroup
> >> > membership. I can confidently say that the latency issue that I'm
> >> > seeing is from load balancer picking the wrong busiest CPU, which is
> >> > not to say that there can be other problems.
> >>
> >> ok. Is there any trace that you can share ? your behavior seems
> >> different of mine
> >
> >

[ snip]

> > You can notice that B's pertask weight is 4.409 which is way higher
> > than A's 2.779, and this is from Q014-asdf's contribution to Q014-/ is
> > twice as high as it should be. The root queue's runnable avg should
>
> Are you sure that this is because of blocked load in group A ? it can
> be that Q014-asdf has already have to wait before running and its load
> still increase while runnable but not running .
> IIUC your trace, group A has 2 running tasks and group B only one but
> load_balance selects B because of its sgs->avg_load being higher. But
> this can also happen even if runnable_load_avg of child cfs_rq was
> propagated correctly in group entity because we can have situation
> where a group A has only 1 task with higher load than 2 tasks on
> groupB and even if blocked load is not taken into account, and
> load_balance will select A.
>
> IMHO, we should better improve load balance selection. I'm going to
> add smarter group selection in load_balance. that's something we
> should have already done but it was difficult without load/util_avg
> propagation. it should be doable now

Could you test the patch in load_balance below ?
If group is not overloaded which means that threads have all runtime they
want, we select the cfs_rq according to the number of running threads instead

---
kernel/sched/fair.c | 22 ++++++++++++++++++++--
1 file changed, 20 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index a903276..87e3b77 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7069,7 +7069,8 @@ static unsigned long task_h_load(struct task_struct *p)
/********** Helpers for find_busiest_group ************************/

enum group_type {
- group_other = 0,
+ group_idle = 0,
+ group_other,
group_imbalanced,
group_overloaded,
};
@@ -7383,6 +7384,9 @@ group_type group_classify(struct sched_group *group,
if (sgs->group_no_capacity)
return group_overloaded;

+ if (!sgs->sum_nr_running)
+ return group_idle;
+
if (sg_imbalanced(group))
return group_imbalanced;

@@ -7476,8 +7480,19 @@ static bool update_sd_pick_busiest(struct lb_env *env,
if (sgs->group_type < busiest->group_type)
return false;

- if (sgs->avg_load <= busiest->avg_load)
+ if (sgs->group_type == group_other) {
+ /*
+ * The groups are not overloaded so there is enough cpu time
+ * for all threads. In this case, takes the group with the
+ * highest number of tasks per CPU in order to improve
+ * scheduling latency
+ */
+ if ((sgs->sum_nr_running * busiest->group_weight) <=
+ (busiest->sum_nr_running * sgs->group_weight))
+ return false;
+ } if (sgs->avg_load <= busiest->avg_load) {
return false;
+ }

if (!(env->sd->flags & SD_ASYM_CPUCAPACITY))
goto asym_packing;
@@ -7969,6 +7984,9 @@ static struct rq *find_busiest_queue(struct lb_env *env,
!check_cpu_capacity(rq, env->sd))
continue;

+ if (!rq->cfs.h_nr_running)
+ continue;
+
/*
* For the load comparisons with the other cpu's, consider
* the weighted_cpuload() scaled with the cpu capacity, so
--
2.7.4


>
> > only contain what's currently active but because we're scaling load
> > avg which includes both active and blocked, we're ending up picking
> > group B over A.
> >