Re: [PATCH RFC] select_idle_sibling experiments

From: Mike Galbraith
Date: Wed Apr 06 2016 - 03:27:42 EST


> On Tue, 2016-04-05 at 14:08 -0400, Chris Mason wrote:

> Now, on to the patch. I pushed some code around and narrowed the
> problem down to select_idle_sibling() We have cores going into and out
> of idle fast enough that even this cut our latencies in half:

Are you using NO_HZ? If so, you may want to try the attached.

> static int select_idle_sibling(struct task_struct *p, int target)
> goto next;
>
> for_each_cpu(i, sched_group_cpus(sg)) {
> - if (i == target || !idle_cpu(i))
> + if (!idle_cpu(i))
> goto next;
> }
>
> IOW, by the time we get down to for_each_cpu(), the idle_cpu() check
> done at the top of the function is no longer valid.

Ok, that's only an optimization, could go if it's causing trouble.

> I tried a few variations on select_idle_sibling() that preserved the
> underlying goal of returning idle cores before idle SMT threads. They
> were all horrible in different ways, and none of them were fast.
>
> The patch below just makes select_idle_sibling pick the first idle
> thread it can find. When I ran it through production workloads here, it
> was faster than the patch we've been carrying around for the last few
> years.
>
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 56b7d4b..c41baa6 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -4974,7 +4974,6 @@ find_idlest_cpu(struct sched_group *group,
> struct task_struct *p, int this_cpu)
> static int select_idle_sibling(struct task_struct *p, int target)
> {
> struct sched_domain *sd;
> - struct sched_group *sg;
> int i = task_cpu(p);
>
> if (idle_cpu(target))
> @@ -4990,24 +4989,14 @@ static int select_idle_sibling(struct
> task_struct *p, int target)
> * Otherwise, iterate the domains and find an elegible idle
> cpu.
> */
> sd = rcu_dereference(per_cpu(sd_llc, target));
> - for_each_lower_domain(sd) {
> - sg = sd->groups;
> - do {
> - if
> (!cpumask_intersects(sched_group_cpus(sg),
> - tsk_cpus_allowed(p))
> )
> - goto next;
> -
> - for_each_cpu(i, sched_group_cpus(sg)) {
> - if (i == target || !idle_cpu(i))
> - goto next;
> - }
> + if (!sd)
> + goto done;
>
> - target =
> cpumask_first_and(sched_group_cpus(sg),
> - tsk_cpus_allowed(p));
> + for_each_cpu_and(i, sched_domain_span(sd), &p->cpus_allowed)
> {
> + if (cpu_active(i) && idle_cpu(i)) {
> + target = i;
> goto done;
> -next:
> - sg = sg->next;
> - } while (sg != sd->groups);
> + }
> }
> done:
> return target;
>

Ew. That may improve your latency is everything load, but worst case
package walk will hurt like hell on CPUs with insane number of threads.
That full search also turns the evil face of two-faced little
select_idle_sibling() into it's only face, the one that bounces tasks
about much more than they appreciate.

Looking for an idle core first delivers the most throughput boost, and
only looking at target's threads if you don't find one keeps the bounce
and traverse pain down to a dull roar, while at least trying to get
that latency win. To me, your patch looks like it trades harm to many,
for good for a few.

A behavior switch would be better. It can't get any dumber, but trying
to make it smarter makes it too damn fat. As it sits, it's aiming in
the general direction of the bullseye.. and occasionally hits the wall.

-Mikesched: ratelimit nohz

Entering nohz code on every micro-idle is too expensive to bear.

Signed-off-by: Mike Galbraith <efault@xxxxxx>
---
include/linux/sched.h | 5 +++++
kernel/sched/core.c | 8 ++++++++
kernel/time/tick-sched.c | 2 +-
3 files changed, 14 insertions(+), 1 deletion(-)

--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2286,6 +2286,11 @@ static inline int set_cpus_allowed_ptr(s
#ifdef CONFIG_NO_HZ_COMMON
void calc_load_enter_idle(void);
void calc_load_exit_idle(void);
+#ifdef CONFIG_SMP
+extern int sched_needs_cpu(int cpu);
+#else
+static inline int sched_needs_cpu(int cpu) { return 0; }
+#endif
#else
static inline void calc_load_enter_idle(void) { }
static inline void calc_load_exit_idle(void) { }
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -577,6 +577,14 @@ static inline bool got_nohz_idle_kick(vo
return false;
}

+int sched_needs_cpu(int cpu)
+{
+ if (tick_nohz_full_cpu(cpu))
+ return 0;
+
+ return cpu_rq(cpu)->avg_idle < sysctl_sched_migration_cost;
+}
+
#else /* CONFIG_NO_HZ_COMMON */

static inline bool got_nohz_idle_kick(void)
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -676,7 +676,7 @@ static ktime_t tick_nohz_stop_sched_tick
} while (read_seqretry(&jiffies_lock, seq));
ts->last_jiffies = basejiff;

- if (rcu_needs_cpu(basemono, &next_rcu) ||
+ if (sched_needs_cpu(cpu) || rcu_needs_cpu(basemono, &next_rcu) ||
arch_needs_cpu() || irq_work_needs_cpu()) {
next_tick = basemono + TICK_NSEC;
} else {