Re: [RFC PATCH v3 0/3] sched: simplify the select_task_rq_fair()

From: Mike Galbraith
Date: Thu Feb 21 2013 - 01:12:07 EST


On Thu, 2013-02-21 at 12:51 +0800, Michael Wang wrote:
> On 02/20/2013 06:49 PM, Ingo Molnar wrote:
> [snip]
> >
> > The changes look clean and reasoable, any ideas exactly *why* it
> > speeds up?
> >
> > I.e. are there one or two key changes in the before/after logic
> > and scheduling patterns that you can identify as causing the
> > speedup?
>
> Hi, Ingo
>
> Thanks for your reply, please let me point out the key changes here
> (forgive me for haven't wrote a good description in cover).
>
> The performance improvement from this patch set is:
> 1. delay the invoke on wake_affine().
> 2. save the circle to gain proper sd.
>
> The second point is obviously, and will benefit a lot when the sd
> topology is deep (NUMA is suppose to make it deeper on large system).
>
> So in my testing on a 12 cpu box, actually most of the benefit comes
> from the first point, and please let me introduce it in detail.
>
> The old logical when locate affine_sd is:
>
> if prev_cpu != curr_cpu
> if wake_affine()
> prev_cpu = curr_cpu
> new_cpu = select_idle_sibling(prev_cpu)
> return new_cpu
>
> The new logical is same to the old one if prev_cpu == curr_cpu, so let's
> simplify the old logical like:
>
> if wake_affine()
> new_cpu = select_idle_sibling(curr_cpu)
> else
> new_cpu = select_idle_sibling(prev_cpu)
>
> return new_cpu
>
> Actually that doesn't make sense.
>
> I think wake_affine() is trying to check whether move a task from
> prev_cpu to curr_cpu will break the balance in affine_sd or not, but why
> won't break balance means curr_cpu is better than prev_cpu for searching
> the idle cpu?

You could argue that it's impossible to break balance by moving any task
to any idle cpu, but that would mean bouncing tasks cross node on every
wakeup is fine, which it isn't.

> So the new logical in this patch set is:
>
> new_cpu = select_idle_sibling(prev_cpu)
> if idle_cpu(new_cpu)
> return new_cpu

So you tilted the scales in favor of leaving tasks in their current
package, which should benefit large footprint tasks, but should also
penalize light communicating tasks.

I suspect that much of the pgbench improvement comes from the preemption
mitigation from keeping 1:N load maximally spread, which is the perfect
thing to do with such loads. In all the testing I ever did with it in
1:N mode, preemption dominated performance numbers. Keep server away
from clients, it has fewer fair competition worries, can consume more
CPU preemption free, pushing the load collapse point strongly upward.

-Mike

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/