Re: better wake-balancing: respin

From: Nick Piggin
Date: Fri Oct 28 2005 - 04:19:29 EST


Chen, Kenneth W wrote:
Once upon a time, this patch was in -mm tree (2.6.13-mm1):
http://marc.theaimsgroup.com/?l=linux-kernel&m=112265450426975&w=2

It is neither in Linus's official tree, nor it is in -mm anymore.

I guess I missed the objection for dropping the patch. I'm bringing

My objection for the patch is that it seems to be designed just to
improve your TPC - and I don't think we've seen results yet... or
did I miss that?

Also - by no means do I think improving TPC is wrong, but I think
such a patch may not be the right way to go. It doesn't seem to solve
your problem well.

Now you may have one of two problems. Well it definitely looks like
you are taking a lot of cache misses in try_to_wake_up - however this
won't be due to the load balancing stuff, but rather from locking the
remote CPUs runqueue and touching its runqueues, and cachelines in
the task_struct that had been last touched by the remote CPU.

In fact, if the balancing stuff in try_to_wake_up is working as it
should, then it will result in fewer "remote wakups" because tasks
will be moved to the same CPU that wakes them. Schedstats can tell
us a lot about this, BTW.

The second problem you may have is that the balancing stuff is going
haywire and actually causing tasks to move around too much. If this
is the case, then I really need to look at your workload (at least
schedstats output) and try to get things working a bit better. Knocking
half its brains out with a hammer is just going to make it perform
poorly in more cases without fixing your underlying problem.

Well - you may have a 3rd problem: that schedule and wake_up are simply
being called too often. What's going on with your workload? How many
context switches? What's the schedule profile look like? (we should get
a wake up profile too).

Basically I'd like to see a lot more information.

up this discussion again. The wake-up path is a lot hotter on numa
system running database benchmark. Even on a moderate 8P numa box,
__wake_up and try_to_wake_up is showing up as #1 and #4 hottest kernel
functions. While on a comparable 4P smp box, these two functions are
#5 and #9 respectively.


With all else being equal, an 8P box is going to have 133% more remote
wakeups than a 4P box, and each of those cacheline transfers is going
to have a higher latency. The difference in numbers when moving from
4 to 8 way isn't very surprising.

I think situation will be worse on 32P numa box in the wake up path.
I don't have any measurement on 32P setup yet, because 8P numa
performance sucks at the moment and it is a blocker for us before
proceed any bigger setup.


That's a pity. What do we suck in comparison to? What's our ratio
between actual and expected performance?

Thanks,
Nick

--
SUSE Labs, Novell Inc.

Send instant messages to your online friends http://au.messenger.yahoo.com -
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/