Re: [Lse-tech] [patch] sched-domain cleanups, sched-2.6.5-rc2-mm2-A3

From: Nick Piggin
Date: Tue Mar 30 2004 - 21:10:35 EST


Erich Focht wrote:
Hi Nick,


Hi Erich,

On Tuesday 30 March 2004 11:05, Nick Piggin wrote:

I'm with Martin here, we are just about to merge all this
sched-domains stuff. So we should at least wait until after
that. And of course, *nothing* gets changed without at least
one benchmark that shows it improves something. So far
nobody has come up to the plate with that.


I thought you're talking the whole time about STREAM. That is THE
benchmark which shows you an impact of balancing at fork. At it is a
VERY relevant benchmark. Though you shouldn't run it on historical
machines like NUMAQ, no compute center in the western world will buy
NUMAQs for high performance... Andy typically runs STREAM on all CPUs
of a machine. Try on N/2 and N/4 and so on, you'll see the impact.


Well yeah, but the immediate problem was that sched-domains was
*much* worse than 2.6's numasched, neither of which balance on
fork/clone. I didn't want to obscure the issue by implementing
balance on fork/clone until we worked out exactly the problem.

Anyway, once sched-domains goes in, you can basically do whatever
you like without impacting anyone else...


There are other things, like java, ervers, etc that use threads.


I'm just saying that you should have the choice. The default should be
as before, balance at exec().


Yeah well that is a very sane thing to do ;)


The point is that we have never had this before, and nobody
(until now) has been asking for it. And there are as yet no


?? Sorry, I'm having balance at fork since 2001 in the NEC IA64 NUMA
kernels and users use it intensively with OpenMP. Advertised it a lot,
asked for it, atlked about it at the last OLS. Only IA64 was
considered rare big iron. I understand that the issue gets hotter if
the problem hurts on AMD64...


Sorry I hadn't realised. I guess because you are happy with
your own stuff you don't make too much noise about it on the
list lately. I apologise.

I wonder though, why don't you just teach OpenMP to use
affinities as well? Surely that is better than relying on the
behaviour of the scheduler, even if it does balance on clone.


convincing benchmarks that even show best case improvements. And
it could very easily have some bad cases.


Again: I'm talking about having the choice. The user decides. Nothing
protects you against user stupidity, but if they just have the choice
of poor automatic initial scheduling, it's not enough. And: having the
fork/clone initial balancing policy means: you don't need to make your
code complicated and unportable by playing with setaffinity (which is
just plainly unusable when you share the machine with other users).


If you do it by hand, you know exactly what is going to happen,
and you can turn off the balance-on-clone flags and you don't
incur the hit of pulling in remote cachelines from every CPU at
clone time to do balancing. Surely an HPC application wouldn't
mind doing that? (I guess they probably don't call clone a lot
though).


And finally, HPC
applications are the very ones that should be using CPU
affinities because they are usually tuned quite tightly to the
specific architecture.


There are companies mainly selling NUMA machines for HPC (SGI?), so
this is not a niche market. Clusters of big NUMA machines are not
unusual, and they're typically not used for databases but for HPC
apps. Unfortunately proprietary UNIX is still considered to have
better features than Linux for such configurations.


Well, SGI should be doing tests soon and tuning the scheduler
to their liking. Hopefully others will too, so we'll see what
happens.


Let's just make sure we don't change defaults without any
reason...


No reason? Aaarghh... >;-)


Sorry I mean evidence. I'm sure with a properly tuned
implementation, you could get really good speedups in lots
of places... I just want to *see* them. All I have seen so
far is Andi getting a bit better performance on something
where he can get *much* better performance by making a
trivial tweak instead.

I really don't have the software or hardware to test this
at all so I just have to sit and watch.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/