Re: [BUG] sched: big numa dynamic sched domain memory corruption

From: Paul Jackson
Date: Tue Aug 01 2006 - 04:23:44 EST

Next message: Michael Tokarev: "Re: let md auto-detect 128+ raid members, fix potential race condition"
Previous message: Heiko Carstens: "do { } while (0) question"
Next in thread: Siddha, Suresh B: "Re: [BUG] sched: big numa dynamic sched domain memory corruption"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Andrew, Ingo,

I believe Suresh is correct, and that the mainline no longer
has this bug. So the patch that was buried in my report should
not be applied.

Suresh wrote:
> Paul can you please test the mainline code and confirm?

I have tested the mainline, 2.6.18-rc2-mm1, and I am now nearly certain
you are correct. I did not see the failure. There is a slim chance
that I botched the forward port of the tripwire code needed to catch
this bug efficiently, and so could have missed seeing the bug on that
account.

I will now try backporting the three patches you recommended for
systems on a release such as SLES10:

sched: fix group power for allnodes_domains
sched_domai: Allocate sched_group structures dynamically
sched: build_sched_domains() fix

I wish you well on any further code improvements you have planned for
this code. It's tough to understand, with such issues as many #ifdef's,
an interesting memory layout of the key sched domain arrays that I
didn't see described much in the comments, and a variety of memory
allocation calls that are tough to unravel on error. Portions of
the code could use some more comments, explaining what is going on.
For example, I still haven't figured exactly what 'cpu_power' means.

The allocations of sched_group_allnodes, sched_group_phys and
sched_group_core are -big- on our ia64 SN2 systems (1024 CPUs),
and could fail once a system has been up for a while and is
getting memory tight and fragmented.

It is not obvious to me from the code or comments just how sched
domains are arranged on various large systems with hyper-threading
(SMT) and/or multiple cores (MC) and/or multiple processor packages
per node, and how scheduling is affected by all this.

This was about the third bug that has come by in it -- which I
in particular notice when it is someone playing with cpu_exclusive
cpusets who hits the bug. Any kernel backtrace with 'cpuset' buried in
it tends to migrate to my inbox. This latest bug was particularly
nasty, as is usually the case with random memory corruption bugs,
costing us a bunch of hours.

Good luck.

If you are aware of any other fixes/patches besides the above that us
big honkin numa iron SLES10 users need for reliable operation, let me
know.

Thanks.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <pj@xxxxxxx> 1.925.600.0401
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Next message: Michael Tokarev: "Re: let md auto-detect 128+ raid members, fix potential race condition"
Previous message: Heiko Carstens: "do { } while (0) question"
Next in thread: Siddha, Suresh B: "Re: [BUG] sched: big numa dynamic sched domain memory corruption"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]