Re: [GIT PULL] scheduler fixes

From: Ingo Molnar
Date: Sun May 24 2009 - 22:55:31 EST



* Yinghai Lu <yinghai@xxxxxxxxxx> wrote:

> Pekka J Enberg wrote:
> > On Mon, 18 May 2009, Linus Torvalds wrote:
> >>>> I hate that stupid bootmem allocator. I suspect we seriously
> >>>> over-use it, and that we _should_ be able to do the SL*B init
> >>>> earlier.
> >>> Hm, tempting thought - not sure how to pull it off though.
> >> As far as I can recall, one of the things that historically made us want
> >> to use the bootmem allocator even relatively late was that the real SLAB
> >> allocator had to wait until all the node information etc was initialized.
> >>
> >> That's pretty damn late. And I wonder if SLUB (and SLOB) might not need a
> >> lot less initialization, and work much earlier. Something like that might
> >> be the final nail in the coffin for SLAB, and convince me to just say
> >> 'we don't support it any more".
> >
> > Ingo, here's a patch that boots UMA+SMP+SLUB x86-64 kernel on qemu all
> > the way to userspace. It probably breaks bunch of things for now but
> > something for you to play with if you want.
> >
>
> updated with tip/master. also add change to cpupri_init
> otherwise will get
> [ 0.000000] Memory: 523096612k/537526272k available (10461k kernel code, 656156k absent, 13773504k reserved, 7186k data, 2548k init)
> [ 0.000000] SLUB: Genslabs=14, HWalign=64, Order=0-3, MinObjects=0, CPUs=32, Nodes=8
> [ 0.000000] ------------[ cut here ]------------
> [ 0.000000] WARNING: at kernel/lockdep.c:2282 lockdep_trace_alloc+0xaf/0xee()
> [ 0.000000] Hardware name: Sun Fire X4600 M2
> [ 0.000000] Modules linked in:
> [ 0.000000] Pid: 0, comm: swapper Not tainted 2.6.30-rc6-tip-01778-g0afdd0f-dirty #259
> [ 0.000000] Call Trace:
> [ 0.000000] [<ffffffff810a0274>] ? lockdep_trace_alloc+0xaf/0xee
> [ 0.000000] [<ffffffff81075ab0>] warn_slowpath_common+0x88/0xcb
> [ 0.000000] [<ffffffff81075b15>] warn_slowpath_null+0x22/0x38
> [ 0.000000] [<ffffffff810a0274>] lockdep_trace_alloc+0xaf/0xee
> [ 0.000000] [<ffffffff8110301b>] kmem_cache_alloc_node+0x38/0x14d
> [ 0.000000] [<ffffffff813ec548>] ? alloc_cpumask_var_node+0x4a/0x10a
> [ 0.000000] [<ffffffff8109eb61>] ? lockdep_init_map+0xb9/0x564
> [ 0.000000] [<ffffffff813ec548>] alloc_cpumask_var_node+0x4a/0x10a
> [ 0.000000] [<ffffffff813ec62c>] alloc_cpumask_var+0x24/0x3a
> [ 0.000000] [<ffffffff819e6306>] cpupri_init+0x7f/0x112
> [ 0.000000] [<ffffffff819e5a30>] init_rootdomain+0x72/0xb7
> [ 0.000000] [<ffffffff821facce>] sched_init+0x109/0x660
> [ 0.000000] [<ffffffff82203082>] ? kmem_cache_init+0x193/0x1b2
> [ 0.000000] [<ffffffff821dfd7a>] start_kernel+0x218/0x3f3
> [ 0.000000] [<ffffffff821df2a9>] x86_64_start_reservations+0xb9/0xd4
> [ 0.000000] [<ffffffff821df3b2>] x86_64_start_kernel+0xee/0x109
> [ 0.000000] ---[ end trace a7919e7f17c0a725 ]---
>
> works with 8 sockets numa amd64 box.
>
> YH
>
> ---
> init/main.c | 28 ++++++++++++++++------------
> kernel/irq/handle.c | 23 ++++++++---------------
> kernel/sched.c | 34 +++++++++++++---------------------
> kernel/sched_cpupri.c | 9 ++++++---
> mm/slub.c | 17 ++++++++++-------
> 5 files changed, 53 insertions(+), 58 deletions(-)

Very nice!

Would it be possible to restructure things to move kmalloc init to
before IRQ init as well? We have a couple of uglinesses there too.

Conceptually, memory should be the first thing set up in general, in
a kernel. It does not need IRQs, timers, the scheduler or any of the
IO facilities and abstractions. All of them need memory though - and
as Linux scales to more and more hardware via the same single image,
so will we get more and more dynamic concepts like cpumask_var_t and
sparse-irqs, which want to allocate very early.

setup_arch() is one huge function that sets up all architecture
details at once - but if we split a separate setup_arch_mem() out of
it, and left the rest in setup_arch (and moved it further down), we
could remove much of bootmem (especially the ugly uses).

This might even be doable realistically, and we could thus librarize
bootmem and eliminate it from x86 at least. Perhaps.

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/