Re: [GIT PULL] scheduler fixes

From: Ingo Molnar
Date: Mon May 18 2009 - 16:21:16 EST



* Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> wrote:

> On Mon, 18 May 2009, Ingo Molnar wrote:
> >
> > Something like the patch below. It also fixes ->span[] which has
> > a similar problem.
>
> Patch looks good to me.

ok. I've queued it up for .31, with your Acked-by. (which i assume
your reply implies?)

> > But ... i think this needs further clean-ups really. Either go
> > fully static, or go fully dynamic.
>
> I do agree that it would probably be good to try to avoid this
> static allocation, and allocate these data structures dynamically.
> However, if we end up having to use two different allocators
> anyway (one for bootup, and one for regular uptimes), then I think
> that would be an overall loss (compared to just the simplicity of
> statically doing this in a couple of places), rather than an
> overall win.
>
> > Would be nice if bootmem_alloc() was extended with such
> > properties - if SLAB is up (and bootmem is down) it would return
> > kmalloc(GFP_KERNEL) memory buffers.
>
> I would rather say the other way around: no "bootmem_alloc()" at
> all, but just have a regular alloc() that ends up working like the
> "SMP alternatives" code, but instead of being about SMP, it would
> be about how early in the boot sequence it is.
>
> That said, if there are just a couple of places like this that
> care, I don't think it's worth it. The static allocation isn't
> that horrible. I'd rather have a few ugly static allocations with
> comments about _why_ they look the way they do, than try to
> over-design things to look "clean".
>
> Simplicity is a good thing - even if it can then end up meaning
> special cases like this.
>
> That said, if we could move the kmalloc initialization up some
> more (and get at least the "boot node" data structures set up, and
> avoid any bootmem alloc issues _entirely_, then that would be
> good.
>
> I hate that stupid bootmem allocator. I suspect we seriously
> over-use it, and that we _should_ be able to do the SL*B init
> earlier.

Hm, tempting thought - not sure how to pull it off though.

One of the biggest user of bootmem is the mem_map[] hierarchies and
the page allocator bitmaps. Not sure we can get rid of bootmem there
- those areas are really large, physical memory is often fragmented
and we need a good NUMA sense for them as well.

We might also have a 22-architectures-to-fix problem as well, before
we can get rid of bootmem:

$ git grep alloc_bootmem arch/ | wc -l
168

On x86 we recently switched some (but not all) early-pagetable
allocations to the 'early brk' method (which is an utterly simple
early linear allocator, for limited early dynamic allocations), but
even with that we still have ugly bootmem use - for example see the
after_bootmem hacks in arch/x86/mm/init_64.c.

So we have these increasingly more complete layers of allocators,
which bootstrap each other gradually:

- static, build-time allocations

- early-brk (see extend_brk(), RESERVE_BRK and direct use of
_brk_end in assembly code)

- e820 based early allocator (reserve_early()) to bootstrap bootmem

- bootmem - to bootstrap the page allocator [NUMA aware]

- page allocator - to bootstrap SLAB

- SLAB

that's 5 layers until we get to SLAB. Each layer has to be aware of
its own limits, has to interact with pagetable setup and has to end
up with a NUMA-aware dynamic allocations as early as possible.

And all this complexity definitely _feels_ utterly wrong, as we
really know it pretty early on what kind of memory we have, how it's
laid out amongst nodes. In the end we really just want to have the
page allocator and SL[AOQU]B.

Looks daunting.

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/