[MODSLAB 0/5] Modular slab allocator V3

From: Christoph Lameter
Date: Fri Sep 01 2006 - 18:31:39 EST

Next message: Christoph Lameter: "[MODSLAB 4/5] Kmalloc subsystem"
Previous message: Andreas Gruenbacher: "Re: Generic infrastructure for acls"
Next in thread: Christoph Lameter: "[MODSLAB 4/5] Kmalloc subsystem"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Modular Slab Allocator:

Why would one use this?

1. Reduced memory requirements.

Saving range from a few hundred kbyte on i386 to 5GB on a 1024p 4TB
Altix NUMA system.

The slabifier has no caches in the sense of the slab allocator. No storage
is allocated for per cpu, shared or alien caches. A slab in itself functions
as the cache. Objects are served directly from a per cpu slab (an "active"
slab). The management overhead for caches is gone.

Slabs do not contain metadata but only the payload. Metadata is kept
in the associated page struct. This means that object can begin at the
start of a slab and are always properly aligned.

2. No cache reaper

The current slab allocator needs to periodically check its slab caches and
move objects back into the slabs. Every 2 seconds on every cpu all slab caches
are scanned and object move around the system. The system cannot really
enter a quiescent state.

The slabifier needs no such mechanism in the single processor case. In the
SMP case we have a per slab flusher that is active as long as processors
have active slabs. After a timeout it flushes the active slabs back into
the slab lists. If no active slabs exist then the flusher is deactivated.

The cache_reaper has been a consistent trouble spot for interrupt holdoffs
and scheduling latencies in the SLES9 and SLES10 development cycle. I
would be grateful if we would not have to deal again with that.

3. Can use the NUMA policies of the page allocator.

The current slab allocator implements NUMA support through per node lists of
slabs. If a memory policy or cpuset restrict access to certain node then the
slab allocator itself must enforce these policies. This is only partially
implemented and as a result cpusets, memory policies and the NUMA slab do
not mix too well which leads to per node slabs containing slabs that are
actually located on different nodes, which causes latencies during
cache draining (in the cache reaper and elsewhere....).

The Slabifier does not implement per node slabs. Instead it uses a single
global pool of partial pages. Memory policies only come into play
when the active slab is empty and new pages are allocated from the
page allocator. In that case the page is allocated given the current cpuset
and the current memory policies by the page allocator and then the slabifier
serves objects from the slab to the application. The Slabifier does not
attempt to guarantee that the allocations by kmalloc() are node local. It
only does a best effort approach. Only kmalloc_node() guarantees that an
object is allocated on a particular node. kmalloc_node() accomplishes that
by searching the partial list for a fitting page and allocating a page
from the requested node if none can be found.

4. Reduced amount of partial slabs.

The current NUMA slab allocator contains per node partial lists. This
means that fragmentation of slabs occurs on a per node basis. The more
nodes are in the system the more potentially partially allocated slabs.

The slabifier contains a global partial lists. Allocations on other
nodes can cause the partial list to be shrunk. The existing slab pages
of a slab cache have therefore a higher usage rate.

The locking of the partial list is potential scalability problem that is
addressed in the following ways:

A. The partial list is only modified (and the lock taken) when necessary.
Locking is only necessary when a page enters the partial list (it was
full and the first object was deleted), or it becomes completely
depleted (slab has to be freed) or it is retrieved to become an
active slab for allocations of a particular cpu.

B. A "min_slab_order=" kernel boot option is added. This allows to increase
the size of the slab pages. Bigger slabs mean less lock taking and larger
per cpu caches. It also reduces slab fragmentation but comes with the
danger that the kernel cannot satisfy higher order allocations. However,
order 1,2,3 allocations should usually be fine. In my tests up to 32p
I have not yet seen a need to use this to reduce lock contention.

5. Ability to compactify partial lists.

The slab shrinker of the slabifier can take an argument of a function
that is capable of moving an object. With that the slabifier is able to
reduce the amount of partial slabs.

6. Maintainability,

The Modular Slab is made up out of components that can individually
be replaced. Modifications are easy and it is easier to add new features.

7. Performance

The performance of the Modular Slab is roughly comparable with the existing
slab allocator since both rely on managing a per cpu cache of objects.
The slab allocator does that by explicitly managing object lists and the
slabifier does it by reserving a slab per cpu for allocations.

TODO:

- More performance tests than just with AIM7.... Higher CPU
counts than 32.

Changes V2->V3:

- Tested on i386 (UP + SMP) , x86_64(up), IA64(NUMA up to 32p)

- Overload struct page in mm.h with slab definitions. That
reduces the macros significantly and makes code more
readable.

- Debug and optimize functions, Reduce cacheline footprint.

- Add support for specifying slab_min_order= at bootup in order
to be able to influence fragmentation and lock scaling.

Changes V1-V2:
- Drop pageslab and numaslab. Drop support for VMALLOC allocations.

- Enhance slabifier with some numa capability. Bypass
free list management for slabs with a single object.
Drop slab full lists and minimize lock taking
for partial lists.

- Optimize code: Generate general slab array immediately
and pass the address of the slab cache in kmalloc(). DMA
caches remain dynamic.

- Add support for non power of 2 general caches.

- Tested on i386, x86_64 and ia64.

The main intend of this patchset is to modularize
the slab allocator so that development of additions
or modification to the allocator layer become easier.
The framework enables the use of multiple slab allocator
and allows the generation of additional underlying
page allocators (as f.e. needed for mempools and other
specialized things).

The modularization is accomplished through the use of a few
concepts from object oriented programming. Allocators are
described by methods and functions can produce new allocators
based on existing ones by modifying their methods.

So what the patches provide here is:

1. A framework for page allocators and slab allocators

2. Various methods to derive new allocators from old ones
(add rcu support, destructors, constructors, dma etc)

3. A layer that emulates the exist slab interface (the slabulator).

4. A layer that provides kmalloc functionality.

5. The Slabifier. This is conceptually the Simple Slab (See my RFC
from last week) but with the additional allocator modifications
possible it grows like on steroids and then can supply most of
the functionality of the existing slab allocator and can go even
beyond it. My tests with AIM7 seem indicates that it is
equal in performance to the existing slab allocator.

Some new features:

1. The slabifier can flag double frees when the act occurs
and will attempt to continue.

2. Ability to merge slabs of the same type.

Notably missing features:

- Slab Debugging
(This should be implemented by deriving a new slab allocator from
existing ones and adding the necessary processing in alloc and free).

Performance tests on an 8p and 32p machine show consistently that the
performance is equal to the standard slab allocator. Memory use is much
less with this since there is no meta data overhead per slab.

This patchset should just leave the existing slab allocator unharmed. It
adds a hook to include/linux/slab.h to redirect includes to the definitions
for the allocation framework by the slabulator.

Deactivate the "Traditional Slab allocator" in order to activate the modular
slab allocator.

More details may be found in the header of each of the following 4 patches.

--
VGER BF report: U 0.5
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Next message: Christoph Lameter: "[MODSLAB 4/5] Kmalloc subsystem"
Previous message: Andreas Gruenbacher: "Re: Generic infrastructure for acls"
Next in thread: Christoph Lameter: "[MODSLAB 4/5] Kmalloc subsystem"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]