Re: The Slab Allocator (Was: Re: New dcache not using slab allocator?)

Mark Hemment (
Wed, 6 Aug 1997 00:39:15 +0100 (BST)

On Tue, 5 Aug 1997, David S. Miller wrote:
> SIMP is fast because it can rely on the following premises:
> 1) The local per-cpu pool is only accesses from the current processor
> 2) Interrupts never ever try to get at a SIMP pool
> .....
> SLAB can never make these assumptions, and thus not have these
> performance characteristics. There is only one way around it and I am
> not even certain it would work or not:
> 1) Add different front end calls for the SLAB, clients calling this
> set of interfaces are informing SLAB that they never will make
> any calls to SLAB functions etc. from within an interrupt handler.
> 2) Change these new interfaces to use per-cpu SLAB object pools.
> SIMP should not be removed until SLAB can provide all of the
> advantages SIMP has and at the same level of performance. This means:

> 1) No SMP locking in the common case, xchg's are verboten as well.
> 2) No cli()'s in any code path whatsoever.
> Comments?

Changing the current SLAB to use per-engine caches is something I've
been meaning to try to do. It is certainly non-trivial.
The problem is that an object may be released by a different engine that
allocated it. As an object is a member of a 'slab' it needs to be
returned to that slab, but there is no guarantee that the releasing engine
owns the slab (another engine may be releasing/allocating from the slab).
The easiest way to insure slab ownership is with a spinlock - which breaks
David's requirements.
One solution (although ugly) is to have multiple slab management objects
for each slab plus one extra. The 'extra' is used by a global slab pool
for a given cache. Each engine also has a slab management object
Objects can migrate from the global pool to a per-engine pool, and
vice-versa, but _not_ from one engine pool directly to another engine
pool. Only the global pool needs a spinlock.
This makes maintaining the partial ordering of slabs difficult - a slab
may have a different ordering in each pool!
An allocation first checks if there are any objects in the per-engine
pool. If not, it checks the global-pool (after taking the spin-lock), and
moves all the objects from a slab [in the global] into the per-engine
pool. (Moving the entire slab - which may only have a few objects, as
others might be owned by other engines - is more efficient than moving
individual objects).
To be able to reap slabs from a per-engine pool into the global-pool,
there will need to be a kernel thread bound to each engine (this is the
only way of avoid a per-engine spin-lock within the allocation/release
routines) or a single reaping thread which migrates between engines. These
threads will monitor the per-engine slab pools, and move objects from
these pools into the global pool. Slabs in the global pool can be reaped
back into the MM sub-system.
(This requires changes to the schedular to allow engine-binding, which
could also be exposed as a system call.)
Of course, part of the cache management object (kmem_cache_t) will also
need to be divided into per-engine parts plus a global-pool part.
There are a load of other problems as well, but I think this is just
about do-able. However, I feel such an implementation would only become
efficient on a system with many (4 or more?) engines.

Because of David's requirement of not blocking interrupts at any stage
during an allocation/release, there will need to be two versions of the
allocation/release functions - one which performs cli()/restore() and one
which doesn't. This isn't too bad.
To avoid confusion, the non-cli versions would only be called via the
SIMP. Basically, as a few people e-mailed me about, maintianing the
current SIMP API (or something similar) makes good sense. We can hide the
SLAB's non-cli() interface behind this API.

Implementing a front-end to the SLAB (which would be the SIMP) is not
difficult. It involves extending the kmem_cache_t structure to include
per-engine locally free indices - which would be very similar to the
current SIMP's 'postbuffer'. The problem is in refreshing (and flushing)
these indices while still maintaining a small SLAB API. There might be an
easy way of doing this, but haven't thought of it yet.

In summary, meeting all the requirements is not easy. If implemented,
Linux would have a memory allocator which would work well with many
engines, but as most of the kernel doesn't currently scale well this could
(and should) be considered overkill. Avoid cli()s is important (my
'experiement' MM avoids this for many page allocations). Avoiding taking
spinlocks is not so important out of the very common code paths. Infact,
if Linux heads towards a pre-emptive kernel (for real-time) some of the
assumptions made by the current SIMP (and other code) may not hold true.
(Some pre-emptive kernels assume a task [in kernel] can be pre-emptived if
no locks are held by the current task, but this is looking far into the
I'll look in detail at changing the SLAB allocator next week, but I
doubt I will implemented anything soon.



Mark Hemment, Unix/C Software Engineer (Contractor)
"Success has many fathers, failure is a B**TARD!" - anon