Re: Fw: [PATCH] NUMA Slab Allocator
From: Manfred Spraul
Date: Wed Mar 16 2005 - 13:42:35 EST
Hi Christoph,
Do you have profile data from your modification? Which percentage of the
allocations is node-local, which percentage is from foreign nodes?
Preferably per-cache. It shouldn't be difficult to add statistics
counters to your patch.
And: Can you estaimate which percentage is really accessed node-local
and which percentage are long-living structures that are accessed from
all cpus in the system?
I had discussions with guys from IBM and SGI regarding a numa allocator,
and we decided that we need profile data before we can decide if we need
one:
- A node-local allocator reduces the inter-node traffic, because the
callers get node-local memory
- A node-local allocator increases the inter-node traffic, because
objects that are kfree'd on the wrong node must be returned to their
home node.
static inline void __cache_free (kmem_cache_t *cachep, void* objp)
{
struct array_cache *ac = ac_data(cachep);
+ struct slab *slabp;
check_irq_off();
objp = cache_free_debugcheck(cachep, objp, __builtin_return_address(0));
- if (likely(ac->avail < ac->limit)) {
+ /* Make sure we are not freeing a object from another
+ * node to the array cache on this cpu.
+ */
+ slabp = GET_PAGE_SLAB(virt_to_page(objp));
This line is quite slow, and should be performed only for NUMA builds,
not for non-numa builds. Some kind of wrapper is required.
+ if(unlikely(slabp->nodeid != numa_node_id())) {
+ STATS_INC_FREEMISS(cachep);
+ int nodeid = slabp->nodeid;
+ spin_lock(&(cachep->nodelists[nodeid])->list_lock);
This line is very dangerous: Every wrong-node allocation causes a
spin_lock operation. I fear that the cache line traffic for the spinlock
might kill the performance for some workloads. I personally think that
batching is required, i.e. each cpu stores wrong-node objects in a
seperate per-cpu array, and then the objects are returned as a block to
their home node.
-/*
- * NUMA: different approach needed if the spinlock is moved into
- * the l3 structure
You have moved the cache spinlock into the l3 structure. Have you
compared both approaches?
A global spinlock has the advantage that batching is possible in
free_block: Acquire global spinlock, return objects to all nodes in the
system, release spinlock. A node-local spinlock would mean less
contention [multiple spinlocks instead of one global lock], but far more
spin_lock/unlock calls.
IIRC the conclusion from our discussion was, that there are at least
four possible implementations:
- your version
- Add a second per-cpu array for off-node allocations. __cache_free
batches, free_block then returns. Global spinlock or per-node spinlock.
A patch with a global spinlock is in
http://www.colorfullife.com/~manfred/Linux-kernel/slab/patch-slab-numa-2.5.66
per-node spinlocks would require a restructuring of free_block.
- Add per-node array for each cpu for wrong node allocations. Allows
very fast batch return: each array contains memory just from one node,
usefull if per-node spinlocks are used.
- do nothing. Least overhead within slab.
I'm fairly certains that "do nothing" is the right answer for some
caches. For example the dentry-cache: The object lifetime is seconds to
minutes, the objects are stored in a global hashtable. They will be
touched from all cpus in the system, thus guaranteeing that
kmem_cache_alloc returns node-local memory won't help. But the added
overhead within slab.c will hurt.
--
Manfred
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/