[patch 0/6] Per cpu structures for SLUB

From: Christoph Lameter
Date: Thu Aug 23 2007 - 02:47:46 EST


The following patchset introduces per cpu structures for SLUB. These
are very small (and multiples of these may fit into one cacheline)
and (apart from performance improvements) allow the addressing of
several isues in SLUB:

1. The number of objects per slab is no longer limited to a 16 bit
number.

2. Room is freed up in the page struct. We can avoid using the
mapping field which allows to get rid of the #ifdef CONFIG_SLUB
in page_mapping().

3. We will have an easier time adding new things like Peter Z.s reserve
management.

The RFC for this patchset was discussed on lkml a while ago:

http://marc.info/?l=linux-kernel&m=118386677704534&w=2

(And no this patchset does not include the use of cmpxchg_local that
we discussed recently on lkml nor the cmpxchg implementation
mentioned in the RFC)

Performance
-----------


Norm = 2.6.23-rc3
PCPU = Adds page allocator pass through plus per cpu structure patches


IA64 8p 4n NUMA Altix

Single threaded Concurrent Alloc

Kmalloc Alloc/Free Kmalloc Alloc/Free
Size Norm PCPU Norm PCPU Norm PCPU Norm PCPU
-------------------------------------------------------------------
8 132 84 93 104 98 90 95 106
16 98 92 93 104 115 98 95 106
32 112 105 93 104 146 111 95 106
64 119 112 93 104 214 133 95 106
128 132 119 94 104 321 163 95 106
256+ 83255 176 106 115 415 224 108 117
512 191 176 106 115 487 341 108 117
1024 252 246 106 115 937 609 108 117
2048 308 292 107 115 2494 1207 108 117
4096 341 319 107 115 2497 1217 108 117
8192 402 380 107 115 2367 1188 108 117
16384* 560 474 106 434 4464 1904 108 478

X86_64 2p SMP (Dual Core Pentium 940)

Single threaded Concurrent Alloc

Kmalloc Alloc/Free Kmalloc Alloc/Free
Size Norm PCPU Norm PCPU Norm PCPU Norm PCPU
--------------------------------------------------------------------
8 313 227 314 324 207 208 314 323
16 202 203 315 324 209 211 312 321
32 212 207 314 324 251 243 312 321
64 240 237 314 326 329 306 312 321
128 301 302 314 324 511 416 313 324
256 498 554 327 332 970 837 326 332
512 532 553 324 332 1025 932 326 335
1024 705 718 325 333 1489 1231 324 330
2048 764 767 324 334 2708 2175 324 332
4096* 1033 476 325 674 4727 782 324 678

Notes:

Worst case:
-----------
We generally loose in the alloc free test (x86_64 3%, IA64 5-10%)
since the processing overhead increases because we need to lookup
the per cpu structure. Alloc/Free is simply kfree(kmalloc(size, mask)).
So objects with the shortest lifetime possible. We would never use
objects in that way but the measurement is important to show the worst
case overhead created.

Single Threaded:
----------------
The single threaded kmalloc test shows behavior of a continual stream
of allocation without contention. In the SMP case the losses are minimal.
In the NUMA case we already have a winner there because the per cpu structure
is placed local to the processor. So in the single threaded case we already
win around 5% just by placing things better.

Concurrent Alloc:
-----------------
We have varying gains up to a 50% on NUMA because we are now never updating
a cacheline used by the other processor and the data structures are local
to the processor.

The SMP case shows gains but they are smaller (especially since
this is the smallest SMP system possible.... 2 CPUs). So only up
to 25%.

Page allocator pass through
---------------------------
There is a significant difference in the columns marked with a * because
of the way that allocations for page sized objects are handled. If we handle
the allocations in the slab allocator (Norm) then the alloc free tests
results are superb since we can use the per cpu slab to just pass a pointer
back and forth. The page allocator pass through (PCPU) shows that the page
allocator may have problems with giving back the same page after a free.
Or there something else in the page allocator that creates significant
overhead compared to slab. Needs to be checked out I guess.

However, the page allocator pass through is a win in the other cases
since we can cut out the page allocator overhead. That is the more typical
load of allocating a sequence of objects and we should optimize for that.

(+ = Must be some cache artifact here or code crossing a TLB boundary.
The result is reproducable)

--
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/