Re: [rfc][patch] SLQB slab allocator

From: Eric Dumazet
Date: Fri Dec 12 2008 - 00:38:42 EST


Nick Piggin a écrit :
> (Re)introducing SLQB allocator. Q for queued, but in reality, SLAB and
> SLUB also have queues of things as well, so "Q" is just a meaningless
> differentiator :)
>
> I've kept working on SLQB slab allocator because I don't agree with the
> design choices in SLUB, and I'm worried about the push to make it the
> one true allocator.
>
> My primary goal in SLQB is performance, secondarily are order-0 page
> allocations, and memory consumption.
>
> I have worked with the Linux guys at Intel to ensure that SLQB is comparable
> to SLAB in their OLTP performance benchmark. Recently that goal has been
> reached -- so SLQB performs comparably well to SLAB on that test (it's
> within the noise).
>
> I've also been comparing SLQB with SLAB and SLUB in other benchmarks, and
> trying to ensure it is as good or better. I don't know if that's always
> the case, but nothing obvious has gone wrong (it's sometimes hard to find
> meaningful benchmarks that exercise slab in interesting ways).
>
> Now it isn't exactly complete -- debugging, tracking, stats, etc. code is
> not always in the best shape, however I have been focusing on performance
> of the core allocator. No matter how good the rest is if the core code is
> poor... But it boots, works, is pretty stable.
>
> SLQB, like SLUB and unlike SLAB, doesn't have greater than linear memory
> consumption growth with the number of CPUs or nodes.
>
> SLQB tries to be very page-size agnostic. And it tries very hard to use
> order-0 pages. This is good for both page allocator fragmentation, and
> slab fragmentation. I don't like that SLUB performs significantly worse
> with order-0 pages in some workloads.
>
> SLQB goes to some lengths to optimise remote-freeing cases (allocate on
> one CPU, free on another). It seems to work well, but there are a *lot*
> of possible ways this can be implemented especially when NUMA comes into
> play, so I'd like to know of workloads where remote freeing happens a
> lot, and perhaps look at alternative ways to do it.
>
> SLQB initialistaion code attempts to be as simple and un-clever as possible.
> There are no multiple phases where different things come up. There is no
> weird self bootstrapping stuff. It just statically allocates the structures
> required to create the slabs that allocate other slab structures.
>
> I'm going to continue working on this as I get time, and I plan to soon ask
> to have it merged. It would be great if people could comment or test it.
>

It seems really good, but will need some hours to review :)

Minor nit : You spelled Qeued instead of Queued in init/Kconfig

+config SLQB
+ bool "SLQB (Qeued allocator)"

One of the problem I see with SLAB & SLUB is the irq masking stuff.
Some (many ???) kmem_cache are only used in process context, I see no point of
disabling irqs for them.

I tested your patch on my 8 ways HP BL460c G1, on top
on my last patch serie. (linux-2.6, not net-next-2.6)

# time ./socketallocbench

real 0m1.300s
user 0m0.078s
sys 0m1.207s
# time ./socketallocbench -n 8

real 0m1.686s
user 0m0.614s
sys 0m12.737s

So no bad effect (same than SLUB).

For the record, SLAB is really really bad for this workload

PU: Core 2, speed 3000.1 MHz (estimated)
Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit mask of 0x00 (Unhalted core cycles) count 100
000
samples cum. samples % cum. % symbol name
136537 136537 10.8300 10.8300 kmem_cache_alloc
129380 265917 10.2623 21.0924 tcp_close
79696 345613 6.3214 27.4138 tcp_v4_init_sock
73873 419486 5.8596 33.2733 tcp_v4_destroy_sock
63436 482922 5.0317 38.3050 sysenter_past_esp
62140 545062 4.9289 43.2339 inet_csk_destroy_sock
56565 601627 4.4867 47.7206 kmem_cache_free
40430 642057 3.2069 50.9275 __percpu_counter_add
35742 677799 2.8350 53.7626 init_timer
35611 713410 2.8246 56.5872 copy_from_user
21616 735026 1.7146 58.3018 d_alloc
20821 755847 1.6515 59.9533 alloc_inode
19645 775492 1.5582 61.5115 alloc_fd
18935 794427 1.5019 63.0134 __fput
18922 813349 1.5009 64.5143 inet_create
18919 832268 1.5006 66.0149 sys_close
16074 848342 1.2750 67.2899 release_sock
15337 863679 1.2165 68.5064 lock_sock_nested
15172 878851 1.2034 69.7099 sock_init_data
14196 893047 1.1260 70.8359 fd_install
13677 906724 1.0849 71.9207 drop_file_write_access
13195 919919 1.0466 72.9673 dput
12768 932687 1.0127 73.9801 inotify_d_instantiate
11404 944091 0.9046 74.8846 init_waitqueue_head
11228 955319 0.8906 75.7752 sysenter_do_call
11213 966532 0.8894 76.6647 local_bh_enable_ip
10948 977480 0.8684 77.5330 __sock_create
10912 988392 0.8655 78.3986 local_bh_enable
10665 999057 0.8459 79.2445 __new_inode
10579 1009636 0.8391 80.0836 inet_release
9665 1019301 0.7666 80.8503 iput_single
9545 1028846 0.7571 81.6074 fput
7950 1036796 0.6306 82.2379 sock_release
7236 1044032 0.5740 82.8119 local_bh_disable


We can see most of the time is taken by the memset() to clear object,
then irq masking stuff...

c0281e10 <kmem_cache_alloc>: /* kmem_cache_alloc total: 140659 10.8277 */
2414 0.1858 :c0281e10: push %ebp
7 5.4e-04 :c0281e11: mov %esp,%ebp
:c0281e13: push %edi
1454 0.1119 :c0281e14: push %esi
310 0.0239 :c0281e15: mov %eax,%esi
:c0281e17: push %ebx
368 0.0283 :c0281e18: sub $0x10,%esp
949 0.0731 :c0281e1b: mov %edx,-0x18(%ebp)
383 0.0295 :c0281e1e: mov 0x4(%ebp),%eax
1189 0.0915 :c0281e21: mov %eax,-0x14(%ebp)
1240 0.0955 :c0281e24: jmp c0281e6e <kmem_cache_alloc+0x5e>
:c0281e26: lea 0x0(%esi),%esi
:c0281e29: lea 0x0(%edi,%eiz,1),%edi
1188 0.0915 :c0281e30: mov 0x10(%esi),%eax
:c0281e33: mov (%edx,%eax,1),%eax
1483 0.1142 :c0281e36: decl (%ebx)
898 0.0691 :c0281e38: mov %eax,0x4(%ebx)
586 0.0451 :c0281e3b: mov %edx,-0x1c(%ebp)
1 7.7e-05 :c0281e3e: pushl -0x10(%ebp)
1226 0.0944 :c0281e41: popf
26385 2.0311 :c0281e42: testl $0x210d00,(%esi)
1188 0.0915 :c0281e48: je c0281ef8 <kmem_cache_alloc+0xe8>
:c0281e4e: mov -0x1c(%ebp),%eax
:c0281e51: test %eax,%eax
:c0281e53: je c0281ef8 <kmem_cache_alloc+0xe8>
:c0281e59: mov -0x14(%ebp),%ecx
:c0281e5c: mov -0x1c(%ebp),%edx
:c0281e5f: mov %esi,%eax
:c0281e61: call c0280d60 <alloc_debug_processing>
:c0281e66: test %eax,%eax
:c0281e68: jne c0281ef8 <kmem_cache_alloc+0xe8>
1205 0.0928 :c0281e6e: pushf
4888 0.3763 :c0281e6f: popl -0x10(%ebp)
319 0.0246 :c0281e72: cli
5975 0.4599 :c0281e73: nop
:c0281e74: lea 0x0(%esi,%eiz,1),%esi
:c0281e78: mov %fs:0xc068d004,%eax
1166 0.0898 :c0281e7e: mov 0x38(%esi,%eax,4),%ebx
26 0.0020 :c0281e82: mov 0x4(%ebx),%edx
662 0.0510 :c0281e85: test %edx,%edx
:c0281e87: jne c0281e30 <kmem_cache_alloc+0x20>
:c0281e89: mov 0xc(%ebx),%eax
:c0281e8c: test %eax,%eax
:c0281e8e: jne c0281ec8 <kmem_cache_alloc+0xb8>
:c0281e90: mov %ebx,%edx
:c0281e92: mov %esi,%eax
:c0281e94: call c0280010 <__cache_list_get_page>
:c0281e99: mov %eax,%edx
:c0281e9b: test %eax,%eax
:c0281e9d: jne c0281f31 <kmem_cache_alloc+0x121>
:c0281ea3: mov $0xffffffff,%ecx
1 7.7e-05 :c0281ea8: mov -0x18(%ebp),%edx
:c0281eab: mov %esi,%eax
:c0281ead: call c02815a0 <__slab_alloc_page>
:c0281eb2: test %eax,%eax
:c0281eb4: jne c0281e78 <kmem_cache_alloc+0x68>
:c0281eb6: movl $0x0,-0x1c(%ebp)
:c0281ebd: jmp c0281e3e <kmem_cache_alloc+0x2e>
:c0281ec2: lea 0x0(%esi),%esi
:c0281ec8: mov %esi,%eax
:c0281eca: mov %ebx,%edx
:c0281ecc: call c0280240 <claim_remote_free_list>
:c0281ed1: mov 0x4(%esi),%eax
:c0281ed4: shl $0x2,%eax
:c0281ed7: cmp %eax,(%ebx)
:c0281ed9: ja c0281f48 <kmem_cache_alloc+0x138>
:c0281edb: mov 0x4(%ebx),%edx
:c0281ede: mov %edx,-0x1c(%ebp)
:c0281ee1: test %edx,%edx
:c0281ee3: je c0281e90 <kmem_cache_alloc+0x80>
:c0281ee5: mov 0x10(%esi),%eax
:c0281ee8: mov (%edx,%eax,1),%eax
:c0281eeb: decl (%ebx)
:c0281eed: mov %eax,0x4(%ebx)
:c0281ef0: jmp c0281e3e <kmem_cache_alloc+0x2e>

:c0281ef5: lea 0x0(%esi),%esi
1261 0.0971 :c0281ef8: cmpw $0x0,-0x18(%ebp)
957 0.0737 :c0281efd: jns c0281f26 <kmem_cache_alloc+0x116>
627 0.0483 :c0281eff: mov -0x1c(%ebp),%eax
:c0281f02: test %eax,%eax
:c0281f04: je c0281f26 <kmem_cache_alloc+0x116>
82 0.0063 :c0281f06: mov 0xc(%esi),%esi
2 1.5e-04 :c0281f09: mov -0x1c(%ebp),%ebx
527 0.0406 :c0281f0c: mov %esi,%ecx
:c0281f0e: mov %ebx,%edi
86 0.0066 :c0281f10: shr $0x2,%ecx
602 0.0463 :c0281f13: xor %eax,%eax
1 7.7e-05 :c0281f15: mov %esi,%edx
74845 5.7614 :c0281f17: rep stos %eax,%es:(%edi)
1170 0.0901 :c0281f19: test $0x2,%dl
2 1.5e-04 :c0281f1c: je c0281f20 <kmem_cache_alloc+0x110>
:c0281f1e: stos %ax,%es:(%edi)
600 0.0462 :c0281f20: test $0x1,%dl
:c0281f23: je c0281f26 <kmem_cache_alloc+0x116>
:c0281f25: stos %al,%es:(%edi)
1171 0.0901 :c0281f26: mov -0x1c(%ebp),%eax
199 0.0153 :c0281f29: add $0x10,%esp
:c0281f2c: pop %ebx
2 1.5e-04 :c0281f2d: pop %esi
1215 0.0935 :c0281f2e: pop %edi
548 0.0422 :c0281f2f: leave
1251 0.0963 :c0281f30: ret
:c0281f31: mov 0x10(%edx),%ecx
:c0281f34: mov %ecx,-0x1c(%ebp)
:c0281f37: mov 0x10(%esi),%eax
:c0281f3a: mov (%ecx,%eax,1),%eax
:c0281f3d: mov %eax,0x10(%edx)
:c0281f40: jmp c0281e3e <kmem_cache_alloc+0x2e>
:c0281f45: lea 0x0(%esi),%esi
:c0281f48: mov %ebx,%edx
:c0281f4a: mov %esi,%eax
:c0281f4c: call c02811d0 <flush_free_list>
:c0281f51: jmp c0281edb <kmem_cache_alloc+0xcb>
:c0281f53: lea 0x0(%esi),%esi

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/