[RFC] More deterministic SLOB for real time embedded systems

From: Hyeonggon Yoo
Date: Sun Oct 17 2021 - 00:29:54 EST


I've been reading SLUB/SLOB code for a while. SLUB recently became
real time compatible by reducing its locking area.

for now, SLUB is the only slab allocator for PREEMPT_RT because
it works better than SLAB on RT and SLOB uses non-deterministic method,
sequential fit.

But memory usage of SLUB is too high for systems with low memory.
So In my local repository I made SLOB to use segregated free list
method, which is more more deterministic, to provide bounded latency.

This can be done by managing list of partial pages globally
for every power of two sizes (8, 16, 32, ..., PAGE_SIZE) per NUMA nodes.
minimal allocation size is size of pointers to keep pointer of next free object
like SLUB.

By making size of objects in same page to have same size, there's no
need to iterate free blocks in a page. (Also iterating pages isn't needed)

Some cleanups and more tests (especially with NUMA/RT configs) needed,
but want to hear your opinion about the idea. Did not test on RT yet.

Below is result of benchmarks and memory usage. (on !RT)
with 13% increase in memory usage, it's nine times faster and
bounded fragmentation, and importantly provides predictable execution time.

current SLOB:
memory usage:
Slab: 7936 kB
hackbench:
Time: 263.900
Performance counter stats for 'hackbench -g 4 -l 10000':

527649.37 msec cpu-clock # 1.999 CPUs utilized
12451963 context-switches # 23.599 K/sec
251231 cpu-migrations # 476.132 /sec
4112 page-faults # 7.793 /sec
342196899596 cycles # 0.649 GHz
228439896295 instructions # 0.67 insn per cycle
3228211614 branch-misses
65667138978 cache-references # 124.452 M/sec
7406902357 cache-misses # 11.279 % of all cache refs

263.956975106 seconds time elapsed

5.213166000 seconds user
521.716737000 seconds sys

SLOB with segregated free list:
memory usage:
Slab: 8976 kB
hackbench:

Time: 28.896
Performance counter stats for 'hackbench -g 4 -l 10000':

57669.66 msec cpu-clock # 1.995 CPUs utilized
902343 context-switches # 15.647 K/sec
10569 cpu-migrations # 183.268 /sec
4116 page-faults # 71.372 /sec
72101524728 cycles # 1.250 GHz
68780577270 instructions # 0.95 insn per cycle
230133481 branch-misses
23610741192 cache-references # 409.414 M/sec
896060729 cache-misses # 3.795 % of all cache refs

28.909188264 seconds time elapsed

1.521686000 seconds user
56.105718000 seconds sys