[RFC PATCH] greatly reduce SLOB external fragmentation

From: Matt Mackall
Date: Wed Jan 09 2008 - 14:17:00 EST



On Mon, 2008-01-07 at 20:06 +0200, Pekka J Enberg wrote:
> Hi Matt,
>
> On Sun, 6 Jan 2008, Matt Mackall wrote:
> > I don't have any particular "terrible" workloads for SLUB. But my
> > attempts to simply boot with all three allocators to init=/bin/bash in,
> > say, lguest show a fair margin for SLOB.
>
> Sorry, I once again have bad news ;-). I did some testing with
>
> lguest --block=<rootfile> 32 /boot/vmlinuz-2.6.24-rc6 root=/dev/vda init=doit
>
> where rootfile is
>
> http://uml.nagafix.co.uk/BusyBox-1.5.0/BusyBox-1.5.0-x86-root_fs.bz2
>
> and the "doit" script in the guest passed as init= is just
>
> #!/bin/sh
> mount -t proc proc /proc
> cat /proc/meminfo | grep MemTotal
> cat /proc/meminfo | grep MemFree
> cat /proc/meminfo | grep Slab
>
> and the results are:
>
> [ the minimum, maximum, and average are of captured from 10 individual runs ]
>
> Free (kB) Used (kB)
> Total (kB) min max average min max average
> SLUB (no debug) 26536 23868 23892 23877.6 2644 2668 2658.4
> SLOB 26548 23472 23640 23579.6 2908 3076 2968.4
> SLAB (no debug) 26544 23316 23364 23343.2 3180 3228 3200.8
> SLUB (with debug) 26484 23120 23136 23127.2 3348 3364 3356.8
>
> So it seems that on average SLUB uses 300 kilobytes *less memory* (!) (which is
> roughly 1% of total memory available) after boot than SLOB for my
> configuration.
>
> One possible explanation is that the high internal fragmentation (space
> allocated but not used) of SLUB kmalloc() only affects short-lived allocations
> and thus does not show up in the more permanent memory footprint. Likewise, it
> could be that SLOB has higher external fragmentation (small blocks that are
> unavailable for allocation) of which SLUB does not suffer from. Dunno, haven't
> investigated as my results are contradictory to yours.

Yep, you we definitely onto something here. Here's what I got with 10
runs of SLUB on my setup:

MemFree: 55780 kB
MemFree: 55780 kB
MemFree: 55780 kB
MemFree: 55784 kB
MemFree: 55788 kB
MemFree: 55788 kB
MemFree: 55788 kB
MemFree: 55792 kB
MemFree: 55796 kB
MemFree: 55800 kB
Avg: 55787.6

And with SLOB + my defrag fix:

MemFree: 55236 kB
MemFree: 55284 kB
MemFree: 55292 kB
MemFree: 55304 kB
MemFree: 55384 kB
MemFree: 55388 kB
MemFree: 55412 kB
MemFree: 55420 kB
MemFree: 55436 kB
MemFree: 55444 kB
Avg: 55360.0

Ouch!

So I added a bunch of statistics gathering:

counted pages 409 unused 185242 biggest 1005 fragments 1416
slob pages 410 allocs 22528 frees 12109 active 10419 allocated 932647
page scans 11249 block scans 40650
kmallocs 10247 active 5450 allocated 3484680 overhead 48836
bigpages 827 active 17
total 427 used 245

The first line tells us about SLOB's free list, which has 409 pages,
185k unused, spread into 1416 fragments. The average fragment is 130
bytes.

The next tells us that we've got 410 total SLOB pages (1 is fully
allocated), we've done 22k allocs, 12k frees, have 10k allocations
active, and 932k total memory allocated (including kmallocs). That means
our average SLOB allocation is ~90 bytes.

The kmallocs line tells us we've done 10k allocs, and 5k of them are not
yet freed. Since boot, we requested 3.48MiB of kmalloc (without padding)
and added on 49k of padding. Thus the average kmalloc is 340 bytes and
has 4.77 bytes of padding (1.2% overhead, quite good!).

SLAB and kmalloc objects => 4k are handed straight to the page allocator
(same as SLUB), of which there are 17 active pages.

So in total, SLOB is using 427 pages for what optimally could fit in 245
pages. In other words, external fragmentation is horrible.

I kicked this around for a while, slept on it, and then came up with
this little hack first thing this morning:

------------
slob: split free list by size

diff -r 6901ca355181 mm/slob.c
--- a/mm/slob.c Tue Jan 08 21:01:15 2008 -0600
+++ b/mm/slob.c Wed Jan 09 12:31:59 2008 -0600
@@ -112,7 +112,9 @@ static inline void free_slob_page(struct
/*
* All (partially) free slob pages go on this list.
*/
-static LIST_HEAD(free_slob_pages);
+#define SLOB_BREAK_POINT 300
+static LIST_HEAD(free_slob_pages_big);
+static LIST_HEAD(free_slob_pages_small);

/*
* slob_page: True for all slob pages (false for bigblock pages)
@@ -140,9 +142,9 @@ static inline int slob_page_free(struct
return test_bit(PG_private, &sp->flags);
}

-static inline void set_slob_page_free(struct slob_page *sp)
+static inline void set_slob_page_free(struct slob_page *sp, struct list_head *list)
{
- list_add(&sp->list, &free_slob_pages);
+ list_add(&sp->list, list);
__set_bit(PG_private, &sp->flags);
}

@@ -294,12 +296,18 @@ static void *slob_alloc(size_t size, gfp
{
struct slob_page *sp;
struct list_head *prev;
+ struct list_head *slob_list;
slob_t *b = NULL;
unsigned long flags;

+ slob_list = &free_slob_pages_small;
+ if (size > SLOB_BREAK_POINT)
+ slob_list = &free_slob_pages_big;
+
spin_lock_irqsave(&slob_lock, flags);
/* Iterate through each partially free page, try to find room */
- list_for_each_entry(sp, &free_slob_pages, list) {
+
+ list_for_each_entry(sp, slob_list, list) {
#ifdef CONFIG_NUMA
/*
* If there's a node specification, search for a partial
@@ -321,9 +329,9 @@ static void *slob_alloc(size_t size, gfp
/* Improve fragment distribution and reduce our average
* search time by starting our next search here. (see
* Knuth vol 1, sec 2.5, pg 449) */
- if (prev != free_slob_pages.prev &&
- free_slob_pages.next != prev->next)
- list_move_tail(&free_slob_pages, prev->next);
+ if (prev != slob_list->prev &&
+ slob_list->next != prev->next)
+ list_move_tail(slob_list, prev->next);
break;
}
spin_unlock_irqrestore(&slob_lock, flags);
@@ -341,7 +349,7 @@ static void *slob_alloc(size_t size, gfp
sp->free = b;
INIT_LIST_HEAD(&sp->list);
set_slob(b, SLOB_UNITS(PAGE_SIZE), b + SLOB_UNITS(PAGE_SIZE));
- set_slob_page_free(sp);
+ set_slob_page_free(sp, slob_list);
b = slob_page_alloc(sp, size, align);
BUG_ON(!b);
spin_unlock_irqrestore(&slob_lock, flags);
@@ -357,6 +365,7 @@ static void slob_free(void *block, int s
static void slob_free(void *block, int size)
{
struct slob_page *sp;
+ struct list_head *slob_list;
slob_t *prev, *next, *b = (slob_t *)block;
slobidx_t units;
unsigned long flags;
@@ -364,6 +373,10 @@ static void slob_free(void *block, int s
if (unlikely(ZERO_OR_NULL_PTR(block)))
return;
BUG_ON(!size);
+
+ slob_list = &free_slob_pages_small;
+ if (size > SLOB_BREAK_POINT)
+ slob_list = &free_slob_pages_big;

sp = (struct slob_page *)virt_to_page(block);
units = SLOB_UNITS(size);
@@ -387,7 +400,7 @@ static void slob_free(void *block, int s
set_slob(b, units,
(void *)((unsigned long)(b +
SLOB_UNITS(PAGE_SIZE)) & PAGE_MASK));
- set_slob_page_free(sp);
+ set_slob_page_free(sp, slob_list);
goto out;
}
------------

And the results are fairly miraculous, so please double-check them on
your setup. The resulting statistics change to this:

small list pages 107 unused 39622 biggest 1516 fragments 3511
big list pages 129 unused 23264 biggest 2076 fragments 232
slob pages 243 allocs 22528 frees 12108 active 10420 allocated 932079
page scans 8074 block scans 481530
kmallocs 10248 active 5451 allocated 3484220 overhead 42054
bigpages 825 active 16
total 259 used 244

and 10 runs looks like this:

MemFree: 56056 kB
MemFree: 56064 kB
MemFree: 56064 kB
MemFree: 56068 kB
MemFree: 56068 kB
MemFree: 56076 kB
MemFree: 56084 kB
MemFree: 56084 kB
MemFree: 56088 kB
MemFree: 56092 kB
Avg: 56074.4

So the average jumped by 714k from before the patch, became much more
stable, and beat SLUB by 287k. There are also 7 perfectly filled pages
now, up from 1 before. And we can't get a whole lot better than this:
we're using 259 pages for 244 pages of actual data, so our total
overhead is only 6%! For comparison, SLUB's using about 70 pages more
for the same data, so its total overhead appears to be about 35%.

By the way, the break at 300 bytes was just the first number that came
to my head but moving it around didn't seem to help. It might want to
change with page size. Knuth suggests that empirically, arena size/10 is
about the maximum allocation size to avoid fragmentation.

--
Mathematics is the supreme nostalgia of our time.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/