in the past couple of years the buddy allocator has started to show
limitations that are hurting performance and flexibility.
eg. one of the main reasons why we keep MAX_ORDER at an almost obscenely
high level is the fact that we occasionally have to allocate big,
physically continuous memory areas. We do not realistically expect to be
able to allocate such high-order pages after bootup, still every page
allocation carries the cost of it. And even with MAX_ORDER at 10, large
RAM boxes have hit this limit and are hurting visibly - as witnessed by
Anton. Falling back to vmalloc() is not a high-quality option, due to the
TLB-miss overhead.
If we had an allocator that could handle large, rare but
performance-insensitive allocations, then we could decrease MAX_ORDER back
to 5 or 6, which would result in less cache-footprint and faster operation
of the page allocator.
the attached memarea-2.4.15-D6 patch does just this: it implements a new
'memarea' allocator which uses the buddy allocator data structures without
impacting buddy allocator performance. It has two main entry points:
struct page * alloc_memarea(unsigned int gfp_mask, unsigned int pages);
void free_memarea(struct page *area, unsigned int pages);
the main properties of the memarea allocator are:
- to be an 'unlimited size' allocator: it will find and allocate 100 GB
of physically continuous memory if that much RAM is available.
- no alignment or size limitations either, size does not have to be a
power of 2 like for the buddy allocator, and alignment will be whatever
constellation the allocator finds. This property ensures that if there
is a sufficiently sized physically continous piece of RAM available,
the allocator will find it. The buddy allocator only finds order-2
aligned and order-2 sized pages.
- no impact on the performance of the page allocator. (The only (very
small) effect is the use of list_del_init() instead of list_del() when
allocating pages. This is insignificant as the initialization will be
done in two assembly instructions, touching an already present and
dirty cacheline.)
Obviously, alloc_memarea() can be pretty slow if RAM is getting full, nor
does it guarantee allocation, so for non-boot allocations other backup
mechanizms have to be used, such as vmalloc(). It is not a replacement for
the buddy allocator - it's not intended for frequent use.
right now the memarea allocator is used in one place: to allocate the
pagecache hash table at boot time. [ Anton, it would be nice if you could
check it out on your large-RAM box, does it improve the hash chain
situation? ]
other candidates of alloc_memarea() usage are:
- module code segment allocation, fall back to vmalloc() if failure.
- swap map allocation, it uses vmalloc() now.
- buffer, inode, dentry, TCP hash allocations. (in case we decrease
MAX_ORDER, which the patch does not do yet.)
- those funky PCI devices that need some big chunk of physical memory.
- other uses?
alloc_memarea() tries to optimize away as much as possible from linear
scanning of zone mem-maps, but the worst-case scenario is that it has to
iterate over all pages - which can be ~256K iterations if eg. we search on
a 1 GB box.
possible future improvements:
- alloc_memarea() could zap clean pagecache pages as well.
- if/once reverse pte mappings are added, alloc_memarea() could also
initiate the swapout of anonymous & dirty pages. These modifications
would make it pretty likely to succeed if the allocation size is
realistic.
- possibly add 'alignment' and 'offset' to the __alloc_memarea()
arguments, to possibly create a given alignment for the memarea, to
handle really broken hardware and possibly result in better page
coloring as well.
- if we extended the buddy allocator to have a page-granularity bitmap as
well, then alloc_memarea() could search for physically continuous page
areas *much* faster. But this creates a real runtime (and cache
footprint) overhead in the buddy allocator.
the patch also cleans up the buddy allocator code:
- cleaned up the zone structure namespace
- removed the memlist_ defines. (I originally added them to play
with FIFO vs. LIFO allocation, but now we have settled for the later.)
- simplified code
- ( fixed index to be unsigned long in rmqueue(). This enables 64-bit
systems to have more than 32 TB of RAM in a single zone. [not quite
realistic, yet, but hey.] )
NOTE: the memarea allocator pieces are in separate chunks and are
completely non-intrusive if the filemap.c change is omitted.
i've tested the patch pretty thoroughly on big and small RAM boxes. The
patch is against 2.4.15-pre3.
Reports, comments, suggestions welcome,
Ingo
This archive was generated by hypermail 2b29 : Thu Nov 15 2001 - 21:00:29 EST