Re: [patch 1/2] mm, page_alloc: extend kernelcore and movablecore for percent

From: Matthew Wilcox
Date: Fri Feb 16 2018 - 12:10:05 EST


On Fri, Feb 16, 2018 at 10:08:28AM -0600, Christopher Lameter wrote:
> On Fri, 16 Feb 2018, Matthew Wilcox wrote:
> > I don't understand this response. I'm not suggesting mixing objects
> > of different sizes within the same page. The vast majority of slabs
> > use order-0 pages, a few use order-1 pages and larger sizes are almost
> > unheard of. I'm suggesting the slab have it's own private arena of pages
> > that it uses for allocating pages to slabs; when an entire page comes
> > free in a slab, it is returned to the arena. When the arena is empty,
> > slab requests another arena from the page allocator.
>
> This just shifts the fragmentation problem because the 2M page cannot be
> released until all 4k or 8k pages within that 2M page are freed. How is
> that different from the page allocator which cannot coalesce an 2M page
> until all fragments have been released?

I'm not proposing releasing this 2MB page, unless it naturally frees up.
I'm saying that by restricting allocations to be within this 2MB page,
we prevent allocating from the adjacent 2MB page.

The workload I'm thinking of looks like this ... maybe the result of
running 'file' on every inode in a directory:

do {
Allocate an inode
Allocate a page of pagecache
} while (lots of times);

naively, we allocate a page for the inode slab, then 3-6 pages for page
cache (depending on the filesystem), then we allocate another page for
the inode slab, then another 3-6 pages of page cache, and so on. So the
pages end up looking like this:

IPPPPPIP|PPPPIPPP|PPIPPPPP|IPPPPPIP|...

Now we need an order-3 allocation. We can't get there just by releasing
page cache pages because there's inode slab pages in there, so we need to
shrink the inode caches as well. I'm proposing:

IIIIII00|PPPPPPPP|PPPPPPPP|PPPPPPPP|PP...

and we can get our order-3 allocation just by releasing page cache pages.

> The kernelcore already does something similar by limiting the
> general unmovable allocs to a section of memory.

Right! But Michal's unhappy about kernelcore (see the beginning of this
thread), and so I'm proposing an alternative.

> Maybe what we should do is raise the lowest allocation size instead and
> allocate 2^x groups of pages to certain purposes?
>
> I.e. have a base allocation size of 16k and if the alloc was a page cache
> page then use the remainder for the neigboring pages.

Yes, there are a lot of ideas like this floating around; I know Kirill's
interested in this kind of thing not just for THP but also for faultaround.