Re: missing madvise functionality

From: Nick Piggin
Date: Wed Apr 04 2007 - 04:05:23 EST


Nick Piggin wrote:
Ulrich Drepper wrote:

People might remember the thread about mysql not scaling and pointing
the finger quite happily at glibc. Well, the situation is not like that.

The problem is glibc has to work around kernel limitations. If the
malloc implementation detects that a large chunk of previously allocated
memory is now free and unused it wants to return the memory to the
system. What we currently have to do is this:

to free: mmap(PROT_NONE) over the area
to reuse: mprotect(PROT_READ|PROT_WRITE)

Yep, that's expensive, both operations need to get locks preventing
other threads from doing the same.

Some people were quick to suggest that we simply avoid the freeing in
many situations (that's what the patch submitted by Yanmin Zhang
basically does). That's no solution. One of the very good properties
of the current allocator is that it does not use much memory.


Does mmap(PROT_NONE) actually free the memory?


A solution for this problem is a madvise() operation with the following
property:

- the content of the address range can be discarded

- if an access to a page in the range happens in the future it must
succeed. The old page content can be provided or a new, empty page
can be provided

That's it. The current MADV_DONTNEED doesn't cut it because it zaps the
pages, causing *all* future reuses to create page faults. This is what
I guess happens in the mysql test case where the pages where unused and
freed but then almost immediately reused. The page faults erased all
the benefits of using one mprotect() call vs a pair of mmap()/mprotect()
calls.


Two questions.

In the case of pages being unused then almost immediately reused, why is
it a bad solution to avoid freeing? Is it that you want to avoid
heuristics because in some cases they could fail and end up using memory?

Secondly, why is MADV_DONTNEED bad? How much more expensive is a pagefault
than a syscall? (including the cost of the TLB fill for the memory access
after the syscall, of course).

zapping the pages puts them on a nice LIFO cache hot list of pages that
can be quickly used when the next fault comes in, or used for any other
allocation in the kernel. Putting them on some sort of reclaim list seems
a bit pointless.

Oh, also: something like this patch would help out MADV_DONTNEED, as it
means it can run concurrently with page faults. I think the locking will
work (but needs forward porting).

BTW. and this way it becomes much more attractive than using mmap/mprotect
can ever be, because they must take mmap_sem for writing always.

You don't actually need to protect the ranges unless running with use after
free debugging turned on, do you?

--
SUSE Labs, Novell Inc.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/