Re: [PATCH v2] mm: Reduce memory bloat with THP

From: Andrea Arcangeli
Date: Thu Jan 25 2018 - 17:29:20 EST


On Thu, Jan 25, 2018 at 11:41:03AM -0800, Nitin Gupta wrote:
> I'm trying to address many different THP issues and memory bloat is
> first among them.

You quoted redis in an earlier email, the redis issue has nothing to
do with MADV_DONTNEED.

I can quickly explain the redis issue.

Redis uses fork() to create a readonly copy of the memory to do
snapshotting in the child, while parent still writes to the memory.

THP CoWs in the parent are higher latency than 4k CoWs, they also take
more memory, but that's secondary, in fact the maximum waste of memory
in this model will reach the same worst case (x2) with 4k CoWs
too, no difference.

The problem is the copy-user there, it adds latency and wastes CPU.

Redis can simply use userfaultfd WP mode once it'll be upstream and
then it will use 4k granularity as the granularity of the writeprotect
userfaults is up to userland to decide.

The main benefit is it can avoid the worst case degradation of using
x2 physical memory (disabling THP makes zero difference in that
regard, if storage is very slow x2 physical memory can still be used
if very unlucky), it can throttle the WP writes (anon COW cannot
throttle), it can avoid to fork altogether so it shares the same
pagetables. It can also put the "user-CoWed" pages (in the fault
handler) in front of the write queue, to be written first, using a
ring buffer for the CoWed 4k pages, to keep memory utilization even
lower despite THP stays on at all times for all pages that didn't get
a CoW yet. This will be an optimal snapshot method, much better than
fork() no matter if 4k or THP are backing the memory.

In short MADV_DONTNEED has nothing to do with redis, if mysql gets an
improvement surely you can post a benchmark instead of URLs.

If you want low memory usage at the cost of potentially slower
performance overall you should use transparent_hugepage=madvise .

The cases where THP is not a good tradeoff are genreally related to
lower performance in copy-user or the higher cost of compaction if the
app is only ever doing short lived allocations.

If you post a reproducible benchmark with real life app that gets an
improvement with whatever change you're doing, it'll be possible to
evaluate it.

Thanks,
Andrea