Re: [PATCH v2] mm,thp,shmem: limit shmem THP alloc gfp_mask

From: Rik van Riel
Date: Thu Oct 22 2020 - 23:41:08 EST


On Thu, 2020-10-22 at 19:54 -0700, Hugh Dickins wrote:
> On Thu, 22 Oct 2020, Rik van Riel wrote:
>
> > The allocation flags of anonymous transparent huge pages can be
> controlled
> > through the files in /sys/kernel/mm/transparent_hugepage/defrag,
> which can
> > help the system from getting bogged down in the page reclaim and
> compaction
> > code when many THPs are getting allocated simultaneously.
> >
> > However, the gfp_mask for shmem THP allocations were not limited by
> those
> > configuration settings, and some workloads ended up with all CPUs
> stuck
> > on the LRU lock in the page reclaim code, trying to allocate dozens
> of
> > THPs simultaneously.
> >
> > This patch applies the same configurated limitation of THPs to
> shmem
> > hugepage allocations, to prevent that from happening.
> >
> > This way a THP defrag setting of "never" or "defer+madvise" will
> result
> > in quick allocation failures without direct reclaim when no 2MB
> free
> > pages are available.
> >
> > Signed-off-by: Rik van Riel <riel@xxxxxxxxxxx>
>
> NAK in its present untested form: see below.

Oops. That issue is easy to fix, but indeed lets figure
out what the desired behavior is.

> I'm open to change here, particularly to Yu Xu's point (in other
> mail)
> about direct reclaim - we avoid that here in Google too: though it's
> not so much to avoid the direct reclaim, as to avoid the latencies of
> direct compaction, which __GFP_DIRECT_RECLAIM allows as a side-
> effect.
>
> > @@ -1887,7 +1888,8 @@ static int shmem_getpage_gfp(struct inode
> *inode, pgoff_t index,
> > }
> >
> > alloc_huge:
> > - page = shmem_alloc_and_acct_page(gfp, inode, index, true);
> > + huge_gfp = alloc_hugepage_direct_gfpmask(vma);
>
> Still looks nice: but what about the crash when vma is NULL?

That's a one line fix, but I suppose we should get the
discussion on what the code behavior should be out of
the way first :)

> Michal is right to remember pushback before, because tmpfs is a
> filesystem, and "huge=" is a mount option: in using a huge=always
> filesystem, the user has already declared a preference for huge
> pages.
> Whereas the original anon THP had to deduce that preference from sys
> tunables and vma madvice.

...

> But it's likely that they have accumulated some defrag wisdom, which
> tmpfs can take on board - but please accept that in using a huge
> mount,
> the preference for huge has already been expressed, so I don't expect
> anon THP alloc_hugepage_direct_gfpmask() choices will map one to one.

In my mind, the huge= mount options for tmpfs corresponded
to the "enabled" anon THP options, denoting a desired end
state, not necessarily how much we will stall allocations
to get there immediately.

The underlying allocation behavior has been changed repeatedly,
with changes to the direct reclaim code and the compaction
deferral code.

The shmem THP gfp_mask never tried really hard anyway,
with __GFP_NORETRY being the default, which matches what
is used for non-VM_HUGEPAGE anon VMAs.

Likewise, the direct reclaim done from the opportunistic
THP allocations done by the shmem code limited itself to
reclaiming 32 4kB pages per THP allocation.

In other words, mounting
with huge=always has never behaved
the same as the more aggressive allocations done for
MADV_HUGEPAGE VMAs.

This patch would leave shmem THP allocations for non-MADV_HUGEPAGE
mapped files opportunistic like today, and make shmem THP
allocations for files mapped with MADV_HUGEPAGE more aggressive
than today.

However, I would like to know what people think the shmem
huge= mount options should do, and how things should behave
when memory gets low, before pushing in a patch just because
it makes the system run smoother "without changing current
behavior too much".

What do people want tmpfs THP allocations to do?

--
All Rights Reversed.

Attachment: signature.asc
Description: This is a digitally signed message part