Re: [patch] mm, thp: always direct reclaim for MADV_HUGEPAGE even when deferred

From: David Rientjes
Date: Fri Dec 23 2016 - 17:47:07 EST


On Fri, 23 Dec 2016, Michal Hocko wrote:

> > We have no way to compact memory for users who are not using
> > MADV_HUGEPAGE,
>
> yes we have. it is defrag=always. If you do not want direct compaction
> and the resulting allocation stalls then you have to rely on kcompactd
> which is something we should work longterm.
>

No, the point of madvise(MADV_HUGEPAGE) is for applications to tell the
kernel that they really want hugepages. Really. Everybody else either
never did direct compaction or did a substantially watered down version of
it. Now, we have a situation where you can either do direct compaction
for MADV_HUGEPAGE and nothing for anybody else, or direct compaction for
everybody. In our usecase, we want everybody to kick off background
compaction because order=9 gfp_mask & __GFP_KSWAPD_RECLAIM is the only
thing that is going to trigger background compaction but are unable to do
so without still incurring lengthy pagefaults for non MADV_HUGEPAGE users.

> > which is some customers, others require MADV_HUGEPAGE for
> > .text segment remap while loading their binary, without defrag=always or
> > defrag=defer. The problem is that we want to demand direct compact for
> > MADV_HUGEPAGE: they _really_ want hugepages, it's the point of the
> > madvise.
>
> and that is the point of defrag=madvise to give them this direct
> compaction.
>

Do you see the problem by first suggesting defrag=always at the top of
your reply and then defrag=madvise now? We cannot set both at once, it's
the entire problem with the tristate and now quadstate setting. We want a
combination: EVERYBODY kicks off background compaction and applications
that really want hugepages and are fine with incuring lengthy page fault,
such as those (for the third time) remapping .text segment and doing
madvise(MADV_HUGEPAGE) before fault, can use the madvise.

> > We have no setting, without this patch, to ask for background
> > compaction for everybody so that their fault does not have long latency
> > and for some customers to demand compaction.
>
> that is true and what I am trying to say is that we should aim to give
> this background compaction for everybody via kcompactd because there are
> more users than THP who might benefit from low latency high order pages
> availability.

My patch does that, we _defer_ for everybody unless you're using
madvise(MADV_HUGEPAGE) and really want hugepages. Forget defrag=never
exists, it's not important in the discussion. Forget defrag=always exists
because all apps, like batch jobs, don't want lengthy pagefaults. We have
two options remaining:

- defrag=defer: everybody kicks off background compaction, _nobody_ does
direct compaction

- defrag=madvise: madvise(MADV_HUGEPAGE) does direct compaction,
everybody else does nothing

The point you're missing is that we _want_ defrag=defer. We really do.
We don't want to stall in the page allocator to get thp, but we want to
try to make it available in the short term. However, apps that do
madvise(MADV_HUGEPAGE), like remapping your .text segment and wanting your
text backed by hugepages and incurring the expense up front, or a
database, or a vm, _want_ hugepages now and don't care about lengthy page
faults.

The point is that I HAVE NO SETTING to get that behavior and
defrag=madvise is _not_ a solution because it requires the presence of an
app that is doing madvise(MADV_HUGEPAGE) AND faulting memory to get any
order=9 compaction.

> > ?????? Why does the admin care if a user's page fault wants to reclaim to
> > get high order memory?
>
> Because the whole point of the defrag knob is to allow _administrator_
> control how much we try to fault in THP. And the primary motivation were
> latencies. The whole point of introducing defer option was to _never_
> stall in the page fault while it still allows to kick the background
> compaction. If you really want to tweak any option then madvise would be
> more appropriate IMHO because the semantic would be still clear. Use
> direct compaction for MADV_HUGEPAGE vmas and kick in kswapd/kcompactd
> for others.
>

You want defrag=madvise to start doing background compaction for
everybody, which was never done before for existing users of
defrag=madvise? That might be possible, I don't really care, I just think
it's riskier because there are existing users of defrag=madvise who are
opting in to new behavior because of the kernel change. This patch
changes defrag=defer because it's the new option and people setting the
mode know what they are getting.

I disagree with your description of what the defrag setting is intended
for. The setting of thp defrag is to optimize for apps that truly want
transparent behavior, i.e. they aren't doing madvise(MADV_HUGEPAGE). Are
they willing to incur lengthy pagefaults for thp when not doing any
madvise(2)? defrag=defer should not mean that users of
madvise(MADV_HUGEPAGE) that have clearly specified their intent should not
be allowed to try compacting memory themselves because they have indicated
they are fine with such an expense by doing the madvise(2).

This is obviously fine for Kirill, and I have users who remap their .text
segment and do madvise(MADV_DONTNEED) because they really want hugepages
when they are exec'd, so I'd kindly ask you to consider the real-world use
cases that require background compaction to make hugepages available for
everybody but allow apps to opt-in to take the expense of compaction on
themselves rather than your own theory of what users want.