Re: [PATCH V2 0/6] mm: page_alloc: freelist migratetype hygiene

From: Johannes Weiner
Date: Mon Sep 18 2023 - 11:16:55 EST


On Mon, Sep 18, 2023 at 09:16:58AM +0200, Vlastimil Babka wrote:
> On 9/16/23 21:57, Mike Kravetz wrote:
> > On 09/15/23 10:16, Johannes Weiner wrote:
> >> On Thu, Sep 14, 2023 at 04:52:38PM -0700, Mike Kravetz wrote:
> >> > In next-20230913, I started hitting the following BUG. Seems related
> >> > to this series. And, if series is reverted I do not see the BUG.
> >> >
> >> > I can easily reproduce on a small 16G VM. kernel command line contains
> >> > "hugetlb_free_vmemmap=on hugetlb_cma=4G". Then run the script,
> >> > while true; do
> >> > echo 4 > /sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages
> >> > echo 4 > /sys/kernel/mm/hugepages/hugepages-1048576kB/demote
> >> > echo 0 > /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages
> >> > done
> >> >
> >> > For the BUG below I believe it was the first (or second) 1G page creation from
> >> > CMA that triggered: cma_alloc of 1G.
> >> >
> >> > Sorry, have not looked deeper into the issue.
> >>
> >> Thanks for the report, and sorry about the breakage!
> >>
> >> I was scratching my head at this:
> >>
> >> /* MIGRATE_ISOLATE page should not go to pcplists */
> >> VM_BUG_ON_PAGE(is_migrate_isolate(mt), page);
> >>
> >> because there is nothing in page isolation that prevents setting
> >> MIGRATE_ISOLATE on something that's on the pcplist already. So why
> >> didn't this trigger before already?
> >>
> >> Then it clicked: it used to only check the *pcpmigratetype* determined
> >> by free_unref_page(), which of course mustn't be MIGRATE_ISOLATE.
> >>
> >> Pages that get isolated while *already* on the pcplist are fine, and
> >> are handled properly:
> >>
> >> mt = get_pcppage_migratetype(page);
> >>
> >> /* MIGRATE_ISOLATE page should not go to pcplists */
> >> VM_BUG_ON_PAGE(is_migrate_isolate(mt), page);
> >>
> >> /* Pageblock could have been isolated meanwhile */
> >> if (unlikely(isolated_pageblocks))
> >> mt = get_pageblock_migratetype(page);
> >>
> >> So this was purely a sanity check against the pcpmigratetype cache
> >> operations. With that gone, we can remove it.
> >
> > With the patch below applied, a slightly different workload triggers the
> > following warnings. It seems related, and appears to go away when
> > reverting the series.
> >
> > [ 331.595382] ------------[ cut here ]------------
> > [ 331.596665] page type is 5, passed migratetype is 1 (nr=512)
> > [ 331.598121] WARNING: CPU: 2 PID: 935 at mm/page_alloc.c:662 expand+0x1c9/0x200
>
> Initially I thought this demonstrates the possible race I was suggesting in
> reply to 6/6. But, assuming you have CONFIG_CMA, page type 5 is cma and we
> are trying to get a MOVABLE page from a CMA page block, which is something
> that's normally done and the pageblock stays CMA. So yeah if the warnings
> are to stay, they need to handle this case. Maybe the same can happen with
> HIGHATOMIC blocks?

Hm I don't think that's quite it.

CMA and HIGHATOMIC have their own freelists. When MOVABLE requests dip
into CMA and HIGHATOMIC, we explicitly pass that migratetype to
__rmqueue_smallest(). This takes a chunk of e.g. CMA, expands the
remainder to the CMA freelist, then returns the page. While you get a
different mt than requested, the freelist typing should be consistent.

In this splat, the migratetype passed to __rmqueue_smallest() is
MOVABLE. There is no preceding warning from del_page_from_freelist()
(Mike, correct me if I'm wrong), so we got a confirmed MOVABLE
order-10 block from the MOVABLE list. So far so good. However, when we
expand() the order-9 tail of this block to the MOVABLE list, it warns
that its pageblock type is CMA.

This means we have an order-10 page where one half is MOVABLE and the
other is CMA.

I don't see how the merging code in __free_one_page() could have done
that. The CMA buddy would have failed the migrate_is_mergeable() test
and we should have left it at order-9s.

I also don't see how the CMA setup could have done this because
MIGRATE_CMA is set on the range before the pages are fed to the buddy.

Mike, could you describe the workload that is triggering this?

Does this reproduce instantly and reliably?

Is there high load on the system, or is it requesting the huge page
with not much else going on?

Do you see compact_* history in /proc/vmstat after this triggers?

Could you please also provide /proc/zoneinfo, /proc/pagetypeinfo and
the hugetlb_cma= parameter you're using?

Thanks!