Re: [PATCH 1/1] mm: compaction: avoid fast_isolate_around() to set pageblock_skip on reserved pages

From: Mike Rapoport
Date: Wed Nov 25 2020 - 16:04:50 EST


On Wed, Nov 25, 2020 at 08:27:21PM +0100, David Hildenbrand wrote:
> On 25.11.20 19:28, Andrea Arcangeli wrote:
> > On Wed, Nov 25, 2020 at 07:45:30AM +0100, David Hildenbrand wrote:
> >
> > What would need to call pfn_zone in between first and second stage?
> >
> > If something calls pfn_zone in between first and second stage isn't it
> > a feature if it crashes the kernel at boot?
> >
> > Note: I suggested 0xff kernel crashing "until the second stage comes
> > around" during meminit at boot, not permanently.
>
> Yes, then it makes sense - if we're able to come up with a way to
> initialize any memmap we might have - including actual memory holes that
> have a memmap.
>
> >
> > /*
> > * Use a fake node/zone (0) for now. Some of these pages
> > * (in memblock.reserved but not in memblock.memory) will
> > * get re-initialized via reserve_bootmem_region() later.
> > */
> >
> > Specifically I relied on the comment "get re-initialized via
> > reserve_bootmem_region() later".
>
> Yes, but there is a "Some of these" :)
>
> Boot a VM with "-M 4000" and observe the memmap in the last section -
> they won't get initialized a second time.
>
> >
> > I assumed the second stage overwrites the 0,0 to the real zoneid/nid
> > value, which is clearly not happening, hence it'd be preferable to get
> > a crash at boot reliably.
> >
> > Now I have CONFIG_DEFERRED_STRUCT_PAGE_INIT=n so the second stage
> > calling init_reserved_page(start_pfn) won't do much with
> > CONFIG_DEFERRED_STRUCT_PAGE_INIT=n but I already tried to enable
> > CONFIG_DEFERRED_STRUCT_PAGE_INIT=y yesterday and it didn't help, the
> > page->flags were still wrong for reserved pages in the "Unknown E820
> > type" region.

I think the very root cause is how e820__memblock_setup() registers
memory with memblock:

if (entry->type == E820_TYPE_SOFT_RESERVED)
memblock_reserve(entry->addr, entry->size);

if (entry->type != E820_TYPE_RAM && entry->type != E820_TYPE_RESERVED_KERN)
continue;

memblock_add(entry->addr, entry->size);

>From that point the system has inconsistent view of RAM in both
memblock.memory and memblock.reserved and, which is then translated to
memmap etc.

Unfortunately, simply adding all RAM to memblock is not possible as
there are systems that for them "the addresses listed in the reserved
range must never be accessed, or (as we discovered) even be reachable by
an active page table entry" [1].

[1] https://lore.kernel.org/lkml/20200528151510.GA6154@raspberrypi/

--
Sincerely yours,
Mike.