Re: [PATCH] mm: migrate: Fix THP's mapcount on isolation

From: David Hildenbrand
Date: Fri Nov 25 2022 - 03:55:21 EST


I think you're right. Without a page reference I don't think it is even
safe to look at struct page, at least not without synchronisation
against memory hot unplug which could remove the struct page. From a
quick glance I didn't see anything here that obviously did that though.
Memory hotplug is the offending party here. It has to make sure
that
everything else is definitely quiescent before removing the struct pages.
Otherwise you can't even try_get a refcount.

Sure, I might be missing something but how can memory hotplug possibly
synchronise against some process (eg. try_to_compact_pages) doing
try_get(pfn_to_page(random_pfn)) without that process either informing
memory hotplug that it needs pages to remain valid long enough to get a
reference or detecting that memory hotplug is in the process of
offlining pages?

It currently doesn't, and it's been mostly a known theoretical problem so far. We've been ignoring these kind of problems because they are not easy to sort out and so far never popped up in practice.

First, the correct approach is using pfn_to_online_page() instead of pfn_to_page() when in doubt whether the page might already be offline. While we're using pfn_to_online_page() already in quite some PFN walkers, changes are good that we're still missing some.

Second, essentially anybody (PFN walker) doing a pfn_to_online_page() is prone to races with memory offlining. I've discussed this in the past with Michal and Dan and one idea was to use RCU to synchronize PFN walkers and pfn_to_online_page(), essentially synchronizing clearing of the SECTION_IS_ONLINE flag.

Nobody worked on that so far because we've never had actual BUG reports. These kind of races are rare, but they are theoretically possible.


At least alloc_contig_range() and memory offlining are mutually
exclusive due to MIGRATE_ISOLTAE. I recall that ordinary memory
compaction similarly deals with isolated pageblocks (or some other
mechanism I forgot) to not race with memory offlining. Wouldn't worry
about that for now.

Sorry, agree - to be clear this discussion isn't relevant for this patch
but reviewing it got me thinking about how this works. I still don't see
how alloc_contig_range() is totally safe against memory offlining
though. From what I can see we have the following call path to set
MIGRATE_ISOLATE:

alloc_contig_range(pfn) -> start_isolate_page_range(pfn) ->
isolate_single_pageblock(pfn) -> page_zone(pfn_to_page(pfn))

There's nothing in that call stack that prevent offlining of the page,
hence the struct page may be freed by this point. Am I missing something
else here?

Good point. I even had at some point a patch that converted some pfn_to_online_page() calls in there, but discarded it.

IIRC, two alloc_contig_range() users are safe because:

1) virtio-mem synchonizes against memory offlining using the memory notifier. While memory offlining is in progress, it won't call alloc_contig_range().

2) CMA regions will always fail to offline because they have MIGRATE_CMA set.

What remains is alloc_contig_pages(), for example, used for memtrace on ppc, kfence, and allocation of gigantic pages. That might indeed be racy. At least from kfence_init_late() it's unlikely (impossible?) that we'll have concurrent memory offlining.


try_to_compact_pages() has a similar issue which is what I noticed
reviewing this patch:

Yes, I can spot pfn_to_online_page() in __reset_isolation_pfn(), but of course, are likely missing proper synchronization and checks later. I wonder if we could use the memory notifier to temporarily pause any running compaction and to restart it once memory offlining succeeded.


try_to_compact_pages() -> compact_zone_order() -> compact_zone() ->
isolate_migratepages() -> isolate_migratepages_block() ->
PageHuge(pfn_to_page(pfn))

Again I couldn't see anything in that path that would hold the page
stable long enough to safely perform the pfn_to_page() and get a
reference. And after a bit of fiddling I was able to trigger the below
oops running the compaction_test selftest whilst offlining memory so I
don't think it is safe:

Thanks for finding a reproducer. This is exactly the kind of BUG that we speculated about in the past but that we haven't seen in practice yet.


Having that said, I'd be very happy if someone would have time to work on proper synchronization and I'd be happy to help brainstorming/reviewing :)

--
Thanks,

David / dhildenb