Re: [RFC PATCH] mm: Introduce new MADV_NOMOVABLE behavior

From: Baolin Wang
Date: Thu Oct 20 2022 - 03:17:23 EST




On 10/19/2022 11:17 PM, David Hildenbrand wrote:
I observed one migration failure case (which is not easy to reproduce)
is that, the 'thp_migration_fail' count is 1 and the
'thp_split_page_failed' count is also 1.

That means when migrating a THP which is in CMA area, but can not
allocate a new THP due to memory fragmentation, so it will split the
THP. However THP split is also failed, probably the reason is temporary
reference count of this THP. And the temporary reference count can be
caused by dropping page caches (I observed the drop caches operation in
the system), but we can not drop the shmem page caches due to they are
already dirty at that time.

So we can try again in migrate_pages() if THP split is failed to
mitigate the failure of migration, especially for the failure reason is
temporary reference count? Does this sound reasonable for you?

It sound reasonable, and I understand that debugging these issues is tricky. But we really have to figure out the root cause to make these pages that are indeed movable (but only temporarily not movable for reason XYZ) movable.

We'd need some indication to retry migration longer / again.

OK. Let me try this and see if there are other possible failure cases in the products.


However I still worried there are other possible cases to cause
migration failure, so no CMA allocation for our case seems more stable IMO.

Yes, I can understand that. But as one example, you're approach doesn't handle the case that a page that was allocated on !CMA/!ZONE_MOVABLE would get migrated to CMA/ZONE_MOVABLE just before you would try pinning the page (to migrate it again off CMA/ZONE_MOVABLE).

Indeed, like you said before, just helpful to minimize page migration now. Maybe I can take MADV_PINNABLE into considering when allocating new pages, such as alloc_migration_target().

Anyway let me try to fix the root cause first to see if it can solve our problem.

We really have to fix the root cause.

OK. Thanks for your input.