Re: [PATCH v1 6/6] mm/hwpoison: fix unpoison_memory()

From: Ding Hui
Date: Sat Jun 19 2021 - 08:36:01 EST


On 2021/6/18 4:36 下午, HORIGUCHI NAOYA(堀口 直也) wrote:
On Thu, Jun 17, 2021 at 06:00:21PM +0800, Ding Hui wrote:
On 2021/6/14 10:12, Naoya Horiguchi wrote:
From: Naoya Horiguchi <naoya.horiguchi@xxxxxxx>

After recent soft-offline rework, error pages can be taken off from
buddy allocator, but the existing unpoison_memory() does not properly
undo the operation. Moreover, due to the recent change on
__get_hwpoison_page(), get_page_unless_zero() is hardly called for
hwpoisoned pages. So __get_hwpoison_page() mostly returns zero (meaning
to fail to grab page refcount) and unpoison just clears PG_hwpoison
without releasing a refcount. That does not lead to a critical issue
like kernel panic, but unpoisoned pages never get back to buddy (leaked
permanently), which is not good.

As I mention in [1], I'm not sure about the exactly meaning of "broken" in
unpoison_memory().

Maybe the misunderstanding is:

I think __get_hwpoison_page() mostly returns one for hwpoisoned page.
In 06be6ff3d2ec ("mm,hwpoison: rework soft offline for free pages"),
page_handle_poison() is introduced, it will add refcount for all
soft-offlineed hwpoison page.
In memory_failure() for hard-offline,page_ref_inc() called on free page
too, and for used page, we do not call put_page() after get_hwpoison_page()
!= 0.
So all hwpoisoned page refcount must be great than zero when
unpoison_memory() if regardless of racy.

Hi, Ding,

Thanks for the comment. I feel that I failed to define the exact issue in
unpoison. Maybe I saw and misinterpreted some random error as unpoison's
issue during developing other hwpoison patches, so please don't take serious
my previous wrong word "broken", sorry about that.

Anyway I reconsider how to handle this 6/6, maybe it will be a clear
description of the problem, and will be simplified.


Recently I tested loop soft-offline random pages and unpoison them for days,
it works fine to me. (with bac9c6fa1f92 patched)

Thank you for testing,


Hi Naoya,

I'm afraid of my description about testing is ambiguous for others, let me clarify that I ran stress soft-offline test case from mce-test project (https://git.kernel.org/pub/scm/utils/cpu/mce/mce-test.git) for days to verify my modify about NR_FREE_PAGES (bac9c6fa1f92), without your current patchset, the case is loop soft-offline random pages and unpoison them, and it works basic fine to me.

--
Thanks,
-dinghui