Re: Re: [PATCH] mm: hugetlbfs: add hwcrp_hugepages to record memory failure on hugetlbfs

From: HORIGUCHI NAOYA(堀口 直也)
Date: Tue Jun 08 2021 - 05:13:59 EST


On Tue, Jun 08, 2021 at 10:24:50AM +0800, wangbin wrote:
> > What specific problem are you trying to solve? Are trying to see how
> > many huge pages were hit by memory errors?
>
> Yes, I'd like to know how many huge pages are not available because of
> the memory errors. Just like HardwareCorrupted in the /proc/meminfo.
> But the HardwareCorrupted only adds one page size when a huge page is
> hit by memory errors, and mixes with normal pages. So I think we should
> add a new counts to track the memory errors on hugetlbfs.

If you can use root privilege in your use-case, an easy way to get the
number of corrupted hugepages is to use page-types.c (which reads
/proc/kpageflags) like below:

$ page-types -b huge,hwpoison=huge,hwpoison
flags page-count MB symbolic-flags long-symbolic-flags
0x00000000000a8000 1 0 _______________H_G_X_______________________ compound_head,huge,hwpoison
total 1 0


But I guess that many usecases do not permit access to this interface,
where some new accounting interface for corrupted hugepages could be
helpful as you suggest.

Thanks,
Naoya Horiguchi