Re: [PATCH v4 4/4] selftests/mm: add tests for HWPOISON hugetlbfs read

From: Jiaqi Yan
Date: Thu Jan 11 2024 - 12:34:56 EST


On Thu, Jan 11, 2024 at 12:48 AM Muhammad Usama Anjum
<usama.anjum@xxxxxxxxxxxxx> wrote:
>
> On 1/11/24 7:32 AM, Sidhartha Kumar wrote:
> > On 1/10/24 2:15 AM, Muhammad Usama Anjum wrote:
> >> On 1/10/24 11:49 AM, Muhammad Usama Anjum wrote:
> >>> On 1/6/24 2:13 AM, Jiaqi Yan wrote:
> >>>> On Thu, Jan 4, 2024 at 10:27 PM Muhammad Usama Anjum
> >>>> <usama.anjum@xxxxxxxxxxxxx> wrote:
> >>>>>
> >>>>> Hi,
> >>>>>
> >>>>> I'm trying to convert this test to TAP as I think the failures
> >>>>> sometimes go
> >>>>> unnoticed on CI systems if we only depend on the return value of the
> >>>>> application. I've enabled the following configurations which aren't
> >>>>> already
> >>>>> present in tools/testing/selftests/mm/config:
> >>>>> CONFIG_MEMORY_FAILURE=y
> >>>>> CONFIG_HWPOISON_INJECT=m
> >>>>>
> >>>>> I'll send a patch to add these configs later. Right now I'm trying to
> >>>>> investigate the failure when we are trying to inject the poison page by
> >>>>> madvise(MADV_HWPOISON). I'm getting device busy every single time. The
> >>>>> test
> >>>>> fails as it doesn't expect any business for the hugetlb memory. I'm not
> >>>>> sure if the poison handling code has issues or test isn't robust enough.
> >>>>>
> >>>>> ./hugetlb-read-hwpoison
> >>>>> Write/read chunk size=0x800
> >>>>> ... HugeTLB read regression test...
> >>>>> ... ... expect to read 0x200000 bytes of data in total
> >>>>> ... ... actually read 0x200000 bytes of data in total
> >>>>> ... HugeTLB read regression test...TEST_PASSED
> >>>>> ... HugeTLB read HWPOISON test...
> >>>>> [ 9.280854] Injecting memory failure for pfn 0x102f01 at process
> >>>>> virtual
> >>>>> address 0x7f28ec101000
> >>>>> [ 9.282029] Memory failure: 0x102f01: huge page still referenced by
> >>>>> 511
> >>>>> users
> >>>>> [ 9.282987] Memory failure: 0x102f01: recovery action for huge
> >>>>> page: Failed
> >>>>> ... !!! MADV_HWPOISON failed: Device or resource busy
> >>>>> ... HugeTLB read HWPOISON test...TEST_FAILED
> >>>>>
> >>>>> I'm testing on v6.7-rc8. Not sure if this was working previously or not.
> >>>>
> >>>> Thanks for reporting this, Usama!
> >>>>
> >>>> I am also able to repro MADV_HWPOISON failure at "501a06fe8e4c
> >>>> (akpm/mm-stable, mm-stable) zswap: memcontrol: implement zswap
> >>>> writeback disabling."
> >>>>
> >>>> Then I checked out the earliest commit "ba91e7e5d15a (HEAD -> Base)
> >>>> selftests/mm: add tests for HWPOISON hugetlbfs read". The
> >>>> MADV_HWPOISON injection works and and the test passes:
> >>>>
> >>>> ... HugeTLB read HWPOISON test...
> >>>> ... ... expect to read 0x101000 bytes of data in total
> >>>> ... !!! read failed: Input/output error
> >>>> ... ... actually read 0x101000 bytes of data in total
> >>>> ... HugeTLB read HWPOISON test...TEST_PASSED
> >>>> ... HugeTLB seek then read HWPOISON test...
> >>>> ... ... init val=4 with offset=0x102000
> >>>> ... ... expect to read 0xfe000 bytes of data in total
> >>>> ... ... actually read 0xfe000 bytes of data in total
> >>>> ... HugeTLB seek then read HWPOISON test...TEST_PASSED
> >>>> ...
> >>>>
> >>>> [ 2109.209225] Injecting memory failure for pfn 0x3190d01 at process
> >>>> virtual address 0x7f75e3101000
> >>>> [ 2109.209438] Memory failure: 0x3190d01: recovery action for huge
> >>>> page: Recovered
> >>>> ...
> >>>>
> >>>> I think something in between broken MADV_HWPOISON on hugetlbfs, and we
> >>>> should be able to figure it out via bisection (and of course by
> >>>> reading delta commits between them, probably related to page
> >>>> refcount).
> >>> Thank you for this information.
> >>>
> >>>>
> >>>> That being said, I will be on vacation from tomorrow until the end of
> >>>> next week. So I will get back to this after next weekend. Meanwhile if
> >>>> you want to go ahead and bisect the problematic commit, that will be
> >>>> very much appreciated.
> >>> I'll try to bisect and post here if I find something.
> >> Found the culprit commit by bisection:
> >>
> >> a08c7193e4f18dc8508f2d07d0de2c5b94cb39a3
> >> mm/filemap: remove hugetlb special casing in filemap.c

Thanks Usama!

> >>
> >> hugetlb-read-hwpoison started failing from this patch. I've added the
> >> author of this patch to this bug report.
> >>
> > Hi Usama,
> >
> > Thanks for pointing this out. After debugging, the below diff seems to fix
> > the issue and allows the tests to pass again. Could you test it on your
> > configuration as well just to confirm.
> >
> > Thanks,
> > Sidhartha
> >
> > diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
> > index 36132c9125f9..3a248e4f7e93 100644
> > --- a/fs/hugetlbfs/inode.c
> > +++ b/fs/hugetlbfs/inode.c
> > @@ -340,7 +340,7 @@ static ssize_t hugetlbfs_read_iter(struct kiocb *iocb,
> > struct iov_iter *to)
> > } else {
> > folio_unlock(folio);
> >
> > - if (!folio_test_has_hwpoisoned(folio))
> > + if (!folio_test_hwpoison(folio))

Sidhartha, just curious why this change is needed? Does
PageHasHWPoisoned change after commit
"a08c7193e4f18dc8508f2d07d0de2c5b94cb39a3"?

> > want = nr;
> > else {
> > /*
> > diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> > index d8c853b35dbb..87f6bf7d8bc1 100644
> > --- a/mm/memory-failure.c
> > +++ b/mm/memory-failure.c
> > @@ -973,7 +973,7 @@ struct page_state {
> > static bool has_extra_refcount(struct page_state *ps, struct page *p,
> > bool extra_pins)
> > {
> > - int count = page_count(p) - 1;
> > + int count = page_count(p) - folio_nr_pages(page_folio(p));
> >
> > if (extra_pins)
> > count -= 1;
> >
> Tested the patch, it fixes the test. Please send this patch.
>
> Tested-by: Muhammad Usama Anjum <usama.anjum@xxxxxxxxxxxxx>
>
> --
> BR,
> Muhammad Usama Anjum