Re: [PATCH v3] mm,hwpoison: return -EHWPOISON when page already poisoned

From: Aili Yao
Date: Mon Apr 05 2021 - 21:04:50 EST


On Mon, 5 Apr 2021 13:50:18 +0000
HORIGUCHI NAOYA(堀口 直也) <naoya.horiguchi@xxxxxxx> wrote:

> On Fri, Apr 02, 2021 at 03:11:20PM +0000, Luck, Tony wrote:
> > >> Combined with my "mutex" patch (to get rid of races where 2nd process returns
> > >> early, but first process is still looking for mappings to unmap and tasks
> > >> to signal) this patch moves forward a bit. But I think it needs an
> > >> additional change here in kill_me_maybe() to just "return" if there is a
> > >> EHWPOISON return from memory_failure()
> > >>
> > > Got this, Thanks for your reply!
> > > I will dig into this!
> >
> > One problem with this approach is when the first task to find poison
> > fails to complete actions. Then the poison pages are not unmapped,
> > and just returning from kill_me_maybe() gets into a loop :-(
>
> Yes, that's the pain point. We need send SIGBUS to the current process in
> "already haredware poisoned" case of memory_failure(). SIGBUS should
> contain the error virtual address, but unfortunately walking the page table
> or using p->mce_vaddr is not always reliable now.
>
> So as a second-best approach, we can extend the "walking page table"
> approach such that we walk over the whole virtual address space to make sure
> that the number of entries pointing to the error page is exactly 1.
> If that's the case, then we can confidently send SIGBUS with it. If we find
> multiple entries pointing to the error page, then we give up guessing, then
> send a nomral SIGBUS to the current process. That's not worse than now,
> and I think we need wait in the hope that the virtual address will be
> available in MCE handler.
>
> Anyway I'll try to write a patch for this.

Yeah, previous patch didn't adress the multiple virtual address issue, If there is a way to fix that,
That would be great!

--
Thanks!
Aili Yao