Re: [PATCH 1/4] mm: Trial do_wp_page() simplification

From: Oded Gabbay
Date: Wed Feb 16 2022 - 11:59:54 EST


On Mon, Sep 21, 2020 at 3:03 PM Jason Gunthorpe <jgg@xxxxxxxx> wrote:
>
> On Mon, Sep 21, 2020 at 10:35:05AM +0200, Jan Kara wrote:
> > > My thinking is to hit this issue you have to already be doing
> > > FOLL_LONGTERM, and if some driver hasn't been properly marked and
> > > regresses, the fix is to mark it.
> > >
> > > Remember, this use case requires the pin to extend after a system
> > > call, past another fork() system call, and still have data-coherence.
> > >
> > > IMHO that can only happen in the FOLL_LONGTERM case as it inhernetly
> > > means the lifetime of the pin is being controlled by userspace, not by
> > > the kernel. Otherwise userspace could not cause new DMA touches after
> > > fork.
> >
> > I agree that the new aggressive COW behavior is probably causing issues
> > only for FOLL_LONGTERM users. That being said it would be nice if even
> > ordinary threaded FOLL_PIN users would not have to be that careful about
> > fork(2) and possible data loss due to COW - we had certainly reports of
> > O_DIRECT IO loosing data due to fork(2) and COW exactly because it is very
> > subtle how it behaves... But as I wrote above this is not urgent since that
> > problematic behavior exists since the beginning of O_DIRECT IO in Linux.
>
> Yes, I agree - what I was thinking is to do this FOLL_LONGTERM for the
> rc and then a small patch to make it wider for the next cycle so it
> can test in linux-next for a responsible time period.
>
> Interesting to hear you confirm block has also seen subtle user
> problems with this as well.
>
> Jason
>

Hi Jason, Linus,
Sorry for waking up this thread, but I've filed a bug against this change:
https://bugzilla.kernel.org/show_bug.cgi?id=215616

In the past week, I've bisected a problem we have in one of our new
demos running on our Gaudi accelerator, and after a very long
bisection, I've come to this commit.
All the details are in the bug, but the bottom line is that somehow,
this patch causes corruption when the numa balancing feature is
enabled AND we don't use process affinity AND we use GUP to pin pages
so our accelerator can DMA to/from system memory.

Either disabling numa balancing, using process affinity to bind to
specific numa-node or reverting this patch causes the bug to
disappear.
I validated the bug and the revert on kernels 5.9, 5.11 and 5.17-rc4.

You can see our GUP code in the driver in get_user_memory() in
drivers/misc/habanalabs/common/memory.c.
It is fairly standard and I think I got that line from Daniel (cc'ed
on this email).

I would appreciate help from the mm experts here to understand how to
fix this, but it looks as if this simplification caused or exposed
some race between numa migration code and GUP.

Thanks,
Oded