Re: [RFC PATCH 09/37] mm: add per-mm mmap sequence counter for speculative page fault handling.

From: Michel Lespinasse
Date: Wed Apr 07 2021 - 16:50:53 EST


On Wed, Apr 07, 2021 at 04:47:34PM +0200, Peter Zijlstra wrote:
> On Tue, Apr 06, 2021 at 06:44:34PM -0700, Michel Lespinasse wrote:
> > The counter's write side is hooked into the existing mmap locking API:
> > mmap_write_lock() increments the counter to the next (odd) value, and
> > mmap_write_unlock() increments it again to the next (even) value.
> >
> > The counter's speculative read side is supposed to be used as follows:
> >
> > seq = mmap_seq_read_start(mm);
> > if (seq & 1)
> > goto fail;
> > .... speculative handling here ....
> > if (!mmap_seq_read_check(mm, seq)
> > goto fail;
> >
> > This API guarantees that, if none of the "fail" tests abort
> > speculative execution, the speculative code section did not run
> > concurrently with any mmap writer.
>
> So this is obviously safe, but it's also super excessive. Any change,
> anywhere, will invalidate and abort a SPF.
>
> Since you make a complete copy of the vma, you could memcmp it in its
> entirety instead of this.

Yeah, there is a deliberate choice here to start with the simplest
possible approach, but this could lead to more SPF aborts than
strictly necessary.

It's not clear to me that just comparing original vs current vma
attributes would always be sufficient - I think in some cases, we may
also need to worry about attributes being changed back and forth
concurrently with the fault. However, the comparison you suggest would
be safe at least in the case where the original pte was pte_none.

At some point, I did try implementing the optimization you suggest -
if the global counter has changed, re-fetch the current vma and
compare it against the old - but this was basically no help for the
(Android) workloads I looked at. There were just too few aborts that
were caused by the global counter in the first place. Note that my
patchset only handles anon and page cache faults speculatively, so
generally there wasn't very much time for the counter to change.

But yes, this may not hold across all workloads, and maybe we'd want
to do something smarter once/if a problem workload is identified.