[PATCH v5 0/4] shoot lazy tlbs

From: Nicholas Piggin
Date: Mon Nov 08 2021 - 23:11:31 EST


Since v4, this fixes a kthread_use_mm refcounting bug and adds some
comments in code and changelogs around the kthread_use_mm change in
patch 1 (due to akpm's comment -- thanks).

It also adds and improves comments in code, changelogs, and Kconfig
options. The overall design is unchanged though. Please merge.

This series has suffered some issues getting agreement, so I would
like to address a few sticking points or misconceptions up front,
which hopefully can result in constructive disagreement and actual
actionable feedback.

* That the lazy mm scheme is complicated or bug prone.

This is not true, the concept is trivial and core code is extremely
simple and basically unchanged since Linus' active_mm email 20 years
ago in 2.3 days.

This series leaves the lazy tlb switching and ->active_mm semantics
entirely unchanged. It does change the refcounting, but the effects
are hidden under wrappers. It does not add anything new for code
outside those few places to think about except that they must specify
_lazy_mm when refcounting this particular type of reference. This
is not much of a problem since lazy mm references never "escape"
from specific switching sequences and become hard to track. Refs
that go into the wider world are always normal ones (i.e., created
by explicit mmgrab or kthread_use_mm).

* That membarrier code is complicated

This is true. My series changes exactly nothing to do with
membarriers. My series is entirely about lazy mm, which has been
virtually unchanged for many years before membarrier.
membarrier code takes advantage of memory ordering in scheduler
switch code that lazy mm refcounting was providing, so this series
adds one commented smp_mb() ifdef there to replace the refcount op
being removed. That does not affect the ability to change membarrier
code in future because the refcounted path has to be accounted for
here anyway.

In other words, any changes to membarrier code which deal with the
refcounted lazy mm path that exists today, then dealing with the non
refcounted option is trivial.

* That active_mm should be removed from core code.

I don't know how to address this other than it's not a good or well
thought out idea. This is not happening and is certainly not related
to my series which does not change ->active_mm semantics at all.

* That this series provides an option for archs to enable which result
in stale ->active_mm pointers, whereby it is up to the arch to
ensure nothing dereferences those pointers.

This is FUD. It has always been false. Archs that enable
MMU_LAZY_TLB_SHOOTDOWN never ever have stale ->active_mm pointers,
ever. If active_mm is non-NULL, then that gives exactly the same
guarantees as you have today.

* That performance of IPIs or other things is a problem.

I posted actual numbers showing this was not a concern, and listed
some options that could reduce them further if needed. No numbers
were ever posted to support the other side of the argument.

* That the series is a powerpc specific thing.

Untrue. I have trivial sparc and alpha conversions as the first two
things I looked at which I have SMP qemu environments for.

* That this series somehow prevents future changes or improvements.

It doesn't.

* That the series is very complex, code is bad or has problems.

Look at the patches. They seem pretty small and simple to me. I am
happy to address specific issues that are pointed out though, and
have done so.

* That x86 is relevant here.

This patch does not touch or affect x86 in any way. x86 has gone off
and done its own horrendously complicated and under-documented thing
with active_mm and the lazy mm concept. But that has been entirely
hidden from core code by the arch context switching hooks. Core code
continues to operate on the concept of ->mm and ->active_mm, and this
series does not change that at all. x86 is no more or less divorced
from that after the series.

Nothing the series does constrains x86 or changes to it in future. The
option can not be used immediately by x86, but there is no reason x86
could not be adapted to use it, or change their scheme to something
else entirely. Where code can be adapted to be shared or made usable by
x86, I have no problem with doing that.

If I've missed something or I've got anything wrong with the above,
I'm happy to hear it.

Thanks,
Nick

Nicholas Piggin (4):
lazy tlb: introduce lazy mm refcount helper functions
lazy tlb: allow lazy tlb mm refcounting to be configurable
lazy tlb: shoot lazies, a non-refcounting lazy tlb option
powerpc/64s: enable MMU_LAZY_TLB_SHOOTDOWN

Documentation/vm/active_mm.rst | 6 ++++
arch/Kconfig | 32 +++++++++++++++++
arch/arm/mach-rpc/ecard.c | 2 +-
arch/powerpc/Kconfig | 1 +
arch/powerpc/kernel/smp.c | 2 +-
arch/powerpc/mm/book3s64/radix_tlb.c | 4 +--
fs/exec.c | 2 +-
include/linux/sched/mm.h | 20 +++++++++++
kernel/cpu.c | 2 +-
kernel/exit.c | 2 +-
kernel/fork.c | 51 ++++++++++++++++++++++++++++
kernel/kthread.c | 21 +++++++-----
kernel/sched/core.c | 35 +++++++++++++------
kernel/sched/sched.h | 4 ++-
14 files changed, 158 insertions(+), 26 deletions(-)

--
2.23.0