Re: [PATCH] arm64/mm: Add memory barrier for mm_cid

From: levi.yun
Date: Tue Mar 05 2024 - 16:08:48 EST


Hi Mathieu!

On 05/03/2024 20:01, Mathieu Desnoyers wrote:
On 2024-03-05 09:53, levi.yun wrote:
Currently arm64's switch_mm() doesn't always have an smp_mb()
which the core scheduler code has depended upon since commit:

     commit 223baf9d17f25 ("sched: Fix performance regression introduced by mm_cid")

If switch_mm() doesn't call smp_mb(), sched_mm_cid_remote_clear()
can unset the activly used cid when it fails to observe active task after it
sets lazy_put.

By adding an smp_mb() in arm64's check_and_switch_context(),
Guarantee to observe active task after sched_mm_cid_remote_clear()
success to set lazy_put.

This comment from the original implementation of membarrier
MEMBARRIER_CMD_PRIVATE_EXPEDITED states that the original need from
membarrier was to have a full barrier between storing to rq->curr and
return to userspace:

commit 22e4ebb9758 ("membarrier: Provide expedited private command")

commit message:

    * Our TSO archs can do RELEASE without being a full barrier. Look at
      x86 spin_unlock() being a regular STORE for example.  But for those
      archs, all atomics imply smp_mb and all of them have atomic ops in
      switch_mm() for mm_cpumask(), and on x86 the CR3 load acts as a full
      barrier.
        * From all weakly ordered machines, only ARM64 and PPC can do RELEASE,
      the rest does indeed do smp_mb(), so there the spin_unlock() is a full
      barrier and we're good.
        * ARM64 has a very heavy barrier in switch_to(), which suffices.
        * PPC just removed its barrier from switch_to(), but appears to be
      talking about adding something to switch_mm(). So add a
      smp_mb__after_unlock_lock() for now, until this is settled on the PPC
      side.

associated code:

+               /*
+                * The membarrier system call requires each architecture
+                * to have a full memory barrier after updating
+                * rq->curr, before returning to user-space. For TSO
+                * (e.g. x86), the architecture must provide its own
+                * barrier in switch_mm(). For weakly ordered machines
+                * for which spin_unlock() acts as a full memory
+                * barrier, finish_lock_switch() in common code takes
+                * care of this barrier. For weakly ordered machines for
+                * which spin_unlock() acts as a RELEASE barrier (only
+                * arm64 and PowerPC), arm64 has a full barrier in
+                * switch_to(), and PowerPC has
+                * smp_mb__after_unlock_lock() before
+                * finish_lock_switch().
+                */

Which got updated to this by

commit 306e060435d ("membarrier: Document scheduler barrier requirements")

                /*
                 * The membarrier system call requires each architecture
                 * to have a full memory barrier after updating
+                * rq->curr, before returning to user-space.
+                *
+                * Here are the schemes providing that barrier on the
+                * various architectures:
+                * - mm ? switch_mm() : mmdrop() for x86, s390, sparc, PowerPC.
+                *   switch_mm() rely on membarrier_arch_switch_mm() on PowerPC.
+                * - finish_lock_switch() for weakly-ordered
+                *   architectures where spin_unlock is a full barrier,
+                * - switch_to() for arm64 (weakly-ordered, spin_unlock
+                *   is a RELEASE barrier),
                 */

However, rseq mm_cid has stricter requirements: the barrier needs to be
issued between store to rq->curr and switch_mm_cid(), which happens
earlier than:

- spin_unlock(),
- switch_to().

So it's fine when the architecture switch_mm happens to have that barrier
already, but less so when the architecture only provides the full barrier
in switch_to() or spin_unlock().

The issue is therefore not specific to arm64, it's actually a bug in the
rseq switch_mm_cid() implementation. All architectures that don't have
memory barriers in switch_mm(), but rather have the full barrier either in
finish_lock_switch() or switch_to() have them too late for the needs of
switch_mm_cid().

Thanks for the great detail explain!



I would recommend one of three approaches here:

A) Add smp_mb() in switch_mm_cid() for all architectures that lack that
   barrier in switch_mm().

B) Figure out if we can move switch_mm_cid() further down in the scheduler
   without breaking anything (within switch_to(), at the very end of
   finish_lock_switch() for instance). I'm not sure we can do that though
   because switch_mm_cid() touches the "prev" which is tricky after
   switch_to().

C) Add barriers in switch_mm() within all architectures that are missing it.

Thoughts ?
IMHO, A) is look good to me.

Because, In case of B), If you assume spin_unlock() for rq->lock has full memory barrier,
I'm not sure about the architecture which using queued_spin_unlock().

When I see the queued_spin_unlock()'s implementation, It implements using smp_store_relasse().
But, when we see the memory_barrier.txt describing MULTICOPY ATOMICITY,
If smp_mb__after_atomic() is implemented with smp_mb(), There might fail to observe.

Am I wrong?

Many thanks!