Re: linux-next: manual merge of the akpm-current tree with the tip tree

From: Nadav Amit
Date: Tue Aug 15 2017 - 03:52:06 EST


Peter Zijlstra <peterz@xxxxxxxxxxxxx> wrote:

> On Mon, Aug 14, 2017 at 05:07:19AM +0000, Nadav Amit wrote:
>>>> So I'm not entirely clear about this yet.
>>>>
>>>> How about:
>>>>
>>>>
>>>> CPU0 CPU1
>>>>
>>>> tlb_gather_mmu()
>>>>
>>>> lock PTLn
>>>> no mod
>>>> unlock PTLn
>>>>
>>>> tlb_gather_mmu()
>>>>
>>>> lock PTLm
>>>> mod
>>>> include in tlb range
>>>> unlock PTLm
>>>>
>>>> lock PTLn
>>>> mod
>>>> unlock PTLn
>>>>
>>>> tlb_finish_mmu()
>>>> force = mm_tlb_flush_nested(tlb->mm);
>>>> arch_tlb_finish_mmu(force);
>>>>
>>>>
>>>> ... more ...
>>>>
>>>> tlb_finish_mmu()
>>>>
>>>>
>>>>
>>>> In this case you also want CPU1's mm_tlb_flush_nested() call to return
>>>> true, right?
>>>
>>> No, because CPU 1 mofified pte and added it into tlb range
>>> so regardless of nested, it will flush TLB so there is no stale
>>> TLB problem.
>
>> To clarify: the main problem that these patches address is when the first
>> CPU updates the PTE, and second CPU sees the updated value and thinks: âthe
>> PTE is already what I wanted - no flush is neededâ.
>
> OK, that simplifies things.
>
>> For some reason (I would assume intentional), all the examples here first
>> âdo not modifyâ the PTE, and then modify it - which is not an âinterestingâ
>> case.
>
> Depends on what you call 'interesting' :-) They are 'interesting' to
> make work from a memory ordering POV. And since I didn't get they were
> excluded from the set, I worried.
>
> In fact, if they were to be included, I couldn't make it work at all. So
> I'm really glad to hear we can disregard them.
>
>> However, based on what I understand on the memory barriers, I think
>> there is indeed a missing barrier before reading it in
>> mm_tlb_flush_nested(). IIUC using smp_mb__after_unlock_lock() in this case,
>> before reading, would solve the problem with least impact on systems with
>> strong memory ordering.
>
> No, all is well. If, as you say, we're naturally constrained to the case
> where we only care about prior modification we can rely on the RCpc PTL
> locks.
>
> Consider:
>
>
> CPU0 CPU1
>
> tlb_gather_mmu()
>
> tlb_gather_mmu()
> inc --------.
> | (inc is constrained by RELEASE)
> lock PTLn |
> mod ^
> unlock PTLn -----------------> lock PTLn
> v no mod
> | unlock PTLn
> |
> | lock PTLm
> | mod
> | include in tlb range
> | unlock PTLm
> |
> (read is constrained |
> by ACQUIRE) |
> | tlb_finish_mmu()
> `---- force = mm_tlb_flush_nested(tlb->mm);
> arch_tlb_finish_mmu(force);
>
>
> ... more ...
>
> tlb_finish_mmu()
>
>
> Then CPU1's acquire of PTLn orders against CPU0's release of that same
> PTLn which guarantees we observe both its (prior) modified PTE and the
> mm->tlb_flush_pending increment from tlb_gather_mmu().
>
> So all we need for mm_tlb_flush_nested() to work is having acquired the
> right PTL at least once before calling it.
>
> At the same time, the decrements need to be after the TLB invalidate is
> complete, this ensures that _IF_ we observe the decrement, we must've
> also observed the corresponding invalidate.
>
> Something like the below is then sufficient.
>
> ---
> Subject: mm: Clarify tlb_flush_pending barriers
> From: Peter Zijlstra <peterz@xxxxxxxxxxxxx>
> Date: Fri, 11 Aug 2017 16:04:50 +0200
>
> Better document the ordering around tlb_flush_pending.
>
> Signed-off-by: Peter Zijlstra (Intel) <peterz@xxxxxxxxxxxxx>
> ---
> include/linux/mm_types.h | 78 +++++++++++++++++++++++++++--------------------
> 1 file changed, 45 insertions(+), 33 deletions(-)
>
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -526,30 +526,6 @@ extern void tlb_gather_mmu(struct mmu_ga
> extern void tlb_finish_mmu(struct mmu_gather *tlb,
> unsigned long start, unsigned long end);
>
> -/*
> - * Memory barriers to keep this state in sync are graciously provided by
> - * the page table locks, outside of which no page table modifications happen.
> - * The barriers are used to ensure the order between tlb_flush_pending updates,
> - * which happen while the lock is not taken, and the PTE updates, which happen
> - * while the lock is taken, are serialized.
> - */
> -static inline bool mm_tlb_flush_pending(struct mm_struct *mm)
> -{
> - /*
> - * Must be called with PTL held; such that our PTL acquire will have
> - * observed the store from set_tlb_flush_pending().
> - */
> - return atomic_read(&mm->tlb_flush_pending) > 0;
> -}
> -
> -/*
> - * Returns true if there are two above TLB batching threads in parallel.
> - */
> -static inline bool mm_tlb_flush_nested(struct mm_struct *mm)
> -{
> - return atomic_read(&mm->tlb_flush_pending) > 1;
> -}
> -
> static inline void init_tlb_flush_pending(struct mm_struct *mm)
> {
> atomic_set(&mm->tlb_flush_pending, 0);
> @@ -558,7 +534,6 @@ static inline void init_tlb_flush_pendin
> static inline void inc_tlb_flush_pending(struct mm_struct *mm)
> {
> atomic_inc(&mm->tlb_flush_pending);
> -
> /*
> * The only time this value is relevant is when there are indeed pages
> * to flush. And we'll only flush pages after changing them, which
> @@ -580,24 +555,61 @@ static inline void inc_tlb_flush_pending
> * flush_tlb_range();
> * atomic_dec(&mm->tlb_flush_pending);
> *
> - * So the =true store is constrained by the PTL unlock, and the =false
> - * store is constrained by the TLB invalidate.
> + * Where the increment if constrained by the PTL unlock, it thus
> + * ensures that the increment is visible if the PTE modification is
> + * visible. After all, if there is no PTE modification, nobody cares
> + * about TLB flushes either.
> + *
> + * This very much relies on users (mm_tlb_flush_pending() and
> + * mm_tlb_flush_nested()) only caring about _specific_ PTEs (and
> + * therefore specific PTLs), because with SPLIT_PTE_PTLOCKS and RCpc
> + * locks (PPC) the unlock of one doesn't order against the lock of
> + * another PTL.
> + *
> + * The decrement is ordered by the flush_tlb_range(), such that
> + * mm_tlb_flush_pending() will not return false unless all flushes have
> + * completed.
> */
> }
>
> -/* Clearing is done after a TLB flush, which also provides a barrier. */
> static inline void dec_tlb_flush_pending(struct mm_struct *mm)
> {
> /*
> - * Guarantee that the tlb_flush_pending does not not leak into the
> - * critical section, since we must order the PTE change and changes to
> - * the pending TLB flush indication. We could have relied on TLB flush
> - * as a memory barrier, but this behavior is not clearly documented.
> + * See inc_tlb_flush_pending().
> + *
> + * This cannot be smp_mb__before_atomic() because smp_mb() simply does
> + * not order against TLB invalidate completion, which is what we need.
> + *
> + * Therefore we must rely on tlb_flush_*() to guarantee order.
> */
> - smp_mb__before_atomic();
> atomic_dec(&mm->tlb_flush_pending);
> }
>
> +static inline bool mm_tlb_flush_pending(struct mm_struct *mm)
> +{
> + /*
> + * Must be called after having acquired the PTL; orders against that
> + * PTLs release and therefore ensures that if we observe the modified
> + * PTE we must also observe the increment from inc_tlb_flush_pending().
> + *
> + * That is, it only guarantees to return true if there is a flush
> + * pending for _this_ PTL.
> + */
> + return atomic_read(&mm->tlb_flush_pending);
> +}
> +
> +static inline bool mm_tlb_flush_nested(struct mm_struct *mm)
> +{
> + /*
> + * Similar to mm_tlb_flush_pending(), we must have acquired the PTL
> + * for which there is a TLB flush pending in order to guarantee
> + * we've seen both that PTE modification and the increment.
> + *
> + * (no requirement on actually still holding the PTL, that is irrelevant)
> + */
> + return atomic_read(&mm->tlb_flush_pending) > 1;
> +}
> +
> struct vm_fault;
>
> struct vm_special_mapping {

Thanks for the detailed explanation. I will pay more attention next time.