Re: [BUG?] X86 arch_tlbbatch_flush() seems to be lacking mm_tlb_flush_nested() integration

From: Mel Gorman
Date: Mon Oct 17 2022 - 11:02:57 EST


On Sat, Oct 15, 2022 at 04:47:16PM -0700, Linus Torvalds wrote:
> On Fri, Oct 14, 2022 at 8:51 PM Nadav Amit <nadav.amit@xxxxxxxxx> wrote:
> >
> > Unless I am missing something, flush_tlb_batched_pending() is would be
> > called and do the flushing at this point, no?
>
> Ahh, yes.
>
> That seems to be doing the right thing, although looking a bit more at
> it, I think it might be improved.
>

To be fair, I originally got it wrong and Nadav caught it almost 2 years
later. However, I think the current behaviour is still ok.

> At least in the zap_pte_range() case, instead of doing a synchronous
> TLB flush if there are pending batched flushes, it migth be better if
> flush_tlb_batched_pending() would set the "need_flush_all" bit in the
> mmu_gather structure.
>

I think setting need_flush_all would miss the case if no PTEs were updated
due to a race during unmap. I think it would be safer to check for deferred
TLB flush in mm_tlb_flush_nested but didn't dig too deep.

> That would possibly avoid that extra TLB flush entirely - since
> *normally* fzap_page_range() will cause a TLB flush anyway.
>
> Maybe it doesn't matter.
>

While it could be better, I still think the simple approach is sufficient
and it can be used in each affected area. For example, move_ptes does not
use mmu_gather and either that would have to be converted to use mmu_gather
or have mmu_gather and !mmu_gather detection of deferred TLB flushing from
reclaim context and I'm not sure it's worth it.

Once reclaim is active, the performance is slightly degraded as TLBs
are being flushed anyway and it's possible that active pages are being
reclaimed that will have to be refaulted which is even more costly. For the
scenario Jann was concerned with, pages belonging to the task are being
reclaimed while mmap/munmap operations are also happening. munmap/mmap
is sufficiently expensive that a spurious flush due to parallel reclaim
should have negligible additional overhead and I'd be surprised if the
additional runtime cost can be reliably measured.

--
Mel Gorman
SUSE Labs