Re: [QUESTION FOR ARM64 TLB] performance issue and implementation difference of TLB flush

From: Gang Li
Date: Fri May 05 2023 - 08:29:15 EST


Hi,

I found that in `ghes_unmap` protected by spinlock, arm64 and x86 have
different strategies for flushing tlb.

# arm64 call trace:
```
holding a spin lock
ghes_unmap
clear_fixmap
__set_fixmap
flush_tlb_kernel_range
```

# x86 call trace:
```
holding a spin lock
ghes_unmap
clear_fixmap
__set_fixmap
mmu.set_fixmap
native_set_fixmap
__native_set_fixmap
set_pte_vaddr
set_pte_vaddr_p4d
__set_pte_vaddr
flush_tlb_one_kernel
```

As we can see, ghes_unmap in arm64 eventually calls
flush_tlb_kernel_range to broadcast TLB invalidation. However, on
x86, ghes_unmap calls flush_tlb_one_kernel.

Why arm64 needs to broadcast TLB invalidation in ghes_unmap, while only
one CPU has accessed this memory area?

Mark Rutland said in https://lore.kernel.org/lkml/369d1be2-d418-1bfb-bfc2-b25e4e542d76@xxxxxxxxxxxxx/

The architecture (arm64) allows a CPU to allocate TLB entries at any time for any
reason, for any valid translation table entries reachable from the root in
TTBR{0,1}_ELx. That can be due to speculation, prefetching, and/or other
reasons.

Due to that, it doesn't matter whether or not a CPU explicitly accesses a
memory location -- TLB entries can be allocated regardless.
Consequently, the
spinlock doesn't make any difference.


arm64 broadcast TLB invalidation in ghes_unmap, because TLB entry can be
allocated regardless of whether the CPU explicitly accesses memory.

Why doesn't x86 broadcast TLB invalidation in ghes_unmap? Is there any
difference between x86 and arm64 in TLB allocation and invalidation strategy?

Thanks,
Gang Li