Re: [QUESTION FOR ARM64 TLB] performance issue and implementation difference of TLB flush
From: Gang Li
Date: Fri May 05 2023 - 08:29:15 EST
Hi,
I found that in `ghes_unmap` protected by spinlock, arm64 and x86 have
different strategies for flushing tlb.
# arm64 call trace:
```
holding a spin lock
ghes_unmap
clear_fixmap
__set_fixmap
flush_tlb_kernel_range
```
# x86 call trace:
```
holding a spin lock
ghes_unmap
clear_fixmap
__set_fixmap
mmu.set_fixmap
native_set_fixmap
__native_set_fixmap
set_pte_vaddr
set_pte_vaddr_p4d
__set_pte_vaddr
flush_tlb_one_kernel
```
As we can see, ghes_unmap in arm64 eventually calls
flush_tlb_kernel_range to broadcast TLB invalidation. However, on
x86, ghes_unmap calls flush_tlb_one_kernel.
Why arm64 needs to broadcast TLB invalidation in ghes_unmap, while only
one CPU has accessed this memory area?
Mark Rutland said in
https://lore.kernel.org/lkml/369d1be2-d418-1bfb-bfc2-b25e4e542d76@xxxxxxxxxxxxx/
The architecture (arm64) allows a CPU to allocate TLB entries at any time for any
reason, for any valid translation table entries reachable from the
root in
TTBR{0,1}_ELx. That can be due to speculation, prefetching, and/or other
reasons.
Due to that, it doesn't matter whether or not a CPU explicitly accesses a
memory location -- TLB entries can be allocated regardless.
Consequently, the
spinlock doesn't make any difference.
arm64 broadcast TLB invalidation in ghes_unmap, because TLB entry can be
allocated regardless of whether the CPU explicitly accesses memory.
Why doesn't x86 broadcast TLB invalidation in ghes_unmap? Is there any
difference between x86 and arm64 in TLB allocation and invalidation
strategy?
Thanks,
Gang Li