[QUESTION FOR ARM64 TLB] performance issue and implementation difference of TLB flush

From: Gang Li
Date: Wed Apr 26 2023 - 23:28:27 EST


Hi all,

I have encountered a performance issue on our ARM64 machine, which seems
to be caused by the flush_tlb_kernel_range.

Here is the stack on the ARM64 machine:

# ARM64:
```
ghes_unmap
clear_fixmap
__set_fixmap
flush_tlb_kernel_range
```

As we can see, the ARM64 implementation eventually calls
flush_tlb_kernel_range, which flushes the TLB on all cores. However, on
AMD64, the implementation calls flush_tlb_one_kernel instead.

# AMD64:
```
ghes_unmap
clear_fixmap
__set_fixmap
mmu.set_fixmap
native_set_fixmap
__native_set_fixmap
set_pte_vaddr
set_pte_vaddr_p4d
__set_pte_vaddr
flush_tlb_one_kernel
```

On our ARM64 machine, flush_tlb_kernel_range is causing a noticeable
performance degradation.

This arm64 patch said:
https://lore.kernel.org/all/20161201135112.15396-1-fu.wei@xxxxxxxxxx/
(commit 9f9a35a7b654e006250530425eb1fb527f0d32e9)

```
/*
* Despite its name, this function must still broadcast the TLB
* invalidation in order to ensure other CPUs don't end up with junk
* entries as a result of speculation. Unusually, its also called in
* IRQ context (ghes_iounmap_irq) so if we ever need to use IPIs for
* TLB broadcasting, then we're in trouble here.
*/
static inline void arch_apei_flush_tlb_one(unsigned long addr)
{
flush_tlb_kernel_range(addr, addr + PAGE_SIZE);
}
```

1. I am curious to know the reason behind the design choice of flushing
the TLB on all cores for ARM64's clear_fixmap, while AMD64 only flushes
the TLB on a single core. Are there any TLB design details that make a
difference here?

2. Is it possible to let the ARM64 to flush the TLB on just one core,
similar to the AMD64?

3. If so, would there be any potential drawbacks or limitations to
making such a change?

Thanks,

Gang Li