Re: [QUESTION FOR ARM64 TLB] performance issue and implementation difference of TLB flush

From: Gang Li
Date: Fri May 05 2023 - 05:49:37 EST


This series accidentally lost CC. Now I forward the lost emails to the
mailing list.

On 2023/4/28 17:27, Mark Rutland wrote:


Hi,

Just to check -- did you mean to drop the other Ccs? It would be good to keep
this discussion on-list if possible.

On Fri, Apr 28, 2023 at 01:49:46PM +0800, Gang Li wrote:
On 2023/4/27 15:30, Mark Rutland wrote:
On Thu, Apr 27, 2023 at 11:26:50AM +0800, Gang Li wrote:
1. I am curious to know the reason behind the design choice of flushing
the TLB on all cores for ARM64's clear_fixmap, while AMD64 only flushes
the TLB on a single core. Are there any TLB design details that make a
difference here?

I don't know why arm64 only clears this on a single CPU.

Sorry, I'm a bit confused.

Did you mean you don't know why *amd64* only clears this on a single
CPU?

Yes, sorry; I meant to say "amd64" rather than "arm64" here.

Looks like I should ask amd64 guy 😉

😉

On arm64 we *must* invalidate the TLB on all CPUs as the kernel page tables are
shared by all CPUs, and the architectural Break-Before-Make rules in require
the TLB to be invalidated between two valid (but distinct) entries.

ghes_unmap is protected by a spin_lock, so only one core can access this
mem area at a time. I understand that there will be no TLB for
this memory area on other cores.

Is it because arm64 has speculative execution? Even if the core does not
hold the spin_lock, the TLB will still cache the critical section?

The architecture allows a CPU to allocate TLB entries at any time for any
reason, for any valid translation table entries reachable from the root in
TTBR{0,1}_ELx. That can be due to speculation, prefetching, and/or other
reasons.

Due to that, it doesn't matter whether or not a CPU explicitly accesses a
memory location -- TLB entries can be allocated regardless. Consequently, the
spinlock doesn't make any difference.

Thanks,
Mark.