Re: [PATCH 0/3] TLB flush multiple pages per IPI v5

From: Linus Torvalds
Date: Wed Jun 10 2015 - 12:17:31 EST


On Wed, Jun 10, 2015 at 6:13 AM, Andi Kleen <andi@xxxxxxxxxxxxxx> wrote:
>
> Assuming the page tables are cache-hot... And hot here does not mean
> L3 cache, but higher. But a memory intensive workload can easily
> violate that.

In practice, no.

You'll spend all your time on the actual real data cache misses, the
TLB misses won't be any more noticeable.

And if your access patters are even *remoptely* cache-friendly (ie
_not_ spending all your time just waiting for regular data cache
misses), then a radix-tree-like page table like Intel will have much
better locality in the page tables than in the actual data. So again,
the TLB misses won't be your big problem.

There may be pathological cases where you just look at one word per
page, but let's face it, we don't optimize for pathological or
unrealistic cases.

And the thing is, you need to look at the costs. Single-page
invalidation taking hundreds of cycles? Yeah, we definitely need to
take the downside of trying to be clever into account.

If the invalidation was really cheap, the rules might change. As it
is, I really don't think there is any question about this.

That's particularly true when the single-page invalidation approach
has lots of *software* overhead too - not just the complexity, but
even "obvious" costs feeding the list of pages to be invalidated
across CPU's. Think about it - there are cache misses there too, and
because we do those across CPU's those cache misses are *mandatory*.

So trying to avoid a few TLB misses by forcing mandatory cache misses
and extra complexity, and by doing lots of 200+ cycle operations?
Really? In what universe does that sound like a good idea?

Quite frankly, I can pretty much *guarantee* that you didn't actually
think about any real numbers, you've just been taught that fairy-tale
of "TLB misses are expensive". As if TLB entries were somehow sacred.

If somebody can show real numbers on a real workload, that's one
thing. But in the absence of those real numbers, we should go for the
simple and straightforward code, and not try to make up "but but but
it could be expensive" issues.

Trying to avoid IPI's by batching them up sounds like a very obvious
win. I absolutely believe that is worth it. I just don't believe it is
worth it trying to pass a lot of page detail data around. The number
of individual pages flushed should be *very* controlled.

Here's a suggestion: the absolute maximum number of TLB entries we
flush should be something we can describe in one single cacheline.
Strive to minimize the number of cachelines we have to transfer for
the IPI, instead of trying to minimize the number of TLB entries. So
maybe soemthing along the lines of "one word of flags (all or nothing,
or a range), or up to seven indivual addresses to be flushed". I
personally think even seven invlpg's is likely too much, but with the
"single cacheline" at least there's another basis for the limit too.

So anyway, I like the patch series. I just think that the final patch
- the one that actually saves the addreses, and limits things to
BATCH_TLBFLUSH_SIZE, should be limited.

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/