Re: [RESEND PATCH v8 0/8] Reduce TLB flushes by 94% by improving folio migration

From: Byungchul Park
Date: Sun Mar 03 2024 - 21:39:57 EST


On Thu, Feb 29, 2024 at 10:33:44AM +0100, David Hildenbrand wrote:
> On 29.02.24 10:28, Byungchul Park wrote:
> > On Mon, Feb 26, 2024 at 12:06:05PM +0900, Byungchul Park wrote:
> > > Hi everyone,
> > >
> > > While I'm working with a tiered memory system e.g. CXL memory, I have
> > > been facing migration overhead esp. TLB shootdown on promotion or
> > > demotion between different tiers. Yeah.. most TLB shootdowns on
> > > migration through hinting fault can be avoided thanks to Huang Ying's
> > > work, commit 4d4b6d66db ("mm,unmap: avoid flushing TLB in batch if PTE
> > > is inaccessible"). See the following link:
> > >
> > > https://lore.kernel.org/lkml/20231115025755.GA29979@xxxxxxxxxxxxxxxxxxx/
> > >
> > > However, it's only for ones using hinting fault. I thought it'd be much
> > > better if we have a general mechanism to reduce the number of TLB
> > > flushes and TLB misses, that we can ultimately apply to any type of
> > > migration, I tried it only for tiering for now tho.
> > >
> > > I'm suggesting a mechanism called MIGRC that stands for 'Migration Read
> > > Copy', to reduce TLB flushes by keeping source and destination of folios
> > > participated in the migrations until all TLB flushes required are done,
> > > only if those folios are not mapped with write permission PTE entries.
> > >
> > > To achieve that:
> > >
> > > 1. For the folios that map only to non-writable TLB entries, prevent
> > > TLB flush at migration by keeping both source and destination
> > > folios, which will be handled later at a better time.
> > >
> > > 2. When any non-writable TLB entry changes to writable e.g. through
> > > fault handler, give up migrc mechanism so as to perform TLB flush
> > > required right away.
> > >
> > > I observed a big improvement of TLB flushes # and TLB misses # at the
> > > following evaluation using XSBench like:
> > >
> > > 1. itlb flush was reduced by 93.9%.
> > > 2. dtlb thread was reduced by 43.5%.
> > > 3. stlb flush was reduced by 24.9%.
> >
> > Hi guys,
>
> Hi,
>
> >
> > The TLB flush reduction is 25% ~ 94%, IMO, it's unbelievable.
>
> Can't we find at least one benchmark that shows an actual improvement on
> some system?

XSBench is more like a real workload that is used for performance
analysis on high performance computing architectrues, not micro
benchmark only for testing TLB things.

XSBench : https://github.com/ANL-CESAR/XSBench

Not to mention TLB numbers, the performance improvement is a little but
clearly positive as you can see the result I shared.

Byungchul

> Staring at the number TLB flushes is nice, but if it does not affect actual
> performance of at least one benchmark why do we even care?
>
> "12 files changed, 597 insertions(+), 59 deletions(-)"
>
> is not negligible and needs proper review.
>
> That review needs motivation. The current numbers do not seem to be
> motivating enough :)
>
> --
> Cheers,
>
> David / dhildenb