Re: [v4 0/3] Reduce TLB flushes under some specific conditions

From: Byungchul Park
Date: Thu Nov 09 2023 - 20:32:32 EST


On Thu, Nov 09, 2023 at 01:20:29PM +0800, Huang, Ying wrote:
> Byungchul Park <byungchul@xxxxxx> writes:
>
> > Hi everyone,
> >
> > While I'm working with CXL memory, I have been facing migration overhead
> > esp. TLB shootdown on promotion or demotion between different tiers.
> > Yeah.. most TLB shootdowns on migration through hinting fault can be
> > avoided thanks to Huang Ying's work, commit 4d4b6d66db ("mm,unmap: avoid
> > flushing TLB in batch if PTE is inaccessible").
> >
> > However, it's only for ones using hinting fault. I thought it'd be much
> > better if we have a general mechanism to reduce # of TLB flushes and
> > TLB misses, that we can apply to any type of migration. I tried it only
> > for tiering migration for now tho.
> >
> > I'm suggesting a mechanism to reduce TLB flushes by keeping source and
> > destination of folios participated in the migrations until all TLB
> > flushes required are done, only if those folios are not mapped with
> > write permission PTE entries at all. I worked Based on v6.6-rc5.
> >
> > Can you believe it? I saw the number of TLB full flush reduced about
> > 80% and iTLB miss reduced about 50%, and the time wise performance
> > always shows at least 1% stable improvement with the workload I tested
> > with, XSBench. However, I believe that it would help more with other
> > ones or any real ones. It'd be appreciated to let me know if I'm missing
> > something.
>
> Can you help to test the effect of commit 7e12beb8ca2a ("migrate_pages:
> batch flushing TLB") for your test case? To test it, you can revert it
> and compare the performance before and after the reverting.

I will.

> And, how do you trigger migration when testing XSBench? Use a tiered
> memory system, and migrate pages between DRAM and CXL memory back and
> forth? If so, how many pages will you migrate for each migration?

Honestly I've been focusing on the migration # and TLB #. I will get
back to you.

Byungchul

> --
> Best Regards,
> Huang, Ying
>
> >
> > Byungchul
> >
> > ---
> >
> > Changes from v3:
> >
> > 1. Don't use the kconfig, CONFIG_MIGRC, and remove sysctl knob,
> > migrc_enable. (feedbacked by Nadav)
> > 2. Remove the optimization skipping CPUs that have already
> > performed TLB flushes needed by any reason when performing
> > TLB flushes by migrc because I can't tell the performance
> > difference between w/ the optimization and w/o that.
> > (feedbacked by Nadav)
> > 3. Minimize arch-specific code. While at it, move all the migrc
> > declarations and inline functions from include/linux/mm.h to
> > mm/internal.h (feedbacked by Dave Hansen, Nadav)
> > 4. Separate a part making migrc paused when the system is in
> > high memory pressure to another patch. (feedbacked by Nadav)
> > 5. Rename:
> > a. arch_tlbbatch_clean() to arch_tlbbatch_clear(),
> > b. tlb_ubc_nowr to tlb_ubc_ro,
> > c. migrc_try_flush_free_folios() to migrc_flush_free_folios(),
> > d. migrc_stop to migrc_pause.
> > (feedbacked by Nadav)
> > 6. Use ->lru list_head instead of introducing a new llist_head.
> > (feedbacked by Nadav)
> > 7. Use non-atomic operations of page-flag when it's safe.
> > (feedbacked by Nadav)
> > 8. Use stack instead of keeping a pointer of 'struct migrc_req'
> > in struct task, which is for manipulating it locally.
> > (feedbacked by Nadav)
> > 9. Replace a lot of simple functions to inline functions placed
> > in a header, mm/internal.h. (feedbacked by Nadav)
> > 10. Add additional sufficient comments. (feedbacked by Nadav)
> > 11. Remove a lot of wrapper functions. (feedbacked by Nadav)
> >
> > Changes from RFC v2:
> >
> > 1. Remove additional occupation in struct page. To do that,
> > unioned with lru field for migrc's list and added a page
> > flag. I know page flag is a thing that we don't like to add
> > but no choice because migrc should distinguish folios under
> > migrc's control from others. Instead, I force migrc to be
> > used only on 64 bit system to mitigate you guys from getting
> > angry.
> > 2. Remove meaningless internal object allocator that I
> > introduced to minimize impact onto the system. However, a ton
> > of tests showed there was no difference.
> > 3. Stop migrc from working when the system is in high memory
> > pressure like about to perform direct reclaim. At the
> > condition where the swap mechanism is heavily used, I found
> > the system suffered from regression without this control.
> > 4. Exclude folios that pte_dirty() == true from migrc's interest
> > so that migrc can work simpler.
> > 5. Combine several patches that work tightly coupled to one.
> > 6. Add sufficient comments for better review.
> > 7. Manage migrc's request in per-node manner (from globally).
> > 8. Add TLB miss improvement in commit message.
> > 9. Test with more CPUs(4 -> 16) to see bigger improvement.
> >
> > Changes from RFC:
> >
> > 1. Fix a bug triggered when a destination folio at the previous
> > migration becomes a source folio at the next migration,
> > before the folio gets handled properly so that the folio can
> > play with another migration. There was inconsistency in the
> > folio's state. Fixed it.
> > 2. Split the patch set into more pieces so that the folks can
> > review better. (Feedbacked by Nadav Amit)
> > 3. Fix a wrong usage of barrier e.g. smp_mb__after_atomic().
> > (Feedbacked by Nadav Amit)
> > 4. Tried to add sufficient comments to explain the patch set
> > better. (Feedbacked by Nadav Amit)
> >
> > Byungchul Park (3):
> > mm/rmap: Recognize read-only TLB entries during batched TLB flush
> > mm: Defer TLB flush by keeping both src and dst folios at migration
> > mm: Pause migrc mechanism at high memory pressure
> >
> > arch/x86/include/asm/tlbflush.h | 3 +
> > arch/x86/mm/tlb.c | 11 ++
> > include/linux/mm_types.h | 21 +++
> > include/linux/mmzone.h | 9 ++
> > include/linux/page-flags.h | 4 +
> > include/linux/sched.h | 7 +
> > include/trace/events/mmflags.h | 3 +-
> > mm/internal.h | 78 ++++++++++
> > mm/memory.c | 11 ++
> > mm/migrate.c | 266 ++++++++++++++++++++++++++++++++
> > mm/page_alloc.c | 30 +++-
> > mm/rmap.c | 35 ++++-
> > 12 files changed, 475 insertions(+), 3 deletions(-)