Re: [RFC 2/2] mm: Defer TLB flush by keeping both src and dst folios at migration

From: Huang, Ying
Date: Mon Aug 14 2023 - 21:30:22 EST


Byungchul Park <byungchul@xxxxxx> writes:

> Implementation of CONFIG_MIGRC that stands for 'Migration Read Copy'.
>
> We always face the migration overhead at either promotion or demotion,
> while working with tiered memory e.g. CXL memory and found out TLB
> shootdown is a quite big one that is needed to get rid of if possible.
>
> Fortunately, TLB flush can be defered or even skipped if both source and
> destination of folios during migration are kept until all TLB flushes
> required will have been done, of course, only if the target PTE entries
> have read only permission, more precisely speaking, don't have write
> permission. Otherwise, no doubt the folio might get messed up.
>
> To achieve that:
>
> 1. For the folios that have only non-writable TLB entries, prevent
> TLB flush by keeping both source and destination of folios during
> migration, which will be handled later at a better time.
>
> 2. When any non-writable TLB entry changes to writable e.g. through
> fault handler, give up CONFIG_MIGRC mechanism so as to perform
> TLB flush required right away.
>
> 3. TLB flushes can be skipped if all TLB flushes required to free the
> duplicated folios have been done by any reason, which doesn't have
> to be done from migrations.
>
> 4. Adjust watermark check routine, __zone_watermark_ok(), with the
> number of duplicated folios because those folios can be freed
> and obtained right away through appropreate TLB flushes.
>
> 5. Perform TLB flushes and free the duplicated folios pending the
> flushes if page allocation routine is in trouble due to memory
> pressure, even more aggresively for high order allocation.

Is the optimization restricted for page migration only? Can it be used
for other places? Like page reclaiming?

> The measurement result:
>
> Architecture - x86_64
> QEMU - kvm enabled, host cpu, 2nodes((4cpus, 2GB)+(cpuless, 6GB))
> Linux Kernel - v6.4, numa balancing tiering on, demotion enabled
> Benchmark - XSBench with no parameter changed
>
> run 'perf stat' using events:
> (FYI, process wide result ~= system wide result(-a option))
> 1) itlb.itlb_flush
> 2) tlb_flush.dtlb_thread
> 3) tlb_flush.stlb_any
>
> run 'cat /proc/vmstat' and pick up:
> 1) pgdemote_kswapd
> 2) numa_pages_migrated
> 3) pgmigrate_success
> 4) nr_tlb_remote_flush
> 5) nr_tlb_remote_flush_received
> 6) nr_tlb_local_flush_all
> 7) nr_tlb_local_flush_one
>
> BEFORE - mainline v6.4
> ==========================================
>
> $ perf stat -e itlb.itlb_flush,tlb_flush.dtlb_thread,tlb_flush.stlb_any ./XSBench
>
> Performance counter stats for './XSBench':
>
> 426856 itlb.itlb_flush
> 6900414 tlb_flush.dtlb_thread
> 7303137 tlb_flush.stlb_any
>
> 33.500486566 seconds time elapsed
> 92.852128000 seconds user
> 10.526718000 seconds sys
>
> $ cat /proc/vmstat
>
> ...
> pgdemote_kswapd 1052596
> numa_pages_migrated 1052359
> pgmigrate_success 2161846
> nr_tlb_remote_flush 72370
> nr_tlb_remote_flush_received 213711
> nr_tlb_local_flush_all 3385
> nr_tlb_local_flush_one 198679
> ...
>
> AFTER - mainline v6.4 + CONFIG_MIGRC
> ==========================================
>
> $ perf stat -e itlb.itlb_flush,tlb_flush.dtlb_thread,tlb_flush.stlb_any ./XSBench
>
> Performance counter stats for './XSBench':
>
> 179537 itlb.itlb_flush
> 6131135 tlb_flush.dtlb_thread
> 6920979 tlb_flush.stlb_any

It appears that the number of "itlb.itlb_flush" changes much, but not
for other 2 events. Because the text segment of the executable file is
mapped as read-only? And most other pages are mapped read-write?

> 30.396700625 seconds time elapsed
> 80.331252000 seconds user
> 10.303761000 seconds sys
>
> $ cat /proc/vmstat
>
> ...
> pgdemote_kswapd 1044602
> numa_pages_migrated 1044202
> pgmigrate_success 2157808
> nr_tlb_remote_flush 30453
> nr_tlb_remote_flush_received 88840
> nr_tlb_local_flush_all 3039
> nr_tlb_local_flush_one 198875
> ...
>
> Signed-off-by: Byungchul Park <byungchul@xxxxxx>
> ---
> arch/x86/include/asm/tlbflush.h | 7 +
> arch/x86/mm/tlb.c | 52 ++++++
> include/linux/mm.h | 30 ++++
> include/linux/mm_types.h | 34 ++++
> include/linux/mmzone.h | 6 +
> include/linux/sched.h | 4 +
> init/Kconfig | 12 ++
> mm/internal.h | 10 ++
> mm/memory.c | 9 +-
> mm/migrate.c | 287 +++++++++++++++++++++++++++++++-
> mm/mm_init.c | 1 +
> mm/page_alloc.c | 16 ++
> mm/rmap.c | 92 ++++++++++
> 13 files changed, 555 insertions(+), 5 deletions(-)
>
> diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
> index 63504cde364b..da987c15049e 100644
> --- a/arch/x86/include/asm/tlbflush.h
> +++ b/arch/x86/include/asm/tlbflush.h
> @@ -279,9 +279,16 @@ static inline void arch_tlbbatch_add_mm(struct arch_tlbflush_unmap_batch *batch,
> }
>
> extern void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch);
> +extern void arch_tlbbatch_clean(struct arch_tlbflush_unmap_batch *batch);
> extern void arch_tlbbatch_fold(struct arch_tlbflush_unmap_batch *bdst,
> struct arch_tlbflush_unmap_batch *bsrc);
>
> +#ifdef CONFIG_MIGRC
> +extern void arch_migrc_adj(struct arch_tlbflush_unmap_batch *batch, int gen);
> +#else
> +static inline void arch_migrc_adj(struct arch_tlbflush_unmap_batch *batch, int gen) {}
> +#endif
> +
> static inline bool pte_flags_need_flush(unsigned long oldflags,
> unsigned long newflags,
> bool ignore_access)
> diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
> index 69d145f1fff1..54f98a50fd59 100644
> --- a/arch/x86/mm/tlb.c
> +++ b/arch/x86/mm/tlb.c
> @@ -1210,9 +1210,40 @@ STATIC_NOPV void native_flush_tlb_local(void)
> native_write_cr3(__native_read_cr3());
> }
>
> +#ifdef CONFIG_MIGRC
> +DEFINE_PER_CPU(int, migrc_done);
> +
> +static inline int migrc_tlb_local_begin(void)
> +{
> + int ret = atomic_read(&migrc_gen);
> +
> + smp_mb__after_atomic();
> + return ret;
> +}
> +
> +static inline void migrc_tlb_local_end(int gen)
> +{
> + smp_mb();
> + WRITE_ONCE(*this_cpu_ptr(&migrc_done), gen);
> +}
> +#else
> +static inline int migrc_tlb_local_begin(void)
> +{
> + return 0;
> +}
> +
> +static inline void migrc_tlb_local_end(int gen)
> +{
> +}
> +#endif
> +
> void flush_tlb_local(void)
> {
> + unsigned int gen;
> +
> + gen = migrc_tlb_local_begin();
> __flush_tlb_local();
> + migrc_tlb_local_end(gen);
> }
>
> /*
> @@ -1237,6 +1268,22 @@ void __flush_tlb_all(void)
> }
> EXPORT_SYMBOL_GPL(__flush_tlb_all);
>
> +#ifdef CONFIG_MIGRC
> +static inline bool before(int a, int b)
> +{
> + return a - b < 0;
> +}
> +
> +void arch_migrc_adj(struct arch_tlbflush_unmap_batch *batch, int gen)
> +{
> + int cpu;
> +
> + for_each_cpu(cpu, &batch->cpumask)
> + if (!before(READ_ONCE(*per_cpu_ptr(&migrc_done, cpu)), gen))
> + cpumask_clear_cpu(cpu, &batch->cpumask);
> +}
> +#endif
> +
> void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch)
> {
> struct flush_tlb_info *info;
> @@ -1265,6 +1312,11 @@ void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch)
> put_cpu();
> }
>
> +void arch_tlbbatch_clean(struct arch_tlbflush_unmap_batch *batch)
> +{
> + cpumask_clear(&batch->cpumask);
> +}
> +
> void arch_tlbbatch_fold(struct arch_tlbflush_unmap_batch *bdst,
> struct arch_tlbflush_unmap_batch *bsrc)
> {
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 27ce77080c79..e1f6e1fdab18 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -3816,4 +3816,34 @@ madvise_set_anon_name(struct mm_struct *mm, unsigned long start,
> }
> #endif
>
> +#ifdef CONFIG_MIGRC
> +void migrc_init_page(struct page *p);
> +bool migrc_pending(struct folio *f);
> +void migrc_shrink(struct llist_head *h);
> +void migrc_req_start(void);
> +void migrc_req_end(void);
> +bool migrc_req_processing(void);
> +bool migrc_try_flush(void);
> +void migrc_try_flush_dirty(void);
> +struct migrc_req *fold_ubc_nowr_migrc_req(void);
> +void free_migrc_req(struct migrc_req *req);
> +int migrc_pending_nr_in_zone(struct zone *z);
> +
> +extern atomic_t migrc_gen;
> +extern struct llist_head migrc_reqs;
> +extern struct llist_head migrc_reqs_dirty;
> +#else
> +static inline void migrc_init_page(struct page *p) {}
> +static inline bool migrc_pending(struct folio *f) { return false; }
> +static inline void migrc_shrink(struct llist_head *h) {}
> +static inline void migrc_req_start(void) {}
> +static inline void migrc_req_end(void) {}
> +static inline bool migrc_req_processing(void) { return false; }
> +static inline bool migrc_try_flush(void) { return false; }
> +static inline void migrc_try_flush_dirty(void) {}
> +static inline struct migrc_req *fold_ubc_nowr_migrc_req(void) { return NULL; }
> +static inline void free_migrc_req(struct migrc_req *req) {}
> +static inline int migrc_pending_nr_in_zone(struct zone *z) { return 0; }
> +#endif
> +
> #endif /* _LINUX_MM_H */
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index 306a3d1a0fa6..3be66d3eabd2 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -228,6 +228,10 @@ struct page {
> #ifdef LAST_CPUPID_NOT_IN_PAGE_FLAGS
> int _last_cpupid;
> #endif
> +#ifdef CONFIG_MIGRC
> + struct llist_node migrc_node;
> + unsigned int migrc_state;
> +#endif

We cannot enlarge "struct page".

> } _struct_page_alignment;
>
> /*
> @@ -1255,4 +1259,34 @@ enum {
> /* See also internal only FOLL flags in mm/internal.h */
> };
>
> +#ifdef CONFIG_MIGRC
> +struct migrc_req {
> + /*
> + * pages pending for TLB flush
> + */
> + struct llist_head pages;
> +
> + /*
> + * llist_node of the last page in pages llist
> + */
> + struct llist_node *last;
> +
> + /*
> + * for hanging onto migrc_reqs llist
> + */
> + struct llist_node llnode;
> +
> + /*
> + * architecture specific batch information
> + */
> + struct arch_tlbflush_unmap_batch arch;
> +
> + /*
> + * when the request hung onto migrc_reqs llist
> + */
> + int gen;
> +};
> +#else
> +struct migrc_req {};
> +#endif
> #endif /* _LINUX_MM_TYPES_H */
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index a4889c9d4055..1ec79bb63ba7 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -958,6 +958,9 @@ struct zone {
> /* Zone statistics */
> atomic_long_t vm_stat[NR_VM_ZONE_STAT_ITEMS];
> atomic_long_t vm_numa_event[NR_VM_NUMA_EVENT_ITEMS];
> +#ifdef CONFIG_MIGRC
> + atomic_t migrc_pending_nr;
> +#endif
> } ____cacheline_internodealigned_in_smp;
>
> enum pgdat_flags {
> @@ -1371,6 +1374,9 @@ typedef struct pglist_data {
> #ifdef CONFIG_MEMORY_FAILURE
> struct memory_failure_stats mf_stats;
> #endif
> +#ifdef CONFIG_MIGRC
> + atomic_t migrc_pending_nr;
> +#endif
> } pg_data_t;
>
> #define node_present_pages(nid) (NODE_DATA(nid)->node_present_pages)
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 2232b2cdfce8..d0a46089959d 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1323,6 +1323,10 @@ struct task_struct {
>
> struct tlbflush_unmap_batch tlb_ubc;
> struct tlbflush_unmap_batch tlb_ubc_nowr;
> +#ifdef CONFIG_MIGRC
> + struct migrc_req *mreq;
> + struct migrc_req *mreq_dirty;
> +#endif
>
> /* Cache last used pipe for splice(): */
> struct pipe_inode_info *splice_pipe;
> diff --git a/init/Kconfig b/init/Kconfig
> index 32c24950c4ce..f4882c1be364 100644
> --- a/init/Kconfig
> +++ b/init/Kconfig
> @@ -907,6 +907,18 @@ config NUMA_BALANCING_DEFAULT_ENABLED
> If set, automatic NUMA balancing will be enabled if running on a NUMA
> machine.
>
> +config MIGRC
> + bool "Deferring TLB flush by keeping read copies on migration"
> + depends on ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH
> + depends on NUMA_BALANCING
> + default n
> + help
> + TLB flush is necessary when PTE changes by migration. However,
> + TLB flush can be deferred if both copies of the src page and
> + the dst page are kept until TLB flush if they are non-writable.
> + System performance will be improved especially in case that
> + promotion and demotion type of migration is heavily happening.
> +
> menuconfig CGROUPS
> bool "Control Group support"
> select KERNFS
> diff --git a/mm/internal.h b/mm/internal.h
> index b90d516ad41f..a8e3168614d6 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -841,6 +841,8 @@ void try_to_unmap_flush(void);
> void try_to_unmap_flush_dirty(void);
> void flush_tlb_batched_pending(struct mm_struct *mm);
> void fold_ubc_nowr(void);
> +int nr_flush_required(void);
> +int nr_flush_required_nowr(void);
> #else
> static inline void try_to_unmap_flush(void)
> {
> @@ -854,6 +856,14 @@ static inline void flush_tlb_batched_pending(struct mm_struct *mm)
> static inline void fold_ubc_nowr(void)
> {
> }
> +static inline int nr_flush_required(void)
> +{
> + return 0;
> +}
> +static inline int nr_flush_required_nowr(void)
> +{
> + return 0;
> +}
> #endif /* CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH */
>
> extern const struct trace_print_flags pageflag_names[];
> diff --git a/mm/memory.c b/mm/memory.c
> index f69fbc251198..061f23e34d69 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -3345,6 +3345,12 @@ static vm_fault_t do_wp_page(struct vm_fault *vmf)
>
> vmf->page = vm_normal_page(vma, vmf->address, vmf->orig_pte);
>
> + if (vmf->page)
> + folio = page_folio(vmf->page);
> +
> + if (folio && migrc_pending(folio))
> + migrc_try_flush();
> +
> /*
> * Shared mapping: we are guaranteed to have VM_WRITE and
> * FAULT_FLAG_WRITE set at this point.
> @@ -3362,9 +3368,6 @@ static vm_fault_t do_wp_page(struct vm_fault *vmf)
> return wp_page_shared(vmf);
> }
>
> - if (vmf->page)
> - folio = page_folio(vmf->page);
> -
> /*
> * Private mapping: create an exclusive anonymous page copy if reuse
> * is impossible. We might miss VM_WRITE for FOLL_FORCE handling.
> diff --git a/mm/migrate.c b/mm/migrate.c
> index 01cac26a3127..944c7e179288 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -58,6 +58,244 @@
>
> #include "internal.h"
>
> +#ifdef CONFIG_MIGRC
> +static int sysctl_migrc_enable = 1;
> +#ifdef CONFIG_SYSCTL
> +static int sysctl_migrc_enable_handler(struct ctl_table *table, int write,
> + void *buffer, size_t *lenp, loff_t *ppos)
> +{
> + struct ctl_table t;
> + int err;
> + int enabled = sysctl_migrc_enable;
> +
> + if (write && !capable(CAP_SYS_ADMIN))
> + return -EPERM;
> +
> + t = *table;
> + t.data = &enabled;
> + err = proc_dointvec_minmax(&t, write, buffer, lenp, ppos);
> + if (err < 0)
> + return err;
> + if (write)
> + sysctl_migrc_enable = enabled;
> + return err;
> +}
> +
> +static struct ctl_table migrc_sysctls[] = {
> + {
> + .procname = "migrc_enable",
> + .data = NULL, /* filled in by handler */
> + .maxlen = sizeof(int),
> + .mode = 0644,
> + .proc_handler = sysctl_migrc_enable_handler,
> + .extra1 = SYSCTL_ZERO,
> + .extra2 = SYSCTL_ONE,
> + },
> + {}
> +};
> +
> +static int __init migrc_sysctl_init(void)
> +{
> + register_sysctl_init("vm", migrc_sysctls);
> + return 0;
> +}
> +late_initcall(migrc_sysctl_init);
> +#endif
> +
> +/*
> + * TODO: Yeah, it's a non-sense magic number. This simple value manages
> + * to work conservatively anyway. However, the value needs to be
> + * tuned and adjusted based on the internal condition of memory
> + * management subsystem later.
> + *
> + * Let's start with a simple value for now.
> + */
> +static const int migrc_pending_max = 512; /* unit: page */
> +
> +atomic_t migrc_gen;
> +LLIST_HEAD(migrc_reqs);
> +LLIST_HEAD(migrc_reqs_dirty);
> +
> +enum {
> + MIGRC_STATE_NONE,
> + MIGRC_SRC_PENDING,
> + MIGRC_DST_PENDING,
> +};
> +
> +#define MAX_MIGRC_REQ_NR 4096
> +static struct migrc_req migrc_req_pool_static[MAX_MIGRC_REQ_NR];
> +static atomic_t migrc_req_pool_idx = ATOMIC_INIT(-1);
> +static LLIST_HEAD(migrc_req_pool_llist);
> +static DEFINE_SPINLOCK(migrc_req_pool_lock);
> +
> +static struct migrc_req *alloc_migrc_req(void)
> +{
> + int idx = atomic_read(&migrc_req_pool_idx);
> + struct llist_node *n;
> +
> + if (idx < MAX_MIGRC_REQ_NR - 1) {
> + idx = atomic_inc_return(&migrc_req_pool_idx);
> + if (idx < MAX_MIGRC_REQ_NR)
> + return migrc_req_pool_static + idx;
> + }
> +
> + spin_lock(&migrc_req_pool_lock);
> + n = llist_del_first(&migrc_req_pool_llist);
> + spin_unlock(&migrc_req_pool_lock);
> +
> + return n ? llist_entry(n, struct migrc_req, llnode) : NULL;
> +}
> +
> +void free_migrc_req(struct migrc_req *req)
> +{
> + llist_add(&req->llnode, &migrc_req_pool_llist);
> +}
> +
> +static bool migrc_full(int nid)
> +{
> + struct pglist_data *node = NODE_DATA(nid);
> +
> + if (migrc_pending_max == -1)
> + return false;
> +
> + return atomic_read(&node->migrc_pending_nr) >= migrc_pending_max;
> +}
> +
> +void migrc_init_page(struct page *p)
> +{
> + WRITE_ONCE(p->migrc_state, MIGRC_STATE_NONE);
> +}
> +
> +/*
> + * The list should be isolated before.
> + */
> +void migrc_shrink(struct llist_head *h)
> +{
> + struct page *p;
> + struct llist_node *n;
> +
> + n = llist_del_all(h);
> + llist_for_each_entry(p, n, migrc_node) {
> + if (p->migrc_state == MIGRC_SRC_PENDING) {
> + struct pglist_data *node;
> + struct zone *zone;
> +
> + node = NODE_DATA(page_to_nid(p));
> + zone = page_zone(p);
> + atomic_dec(&node->migrc_pending_nr);
> + atomic_dec(&zone->migrc_pending_nr);
> + }
> + WRITE_ONCE(p->migrc_state, MIGRC_STATE_NONE);
> + folio_put(page_folio(p));
> + }
> +}
> +
> +bool migrc_pending(struct folio *f)
> +{
> + return READ_ONCE(f->page.migrc_state) != MIGRC_STATE_NONE;
> +}
> +
> +static void migrc_expand_req(struct folio *fsrc, struct folio *fdst)
> +{
> + struct migrc_req *req;
> + struct pglist_data *node;
> + struct zone *zone;
> +
> + req = fold_ubc_nowr_migrc_req();
> + if (!req)
> + return;
> +
> + folio_get(fsrc);
> + folio_get(fdst);
> + WRITE_ONCE(fsrc->page.migrc_state, MIGRC_SRC_PENDING);
> + WRITE_ONCE(fdst->page.migrc_state, MIGRC_DST_PENDING);
> +
> + if (llist_add(&fsrc->page.migrc_node, &req->pages))
> + req->last = &fsrc->page.migrc_node;
> + llist_add(&fdst->page.migrc_node, &req->pages);
> +
> + node = NODE_DATA(folio_nid(fsrc));
> + zone = page_zone(&fsrc->page);
> + atomic_inc(&node->migrc_pending_nr);
> + atomic_inc(&zone->migrc_pending_nr);
> +
> + if (migrc_full(folio_nid(fsrc)))
> + migrc_try_flush();
> +}
> +
> +void migrc_req_start(void)
> +{
> + struct migrc_req *req;
> + struct migrc_req *req_dirty;
> +
> + if (WARN_ON(current->mreq || current->mreq_dirty))
> + return;
> +
> + req = alloc_migrc_req();
> + req_dirty = alloc_migrc_req();
> +
> + if (!req || !req_dirty)
> + goto fail;
> +
> + arch_tlbbatch_clean(&req->arch);
> + init_llist_head(&req->pages);
> + req->last = NULL;
> + current->mreq = req;
> +
> + arch_tlbbatch_clean(&req_dirty->arch);
> + init_llist_head(&req_dirty->pages);
> + req_dirty->last = NULL;
> + current->mreq_dirty = req_dirty;
> + return;
> +fail:
> + if (req_dirty)
> + free_migrc_req(req_dirty);
> + if (req)
> + free_migrc_req(req);
> +}
> +
> +void migrc_req_end(void)
> +{
> + struct migrc_req *req = current->mreq;
> + struct migrc_req *req_dirty = current->mreq_dirty;
> +
> + WARN_ON((!req && req_dirty) || (req && !req_dirty));
> +
> + if (!req || !req_dirty)
> + return;
> +
> + if (llist_empty(&req->pages)) {
> + free_migrc_req(req);
> + } else {
> + req->gen = atomic_inc_return(&migrc_gen);
> + llist_add(&req->llnode, &migrc_reqs);
> + }
> + current->mreq = NULL;
> +
> + if (llist_empty(&req_dirty->pages)) {
> + free_migrc_req(req_dirty);
> + } else {
> + req_dirty->gen = atomic_inc_return(&migrc_gen);
> + llist_add(&req_dirty->llnode, &migrc_reqs_dirty);
> + }
> + current->mreq_dirty = NULL;
> +}
> +
> +bool migrc_req_processing(void)
> +{
> + return current->mreq && current->mreq_dirty;
> +}
> +
> +int migrc_pending_nr_in_zone(struct zone *z)
> +{
> + return atomic_read(&z->migrc_pending_nr);
> +}
> +#else
> +static const int sysctl_migrc_enable;
> +static bool migrc_full(int nid) { return true; }
> +static void migrc_expand_req(struct folio *fsrc, struct folio *fdst) {}
> +#endif
> +
> bool isolate_movable_page(struct page *page, isolate_mode_t mode)
> {
> struct folio *folio = folio_get_nontail_page(page);
> @@ -383,6 +621,9 @@ static int folio_expected_refs(struct address_space *mapping,
> struct folio *folio)
> {
> int refs = 1;
> +
> + refs += migrc_pending(folio) ? 1 : 0;
> +
> if (!mapping)
> return refs;
>
> @@ -1060,6 +1301,12 @@ static void migrate_folio_undo_src(struct folio *src,
> bool locked,
> struct list_head *ret)
> {
> + /*
> + * TODO: There might be folios already pending for migrc.
> + * However, there's no way to cancel those on failure for now.
> + * Let's reflect the requirement when needed.
> + */
> +
> if (page_was_mapped)
> remove_migration_ptes(src, src, false);
> /* Drop an anon_vma reference if we took one */
> @@ -1627,10 +1874,17 @@ static int migrate_pages_batch(struct list_head *from, new_page_t get_new_page,
> LIST_HEAD(unmap_folios);
> LIST_HEAD(dst_folios);
> bool nosplit = (reason == MR_NUMA_MISPLACED);
> + bool migrc_cond1;
>
> VM_WARN_ON_ONCE(mode != MIGRATE_ASYNC &&
> !list_empty(from) && !list_is_singular(from));
>
> + migrc_cond1 = sysctl_migrc_enable &&
> + ((reason == MR_DEMOTION && current_is_kswapd()) ||
> + reason == MR_NUMA_MISPLACED);
> +
> + if (migrc_cond1)
> + migrc_req_start();
> for (pass = 0; pass < nr_pass && (retry || large_retry); pass++) {
> retry = 0;
> large_retry = 0;
> @@ -1638,6 +1892,10 @@ static int migrate_pages_batch(struct list_head *from, new_page_t get_new_page,
> nr_retry_pages = 0;
>
> list_for_each_entry_safe(folio, folio2, from, lru) {
> + int nr_required;
> + bool migrc_cond2;
> + bool migrc;
> +
> /*
> * Large folio statistics is based on the source large
> * folio. Capture required information that might get
> @@ -1671,8 +1929,14 @@ static int migrate_pages_batch(struct list_head *from, new_page_t get_new_page,
> continue;
> }
>
> + nr_required = nr_flush_required();
> rc = migrate_folio_unmap(get_new_page, put_new_page, private,
> folio, &dst, mode, reason, ret_folios);
> + migrc_cond2 = nr_required == nr_flush_required() &&
> + nr_flush_required_nowr() &&
> + !migrc_full(folio_nid(folio));
> + migrc = migrc_cond1 && migrc_cond2;
> +
> /*
> * The rules are:
> * Success: folio will be freed
> @@ -1722,9 +1986,11 @@ static int migrate_pages_batch(struct list_head *from, new_page_t get_new_page,
> nr_large_failed += large_retry;
> stats->nr_thp_failed += thp_retry;
> rc_saved = rc;
> - if (list_empty(&unmap_folios))
> + if (list_empty(&unmap_folios)) {
> + if (migrc_cond1)
> + migrc_req_end();
> goto out;
> - else
> + } else
> goto move;
> case -EAGAIN:
> if (is_large) {
> @@ -1742,6 +2008,13 @@ static int migrate_pages_batch(struct list_head *from, new_page_t get_new_page,
> case MIGRATEPAGE_UNMAP:
> list_move_tail(&folio->lru, &unmap_folios);
> list_add_tail(&dst->lru, &dst_folios);
> +
> + if (migrc)
> + /*
> + * XXX: On migration failure,
> + * extra TLB flush might happen.
> + */
> + migrc_expand_req(folio, dst);
> break;
> default:
> /*
> @@ -1760,6 +2033,7 @@ static int migrate_pages_batch(struct list_head *from, new_page_t get_new_page,
> stats->nr_failed_pages += nr_pages;
> break;
> }
> + fold_ubc_nowr();
> }
> }
> nr_failed += retry;
> @@ -1767,6 +2041,15 @@ static int migrate_pages_batch(struct list_head *from, new_page_t get_new_page,
> stats->nr_thp_failed += thp_retry;
> stats->nr_failed_pages += nr_retry_pages;
> move:
> + /*
> + * Should be prior to try_to_unmap_flush() so that
> + * migrc_try_flush() that will be performed later based on the
> + * gen # assigned in migrc_req_end(), can take benefit of the
> + * TLB flushes in try_to_unmap_flush().
> + */
> + if (migrc_cond1)
> + migrc_req_end();
> +
> /* Flush TLBs for all unmapped folios */
> try_to_unmap_flush();
>
> diff --git a/mm/mm_init.c b/mm/mm_init.c
> index 7f7f9c677854..87cbddc7d780 100644
> --- a/mm/mm_init.c
> +++ b/mm/mm_init.c
> @@ -558,6 +558,7 @@ static void __meminit __init_single_page(struct page *page, unsigned long pfn,
> page_mapcount_reset(page);
> page_cpupid_reset_last(page);
> page_kasan_tag_reset(page);
> + migrc_init_page(page);
>
> INIT_LIST_HEAD(&page->lru);
> #ifdef WANT_PAGE_VIRTUAL
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 47421bedc12b..167dadb0d817 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -3176,6 +3176,11 @@ bool __zone_watermark_ok(struct zone *z, unsigned int order, unsigned long mark,
> long min = mark;
> int o;
>
> + /*
> + * There are pages that can be freed by migrc_try_flush().
> + */
> + free_pages += migrc_pending_nr_in_zone(z);
> +
> /* free_pages may go negative - that's OK */
> free_pages -= __zone_watermark_unusable_free(z, order, alloc_flags);
>
> @@ -4254,6 +4259,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
> unsigned int zonelist_iter_cookie;
> int reserve_flags;
>
> + migrc_try_flush();
> restart:
> compaction_retries = 0;
> no_progress_loops = 0;
> @@ -4769,6 +4775,16 @@ struct page *__alloc_pages(gfp_t gfp, unsigned int order, int preferred_nid,
> if (likely(page))
> goto out;
>
> + if (order && migrc_try_flush()) {
> + /*
> + * Try again after freeing migrc's pending pages in case
> + * of high order allocation.
> + */
> + page = get_page_from_freelist(alloc_gfp, order, alloc_flags, &ac);
> + if (likely(page))
> + goto out;
> + }
> +
> alloc_gfp = gfp;
> ac.spread_dirty_pages = false;
>
> diff --git a/mm/rmap.c b/mm/rmap.c
> index d18460a48485..5b251eb01cd4 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -606,6 +606,86 @@ struct anon_vma *folio_lock_anon_vma_read(struct folio *folio,
>
> #ifdef CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH
>
> +#ifdef CONFIG_MIGRC
> +static bool __migrc_try_flush(struct llist_head *h)
> +{
> + struct arch_tlbflush_unmap_batch arch;
> + struct llist_node *reqs;
> + struct migrc_req *req;
> + struct migrc_req *req2;
> + LLIST_HEAD(pages);
> +
> + reqs = llist_del_all(h);
> + if (!reqs)
> + return false;
> +
> + arch_tlbbatch_clean(&arch);
> +
> + /*
> + * TODO: Optimize the time complexity.
> + */
> + llist_for_each_entry_safe(req, req2, reqs, llnode) {
> + struct llist_node *n;
> +
> + arch_migrc_adj(&req->arch, req->gen);
> + arch_tlbbatch_fold(&arch, &req->arch);
> +
> + n = llist_del_all(&req->pages);
> + llist_add_batch(n, req->last, &pages);
> + free_migrc_req(req);
> + }
> +
> + arch_tlbbatch_flush(&arch);
> + migrc_shrink(&pages);
> + return true;
> +}
> +
> +bool migrc_try_flush(void)
> +{
> + bool ret;
> +
> + if (migrc_req_processing()) {
> + migrc_req_end();
> + migrc_req_start();
> + }
> + ret = __migrc_try_flush(&migrc_reqs);
> + ret = ret || __migrc_try_flush(&migrc_reqs_dirty);
> +
> + return ret;
> +}
> +
> +void migrc_try_flush_dirty(void)
> +{
> + if (migrc_req_processing()) {
> + migrc_req_end();
> + migrc_req_start();
> + }
> + __migrc_try_flush(&migrc_reqs_dirty);
> +}
> +
> +struct migrc_req *fold_ubc_nowr_migrc_req(void)
> +{
> + struct tlbflush_unmap_batch *tlb_ubc_nowr = &current->tlb_ubc_nowr;
> + struct migrc_req *req;
> + bool dirty;
> +
> + if (!tlb_ubc_nowr->nr_flush_required)
> + return NULL;
> +
> + dirty = tlb_ubc_nowr->writable;
> + req = dirty ? current->mreq_dirty : current->mreq;
> + if (!req) {
> + fold_ubc_nowr();
> + return NULL;
> + }
> +
> + arch_tlbbatch_fold(&req->arch, &tlb_ubc_nowr->arch);
> + tlb_ubc_nowr->nr_flush_required = 0;
> + tlb_ubc_nowr->writable = false;
> + return req;
> +}
> +#endif
> +
> void fold_ubc_nowr(void)
> {
> struct tlbflush_unmap_batch *tlb_ubc = &current->tlb_ubc;
> @@ -621,6 +701,16 @@ void fold_ubc_nowr(void)
> tlb_ubc_nowr->writable = false;
> }
>
> +int nr_flush_required(void)
> +{
> + return current->tlb_ubc.nr_flush_required;
> +}
> +
> +int nr_flush_required_nowr(void)
> +{
> + return current->tlb_ubc_nowr.nr_flush_required;
> +}
> +
> /*
> * Flush TLB entries for recently unmapped pages from remote CPUs. It is
> * important if a PTE was dirty when it was unmapped that it's flushed
> @@ -648,6 +738,8 @@ void try_to_unmap_flush_dirty(void)
>
> if (tlb_ubc->writable || tlb_ubc_nowr->writable)
> try_to_unmap_flush();
> +
> + migrc_try_flush_dirty();
> }
>
> /*