[PATCH 4/4] mm: Send one IPI per CPU to TLB flush pages that were recently unmapped

From: Mel Gorman
Date: Tue Jun 09 2015 - 13:32:46 EST

Next message: Mel Gorman: "[PATCH 3/4] mm: Defer flush of writable TLB entries"
Previous message: Mel Gorman: "[PATCH 1/4] x86, mm: Trace when an IPI is about to be sent"
In reply to: Mel Gorman: "[PATCH 1/4] x86, mm: Trace when an IPI is about to be sent"
Next in thread: Mel Gorman: "[PATCH 3/4] mm: Defer flush of writable TLB entries"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

When unmapping pages, an IPI is sent to flush all TLB entries on CPUs that
potentially have a valid TLB entry. There are many circumstances where
this happens but the obvious one is kswapd reclaiming pages belonging to a
running process as kswapd and the task are likely running on separate CPUs.
This forces processes running the affected CPUs to refill their TLB entries.
This is an unpredictable cost as it heavily depends on the workloads,
the timing and the exact CPU used.

This patch uses a structure similar in principle to a pagevec to collect
a list of PFNs and CPUs that require flushing. It then sends one IPI per
CPU that was mapping any of those pages to flush the list of PFNs. A new
TLB flush helper is required for this and one is added for x86. Other
architectures will need to decide if batching like this is both safe and
worth the overhead.

There is a direct cost to tracking the PFNs both in memory and the cost of
the individual PFN flushes. In the absolute worst case, the kernel flushes
individual PFNs and none of the active TLB entries were being used. Hence,
this results reflect the full cost without any of the benefit of preserving
existing entries.

On a 4-socket machine the results were

4.1.0-rc6 4.1.0-rc6
batchdirty-v6 batchunmap-v6
Ops lru-file-mmap-read-elapsed 121.27 ( 0.00%) 118.79 ( 2.05%)

4.1.0-rc6 4.1.0-rc6
batchdirty-v6 batchunmap-v6
User 620.84 608.48
System 4245.35 4152.89
Elapsed 122.65 120.15

In this case the workload completed faster and there was less CPU overhead
but as it's a NUMA machine there are a lot of factors at play. It's easier
to quantify on a single socket machine;

4.1.0-rc6 4.1.0-rc6
batchdirty-v6 batchunmap-v6
Ops lru-file-mmap-read-elapsed 20.35 ( 0.00%) 21.52 ( -5.75%)

4.1.0-rc6 4.1.0-rc6
batchdirty-v6r5batchunmap-v6r5
User 58.02 60.70
System 77.57 81.92
Elapsed 22.14 23.16

That shows the workload takes 5.75% longer to complete with a similar
increase in the system CPU usage.

It is expected that there is overhead to tracking the PFNs and flushing
individual pages. This can be quantified but we cannot quantify the
indirect savings due to active unrelated TLB entries being preserved.
Whether this matters depends on whether the workload was using those
entries and if they would be used before a context switch but targeting
the TLB flushes is the conservative and safer choice.

Signed-off-by: Mel Gorman <mgorman@xxxxxxx>
---
arch/x86/include/asm/tlbflush.h | 2 ++
include/linux/sched.h | 12 ++++++++++--
init/Kconfig | 10 ++++------
mm/rmap.c | 25 +++++++++++++------------
4 files changed, 29 insertions(+), 20 deletions(-)

diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index cd791948b286..10c197a649f5 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -152,6 +152,8 @@ static inline void __flush_tlb_one(unsigned long addr)
* and page-granular flushes are available only on i486 and up.
*/

+#define flush_local_tlb_addr(addr) __flush_tlb_single(addr)
+
#ifndef CONFIG_SMP

/* "_up" is for UniProcessor.
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 6b787a7f6c38..4dbffe0a1868 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1289,6 +1289,9 @@ enum perf_event_task_context {
perf_nr_task_contexts,
};

+/* Matches SWAP_CLUSTER_MAX but refined to limit header dependencies */
+#define BATCH_TLBFLUSH_SIZE 32UL
+
/* Track pages that require TLB flushes */
struct tlbflush_unmap_batch {
/*
@@ -1297,8 +1300,13 @@ struct tlbflush_unmap_batch {
*/
struct cpumask cpumask;

- /* True if any bit in cpumask is set */
- bool flush_required;
+ /*
+ * The number and list of pfns to be flushed. PFNs are tracked instead
+ * of struct pages to avoid multiple page->pfn lookups by each CPU that
+ * receives an IPI in percpu_flush_tlb_batch_pages.
+ */
+ unsigned int nr_pages;
+ unsigned long pfns[BATCH_TLBFLUSH_SIZE];

/*
* If true then the PTE was dirty when unmapped. The entry must be
diff --git a/init/Kconfig b/init/Kconfig
index 6e6fa4842250..095b3d470c3f 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -904,12 +904,10 @@ config ARCH_SUPPORTS_NUMA_BALANCING
bool

#
-# For architectures that prefer to flush all TLBs after a number of pages
-# are unmapped instead of sending one IPI per page to flush. The architecture
-# must provide guarantees on what happens if a clean TLB cache entry is
-# written after the unmap. Details are in mm/rmap.c near the check for
-# should_defer_flush. The architecture should also consider if the full flush
-# and the refill costs are offset by the savings of sending fewer IPIs.
+# For architectures that have a local TLB flush for a PFN without knowledge
+# of the VMA. The architecture must provide guarantees on what happens if
+# a clean TLB cache entry is written after the unmap. Details are in mm/rmap.c
+# near the check for should_defer_flush.
config ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH
bool

diff --git a/mm/rmap.c b/mm/rmap.c
index 1e36b2fb3e95..0085b0eb720c 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -586,15 +586,12 @@ vma_address(struct page *page, struct vm_area_struct *vma)
#ifdef CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH
static void percpu_flush_tlb_batch_pages(void *data)
{
- /*
- * All TLB entries are flushed on the assumption that it is
- * cheaper to flush all TLBs and let them be refilled than
- * flushing individual PFNs. Note that we do not track mm's
- * to flush as that might simply be multiple full TLB flushes
- * for no gain.
- */
+ struct tlbflush_unmap_batch *tlb_ubc = data;
+ unsigned int i;
+
count_vm_tlb_event(NR_TLB_REMOTE_FLUSH_RECEIVED);
- local_flush_tlb();
+ for (i = 0; i < tlb_ubc->nr_pages; i++)
+ flush_local_tlb_addr(tlb_ubc->pfns[i] << PAGE_SHIFT);
}

/*
@@ -608,10 +605,10 @@ void try_to_unmap_flush(void)
struct tlbflush_unmap_batch *tlb_ubc = current->tlb_ubc;
int cpu;

- if (!tlb_ubc || !tlb_ubc->flush_required)
+ if (!tlb_ubc || !tlb_ubc->nr_pages)
return;

- trace_tlb_flush(TLB_REMOTE_SHOOTDOWN, -1UL);
+ trace_tlb_flush(TLB_REMOTE_SHOOTDOWN, tlb_ubc->nr_pages);

cpu = get_cpu();
if (cpumask_test_cpu(cpu, &tlb_ubc->cpumask))
@@ -622,7 +619,7 @@ void try_to_unmap_flush(void)
percpu_flush_tlb_batch_pages, (void *)tlb_ubc, true);
}
cpumask_clear(&tlb_ubc->cpumask);
- tlb_ubc->flush_required = false;
+ tlb_ubc->nr_pages = 0;
tlb_ubc->writable = false;
put_cpu();
}
@@ -642,7 +639,8 @@ static void set_tlb_ubc_flush_pending(struct mm_struct *mm,
struct tlbflush_unmap_batch *tlb_ubc = current->tlb_ubc;

cpumask_or(&tlb_ubc->cpumask, &tlb_ubc->cpumask, mm_cpumask(mm));
- tlb_ubc->flush_required = true;
+ tlb_ubc->pfns[tlb_ubc->nr_pages] = page_to_pfn(page);
+ tlb_ubc->nr_pages++;

/*
* If the PTE was dirty then it's best to assume it's writable. The
@@ -651,6 +649,9 @@ static void set_tlb_ubc_flush_pending(struct mm_struct *mm,
*/
if (writable)
tlb_ubc->writable = true;
+
+ if (tlb_ubc->nr_pages == BATCH_TLBFLUSH_SIZE)
+ try_to_unmap_flush();
}

/*
--
2.3.5

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Next message: Mel Gorman: "[PATCH 3/4] mm: Defer flush of writable TLB entries"
Previous message: Mel Gorman: "[PATCH 1/4] x86, mm: Trace when an IPI is about to be sent"
In reply to: Mel Gorman: "[PATCH 1/4] x86, mm: Trace when an IPI is about to be sent"
Next in thread: Mel Gorman: "[PATCH 3/4] mm: Defer flush of writable TLB entries"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]