[RFC] mm, THP: Map read-only text segments using large THP pages

From: William Kucharski
Date: Mon May 14 2018 - 09:20:47 EST


One of the downsides of THP as currently implemented is that it only supports
large page mappings for anonymous pages.

I embarked upon this prototype on the theory that it would be advantageous to
be able to map large ranges of read-only text pages using THP as well.

The idea is that the kernel will attempt to allocate and map the range using a
PMD sized THP page upon first fault; if the allocation is successful the page
will be populated (at present using a call to kernel_read()) and the page will
be mapped at the PMD level. If memory allocation fails, the page fault routines
will drop through to the conventional PAGESIZE-oriented routines for mapping
the faulting page.

Since this approach will map a PMD size block of the memory map at a time, we
should see a slight uptick in time spent in disk I/O but a substantial drop in
page faults as well as a reduction in iTLB misses as address ranges will be
mapped with the larger page. Analysis of a test program that consists of a very
large text area (483,138,032 bytes in size) that thrashes D$ and I$ shows this
does occur and there is a slight reduction in program execution time.

The text segment as seen from readelf:

LOAD 0x0000000000000000 0x0000000000400000 0x0000000000400000
0x000000001ccc19f0 0x000000001ccc19f0 R E 0x200000

As currently implemented for test purposes, the prototype will only use large
pages to map an executable with a particular filename ("testr"), enabling easy
comparison of the same executable using 4K and 2M (x64) pages on the same
kernel. It is understood that this is just a proof of concept implementation
and much more work regarding enabling the feature and overall system usage of
it would need to be done before it was submitted as a kernel patch. However, I
felt it would be worthy to send it out as an RFC so I can find out whether
there are huge objections from the community to doing this at all, or a better
understanding of the major concerns that must be assuaged before it would even
be considered. I currently hardcode CONFIG_TRANSPARENT_HUGEPAGE to the
equivalent of "always" and bypass some checks for anonymous pages by simply
#ifdefing the code out; obviously I would need to determine the right thing to
do in those cases.

Current comparisons of 4K vs 2M pages as generated by "perf stat -d -d -d -r10"
follow; the 4K pagesize program was named "foo" and the 2M pagesize program
"testr" (as noted above) - please note that these numbers do vary from run to
run, but the orders of magnitude of the differences between the two versions
remain relatively constant:

4K Pages:
=========
Performance counter stats for './foo' (10 runs):

307054.450421 task-clock:u (msec) # 1.000 CPUs utilized ( +- 0.21% )
0 context-switches:u # 0.000 K/sec
0 cpu-migrations:u # 0.000 K/sec
7,728 page-faults:u # 0.025 K/sec ( +- 0.00% )
1,401,295,823,265 cycles:u # 4.564 GHz ( +- 0.19% ) (30.77%)
562,704,668,718 instructions:u # 0.40 insn per cycle ( +- 0.00% ) (38.46%)
20,100,243,102 branches:u # 65.461 M/sec ( +- 0.00% ) (38.46%)
2,628,944 branch-misses:u # 0.01% of all branches ( +- 3.32% ) (38.46%)
180,885,880,185 L1-dcache-loads:u # 589.100 M/sec ( +- 0.00% ) (38.46%)
40,374,420,279 L1-dcache-load-misses:u # 22.32% of all L1-dcache hits ( +- 0.01% ) (38.46%)
232,184,583 LLC-loads:u # 0.756 M/sec ( +- 1.48% ) (30.77%)
23,990,082 LLC-load-misses:u # 10.33% of all LL-cache hits ( +- 1.48% ) (30.77%)
<not supported> L1-icache-loads:u
74,897,499,234 L1-icache-load-misses:u ( +- 0.00% ) (30.77%)
180,990,026,447 dTLB-loads:u # 589.440 M/sec ( +- 0.00% ) (30.77%)
707,373 dTLB-load-misses:u # 0.00% of all dTLB cache hits ( +- 4.62% ) (30.77%)
5,583,675 iTLB-loads:u # 0.018 M/sec ( +- 0.31% ) (30.77%)
1,219,514,499 iTLB-load-misses:u # 21840.71% of all iTLB cache hits ( +- 0.01% ) (30.77%)
<not supported> L1-dcache-prefetches:u
<not supported> L1-dcache-prefetch-misses:u

307.093088771 seconds time elapsed ( +- 0.20% )

2M Pages:
=========
Performance counter stats for './testr' (10 runs):

289504.209769 task-clock:u (msec) # 1.000 CPUs utilized ( +- 0.19% )
0 context-switches:u # 0.000 K/sec
0 cpu-migrations:u # 0.000 K/sec
598 page-faults:u # 0.002 K/sec ( +- 0.03% )
1,323,835,488,984 cycles:u # 4.573 GHz ( +- 0.19% ) (30.77%)
562,658,682,055 instructions:u # 0.43 insn per cycle ( +- 0.00% ) (38.46%)
20,099,662,528 branches:u # 69.428 M/sec ( +- 0.00% ) (38.46%)
2,877,086 branch-misses:u # 0.01% of all branches ( +- 4.52% ) (38.46%)
180,899,297,017 L1-dcache-loads:u # 624.859 M/sec ( +- 0.00% ) (38.46%)
40,209,140,089 L1-dcache-load-misses:u # 22.23% of all L1-dcache hits ( +- 0.00% ) (38.46%)
135,968,232 LLC-loads:u # 0.470 M/sec ( +- 1.56% ) (30.77%)
6,704,890 LLC-load-misses:u # 4.93% of all LL-cache hits ( +- 1.92% ) (30.77%)
<not supported> L1-icache-loads:u
74,955,673,747 L1-icache-load-misses:u ( +- 0.00% ) (30.77%)
180,987,794,366 dTLB-loads:u # 625.165 M/sec ( +- 0.00% ) (30.77%)
835 dTLB-load-misses:u # 0.00% of all dTLB cache hits ( +- 14.35% ) (30.77%)
6,386,207 iTLB-loads:u # 0.022 M/sec ( +- 0.42% ) (30.77%)
51,929,869 iTLB-load-misses:u # 813.16% of all iTLB cache hits ( +- 1.61% ) (30.77%)
<not supported> L1-dcache-prefetches:u
<not supported> L1-dcache-prefetch-misses:u

289.551551387 seconds time elapsed ( +- 0.20% )

A check of /proc/meminfo with the test program running shows the large mappings:

ShmemPmdMapped: 471040 kB

FAQ:
====
Q: What kernel is the prototype based on?
A: 4.14.0-rc7

Q: What is the biggest issue you haven't addressed?
A: Given this is a prototype, there are many. Aside from the fact that I
only map large pages for an executable of a specific name ("testr"), the
code must be integrated with large page size support in the page cache
as currently multiple iterations of an executable would each use their
own individually allocated THP pages and those pages filled with data
using kernel_read(), which allows for performance characterization but
would never be acceptable for a production kernel.

A good example of the large page support required is the ext4 support
outlined in:

https://www.mail-archive.com/linux-block@xxxxxxxxxxxxxxx/msg04012.html

There also need to be configuration options to enable this code at all,
likely only for file systems that support large pages, and more
reasonable fixes for the assumptions that all large THP pages are
anonymous assertions in rmap.c (for the prototype I just "#if 0" them out.)

Q: Which processes get their text as large pages?
A: At this point with this implementation it's any process with a read-only
text area of the proper size/alignment.

An attempt is made to align the address for non-MAP_FIXED addresses.

I do not make any attempt to move mappings that take up a majority of a
large page to a large page; I only map a large page if the address
aligns and the map size is larger than or equal to a large page.

Q: Which architectures has this been tested on?
A: At present, only x64.

Q: How about architectures (ARM, for instance) with multiple large page
sizes that are reasonable for text mappings?
A: At present a "large page" is just PMD size; it would be possible with
additional effort to allow for mapping using PUD-sized pages.

Q: What about the use of non-PMD large page sizes (on non-x86 architectures)?
A: I haven't looked into that; I don't have an answer as to how to best
map a page that wasn't sized to be a PMD or PUD.

Signed-off-by: William Kucharski <william.kucharski@xxxxxxxxxx>

===============================================================

diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index ed113ea..f4ac381 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -146,8 +146,8 @@ static int hugetlbfs_file_mmap(struct file *file, struct vm_area_struct *vma)
if (vma->vm_pgoff & (~huge_page_mask(h) >> PAGE_SHIFT))
return -EINVAL;

- vma_len = (loff_t)(vma->vm_end - vma->vm_start);
- len = vma_len + ((loff_t)vma->vm_pgoff << PAGE_SHIFT);
+ vma_len = (loff_t)(vma->vm_end - vma->vm_start); /* length of VMA */
+ len = vma_len + ((loff_t)vma->vm_pgoff << PAGE_SHIFT); /* add vma->vm_pgoff * PAGESIZE */
/* check for overflow */
if (len < vma_len)
return -EINVAL;
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 87067d2..353bec8 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -80,13 +80,15 @@ extern struct kobj_attribute shmem_enabled_attr;
#define HPAGE_PMD_NR (1<<HPAGE_PMD_ORDER)

#ifdef CONFIG_TRANSPARENT_HUGEPAGE
-#define HPAGE_PMD_SHIFT PMD_SHIFT
-#define HPAGE_PMD_SIZE ((1UL) << HPAGE_PMD_SHIFT)
-#define HPAGE_PMD_MASK (~(HPAGE_PMD_SIZE - 1))
-
-#define HPAGE_PUD_SHIFT PUD_SHIFT
-#define HPAGE_PUD_SIZE ((1UL) << HPAGE_PUD_SHIFT)
-#define HPAGE_PUD_MASK (~(HPAGE_PUD_SIZE - 1))
+#define HPAGE_PMD_SHIFT PMD_SHIFT
+#define HPAGE_PMD_SIZE ((1UL) << HPAGE_PMD_SHIFT)
+#define HPAGE_PMD_OFFSET (HPAGE_PMD_SIZE - 1)
+#define HPAGE_PMD_MASK (~(HPAGE_PMD_OFFSET))
+
+#define HPAGE_PUD_SHIFT PUD_SHIFT
+#define HPAGE_PUD_SIZE ((1UL) << HPAGE_PUD_SHIFT)
+#define HPAGE_PUD_OFFSET (HPAGE_PUD_SIZE - 1)
+#define HPAGE_PUD_MASK (~(HPAGE_PUD_OFFSET))

extern bool is_vma_temporary_stack(struct vm_area_struct *vma);

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 1981ed6..7b61c92 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -445,6 +445,14 @@ subsys_initcall(hugepage_init);

static int __init setup_transparent_hugepage(char *str)
{
+#if 1
+ set_bit(TRANSPARENT_HUGEPAGE_FLAG,
+ &transparent_hugepage_flags);
+ clear_bit(TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG,
+ &transparent_hugepage_flags);
+ printk("THP permanently set ON\n");
+ return 1;
+#else
int ret = 0;
if (!str)
goto out;
@@ -471,6 +479,7 @@ static int __init setup_transparent_hugepage(char *str)
if (!ret)
pr_warn("transparent_hugepage= cannot parse, ignored\n");
return ret;
+#endif
}
__setup("transparent_hugepage=", setup_transparent_hugepage);

@@ -532,8 +541,11 @@ unsigned long thp_get_unmapped_area(struct file *filp, unsigned long addr,

if (addr)
goto out;
+
+#if 0
if (!IS_DAX(filp->f_mapping->host) || !IS_ENABLED(CONFIG_FS_DAX_PMD))
goto out;
+#endif

addr = __thp_get_unmapped_area(filp, len, off, flags, PMD_SIZE);
if (addr)
diff --git a/mm/memory.c b/mm/memory.c
index a728bed..fc352d8 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3506,7 +3506,99 @@ late_initcall(fault_around_debugfs);
* fault_around_pages() value (and therefore to page order). This way it's
* easier to guarantee that we don't cross page table boundaries.
*/
-static int do_fault_around(struct vm_fault *vmf)
+
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+static
+int do_fault_around_thp(struct vm_fault *vmf)
+{
+ struct file *file = vmf->vma->vm_file;
+ unsigned long address = vmf->address;
+ pgoff_t start_pgoff = vmf->pgoff;
+ pgoff_t end_pgoff;
+ int ret = VM_FAULT_FALLBACK;
+ int off;
+
+ /*
+ * vmf->address will be the higher of (fault address & HPAGE_PMD_MASK)
+ * or the start of the VMA.
+ */
+ vmf->address = max((address & HPAGE_PMD_MASK), vmf->vma->vm_start);
+
+ /*
+ * Not a candidate if the start address calculated above isnt properly
+ * aligned
+ */
+ if (vmf->address & HPAGE_PMD_OFFSET)
+ goto dfa_thp_out;
+
+ off = ((address - vmf->address) >> PAGE_SHIFT) & (PTRS_PER_PTE - 1);
+ start_pgoff -= off;
+
+ /*
+ * end_pgoff is either end of page table or end of vma
+ * or fault_around_pages() from start_pgoff, depending what is
+ * smallest.
+ */
+ end_pgoff = start_pgoff -
+ ((vmf->address >> PAGE_SHIFT) & (PTRS_PER_PTE - 1)) +
+ PTRS_PER_PTE - 1;
+ end_pgoff = min3(end_pgoff, vma_pages(vmf->vma) + vmf->vma->vm_pgoff - 1,
+ start_pgoff + PTRS_PER_PTE - 1);
+
+ /*
+ * Check to see if we could map this request with a large THP page
+ * instead.
+ */
+ if (((strncmp(file->f_path.dentry->d_name.name, "testr", 5) == 0)) &&
+ pmd_none(*vmf->pmd) &&
+ ((end_pgoff - start_pgoff) >=
+ ((HPAGE_PMD_SIZE >> PAGE_SHIFT) - 1))) {
+ struct page *page;
+
+ page = alloc_pages_vma(vmf->gfp_mask | __GFP_COMP |
+ __GFP_NORETRY, HPAGE_PMD_ORDER, vmf->vma,
+ vmf->address, numa_node_id(), 1);
+
+ if ((likely(page)) && (PageTransCompound(page))) {
+ ssize_t bytes_read;
+ void *pg_vaddr;
+
+ prep_transhuge_page(page);
+ pg_vaddr = page_address(page);
+
+ if (likely(pg_vaddr)) {
+ loff_t loff = (loff_t)
+ (start_pgoff << PAGE_SHIFT);
+ bytes_read = kernel_read(file, pg_vaddr,
+ HPAGE_PMD_SIZE, &loff);
+ VM_BUG_ON(bytes_read != HPAGE_PMD_SIZE);
+
+ smp_wmb(); /* See comment in __pte_alloc() */
+ ret = alloc_set_pte(vmf, NULL, page);
+
+ if (likely(ret == 0)) {
+ VM_BUG_ON_PAGE(pmd_none(*vmf->pmd),
+ page);
+ vmf->page = page;
+ ret = VM_FAULT_NOPAGE;
+ goto dfa_thp_out;
+ }
+ }
+
+ put_page(page);
+ }
+ }
+
+dfa_thp_out:
+ vmf->address = address;
+ VM_BUG_ON(vmf->pte != NULL);
+ return ret;
+}
+#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
+
+
+static
+int do_fault_around(struct vm_fault *vmf)
{
unsigned long address = vmf->address, nr_pages, mask;
pgoff_t start_pgoff = vmf->pgoff;
@@ -3566,6 +3658,21 @@ static int do_read_fault(struct vm_fault *vmf)
struct vm_area_struct *vma = vmf->vma;
int ret = 0;

+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+ /*
+ * Check to see if we could map this request with a large THP page
+ * instead.
+ */
+ if ((vma_pages(vmf->vma) >= PTRS_PER_PMD) &&
+ ((strncmp(vmf->vma->vm_file->f_path.dentry->d_name.name,
+ "testr", 5)) == 0)) {
+ ret = do_fault_around_thp(vmf);
+
+ if (ret == VM_FAULT_NOPAGE)
+ return ret;
+ }
+#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
+
/*
* Let's call ->map_pages() first and use ->fault() as fallback
* if page by the offset is not ready to be mapped (cold cache or
diff --git a/mm/mmap.c b/mm/mmap.c
index 680506f..1c281d7 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1327,6 +1327,10 @@ unsigned long do_mmap(struct file *file, unsigned long addr,
struct mm_struct *mm = current->mm;
int pkey = 0;

+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+ unsigned long thp_maywrite = VM_MAYWRITE;
+#endif
+
*populate = 0;

if (!len)
@@ -1361,7 +1365,32 @@ unsigned long do_mmap(struct file *file, unsigned long addr,
/* Obtain the address to map to. we verify (or select) it and ensure
* that it represents a valid section of the address space.
*/
- addr = get_unmapped_area(file, addr, len, pgoff, flags);
+
+
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+ /*
+ *
+ * If THP is enabled, and it's a read-only executable that is
+ * MAP_PRIVATE mapped, call the appropriate thp function to perhaps get a
+ * large page aligned virtual address, otherwise use the normal routine.
+ *
+ * Note the THP routine will return a normal page size aligned start
+ * address in some cases.
+ */
+ if ((prot & PROT_READ) && (prot & PROT_EXEC) && (!(prot & PROT_WRITE)) &&
+ (len >= HPAGE_PMD_SIZE) && (flags & MAP_PRIVATE) &&
+ ((!(flags & MAP_FIXED)) || (!(addr & HPAGE_PMD_OFFSET)))) {
+ addr = thp_get_unmapped_area(file, addr, len, pgoff,
+ flags);
+ if (addr && (!(addr & HPAGE_PMD_OFFSET)))
+ thp_maywrite = 0;
+ } else {
+#endif
+ addr = get_unmapped_area(file, addr, len, pgoff, flags);
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+ }
+#endif
+
if (offset_in_page(addr))
return addr;

@@ -1376,7 +1405,11 @@ unsigned long do_mmap(struct file *file, unsigned long addr,
* of the memory object, so we don't do any here.
*/
vm_flags |= calc_vm_prot_bits(prot, pkey) | calc_vm_flag_bits(flags) |
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+ mm->def_flags | VM_MAYREAD | thp_maywrite | VM_MAYEXEC;
+#else
mm->def_flags | VM_MAYREAD | VM_MAYWRITE | VM_MAYEXEC;
+#endif

if (flags & MAP_LOCKED)
if (!can_do_mlock())
diff --git a/mm/rmap.c b/mm/rmap.c
index b874c47..4fc24f8 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1184,7 +1184,9 @@ void page_add_file_rmap(struct page *page, bool compound)
}
if (!atomic_inc_and_test(compound_mapcount_ptr(page)))
goto out;
+#if 0
VM_BUG_ON_PAGE(!PageSwapBacked(page), page);
+#endif
__inc_node_page_state(page, NR_SHMEM_PMDMAPPED);
} else {
if (PageTransCompound(page) && page_mapping(page)) {
@@ -1224,7 +1226,9 @@ static void page_remove_file_rmap(struct page *page, bool compound)
}
if (!atomic_add_negative(-1, compound_mapcount_ptr(page)))
goto out;
+#if 0
VM_BUG_ON_PAGE(!PageSwapBacked(page), page);
+#endif
__dec_node_page_state(page, NR_SHMEM_PMDMAPPED);
} else {
if (!atomic_add_negative(-1, &page->_mapcount))