Re: [PATCH 2/2] hugepage: Allow parallelization of the hugepagefault path

From: Mel Gorman
Date: Wed Jan 26 2011 - 04:25:01 EST


On Tue, Jan 25, 2011 at 02:34:14PM +1100, Anton Blanchard wrote:
> From: David Gibson <dwg@xxxxxxxxxxx>
>
> At present, the page fault path for hugepages is serialized by a
> single mutex. This is used to avoid spurious out-of-memory conditions
> when the hugepage pool is fully utilized (two processes or threads can
> race to instantiate the same mapping with the last hugepage from the
> pool, the race loser returning VM_FAULT_OOM). This problem is
> specific to hugepages, because it is normal to want to use every
> single hugepage in the system - with normal pages we simply assume
> there will always be a few spare pages which can be used temporarily
> until the race is resolved.
>
> Unfortunately this serialization also means that clearing of hugepages
> cannot be parallelized across multiple CPUs, which can lead to very
> long process startup times when using large numbers of hugepages.
>
> This patch improves the situation by replacing the single mutex with a
> table of mutexes, selected based on a hash of the address_space and
> file offset being faulted (or mm and virtual address for MAP_PRIVATE
> mappings).
>
> From: Anton Blanchard <anton@xxxxxxxxx>
>
> Forward ported and made a few changes:
>
> - Use the Jenkins hash to scatter the hash, better than using just the
> low bits.
>
> - Always round num_fault_mutexes to a power of two to avoid an expensive
> modulus in the hash calculation.
>
> I also tested this patch on a 64 thread POWER6 box using a simple parallel
> fault testcase:
>
> http://ozlabs.org/~anton/junkcode/parallel_fault.c
>
> Command line options:
>
> parallel_fault <nr_threads> <size in kB> <skip in kB>
>
> First the time taken to fault 48GB of 16MB hugepages:
> # time hugectl --heap ./parallel_fault 1 50331648 16384
> 11.1 seconds
>
> Now the same test with 64 concurrent threads:
> # time hugectl --heap ./parallel_fault 64 50331648 16384
> 8.8 seconds
>
> Hardly any speedup. Finally the 64 concurrent threads test with this patch
> applied:
> # time hugectl --heap ./parallel_fault 64 50331648 16384
> 0.7 seconds
>
> We go from 8.8 seconds to 0.7 seconds, an improvement of 12.6x.
>
> Signed-off-by: David Gibson <dwg@xxxxxxxxxxx>
> Signed-off-by: Anton Blanchard <anton@xxxxxxxxx>

I haven't tested this patch yet but typically how I would test it is multiple
parallel instances of make func from libhugetlbfs. In particular I would
be looking out for counter corruption. Has something like this been done?
I know hugetlb_lock protects the counters but the locking in there has turned
into a bit of a mess so it's easy to miss something.

> ---
>
> Index: powerpc.git/mm/hugetlb.c
> ===================================================================
> --- powerpc.git.orig/mm/hugetlb.c 2011-01-25 13:20:49.311405902 +1100
> +++ powerpc.git/mm/hugetlb.c 2011-01-25 13:45:54.437235053 +1100
> @@ -21,6 +21,7 @@
> #include <linux/rmap.h>
> #include <linux/swap.h>
> #include <linux/swapops.h>
> +#include <linux/jhash.h>
>
> #include <asm/page.h>
> #include <asm/pgtable.h>
> @@ -54,6 +55,13 @@ static unsigned long __initdata default_
> static DEFINE_SPINLOCK(hugetlb_lock);
>
> /*
> + * Serializes faults on the same logical page. This is used to
> + * prevent spurious OOMs when the hugepage pool is fully utilized.
> + */
> +static unsigned int num_fault_mutexes;
> +static struct mutex *htlb_fault_mutex_table;
> +
> +/*
> * Region tracking -- allows tracking of reservations and instantiated pages
> * across the pages in a mapping.
> */
> @@ -1764,6 +1772,8 @@ module_exit(hugetlb_exit);
>
> static int __init hugetlb_init(void)
> {
> + int i;
> +
> /* Some platform decide whether they support huge pages at boot
> * time. On these, such as powerpc, HPAGE_SHIFT is set to 0 when
> * there is no such support
> @@ -1790,6 +1800,12 @@ static int __init hugetlb_init(void)
>
> hugetlb_register_all_nodes();
>
> + num_fault_mutexes = roundup_pow_of_two(2 * num_possible_cpus());
> + htlb_fault_mutex_table =
> + kmalloc(num_fault_mutexes * sizeof(struct mutex), GFP_KERNEL);

and if this fails? It'd be unusual I know but num_possible_cpus() could
conceivably be large enough to prevent kmalloc() granting the request.
Do you need to do something similar to profile_init() here instead?

> + for (i = 0; i < num_fault_mutexes; i++)
> + mutex_init(&htlb_fault_mutex_table[i]);
> +
> return 0;
> }
> module_init(hugetlb_init);
> @@ -2616,6 +2632,27 @@ backout_unlocked:
> goto out;
> }
>
> +static u32 fault_mutex_hash(struct hstate *h, struct mm_struct *mm,
> + struct vm_area_struct *vma,
> + struct address_space *mapping,
> + unsigned long pagenum, unsigned long address)

pagenum could be anything. Leave it as idx or index because it's easier
to guess it's the result of vma_hugecache_offset().

> +{
> + unsigned long key[2];
> + u32 hash;
> +
> + if ((vma->vm_flags & VM_SHARED)) {
> + key[0] = (unsigned long)mapping;
> + key[1] = pagenum;
> + } else {
> + key[0] = (unsigned long)mm;
> + key[1] = address >> huge_page_shift(h);
> + }
> +
> + hash = jhash2((u32 *)&key, sizeof(key)/sizeof(u32), 0);
> +
> + return hash & (num_fault_mutexes - 1);
> +}
> +
> int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
> unsigned long address, unsigned int flags)
> {
> @@ -2624,8 +2661,10 @@ int hugetlb_fault(struct mm_struct *mm,
> int ret;
> struct page *page = NULL;
> struct page *pagecache_page = NULL;
> - static DEFINE_MUTEX(hugetlb_instantiation_mutex);
> struct hstate *h = hstate_vma(vma);
> + struct address_space *mapping;
> + unsigned long idx;
> + u32 hash;
>
> ptep = huge_pte_offset(mm, address);
> if (ptep) {
> @@ -2642,12 +2681,16 @@ int hugetlb_fault(struct mm_struct *mm,
> if (!ptep)
> return VM_FAULT_OOM;
>
> + mapping = vma->vm_file->f_mapping;
> + idx = vma_hugecache_offset(h, vma, address);
> +
> /*
> * Serialize hugepage allocation and instantiation, so that we don't
> * get spurious allocation failures if two CPUs race to instantiate
> * the same page in the page cache.
> */
> - mutex_lock(&hugetlb_instantiation_mutex);
> + hash = fault_mutex_hash(h, mm, vma, mapping, idx, address);
> + mutex_lock(&htlb_fault_mutex_table[hash]);
> entry = huge_ptep_get(ptep);
> if (huge_pte_none(entry)) {
> ret = hugetlb_no_page(mm, vma, address, ptep, flags);
> @@ -2716,7 +2759,7 @@ out_page_table_lock:
> unlock_page(page);
>
> out_mutex:
> - mutex_unlock(&hugetlb_instantiation_mutex);
> + mutex_unlock(&htlb_fault_mutex_table[hash]);
>
> return ret;
> }
>

I didn't spot anything wrong but I'd be happier if I knew multiple parallel
"make func" from libhugetlbfs tests were run as well.

--
Mel Gorman
Linux Technology Center
IBM Dublin Software Lab
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/