Re: [RFC] Reserve huge pages for reliable MAP_PRIVATE hugetlbfsmappings

From: Adam Litke
Date: Mon Apr 21 2008 - 15:05:16 EST



On Mon, 2008-04-21 at 19:36 +0100, Mel Gorman wrote:
> MAP_SHARED mappings on hugetlbfs reserve huge pages at mmap() time. This is
> so that all future faults will be guaranteed to succeed. Applications are not
> expected to use mlock() as this can result in poor NUMA placement.
>
> MAP_PRIVATE mappings do not reserve pages. This can result in an application
> being SIGKILLed later if a large page is not available at fault time. This
> makes huge pages usage very ill-advised in some cases as the unexpected
> application failure is intolerable. Forcing potential poor placement with
> mlock() is not a great solution either.
>
> This patch reserves huge pages at mmap() time for MAP_PRIVATE mappings similar
> to what happens for MAP_SHARED mappings. Once mmap() succeeds, the application
> developer knows that future faults will also succeed. However, there is no
> guarantee that children of the process will be able to write-fault the same
> mapping. The assumption is being made that the majority of applications that
> fork() either use MAP_SHARED as an IPC mechanism or are calling exec().
>
> Opinions?

Very sound idea in my opinion and definitely a step in the right
direction toward even more reliable huge pages. With this patch (my
comments included), the only remaining cause of unexpected SIGKILLs
would be copy-on-write.

> @@ -40,6 +40,34 @@ static int hugetlb_next_nid;
> */
> static DEFINE_SPINLOCK(hugetlb_lock);
>
> +/* Helpers to track the number of pages reserved for a MAP_PRIVATE vma */
> +static unsigned long vma_resv_huge_pages(struct vm_area_struct *vma)
> +{
> + if (!(vma->vm_flags & VM_MAYSHARE))
> + return (unsigned long)vma->vm_private_data;
> + return 0;
> +}
> +
> +static void adjust_vma_resv_huge_pages(struct vm_area_struct *vma,
> + int delta)
> +{
> + BUG_ON((unsigned long)vma->vm_private_data > 100);
> + WARN_ON_ONCE(vma->vm_flags & VM_MAYSHARE);
> + if (!(vma->vm_flags & VM_MAYSHARE)) {
> + unsigned long reserve;
> + reserve = (unsigned long)vma->vm_private_data + delta;
> + vma->vm_private_data = (void *)reserve;
> + }
> +}
> +static void set_vma_resv_huge_pages(struct vm_area_struct *vma,
> + unsigned long reserve)
> +{
> + BUG_ON((unsigned long)vma->vm_private_data > 100);

I assume you're just playing it safe for this RFC, but surely a 100 page
max reservation is not sufficient (especially since we have a whole
unsigned long to work with). Also, I am not sure a BUG_ON would be an
appropriate response to exceeding the maximum.

<snip>

> @@ -1223,52 +1305,30 @@ static long region_truncate(struct list_
> return chg;
> }
>
> -static int hugetlb_acct_memory(long delta)
> +int hugetlb_reserve_pages(struct inode *inode,
> + long from, long to,
> + struct vm_area_struct *vma)
> {
> - int ret = -ENOMEM;
> + long ret, chg;
>
> - spin_lock(&hugetlb_lock);
> /*
> - * When cpuset is configured, it breaks the strict hugetlb page
> - * reservation as the accounting is done on a global variable. Such
> - * reservation is completely rubbish in the presence of cpuset because
> - * the reservation is not checked against page availability for the
> - * current cpuset. Application can still potentially OOM'ed by kernel
> - * with lack of free htlb page in cpuset that the task is in.
> - * Attempt to enforce strict accounting with cpuset is almost
> - * impossible (or too ugly) because cpuset is too fluid that
> - * task or memory node can be dynamically moved between cpusets.
> - *
> - * The change of semantics for shared hugetlb mapping with cpuset is
> - * undesirable. However, in order to preserve some of the semantics,
> - * we fall back to check against current free page availability as
> - * a best attempt and hopefully to minimize the impact of changing
> - * semantics that cpuset has.
> + * Shared mappings and read-only mappings should based their reservation
> + * on the number of pages that are already allocated on behalf of the
> + * file. Private mappings that are writable need to reserve the full
> + * area. Note that a read-only private mapping that subsequently calls
> + * mprotect() to make it read-write may not work reliably
> */
> - if (delta > 0) {
> - if (gather_surplus_pages(delta) < 0)
> - goto out;
> -
> - if (delta > cpuset_mems_nr(free_huge_pages_node)) {
> - return_unused_surplus_pages(delta);
> - goto out;
> - }
> + if (vma->vm_flags & VM_SHARED)
> + chg = region_chg(&inode->i_mapping->private_list, from, to);
> + else {
> + if (vma->vm_flags & VM_MAYWRITE)
> + chg = to - from;
> + else
> + chg = region_chg(&inode->i_mapping->private_list,
> + from, to);
> + set_vma_resv_huge_pages(vma, chg);

To promote reliability, might it be advisable to just reserve the pages
regardless of VM_MAYWRITE? Otherwise we might want to consider
reserving the pages in hugetlb_change_protection().

--
Adam Litke - (agl at us.ibm.com)
IBM Linux Technology Center

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/