Re: [PATCH] mm/hugetlb: expand restore_reserve_on_error functionality

From: Mike Kravetz
Date: Wed Jun 02 2021 - 21:08:13 EST


On 6/2/21 5:38 PM, Mina Almasry wrote:
> On Wed, Jun 2, 2021 at 4:50 PM Mike Kravetz <mike.kravetz@xxxxxxxxxx> wrote:
>>
>> The routine restore_reserve_on_error is called to restore reservation
>> information when an error occurs after page allocation. The routine
>> alloc_huge_page modifies the mapping reserve map and potentially the
>> reserve count during allocation. If code calling alloc_huge_page
>> encounters an error after allocation and needs to free the page, the
>> reservation information needs to be adjusted.
>>
>> Currently, restore_reserve_on_error only takes action on pages for which
>> the reserve count was adjusted(HPageRestoreReserve flag). There is
>> nothing wrong with these adjustments. However, alloc_huge_page ALWAYS
>> modifies the reserve map during allocation even if the reserve count is
>> not adjusted. This can cause issues as observed during development of
>> this patch [1].
>>
>> One specific series of operations causing an issue is:
>> - Create a shared hugetlb mapping
>> Reservations for all pages created by default
>> - Fault in a page in the mapping
>> Reservation exists so reservation count is decremented
>> - Punch a hole in the file/mapping at index previously faulted
>> Reservation and any associated pages will be removed
>> - Allocate a page to fill the hole
>> No reservation entry, so reserve count unmodified
>> Reservation entry added to map by alloc_huge_page
>> - Error after allocation and before instantiating the page
>> Reservation entry remains in map
>> - Allocate a page to fill the hole
>> Reservation entry exists, so decrement reservation count
>> This will cause a reservation count underflow as the reservation count
>> was decremented twice for the same index.
>>
>> A user would observe a very large number for HugePages_Rsvd in
>> /proc/meminfo. This would also likely cause subsequent allocations of
>> hugetlb pages to fail as it would 'appear' that all pages are reserved.
>>
>> This sequence of operations is unlikely to happen, however they were
>> easily reproduced and observed using hacked up code as described in [1].
>>
>> Address the issue by having the routine restore_reserve_on_error take
>> action on pages where HPageRestoreReserve is not set. In this case, we
>> need to remove any reserve map entry created by alloc_huge_page. A new
>> helper routine vma_del_reservation assists with this operation.
>>
>> There are three callers of alloc_huge_page which do not currently call
>> restore_reserve_on error before freeing a page on error paths. Add
>> those missing calls.
>>
>> [1] https://lore.kernel.org/linux-mm/20210528005029.88088-1-almasrymina@xxxxxxxxxx/
>> Fixes: 96b96a96ddee ("mm/hugetlb: fix huge page reservation leak in private mapping error paths"
>> Signed-off-by: Mike Kravetz <mike.kravetz@xxxxxxxxxx>
>> Cc: <stable@xxxxxxxxxxxxxxx>
>> ---
>> fs/hugetlbfs/inode.c | 1 +
>> include/linux/hugetlb.h | 2 +
>> mm/hugetlb.c | 120 ++++++++++++++++++++++++++++++++--------
>> 3 files changed, 100 insertions(+), 23 deletions(-)
>>
>> diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
>> index 21f7a5c92e8a..926eeb9bf4eb 100644
>> --- a/fs/hugetlbfs/inode.c
>> +++ b/fs/hugetlbfs/inode.c
>> @@ -735,6 +735,7 @@ static long hugetlbfs_fallocate(struct file *file, int mode, loff_t offset,
>> __SetPageUptodate(page);
>> error = huge_add_to_page_cache(page, mapping, index);
>> if (unlikely(error)) {
>> + restore_reserve_on_error(h, &pseudo_vma, addr, page);
>> put_page(page);
>> mutex_unlock(&hugetlb_fault_mutex_table[hash]);
>> goto out;
>> diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
>> index 44721df20e89..b375b31f60c4 100644
>> --- a/include/linux/hugetlb.h
>> +++ b/include/linux/hugetlb.h
>> @@ -627,6 +627,8 @@ struct page *alloc_huge_page_vma(struct hstate *h, struct vm_area_struct *vma,
>> unsigned long address);
>> int huge_add_to_page_cache(struct page *page, struct address_space *mapping,
>> pgoff_t idx);
>> +void restore_reserve_on_error(struct hstate *h, struct vm_area_struct *vma,
>> + unsigned long address, struct page *page);
>>
>> /* arch callback */
>> int __init __alloc_bootmem_huge_page(struct hstate *h);
>> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
>> index 9a616b7a8563..36b691c3eabf 100644
>> --- a/mm/hugetlb.c
>> +++ b/mm/hugetlb.c
>> @@ -2259,12 +2259,18 @@ static void return_unused_surplus_pages(struct hstate *h,
>> * be restored when a newly allocated huge page must be freed. It is
>> * to be called after calling vma_needs_reservation to determine if a
>> * reservation exists.
>> + *
>> + * vma_del_reservation is used in error paths where an entry in the reserve
>> + * map was created during huge page allocation and must be removed. It is to
>> + * be called after calling vma_needs_reservation to determine if a reservation
>> + * exists.
>> */
>> enum vma_resv_mode {
>> VMA_NEEDS_RESV,
>> VMA_COMMIT_RESV,
>> VMA_END_RESV,
>> VMA_ADD_RESV,
>> + VMA_DEL_RESV,
>> };
>> static long __vma_reservation_common(struct hstate *h,
>> struct vm_area_struct *vma, unsigned long addr,
>> @@ -2308,11 +2314,21 @@ static long __vma_reservation_common(struct hstate *h,
>> ret = region_del(resv, idx, idx + 1);
>> }
>> break;
>> + case VMA_DEL_RESV:
>> + if (vma->vm_flags & VM_MAYSHARE) {
>> + region_abort(resv, idx, idx + 1, 1);
>> + ret = region_del(resv, idx, idx + 1);
>> + } else {
>> + ret = region_add(resv, idx, idx + 1, 1, NULL, NULL);
>> + /* region_add calls of range 1 should never fail. */
>> + VM_BUG_ON(ret < 0);
>> + }
>> + break;
>
> I haven't tested, but at first glance I don't think this quite works,
> no? The thing is that alloc_huge_page() does a
> vma_commit_reservation() which does add_region() regardless if
> vm_flags & VM_MAYSHARE or not, so to unroll that we need a function
> that does a region_del() regardless if vm_flags & VM_MAYSHARE or not.
> I wonder if the root of the bug is the unconditional region_add()
> vma_commit_reservation() does.

Do give it a test. I beleive I tested in the same way you tested, but
could be mistaken.

We may not want to always unroll. vma_del_reservation is only called in
the path where HPageRestoreReserve is not set. So, no reservation entry
was present in the reserve map before the allocation. alloc_huge_page
likely added an entry via vma_commit_reservation. Since there was
an error and we will be freeing the page, we need to make sure no
reservation exists. As you know, making sure a reservation does not
exist is different for shared and private mappings.
- For shared mappings, we need to make sure there is not an entry in the
reserve map.
- For private mappings, we need to make sure there is an entry in the
reserve map.
That is why vma_del_reservation does a region_del for shared mappings
and a region_add for private mappings.


> Also, I believe hugetlb_unreserve_pages() calls region_del() directly,
> and doesn't go through the vma_*_reservation() interface. If you're
> adding a delete function it may be nice to convert that to use the new
> function as well.
>
> I'm happy to take this fix over and submit a v2, since I think I
> understand the problem and can readily reproduce the issue (I just add
> the warning and some prints and run the userfaultfd tests).

Do run your test. I am interested to see if you experience issues.
--
Mike Kravetz