Re: [PATCH 04/10] mm/hugetlb: Move swap entry handling into vma lock when faulted

From: Mike Kravetz
Date: Mon Dec 05 2022 - 17:17:11 EST


On 11/29/22 14:35, Peter Xu wrote:
> In hugetlb_fault(), there used to have a special path to handle swap entry
> at the entrance using huge_pte_offset(). That's unsafe because
> huge_pte_offset() for a pmd sharable range can access freed pgtables if
> without any lock to protect the pgtable from being freed after pmd unshare.

Thanks. That special path has been there for a long time. I should have given
it more thought when adding lock. However, I was only considering the 'stale'
pte case as opposed to freed.

> Here the simplest solution to make it safe is to move the swap handling to
> be after the vma lock being held. We may need to take the fault mutex on
> either migration or hwpoison entries now (also the vma lock, but that's
> really needed), however neither of them is hot path.
>
> Note that the vma lock cannot be released in hugetlb_fault() when the
> migration entry is detected, because in migration_entry_wait_huge() the
> pgtable page will be used again (by taking the pgtable lock), so that also
> need to be protected by the vma lock. Modify migration_entry_wait_huge()
> so that it must be called with vma read lock held, and properly release the
> lock in __migration_entry_wait_huge().
>
> Signed-off-by: Peter Xu <peterx@xxxxxxxxxx>
> ---
> include/linux/swapops.h | 6 ++++--
> mm/hugetlb.c | 32 +++++++++++++++-----------------
> mm/migrate.c | 25 +++++++++++++++++++++----
> 3 files changed, 40 insertions(+), 23 deletions(-)

Looks good with one small change noted below,

Reviewed-by: Mike Kravetz <mike.kravetz@xxxxxxxxxx>

>
> diff --git a/include/linux/swapops.h b/include/linux/swapops.h
> index 27ade4f22abb..09b22b169a71 100644
> --- a/include/linux/swapops.h
> +++ b/include/linux/swapops.h
> @@ -335,7 +335,8 @@ extern void __migration_entry_wait(struct mm_struct *mm, pte_t *ptep,
> extern void migration_entry_wait(struct mm_struct *mm, pmd_t *pmd,
> unsigned long address);
> #ifdef CONFIG_HUGETLB_PAGE
> -extern void __migration_entry_wait_huge(pte_t *ptep, spinlock_t *ptl);
> +extern void __migration_entry_wait_huge(struct vm_area_struct *vma,
> + pte_t *ptep, spinlock_t *ptl);
> extern void migration_entry_wait_huge(struct vm_area_struct *vma, pte_t *pte);
> #endif /* CONFIG_HUGETLB_PAGE */
> #else /* CONFIG_MIGRATION */
> @@ -364,7 +365,8 @@ static inline void __migration_entry_wait(struct mm_struct *mm, pte_t *ptep,
> static inline void migration_entry_wait(struct mm_struct *mm, pmd_t *pmd,
> unsigned long address) { }
> #ifdef CONFIG_HUGETLB_PAGE
> -static inline void __migration_entry_wait_huge(pte_t *ptep, spinlock_t *ptl) { }
> +static inline void __migration_entry_wait_huge(struct vm_area_struct *vma,
> + pte_t *ptep, spinlock_t *ptl) { }
> static inline void migration_entry_wait_huge(struct vm_area_struct *vma, pte_t *pte) { }
> #endif /* CONFIG_HUGETLB_PAGE */
> static inline int is_writable_migration_entry(swp_entry_t entry)
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index dfe677fadaf8..776e34ccf029 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -5826,22 +5826,6 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
> int need_wait_lock = 0;
> unsigned long haddr = address & huge_page_mask(h);
>
> - ptep = huge_pte_offset(mm, haddr, huge_page_size(h));
> - if (ptep) {
> - /*
> - * Since we hold no locks, ptep could be stale. That is
> - * OK as we are only making decisions based on content and
> - * not actually modifying content here.
> - */
> - entry = huge_ptep_get(ptep);
> - if (unlikely(is_hugetlb_entry_migration(entry))) {
> - migration_entry_wait_huge(vma, ptep);
> - return 0;
> - } else if (unlikely(is_hugetlb_entry_hwpoisoned(entry)))
> - return VM_FAULT_HWPOISON_LARGE |
> - VM_FAULT_SET_HINDEX(hstate_index(h));
> - }
> -

Before acquiring the vma_lock, there is this comment:

/*
* Acquire vma lock before calling huge_pte_alloc and hold
* until finished with ptep. This prevents huge_pmd_unshare from
* being called elsewhere and making the ptep no longer valid.
*
* ptep could have already be assigned via hugetlb_walk(). That
* is OK, as huge_pte_alloc will return the same value unless
* something has changed.
*/

The second sentence in that comment about ptep being already assigned no
longer applies and can be deleted.

--
Mike Kravetz