Re: [RFC PATCH] kvm: Use huge pages for DAX-backed files

From: Dan Williams
Date: Mon Oct 29 2018 - 18:25:58 EST


On Mon, Oct 29, 2018 at 2:07 PM Barret Rhoden <brho@xxxxxxxxxx> wrote:
>
> This change allows KVM to map DAX-backed files made of huge pages with
> huge mappings in the EPT/TDP.
>
> DAX pages are not PageTransCompound. The existing check is trying to
> determine if the mapping for the pfn is a huge mapping or not. For
> non-DAX maps, e.g. hugetlbfs, that means checking PageTransCompound.
>
> For DAX, we can check the page table itself. Actually, we might always
> be able to walk the page table, even for PageTransCompound pages, but
> it's probably a little slower.
>
> Note that KVM already faulted in the page (or huge page) in the host's
> page table, and we hold the KVM mmu spinlock (grabbed before checking
> the mmu seq). Based on the other comments about not worrying about a
> pmd split, we might be able to safely walk the page table without
> holding the mm sem.
>
> This patch relies on kvm_is_reserved_pfn() being false for DAX pages,
> which I've hacked up for testing this code. That change should
> eventually happen:
>
> https://lore.kernel.org/lkml/20181022084659.GA84523@tiger-server/
>
> Another issue is that kvm_mmu_zap_collapsible_spte() also uses
> PageTransCompoundMap() to detect huge pages, but we don't have a way to
> get the HVA easily. Can we just aggressively zap DAX pages there?
>
> Alternatively, is there a better way to track at the struct page level
> whether or not a page is huge-mapped? Maybe the DAX huge pages mark
> themselves as TransCompound or something similar, and we don't need to
> special case DAX/ZONE_DEVICE pages.
>
> Signed-off-by: Barret Rhoden <brho@xxxxxxxxxx>
> ---
> arch/x86/kvm/mmu.c | 71 +++++++++++++++++++++++++++++++++++++++++++++-
> 1 file changed, 70 insertions(+), 1 deletion(-)
>
> diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
> index cf5f572f2305..9f3e0f83a2dd 100644
> --- a/arch/x86/kvm/mmu.c
> +++ b/arch/x86/kvm/mmu.c
> @@ -3152,6 +3152,75 @@ static int kvm_handle_bad_page(struct kvm_vcpu *vcpu, gfn_t gfn, kvm_pfn_t pfn)
> return -EFAULT;
> }
>
> +static unsigned long pgd_mapping_size(struct mm_struct *mm, unsigned long addr)
> +{
> + pgd_t *pgd;
> + p4d_t *p4d;
> + pud_t *pud;
> + pmd_t *pmd;
> + pte_t *pte;
> +
> + pgd = pgd_offset(mm, addr);
> + if (!pgd_present(*pgd))
> + return 0;
> +
> + p4d = p4d_offset(pgd, addr);
> + if (!p4d_present(*p4d))
> + return 0;
> + if (p4d_huge(*p4d))
> + return P4D_SIZE;
> +
> + pud = pud_offset(p4d, addr);
> + if (!pud_present(*pud))
> + return 0;
> + if (pud_huge(*pud))
> + return PUD_SIZE;
> +
> + pmd = pmd_offset(pud, addr);
> + if (!pmd_present(*pmd))
> + return 0;
> + if (pmd_huge(*pmd))
> + return PMD_SIZE;
> +
> + pte = pte_offset_map(pmd, addr);
> + if (!pte_present(*pte))
> + return 0;
> + return PAGE_SIZE;
> +}
> +
> +static bool pfn_is_pmd_mapped(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn)
> +{
> + struct page *page = pfn_to_page(pfn);
> + unsigned long hva, map_sz;
> +
> + if (!is_zone_device_page(page))
> + return PageTransCompoundMap(page);
> +
> + /*
> + * DAX pages do not use compound pages. The page should have already
> + * been mapped into the host-side page table during try_async_pf(), so
> + * we can check the page tables directly.
> + */
> + hva = gfn_to_hva(kvm, gfn);
> + if (kvm_is_error_hva(hva))
> + return false;
> +
> + /*
> + * Our caller grabbed the KVM mmu_lock with a successful
> + * mmu_notifier_retry, so we're safe to walk the page table.
> + */
> + map_sz = pgd_mapping_size(current->mm, hva);
> + switch (map_sz) {
> + case PMD_SIZE:
> + return true;
> + case P4D_SIZE:
> + case PUD_SIZE:
> + printk_once(KERN_INFO "KVM THP promo found a very large page");

Why not allow PUD_SIZE? The device-dax interface supports PUD mappings.

> + return false;
> + }
> + return false;
> +}

The above 2 functions are similar to what we need to do for
determining the blast radius of a memory error, see
dev_pagemap_mapping_shift() and its usage in add_to_kill().

> +
> static void transparent_hugepage_adjust(struct kvm_vcpu *vcpu,
> gfn_t *gfnp, kvm_pfn_t *pfnp,
> int *levelp)
> @@ -3168,7 +3237,7 @@ static void transparent_hugepage_adjust(struct kvm_vcpu *vcpu,
> */
> if (!is_error_noslot_pfn(pfn) && !kvm_is_reserved_pfn(pfn) &&
> level == PT_PAGE_TABLE_LEVEL &&
> - PageTransCompoundMap(pfn_to_page(pfn)) &&
> + pfn_is_pmd_mapped(vcpu->kvm, gfn, pfn) &&

I'm wondering if we're adding an explicit is_zone_device_page() check
in this path to determine the page mapping size if that can be a
replacement for the kvm_is_reserved_pfn() check. In other words, the
goal of fixing up PageReserved() was to preclude the need for DAX-page
special casing in KVM, but if we already need add some special casing
for page size determination, might as well bypass the
kvm_is_reserved_pfn() dependency as well.