[RFC PATCH 00/19] hugetlb support for KVM guest_mem

From: Ackerley Tng
Date: Tue Jun 06 2023 - 15:04:23 EST


Hello,

This patchset builds upon a soon-to-be-published WIP patchset that Sean
published at https://github.com/sean-jc/linux/tree/x86/kvm_gmem_solo, mentioned
at [1].

The tree can be found at:
https://github.com/googleprodkernel/linux-cc/tree/gmem-hugetlb-rfc-v1

In this patchset, hugetlb support for KVM's guest_mem (aka gmem) is introduced,
allowing VM private memory (for confidential computing) to be backed by hugetlb
pages.

guest_mem provides userspace with a handle, with which userspace can allocate
and deallocate memory for confidential VMs without mapping the memory into
userspace.

Why use hugetlb instead of introducing a new allocator, like gmem does for 4K
and transparent hugepages?

+ hugetlb provides the following useful functionality, which would otherwise
have to be reimplemented:
+ Allocation of hugetlb pages at boot time, including
+ Parsing of kernel boot parameters to configure hugetlb
+ Tracking of usage in hstate
+ gmem will share the same system-wide pool of hugetlb pages, so users
don't have to have separate pools for hugetlb and gmem
+ Page accounting with subpools
+ hugetlb pages are tracked in subpools, which gmem uses to reserve
pages from the global hstate
+ Memory charging
+ hugetlb provides code that charges memory to cgroups
+ Reporting: hugetlb usage and availability are available at /proc/meminfo,
etc

The first 11 patches in this patchset is a series of refactoring to decouple
hugetlb and hugetlbfs.

The central thread binding the refactoring is that some functions (like
inode_resv_map(), inode_subpool(), inode_hstate(), etc) rely on a hugetlbfs
concept, that the resv_map, subpool, hstate, are in a specific field in a
hugetlb inode.

Refactoring to parametrize functions by hstate, subpool, resv_map will allow
hugetlb to be used by gmem and in other places where these data structures
aren't necessarily stored in the same positions in the inode.

The refactoring proposed here is just the minimum required to get a
proof-of-concept working with gmem. I would like to get opinions on this
approach before doing further refactoring. (See TODOs)

TODOs:

+ hugetlb/hugetlbfs refactoring
+ remove_inode_hugepages() no longer needs to be exposed, it is hugetlbfs
specific and used only in inode.c
+ remove_mapping_hugepages(), remove_inode_single_folio(),
hugetlb_unreserve_pages() shouldn't need to take inode as a parameter
+ Updating inode->i_blocks can be refactored to a separate function and
called from hugetlbfs and gmem
+ alloc_hugetlb_folio_from_subpool() shouldn't need to be parametrized by
vma
+ hugetlb_reserve_pages() should be refactored to be symmetric with
hugetlb_unreserve_pages()
+ It should be parametrized by resv_map
+ alloc_hugetlb_folio_from_subpool() could perhaps use
hugetlb_reserve_pages()?
+ gmem
+ Figure out if resv_map should be used by gmem at all
+ Probably needs more refactoring to decouple resv_map from hugetlb
functions

Questions for the community:

1. In this patchset, every gmem file backed with hugetlb is given a new
subpool. Is that desirable?
+ In hugetlbfs, a subpool always belongs to a mount, and hugetlbfs has one
mount per hugetlb size (2M, 1G, etc)
+ memfd_create(MFD_HUGETLB) effectively returns a full hugetlbfs file, so it
(rightfully) uses the hugetlbfs kernel mounts and their subpools
+ I gave each file a subpool mostly to speed up implementation and still be
able to reserve hugetlb pages from the global hstate based on the gmem
file size.
+ gmem, unlike hugetlbfs, isn't meant to be a full filesystem, so
+ Should there be multiple mounts, one for each hugetlb size?
+ Will the mounts be initialized on boot or on first gmem file creation?
+ Or is one subpool per gmem file fine?
2. Should resv_map be used for gmem at all, since gmem doesn't allow userspace
reservations?

[1] https://lore.kernel.org/lkml/ZEM5Zq8oo+xnApW9@xxxxxxxxxx/

---

Ackerley Tng (19):
mm: hugetlb: Expose get_hstate_idx()
mm: hugetlb: Move and expose hugetlbfs_zero_partial_page
mm: hugetlb: Expose remove_inode_hugepages
mm: hugetlb: Decouple hstate, subpool from inode
mm: hugetlb: Allow alloc_hugetlb_folio() to be parametrized by subpool
and hstate
mm: hugetlb: Provide hugetlb_filemap_add_folio()
mm: hugetlb: Refactor vma_*_reservation functions
mm: hugetlb: Refactor restore_reserve_on_error
mm: hugetlb: Use restore_reserve_on_error directly in filesystems
mm: hugetlb: Parametrize alloc_hugetlb_folio_from_subpool() by
resv_map
mm: hugetlb: Parametrize hugetlb functions by resv_map
mm: truncate: Expose preparation steps for truncate_inode_pages_final
KVM: guest_mem: Refactor kvm_gmem fd creation to be in layers
KVM: guest_mem: Refactor cleanup to separate inode and file cleanup
KVM: guest_mem: hugetlb: initialization and cleanup
KVM: guest_mem: hugetlb: allocate and truncate from hugetlb
KVM: selftests: Add basic selftests for hugetlbfs-backed guest_mem
KVM: selftests: Support various types of backing sources for private
memory
KVM: selftests: Update test for various private memory backing source
types

fs/hugetlbfs/inode.c | 102 ++--
include/linux/hugetlb.h | 86 ++-
include/linux/mm.h | 1 +
include/uapi/linux/kvm.h | 25 +
mm/hugetlb.c | 324 +++++++-----
mm/truncate.c | 24 +-
.../testing/selftests/kvm/guest_memfd_test.c | 33 +-
.../testing/selftests/kvm/include/test_util.h | 14 +
tools/testing/selftests/kvm/lib/test_util.c | 74 +++
.../kvm/x86_64/private_mem_conversions_test.c | 38 +-
virt/kvm/guest_mem.c | 488 ++++++++++++++----
11 files changed, 882 insertions(+), 327 deletions(-)

--
2.41.0.rc0.172.g3f132b7071-goog