Re: [PATCH v13 10/24] gunyah: vm_mgr: Add/remove user memory regions

From: Elliot Berman
Date: Tue Jul 18 2023 - 22:29:50 EST


Hi Will,

On 7/14/2023 5:13 AM, Will Deacon wrote:
On Thu, Jul 13, 2023 at 01:28:34PM -0700, Elliot Berman wrote:
On 6/22/2023 4:56 PM, Elliot Berman wrote:
On 6/7/2023 8:54 AM, Elliot Berman wrote:
On 6/5/2023 7:18 AM, Will Deacon wrote:
On Fri, May 19, 2023 at 10:02:29AM -0700, Elliot Berman wrote:
The user interface design for *shared* memory aligns with
KVM_SET_USER_MEMORY_REGION.

I don't think it does. For example, file mappings don't work (as above),
you're placing additional rlimit requirements on the caller, read-only
memslots are not functional, the memory cannot be swapped or migrated,
dirty logging doesn't work etc. pKVM is in the same boat, but that's why
we're not upstreaming this part in its current form.


I thought pKVM was only holding off on upstreaming changes related
to guest-private memory?

I understood we want to use restricted memfd for giving
guest-private memory
(Gunyah calls this "lending memory"). When I went through
the changes, I
gathered KVM is using restricted memfd only for
guest-private memory and not
for shared memory. Thus, I dropped support for lending
memory to the guest
VM and only retained the shared memory support in this
series. I'd like to
merge what we can today and introduce the guest-private
memory support in
tandem with the restricted memfd; I don't see much reason to delay the
series.

Right, protected guests will use the new restricted memfd ("guest mem"
now, I think?), but non-protected guests should implement the existing
interface *without* the need for the GUP pin on guest memory pages. Yes,
that means full support for MMU notifiers so that these pages can be
managed properly by the host kernel. We're working on that for pKVM, but
it requires a more flexible form of memory sharing over what we
currently
have so that e.g. the zero page can be shared between multiple entities.

Gunyah doesn't support swapping pages out while the guest is running
and the design of Gunyah isn't made to give host kernel full control
over the S2 page table for its guests. As best I can tell from
reading the respective drivers, ACRN and Nitro Enclaves both GUP pin
guest memory pages prior to giving them to the guest, so I don't
think this requirement from Gunyah is particularly unusual.


I read/dug into mmu notifiers more and I don't think it matches with
Gunyah's features today. We don't allow the host to freely manage VM's
pages because it requires the guest VM to have a level of trust on the
host. Once a page is given to the guest, it's done for the lifetime of
the VM. Allowing the host to replace pages in the guest memory map isn't
part of any VM's security model that we run in Gunyah. With that
requirement, longterm pinning looks like the correct approach to me.

Is my approach of longterm pinning correct given that Gunyah doesn't allow
host to freely swap pages?

No, I really don't think a longterm GUP pin is the right approach for this.
GUP pins in general are horrible for the mm layer, but required for cases
such as DMA where I/O faults are unrecoverable. Gunyah is not a good
justification for such a hack, and I don't think you get to choose which
parts of the Linux mm you want and which bits you don't.

In other words, either carve out your memory and pin it that way, or
implement the proper hooks for the mm to do its job.

I talked to the team about whether we can extend the Gunyah support for this. We have plans to support sharing/lending individual pages when the guest faults on them. The support also allows (unprotected) pages to be removed from the VM. We'll need to temporarily pin the pages of the VM configuration device tree blob while the VM is being created and those pages can be unpinned once the VM starts. I'll work on this.

Thanks for the feedback!

- Elliot