Re: [RFC PATCH 1/8] KVM: Document KVM_MAP_MEMORY ioctl

From: Isaku Yamahata
Date: Thu Mar 07 2024 - 21:20:03 EST


On Thu, Mar 07, 2024 at 05:28:20PM -0800,
Sean Christopherson <seanjc@xxxxxxxxxx> wrote:

> On Thu, Mar 07, 2024, David Matlack wrote:
> > On 2024-03-08 01:20 PM, Huang, Kai wrote:
> > > > > > +:Parameters: struct kvm_memory_mapping(in/out)
> > > > > > +:Returns: 0 on success, <0 on error
> > > > > > +
> > > > > > +KVM_MAP_MEMORY populates guest memory without running vcpu.
> > > > > > +
> > > > > > +::
> > > > > > +
> > > > > > + struct kvm_memory_mapping {
> > > > > > + __u64 base_gfn;
> > > > > > + __u64 nr_pages;
> > > > > > + __u64 flags;
> > > > > > + __u64 source;
> > > > > > + };
> > > > > > +
> > > > > > + /* For kvm_memory_mapping:: flags */
> > > > > > + #define KVM_MEMORY_MAPPING_FLAG_WRITE _BITULL(0)
> > > > > > + #define KVM_MEMORY_MAPPING_FLAG_EXEC _BITULL(1)
> > > > > > + #define KVM_MEMORY_MAPPING_FLAG_USER _BITULL(2)
> > > > >
> > > > > I am not sure what's the good of having "FLAG_USER"?
> > > > >
> > > > > This ioctl is called from userspace, thus I think we can just treat this always
> > > > > as user-fault?
> > > >
> > > > The point is how to emulate kvm page fault as if vcpu caused the kvm page
> > > > fault. Not we call the ioctl as user context.
> > >
> > > Sorry I don't quite follow. What's wrong if KVM just append the #PF USER
> > > error bit before it calls into the fault handler?
> > >
> > > My question is, since this is ABI, you have to tell how userspace is
> > > supposed to use this. Maybe I am missing something, but I don't see how
> > > USER should be used here.
> >
> > If we restrict this API to the TDP MMU then KVM_MEMORY_MAPPING_FLAG_USER
> > is meaningless, PFERR_USER_MASK is only relevant for shadow paging.
>
> +1
>
> > KVM_MEMORY_MAPPING_FLAG_WRITE seems useful to allow memslots to be
> > populated with writes (which avoids just faulting in the zero-page for
> > anon or tmpfs backed memslots), while also allowing populating read-only
> > memslots.
> >
> > I don't really see a use-case for KVM_MEMORY_MAPPING_FLAG_EXEC.
>
> It would midly be interesting for something like the NX hugepage mitigation.
>
> For the initial implementation, I don't think the ioctl() should specify
> protections, period.
>
> VMA-based mappings, i.e. !guest_memfd, already have a way to specify protections.
> And for guest_memfd, finer grained control in general, and long term compatibility
> with other features that are in-flight or proposed, I would rather userspace specify
> RWX protections via KVM_SET_MEMORY_ATTRIBUTES. Oh, and dirty logging would be a
> pain too.
>
> KVM doesn't currently support execute-only (XO) or !executable (RW), so I think
> we can simply define KVM_MAP_MEMORY to behave like a read fault. E.g. map RX,
> and add W if all underlying protections allow it.
>
> That way we can defer dealing with things like XO and RW *if* KVM ever does gain
> support for specifying those combinations via KVM_SET_MEMORY_ATTRIBUTES, which
> will likely be per-arch/vendor and non-trivial, e.g. AMD's NPT doesn't even allow
> for XO memory.
>
> And we shouldn't need to do anything for KVM_MAP_MEMORY in particular if
> KVM_SET_MEMORY_ATTRIBUTES gains support for RWX protections the existing RWX and
> RX combinations, e.g. if there's a use-case for write-protecting guest_memfd
> regions.
>
> We can always expand the uAPI, but taking away functionality is much harder, if
> not impossible.

Ok, let me drop all the flags. Here is the updated one.

4.143 KVM_MAP_MEMORY
------------------------

:Capability: KVM_CAP_MAP_MEMORY
:Architectures: none
:Type: vcpu ioctl
:Parameters: struct kvm_memory_mapping(in/out)
:Returns: 0 on success, < 0 on error

Errors:

====== =============================================================
EINVAL vcpu state is not in TDP MMU mode or is in guest mode.
Currently, this ioctl is restricted to TDP MMU.
EAGAIN The region is only processed partially. The caller should
issue the ioctl with the updated parameters.
EINTR An unmasked signal is pending. The region may be processed
partially. If `nr_pages` > 0, the caller should issue the
ioctl with the updated parameters.
====== =============================================================

KVM_MAP_MEMORY populates guest memory before the VM starts to run. Multiple
vcpus can call this ioctl simultaneously. It may result in the error of EAGAIN
due to race conditions.

::

struct kvm_memory_mapping {
__u64 base_gfn;
__u64 nr_pages;
__u64 flags;
__u64 source;
};

KVM_MAP_MEMORY populates guest memory at the specified range (`base_gfn`,
`nr_pages`) in the underlying mapping. `source` is an optional user pointer. If
`source` is not NULL and the underlying technology supports it, the memory
contents of `source` are copied into the guest memory. The backend may encrypt
it. `flags` must be zero. It's reserved for future use.

When the ioctl returns, the input values are updated. If `nr_pages` is large,
it may return EAGAIN or EINTR for pending signal and update the values
(`base_gfn` and `nr_pages`. `source` if not zero) to point to the remaining
range.

--
Isaku Yamahata <isaku.yamahata@xxxxxxxxxxxxxxx>