Re: Rename restrictedmem => guardedmem? (was: Re: [PATCH v10 0/9] KVM: mm: fd-based approach for supporting KVM)

From: Vishal Annapurve
Date: Thu May 18 2023 - 21:07:59 EST


On Thu, May 11, 2023 at 1:22 PM Sean Christopherson <seanjc@xxxxxxxxxx> wrote:
> ...
> Ah, you're effectively suggesting a hybrid model where the file is the single
> source of truth for what's private versus shared, ad KVM gets pfns through
> direct communication with the backing store via the file descriptor, but userspace
> can still control things via mmap() and friends.
>
> If you're not suggesting a backdoor, i.e. KVM still gets private pfns via hvas,
> then we're back at Kirill's series, because otherwise there's no easy way for KVM
> to retrieve the pfn.
>

Yeah, I was hinting towards using the backdoor, where KVM still gets
private pfns via HVAs.

> A form of this was also discussed, though I don't know how much of the discussion
> happened on-list.
>
> KVM actually does something like this for s390's Ultravisor (UV), which is quite
> a bit like TDX (UV is a trusted intermediary) except that it handles faults much,
> much more gracefully. Specifically, when the untrusted host attempts to access a
> secure page, a fault occurs and the kernel responds by telling UV to export the
> page. The fault is gracefully handled even even for kernel accesses
> (see do_secure_storage_access()). The kernel does BUG() if the export fails when
> handling fault from kernel context, but my understanding is that export can fail
> if and only if there's a fatal error elsewhere, i.e. the UV essentialy _ensures_
> success, and goes straight to BUG()/panic() if something goes wrong.
>
> On the guest side, accesses to exported (swapped) secure pages generate intercepts
> and KVM faults in the page. To do so, KVM freezes the page/folio refcount, tells
> the UV to import the page, and then unfreezes the page/folio. But very crucially,
> when _anything_ in the untrusted host attempts to access the secure page, the
> above fault handling for untrusted host accesses kicks in. In other words, the
> guest can cause thrash, but can't bring down the host.
>

Yeah, this is very similar to what I was trying to propose. Except in
this case, the backing store i.e. guest_mem will have to let the fault
be unhandled for untrusted host accesses to private ranges of
guest_mem file.

> TDX on the other hand silently poisons memory, i.e. doesn't even generate a
> synchronous fault. Thus the kernel needs to be 100% perfect on preventing _any_
> accesses to private memory from the host, and doing that is non-trivial and
> invasive.
>
> SNP does synchronously fault, but the automatically converting in the #PF handler
> got NAK'd[*] for good reasons, e.g. SNP doesn't guarantee conversion success as the
> guest can trigger concurrent RMP modifications. So the end result ends up being
> the same as TDX, host accesses need to be completely prevented.
>
> Again, this is all doable, but costly. And IMO, provides very little value.

With this hybrid approach with the backdoor access to pfns from KVM,
do we see a scenario where host can bypass the guest_mem restrictions
and still be able to access the private ranges using HVA ranges? One
possibility is that these pages are mapped in the IOMMU (when they are
shared) and then get converted to private without getting unmapped
from IOMMU. Maybe KVM can disallow converting the ranges which are
pinned for DMA (not sure if there is a way to do that).

Few additional benefits here:
1) Possibly handle the pkvm usecase in this series without the need
for additional modifications.
2) Handling UPM for normal VMs possibly could get simpler as this
hybrid approach can allow preserving the contents across conversions.

>
> Allowing things like mbind() is nice-to-have at best, as implementing fbind()
> isn't straightforward and arguably valuable to have irrespective of this
> discussion, e.g. to allow userspace to say "use this policy regardless of what
> process maps the file".
>

Agreed, having mbind supported is not a significant gain given the cost here.

> Using a common memory pool (same physical page is used for both shared and private)
> is a similar story. There are plenty of existing controls to limit userspace/guest
> memory usage and to deal with OOM scenarios, so barring egregious host accounting
> and/or limiting bugs, which would affect _all_ VM types, the worst case scenario
> is that a VM is terminated because host userspace is buggy. On the slip side, using
> a common pool brings complexity into the kernel, as backing stores would need to
> be taught to deny access to a subset of pages in their mappings, and in multiple
> paths, e.g. faults, read()/write() and similar, page migration, swap, etc.

In this case the backing store that needs to be modified would just be
guest_mem though.

>
> [*] https://lore.kernel.org/linux-mm/8a244d34-2b10-4cf8-894a-1bf12b59cf92@xxxxxxxxxxxxxxxx
>
> > > Issues that led to us abandoning the "map with special !Present PTEs" approach:
> > >
> > > - Using page tables, i.e. hardware defined structures, to track gfn=>pfn mappings
> > > is inefficient and inflexible compared to software defined structures, especially
> > > for the expected use cases for CoCo guests.
> > >
> > > - The kernel wouldn't _easily_ be able to enforce a 1:1 page:guest association,
> > > let alone a 1:1 pfn:gfn mapping.
> >
> > Maybe KVM can ensure that each page of the guest_mem file is
> > associated with a single memslot.
>
> This is a hard NAK. Guest physical address space is guaranteed to have holes
> and/or be discontiguous, for the PCI hole at the top of lower memory. Allowing
> only a single binding would prevent userspace from backing all (or large chunks)
> of guest memory with a single file.
>

Poor choice of words from my side. I meant to suggest that KVM can
ensure that ANY page of the guest_mem file is associated with at max
one memslot.

> ...
> That'd work for the hybrid model (fd backdoor with pseudo mmap() support), but
> not for a generic VMA-based implementation. If the file isn't the single source
> of truth, then forcing all mappings to go away simply can't work.
>
> > > #3 is also a limiter. E.g. if a guest is primarly backed by 1GiB pages, keeping
> > > the 1GiB mapping is desirable if the guest converts a few KiB of memory to shared,
> > > and possibly even if the guest converts a few MiB of memory.
> >
> > This caveat maybe can be lived with as shared ranges most likely will
> > not be backed by 1G pages anyways, possibly causing IO performance to
> > get hit. This possibly needs more discussion about conversion
> > granularity used by guests.
>
> Yes, it's not the end of the world. My point is that separating shared and private
> memory provides more flexibility. Maybe that flexibility never ends up being
> super important, but at the same time we shouldn't willingly paint ourselves into
> a corner.

There are some performance implications here with the split approach.

This flexibility is actually coming with the cost of managing double
allocation effectively. As the granularity of mappings increases for
the shared memory, it gets difficult to cap the amount of double
allocation. So effectively it comes down to always using smaller
granularity for shared memory and also the private memory for
converted ranges. In general the performance requirements would always
try to push for higher mapping granularities depending on the scale of
usage.

In general private memory (and also the shared memory on respective
conversions) will always need to be hole punched to ensure that the
double allocation won't happen. And so, even if this is something for
the future, using hugetlbfs pages for backing private memory with the
split model effectively makes it impossible to cap the double
allocation. I am not sure if the 1G pages can be handled better with
the hybrid model but maybe it's worth checking.

Split shared/private mem approach also increases the uncertainties
around memory management in general where the same amount of memory
which was available earlier is first freed to the system and then
allocated back from the system. .e.g. Even if hugepages were around
when private memory was initially allocated, further allocations keep
increasing the possibilities of not being able to use a huge page to
back the memory even if the whole huge page is private/shared.