Re: Rename restrictedmem => guardedmem? (was: Re: [PATCH v10 0/9] KVM: mm: fd-based approach for supporting KVM)

From: Vishal Annapurve
Date: Wed May 10 2023 - 19:04:02 EST

Next message: Stephen Rothwell: "linux-next: duplicate patch in the leds-lj tree"
Previous message: Leonardo Bras: "[RFC PATCH v3 1/1] trace,smp: Add tracepoints around remotelly called functions"
In reply to: Sean Christopherson: "Re: Rename restrictedmem => guardedmem? (was: Re: [PATCH v10 0/9] KVM: mm: fd-based approach for supporting KVM)"
Next in thread: Sean Christopherson: "Re: Rename restrictedmem => guardedmem? (was: Re: [PATCH v10 0/9] KVM: mm: fd-based approach for supporting KVM)"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On Wed, May 10, 2023 at 2:39 PM Sean Christopherson <seanjc@xxxxxxxxxx> wrote:
>
> On Wed, May 10, 2023, Vishal Annapurve wrote:
> > On Fri, Apr 21, 2023 at 6:33 PM Sean Christopherson <seanjc@xxxxxxxxxx> wrote:
> > >
> > > ...
> > > cold. I poked around a bit to see how we could avoid reinventing all of that
> > > infrastructure for fd-only memory, and the best idea I could come up with is
> > > basically a rehash of Kirill's very original "KVM protected memory" RFC[3], i.e.
> > > allow "mapping" fd-only memory, but ensure that memory is never actually present
> > > from hardware's perspective.
> > >
> >
> > I am most likely missing a lot of context here and possibly venturing
> > into an infeasible/already shot down direction here.
>
> Both :-)
>
> > But I would still like to get this discussed here before we move on.
> >
> > I am wondering if it would make sense to implement
> > restricted_mem/guest_mem file to expose both private and shared memory
> > regions, inline with Kirill's original proposal now that the file
> > implementation is controlled by KVM.
> >
> > Thinking from userspace perspective:
> > 1) Userspace creates guest mem files and is able to mmap them but all
> > accesses to these files result into faults as no memory is allowed to
> > be mapped into userspace VMM pagetables.
>
> Never mapping anything into the userspace page table is infeasible. Technically
> it's doable, but it'd effectively require all of the work of an fd-based approach
> (and probably significantly more), _and_ it'd require touching core mm code.
>
> VMAs don't provide hva=>pfn information, they're the kernel's way of implementing
> the abstraction provided to userspace by mmap(), mprotect() etc. Among many other
> things, a VMA describes properties of what is mapped, e.g. hugetblfs versus
> anonymous, where memory is mapped (virtual address), how memory is mapped, e.g.
> RWX protections, etc. But a VMA doesn't track the physical address, that info
> is all managed through the userspace page tables.
>
> To make it possible to allow userspace to mmap() but not access memory (without
> redoing how the kernel fundamentally manages virtual=>physical mappings), the
> simplest approach is to install PTEs into userspace page tables, but never mark
> them Present in hardware, i.e. prevent actually accessing the backing memory.
> This is is exactly what Kirill's series in link [3] below implemented.
>

Maybe it's simpler to do when mmaped regions are backed with files.

I see that shmem has fault handlers for accesses to VMA regions
associated with the files, In theory a file implementation can always
choose to not allocate physical pages for such faults (similar to
F_SEAL_FAULT_AUTOALLOCATE that was discussed earlier).

> Issues that led to us abandoning the "map with special !Present PTEs" approach:
>
> - Using page tables, i.e. hardware defined structures, to track gfn=>pfn mappings
> is inefficient and inflexible compared to software defined structures, especially
> for the expected use cases for CoCo guests.
>
> - The kernel wouldn't _easily_ be able to enforce a 1:1 page:guest association,
> let alone a 1:1 pfn:gfn mapping.

Maybe KVM can ensure that each page of the guest_mem file is
associated with a single memslot. HVAs when they are registered can be
associated with offsets into guest_mem files.

>
> - Does not work for memory that isn't backed by 'struct page', e.g. if devices
> gain support for exposing encrypted memory regions to guests.
>
> - Poking into the VMAs to convert memory would be likely be less performant due
> to using infrastructure that is much "heavier", e.g. would require taking
> mmap_lock for write.

Converting memory doesn't necessarily need to poke holes into VMA, but
rather just unmap pagetables just like what would happen when mmapped
files are punched to free the backing file offsets.

>
> In short, shoehorning this into mmap() requires fighting how the kernel works at
> pretty much every step, and in the end, adding e.g. fbind() is a lot easier.
>
> > 2) Userspace registers mmaped HVA ranges with KVM with additional
> > KVM_MEM_PRIVATE flag
> > 3) Userspace converts memory attributes and this memory conversion
> > allows userspace to access shared ranges of the file because those are
> > allowed to be faulted in from guest_mem. Shared to private conversion
> > unmaps the file ranges from userspace VMM pagetables.
> > 4) Granularity of userspace pagetable mappings for shared ranges will
> > have to be dictated by KVM guest_mem file implementation.
> >
> > Caveat here is that once private pages are mapped into userspace view.
> >
> > Benefits here:
> > 1) Userspace view remains consistent while still being able to use HVA ranges
> > 2) It would be possible to use HVA based APIs from userspace to do
> > things like binding.
> > 3) Double allocation wouldn't be a concern since hva ranges and gpa
> > ranges possibly map to the same HPA ranges.
>
> #3 isn't entirely correct. If a different process (call it "B") maps shared memory,
> and then the guest converts that memory from shared to private, the backing pages
> for the previously shared mapping will still be mapped by process B unless userspace
> also ensures process B also unmaps on conversion.
>

This should be ideally handled by something like: unmap_mapping_range()

> #3 is also a limiter. E.g. if a guest is primarly backed by 1GiB pages, keeping
> the 1GiB mapping is desirable if the guest converts a few KiB of memory to shared,
> and possibly even if the guest converts a few MiB of memory.

This caveat maybe can be lived with as shared ranges most likely will
not be backed by 1G pages anyways, possibly causing IO performance to
get hit. This possibly needs more discussion about conversion
granularity used by guests.

>
> > > Code is available here if folks want to take a look before any kind of formal
> > > posting:
> > >
> > > https://github.com/sean-jc/linux.git x86/kvm_gmem_solo
> > >
> > > [1] https://lore.kernel.org/all/ff5c5b97-acdf-9745-ebe5-c6609dd6322e@xxxxxxxxxx
> > > [2] https://lore.kernel.org/all/20230418-anfallen-irdisch-6993a61be10b@brauner
> > > [3] https://lore.kernel.org/linux-mm/20200522125214.31348-1-kirill.shutemov@xxxxxxxxxxxxxxx

Next message: Stephen Rothwell: "linux-next: duplicate patch in the leds-lj tree"
Previous message: Leonardo Bras: "[RFC PATCH v3 1/1] trace,smp: Add tracepoints around remotelly called functions"
In reply to: Sean Christopherson: "Re: Rename restrictedmem => guardedmem? (was: Re: [PATCH v10 0/9] KVM: mm: fd-based approach for supporting KVM)"
Next in thread: Sean Christopherson: "Re: Rename restrictedmem => guardedmem? (was: Re: [PATCH v10 0/9] KVM: mm: fd-based approach for supporting KVM)"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]