Re: [PATCH v11 0/8] KVM: allow mapping non-refcounted pages

From: Sean Christopherson
Date: Wed Mar 13 2024 - 14:38:46 EST


On Wed, Mar 13, 2024, Christian König wrote:
> Am 13.03.24 um 16:47 schrieb Sean Christopherson:
> > [SNIP]
> > > It can trivially be that userspace only maps 4KiB of some 2MiB piece of
> > > memory the driver has allocate.
> > >
> > > > I.e. Christoph is (implicitly) saying that instead of modifying KVM to play nice,
> > > > we should instead fix the TTM allocations. And David pointed out that that was
> > > > tried and got NAK'd.
> > > Well as far as I can see Christoph rejects the complexity coming with the
> > > approach of sometimes grabbing the reference and sometimes not.
> > Unless I've wildly misread multiple threads, that is not Christoph's objection.
> > From v9 (https://lore.kernel.org/all/ZRpiXsm7X6BFAU%2Fy@xxxxxxxxxxxxx):
> >
> > On Sun, Oct 1, 2023 at 11:25 PM Christoph Hellwig<hch@xxxxxxxxxxxxx> wrote:
> > >
> > > On Fri, Sep 29, 2023 at 09:06:34AM -0700, Sean Christopherson wrote:
> > > > KVM needs to be aware of non-refcounted struct page memory no matter what; see
> > > > CVE-2021-22543 and, commit f8be156be163 ("KVM: do not allow mapping valid but
> > > > non-reference-counted pages").  I don't think it makes any sense whatsoever to
> > > > remove that code and assume every driver in existence will do the right thing.
> > >
> > > Agreed.
> > >
> > > >
> > > > With the cleanups done, playing nice with non-refcounted paged instead of outright
> > > > rejecting them is a wash in terms of lines of code, complexity, and ongoing
> > > > maintenance cost.
> > >
> > > I tend to strongly disagree with that, though.  We can't just let these
> > > non-refcounted pages spread everywhere and instead need to fix their
> > > usage.
>
> And I can only repeat myself that I completely agree with Christoph here.

I am so confused. If you agree with Christoph, why not fix the TTM allocations?

> > > And I have to agree that this is extremely odd.
> > Yes, it's odd and not ideal. But with nested virtualization, KVM _must_ "map"
> > pfns directly into the guest via fields in the control structures that are
> > consumed by hardware. I.e. pfns are exposed to the guest in an "out-of-band"
> > structure that is NOT part of the stage-2 page tables. And wiring those up to
> > the MMU notifiers is extremely difficult for a variety of reasons[*].
> >
> > Because KVM doesn't control which pfns are mapped this way, KVM's compromise is
> > to grab a reference to the struct page while the out-of-band mapping exists, i.e.
> > to pin the page to prevent use-after-free.
>
> Wait a second, as far as I know this approach doesn't work correctly in all
> cases. See here as well: https://lwn.net/Articles/930667/
>
> The MMU notifier is not only to make sure that pages don't go away and
> prevent use after free, but also that PTE updates correctly wait for ongoing
> DMA transfers.
>
> Otherwise quite a bunch of other semantics doesn't work correctly either.
>
> > And KVM's historical ABI is to support
> > any refcounted page for these out-of-band mappings, regardless of whether the
> > page was obtained by gup() or follow_pte().
>
> Well see the discussion from last year the LWN article summarizes.

Oof. I suspect that in practice, no one has observed issues because the pages
in question are heavily used by the guest and don't get evicted from the page cache.

> I'm not an expert for KVM but as far as I can see what you guys do here is
> illegal and can lead to corruption.
>
> Basically you can't create a second channel to modify page content like
> that.

Well, KVM can, it just needs to honor mmu_notifier events in order to be 100%
robust.

That said, while this might be motivation to revisit tying the out-of-band mappings
to the mmu_notifier, it has no bearing on the outcome of Christoph's objection.
Because (a) AIUI, Christoph is objecting to mapping non-refcounted struct page
memory *anywhere*, and (b) in this series, KVM will explicitly disallow non-
refcounted pages from being mapped in the out-of-band structures (see below).

> > Thus, to support non-refouncted VM_PFNMAP pages without breaking existing userspace,
> > KVM resorts to conditionally grabbing references and disllowing non-refcounted
> > pages from being inserted into the out-of-band mappings.
>
> Why not only refcount the pages when out of band mappings are requested?

That's also what this series effectively does. By default, KVM will disallow
inserting *any* non-refcounted memory into the out-of-band mappings.

"By default" because there are use cases where the guest memory pool is hidden
from the kernel at boot, and is fully managed by userspace. I.e. userspace is
effectively responsible/trusted to not free/change the mappings for an active
guest.