Re: [PATCH v11 0/8] KVM: allow mapping non-refcounted pages

From: Sean Christopherson
Date: Wed Mar 13 2024 - 11:48:01 EST


On Wed, Mar 13, 2024, Christian König wrote:
> Am 13.03.24 um 15:48 schrieb Sean Christopherson:
> > On Wed, Mar 13, 2024, Christian König wrote:
> > > Am 13.03.24 um 14:34 schrieb Sean Christopherson:
> > > > What Christoph is objecting to is that, in this series, KVM is explicitly adding
> > > > support for mapping non-compound (huge)pages into KVM guests. David is arguing
> > > > that Christoph's objection to _KVM_ adding support is unfair, because the real
> > > > problem is that the kernel already maps such pages into host userspace. I.e. if
> > > > the userspace mapping ceases to exist, then there are no mappings for KVM to follow
> > > > and propagate to KVM's stage-2 page tables.
> > > And I have to agree with Christoph that this doesn't make much sense. KVM
> > > should *never* map (huge) pages from VMAs marked with VM_PFNMAP into KVM
> > > guests in the first place.
> > >
> > > What it should do instead is to mirror the PFN from the host page tables
> > > into the guest page tables.
> > That's exactly what this series does. Christoph is objecting to KVM playing nice
> > with non-compound hugepages, as he feels that such mappings should not exist
> > *anywhere*.
>
> Well Christoph is right those mappings shouldn't exists and they also don't
> exists.
>
> What happens here is that a driver has allocated some contiguous memory to
> do DMA with. And then some page table is pointing to a PFN inside that
> memory because userspace needs to provide parameters for the DMA transfer.
>
> This is *not* a mapping of a non-compound hugepage, it's simply a PTE
> pointing to some PFN.

Yes, I know. And David knows. By "such mappings" I did not mean "huge PMD mappings
that point at non-compound pages", I meant "any mapping in the host userspace
VMAs and page tables that points at memory that is backed by a larger-than-order-0,
non-compound allocation".

And even then, the whole larger-than-order-0 mapping is not something we on the
KVM side care about, at all. The _only_ new thing KVM is trying to do in this
series is to allow mapping non-refcounted struct page memory into KVM guest.
Those details were brought up purely because they provide context on how/why such
non-refcounted pages exist.

> It can trivially be that userspace only maps 4KiB of some 2MiB piece of
> memory the driver has allocate.
>
> > I.e. Christoph is (implicitly) saying that instead of modifying KVM to play nice,
> > we should instead fix the TTM allocations. And David pointed out that that was
> > tried and got NAK'd.
>
> Well as far as I can see Christoph rejects the complexity coming with the
> approach of sometimes grabbing the reference and sometimes not.

Unless I've wildly misread multiple threads, that is not Christoph's objection.
>From v9 (https://lore.kernel.org/all/ZRpiXsm7X6BFAU%2Fy@xxxxxxxxxxxxx):

On Sun, Oct 1, 2023 at 11:25 PM Christoph Hellwig <hch@xxxxxxxxxxxxx> wrote:
>
> On Fri, Sep 29, 2023 at 09:06:34AM -0700, Sean Christopherson wrote:
> > KVM needs to be aware of non-refcounted struct page memory no matter what; see
> > CVE-2021-22543 and, commit f8be156be163 ("KVM: do not allow mapping valid but
> > non-reference-counted pages").  I don't think it makes any sense whatsoever to
> > remove that code and assume every driver in existence will do the right thing.
>
> Agreed.
>
> >
> > With the cleanups done, playing nice with non-refcounted paged instead of outright
> > rejecting them is a wash in terms of lines of code, complexity, and ongoing
> > maintenance cost.
>
> I tend to strongly disagree with that, though.  We can't just let these
> non-refcounted pages spread everywhere and instead need to fix their
> usage.

> And I have to agree that this is extremely odd.

Yes, it's odd and not ideal. But with nested virtualization, KVM _must_ "map"
pfns directly into the guest via fields in the control structures that are
consumed by hardware. I.e. pfns are exposed to the guest in an "out-of-band"
structure that is NOT part of the stage-2 page tables. And wiring those up to
the MMU notifiers is extremely difficult for a variety of reasons[*].

Because KVM doesn't control which pfns are mapped this way, KVM's compromise is
to grab a reference to the struct page while the out-of-band mapping exists, i.e.
to pin the page to prevent use-after-free. And KVM's historical ABI is to support
any refcounted page for these out-of-band mappings, regardless of whether the
page was obtained by gup() or follow_pte().

Thus, to support non-refouncted VM_PFNMAP pages without breaking existing userspace,
KVM resorts to conditionally grabbing references and disllowing non-refcounted
pages from being inserted into the out-of-band mappings.

But again, I don't think these details are relevant to Christoph's objection.

[*] https://lore.kernel.org/all/ZBEEQtmtNPaEqU1i@xxxxxxxxxx