Re: [PATCH v11 0/8] KVM: allow mapping non-refcounted pages

From: Christian König
Date: Thu Mar 14 2024 - 05:20:43 EST


Sending that out once more since the AMD email servers have converted it to HTML mail once more :(

Grrr,
Christian.

Am 14.03.24 um 10:18 schrieb Christian König:
Am 13.03.24 um 18:26 schrieb Sean Christopherson:
On Wed, Mar 13, 2024, Christian König wrote:
Am 13.03.24 um 16:47 schrieb Sean Christopherson:
[SNIP]
It can trivially be that userspace only maps 4KiB of some 2MiB piece of
memory the driver has allocate.

I.e. Christoph is (implicitly) saying that instead of modifying KVM to play nice,
we should instead fix the TTM allocations. And David pointed out that that was
tried and got NAK'd.
Well as far as I can see Christoph rejects the complexity coming with the
approach of sometimes grabbing the reference and sometimes not.
Unless I've wildly misread multiple threads, that is not Christoph's objection.
From v9 (https://lore.kernel.org/all/ZRpiXsm7X6BFAU%2Fy@xxxxxxxxxxxxx):

On Sun, Oct 1, 2023 at 11:25 PM Christoph Hellwig<hch@xxxxxxxxxxxxx> wrote:
>
> On Fri, Sep 29, 2023 at 09:06:34AM -0700, Sean Christopherson wrote:
> > KVM needs to be aware of non-refcounted struct page memory no matter what; see
> > CVE-2021-22543 and, commit f8be156be163 ("KVM: do not allow mapping valid but
> > non-reference-counted pages").  I don't think it makes any sense whatsoever to
> > remove that code and assume every driver in existence will do the right thing.
>
> Agreed.
>
> >
> > With the cleanups done, playing nice with non-refcounted paged instead of outright
> > rejecting them is a wash in terms of lines of code, complexity, and ongoing
> > maintenance cost.
>
> I tend to strongly disagree with that, though.  We can't just let these
> non-refcounted pages spread everywhere and instead need to fix their
> usage.
And I can only repeat myself that I completely agree with Christoph here.
I am so confused. If you agree with Christoph, why not fix the TTM allocations?

Because the TTM allocation isn't broken in any way.

See in some configurations TTM even uses the DMA API for those allocations and that is actually something Christoph coded.

What Christoph is really pointing out is that absolutely nobody should put non-refcounted pages into a VMA, but again this isn't something TTM does. What TTM does instead is to work with the PFN and puts that into a VMA.

It's just that then KVM comes along and converts the PFN back into a struct page again and that is essentially what causes all the problems, including CVE-2021-22543.

And I have to agree that this is extremely odd.
Yes, it's odd and not ideal. But with nested virtualization, KVM _must_ "map"
pfns directly into the guest via fields in the control structures that are
consumed by hardware. I.e. pfns are exposed to the guest in an "out-of-band"
structure that is NOT part of the stage-2 page tables. And wiring those up to
the MMU notifiers is extremely difficult for a variety of reasons[*].

Because KVM doesn't control which pfns are mapped this way, KVM's compromise is
to grab a reference to the struct page while the out-of-band mapping exists, i.e.
to pin the page to prevent use-after-free.
Wait a second, as far as I know this approach doesn't work correctly in all
cases. See here as well:https://lwn.net/Articles/930667/

The MMU notifier is not only to make sure that pages don't go away and
prevent use after free, but also that PTE updates correctly wait for ongoing
DMA transfers.

Otherwise quite a bunch of other semantics doesn't work correctly either.

And KVM's historical ABI is to support
any refcounted page for these out-of-band mappings, regardless of whether the
page was obtained by gup() or follow_pte().
Well see the discussion from last year the LWN article summarizes.
Oof. I suspect that in practice, no one has observed issues because the pages
in question are heavily used by the guest and don't get evicted from the page cache.

Well in this case I strongly suggest to block the problematic cases.

It just sounds like KVM never run into problems because nobody is doing any of those problematic cases. If that's true it should be doable to change the UAPI and say this specific combination is forbidden because it could result in data corruption.

I'm not an expert for KVM but as far as I can see what you guys do here is
illegal and can lead to corruption.

Basically you can't create a second channel to modify page content like
that.
Well, KVM can, it just needs to honor mmu_notifier events in order to be 100%
robust.

Yeah, completely agree.

That said, while this might be motivation to revisit tying the out-of-band mappings
to the mmu_notifier, it has no bearing on the outcome of Christoph's objection.
Because (a) AIUI, Christoph is objecting to mapping non-refcounted struct page
memory *anywhere*, and (b) in this series, KVM will explicitly disallow non-
refcounted pages from being mapped in the out-of-band structures (see below).

Thus, to support non-refouncted VM_PFNMAP pages without breaking existing userspace,
KVM resorts to conditionally grabbing references and disllowing non-refcounted
pages from being inserted into the out-of-band mappings.
Why not only refcount the pages when out of band mappings are requested?
That's also what this series effectively does. By default, KVM will disallow
inserting *any* non-refcounted memory into the out-of-band mappings.

Ok than that's basically my inconvenient. I can't really criticize the KVM patch because I don't really understand all the details.

I'm only pointing out from a very high level view how memory management works and that I see some conflict with what KVM does.

As far as I can tell the cleanest approach for KVM would be to have two completely separate handlings:

1. Using GUP to handle the out-of-band mappings, this one grabs references and honors all the GUP restrictions with the appropriate flags. When VM_PFNMAP is set then those mappings will be rejected. That should also avoid any trouble with file backed mappings.

2. Using follow_pte() plus an MMU notifier and this one can even handle VMAs with the VM_PFNMAP and VM_IO flag set.

"By default" because there are use cases where the guest memory pool is hidden
from the kernel at boot, and is fully managed by userspace. I.e. userspace is
effectively responsible/trusted to not free/change the mappings for an active
guest.

Wow, could that potentially crash the kernel?

Cheers,
Christian.