Re: [RFCv2 13/13] KVM: unmap guest memory using poisoned pages

From: Kirill A. Shutemov
Date: Mon May 31 2021 - 16:07:14 EST


On Wed, May 26, 2021 at 07:46:52PM +0000, Sean Christopherson wrote:
> On Fri, May 21, 2021, Kirill A. Shutemov wrote:
> > Hi Sean,
> >
> > The core patch of the approach we've discussed before is below. It
> > introduce a new page type with the required semantics.
> >
> > The full patchset can be found here:
> >
> > git://git.kernel.org/pub/scm/linux/kernel/git/kas/linux.git kvm-unmapped-guest-only
> >
> > but only the patch below is relevant for TDX. QEMU patch is attached.
>
> Can you post the whole series?

I hoped to get it posted as part of TDX host enabling.

As it is the feature is incomplete for pure KVM. I didn't implement on KVM
side checks that provided by TDX module/hardware, so nothing prevents the
same page to be added to multiple KVM instances.

> The KVM behavior and usage of FOLL_GUEST is very relevant to TDX.

The patch can be found here:

https://git.kernel.org/pub/scm/linux/kernel/git/kas/linux.git/commit/?h=kvm-unmapped-guest-only&id=2cd6c2c20528696a46a2a59383ca81638bf856b5

> > CONFIG_HAVE_KVM_PROTECTED_MEMORY has to be changed to what is appropriate
> > for TDX and FOLL_GUEST has to be used in hva_to_pfn_slow() when running
> > TDX guest.
>
> This behavior in particular is relevant; KVM should provide FOLL_GUEST iff the
> access is private or the VM type doesn't differentiate between private and
> shared.

I added FOL_GUEST if the KVM instance has the feature enabled.

On top of that TDX-specific code has to check that the page is in fact
PageGuest() before inserting it into private SEPT.

The scheme makes sure that user-accessible memory cannot be not added as
private to TD.

> > When page get inserted into private sept we must make sure it is
> > PageGuest() or SIGBUS otherwise.
>
> More KVM feedback :-)
>
> Ideally, KVM will synchronously exit to userspace with detailed information on
> the bad behavior, not do SIGBUS. Hopefully that infrastructure will be in place
> sooner than later.
>
> https://lkml.kernel.org/r/YKxJLcg/WomPE422@xxxxxxxxxx

My experiments are still v5.11, but I can rebase to whatever needed once
the infrastructure hits upstream.

> > Inserting PageGuest() into shared is fine, but the page will not be accessible
> > from userspace.
>
> Even if it can be functionally fine, I don't think we want to allow KVM to map
> PageGuest() as shared memory. The only reason to map memory shared is to share
> it with something, e.g. the host, that doesn't have access to private memory, so
> I can't envision a use case.
>
> On the KVM side, it's trivially easy to omit FOLL_GUEST for shared memory, while
> always passing FOLL_GUEST would require manually zapping. Manual zapping isn't
> a big deal, but I do think it can be avoided if userspace must either remap the
> hva or define a new KVM memslot (new gpa->hva), both of which will automatically
> zap any existing translations.
>
> Aha, thought of a concrete problem. If KVM maps PageGuest() into shared memory,
> then KVM must ensure that the page is not mapped private via a different hva/gpa,
> and is not mapped _any_ other guest because the TDX-Module's 1:1 PFN:TD+GPA
> enforcement only applies to private memory. The explicit "VM_WRITE | VM_SHARED"
> requirement below makes me think this wouldn't be prevented.

Hm. I didn't realize that TDX module doesn't prevent the same page to be
used as shared and private at the same time.

Omitting FOLL_GUEST for shared memory doesn't look like a right approach.
IIUC, it would require the kernel to track what memory is share and what
private, which defeat the purpose of the rework. I would rather enforce
!PageGuest() when share SEPT is populated in addition to enforcing
PageGuest() fro private SEPT.

Do you see any problems with this?

> Oh, and the other nicety is that I think it would avoid having to explicitly
> handle PageGuest() memory that is being accessed from kernel/KVM, i.e. if all
> memory exposed to KVM must be !PageGuest(), then it is also eligible for
> copy_{to,from}_user().

copy_{to,from}_user() enforce by setting PTE entries to PROT_NONE.
Or do I miss your point?

>
> > Any feedback is welcome.
> >
> > -------------------------------8<-------------------------------------------
> >
> > From: "Kirill A. Shutemov" <kirill.shutemov@xxxxxxxxxxxxxxx>
> > Date: Fri, 16 Apr 2021 01:30:48 +0300
> > Subject: [PATCH] mm: Introduce guest-only pages
> >
> > PageGuest() pages are only allowed to be used as guest memory. Userspace
> > is not allowed read from or write to such pages.
> >
> > On page fault, PageGuest() pages produce PROT_NONE page table entries.
> > Read or write there will trigger SIGBUS. Access to such pages via
> > syscall leads to -EIO.
> >
> > The new mprotect(2) flag PROT_GUEST translates to VM_GUEST. Any page
> > fault to VM_GUEST VMA produces PageGuest() page.
> >
> > Only shared tmpfs/shmem mappings are supported.
>
> Is limiting this to tmpfs/shmem only for the PoC/RFC, or is it also expected to
> be the long-term behavior?

I expect it to be enough to cover all relevant cases, no?

Note that MAP_ANONYMOUS|MAP_SHARED also fits here.

--
Kirill A. Shutemov