RE: [PATCH 00/10] Handle set_memory_XXcrypted() errors

From: Michael Kelley (LINUX)
Date: Thu Oct 19 2023 - 13:05:45 EST


From: Rick Edgecombe <rick.p.edgecombe@xxxxxxxxx> Sent: Tuesday, October 17, 2023 1:25 PM
>
> Shared pages should never return to the page allocator, or future usage of
> the pages may allow for the contents to be exposed to the host. They may
> also cause the guest to crash if the page is used in way disallowed by HW
> (i.e. for executable code or as a page table).
>
> Normally set_memory() call failures are rare. But on TDX
> set_memory_XXcrypted() involves calls to the untrusted VMM, and an attacker
> could fail these calls such that:
> 1. set_memory_encrypted() returns an error and leaves the pages fully
> shared.
> 2. set_memory_decrypted() returns an error, but the pages are actually
> full converted to shared.
>
> This means that patterns like the below can cause problems:
> void *addr = alloc();
> int fail = set_memory_decrypted(addr, 1);
> if (fail)
> free_pages(addr, 0);
>
> And:
> void *addr = alloc();
> int fail = set_memory_decrypted(addr, 1);
> if (fail) {
> set_memory_encrypted(addr, 1);
> free_pages(addr, 0);
> }
>
> Unfortunately these patterns are all over the place. And what the
> set_memory() callers should do in this situation is not clear either. They
> shouldn’t use them as shared because something clearly went wrong, but
> they also need to fully reset the pages to private to free them. But, the
> kernel needs the VMMs help to do this and the VMM is already being
> uncooperative around the needed operations. So this isn't guaranteed to
> succeed and the caller is kind of stuck with unusable pages.
>
> Looking at QEMU/KVM as an example, these VMM converstion failures either
> indicates an attempt to attack the guest, or resource constraints on the
> host. Preventing a DOS attack is out of scope for the coco threat model.
> So this leaves the host resource constraint cause. When similar resource
> constraints are encountered in the host, KVM punts the problem to
> userspace and QEMU terminates the guest. When similar problems are
> detected inside set_memory(), SEV issues a command to terminate the guest.
>
> This all makes it appealing to simply panic (via tdx_panic() call
> which informs the host what is happening) when observing troublesome VMM
> behavior around the memory conversion. It is:
> - Consistent with similar behavior on SEV side.
> - Generally more consistent with how host resource constraints are handled
> (at least in QEMU/KVM)
> - Would be a more foolproof defense against the attack scenario.
>
> Never-the-less, doing so would be an instance of the “crash the kernel for
> security reasons” pattern. This is a big reason, and crashing is not fully
> needed because the unusable pages could just be leaked (as they already
> are in some cases). So instead, this series does a tree-wide search and
> fixes the callers to handle the error by leaking the pages. Going forward
> callers will need to handle the set_memory() errors correctly in order to
> not reintroduce the issue.
>
> I think there are some points for both sides, and we had some internal
> discussion on the right way to handle it. So I've tried to characterize
> both arguments. I'm interested to hear opinions on which is the best.

I'm more in favor of the "simply panic" approach. What you've done
in your Patch 1 and Patch 2 is an intriguing way to try to get the memory
back into a consistent state. But I'm concerned that there are failure
modes that make it less than 100% foolproof (more on that below). If
we can't be sure that the memory is back in a consistent state, then the
original problem isn't fully solved. I'm also not sure of the value of
investing effort to ensure that some errors cases are handled without
panic'ing. The upside benefit of not panic'ing seems small compared to
the downside risk of leaking guest VM data to the host.

My concern about Patches 1 and 2 is that the encryption bit in the PTE
is not a reliable indicator of the state that the host thinks the page is
in. Changing the state requires two steps (in either order): 1) updating
the guest VM PTEs, and 2) updating the host's view of the page state.
Both steps may be done on a range of pages. If #2 fails, the guest
doesn't know which pages in the batch were updated and which were
not, so the guest PTEs may not match the host state. In such a case,
set_memory_encrypted() could succeed based on checking the
PTEs when in fact the host still thinks some of the pages are shared.
Such a mismatch will produce a guest panic later on if the page is
referenced.

As you pointed out, the SEV code, and specifically the SEV-SNP code,
terminates the VM if there is a failure. That's in pvalidate_pages().
You've described your changes as being for TDX, but there's also
the Hyper-V version of handling private <-> shared transitions
which makes use of a paravisor for both SEV-SNP and TDX. The
Hyper-V versions could also fail due to resource constraints in
the paravisor, and so has the same issues as TDX, even when
running on SEV-SNP hardware.

In general, it's hard to anticipate all the failure modes that can occur
currently. Additional failure modes could be added in the future,
and taking into account malicious behavior by the host makes it
even worse. That leads me back to my conclusion that just taking
the panic is best.

>
> I’ve marked the hyperv guest parts in this as RFC, both because I can’t
> test them and I believe Linux TDs can’t run on hyperv yet due to some
> missing support. I would appreciate a correction on this if it’s wrong.

Linux TDs can run on Hyper-V today, though it may require a Hyper-V
version that isn't officially released yet. We have it working internally
at Microsoft and I think at Intel as well. There *are* still a couple of
Linux patches waiting to be accepted upstream to run without a
paravisor. In any case, your concerns about testing are valid -- it's
probably easier for one of us at Microsoft to test the Hyper-V guest
parts if we continue down the path you have proposed.

I've looked through the other patches in the series, and have a few
minor comments on the Hyper-V parts. But I'll hold those pending
an overall conclusion on whether to pursue this approach.

If you send a new version of the series, please include the
linux-hyperv@xxxxxxxxxxxxxxx mailing list as well on all the
patches.

Michael

>
> Rick Edgecombe (10):
> mm: Add helper for freeing decrypted memory
> x86/mm/cpa: Reject incorrect encryption change requests
> kvmclock: Use free_decrypted_pages()
> swiotlb: Use free_decrypted_pages()
> ptp: Use free_decrypted_pages()
> dma: Use free_decrypted_pages()
> hv: Use free_decrypted_pages()
> hv: Track decrypted status in vmbus_gpadl
> hv_nstvsc: Don't free decrypted memory
> uio_hv_generic: Don't free decrypted memory
>
> arch/s390/include/asm/set_memory.h | 1 +
> arch/x86/kernel/kvmclock.c | 2 +-
> arch/x86/mm/pat/set_memory.c | 41 +++++++++++++++++++++++++++++-
> drivers/hv/channel.c | 18 ++++++++-----
> drivers/hv/connection.c | 13 +++++++---
> drivers/net/hyperv/netvsc.c | 7 +++--
> drivers/ptp/ptp_kvm_x86.c | 2 +-
> drivers/uio/uio_hv_generic.c | 12 ++++++---
> include/linux/dma-map-ops.h | 3 ++-
> include/linux/hyperv.h | 1 +
> include/linux/set_memory.h | 13 ++++++++++
> kernel/dma/contiguous.c | 2 +-
> kernel/dma/swiotlb.c | 11 +++++---
> 13 files changed, 101 insertions(+), 25 deletions(-)
>
> --
> 2.34.1