Re: [PATCH 0/5] KVM/x86: add a new hypercall to execute host system

From: Paolo Bonzini
Date: Tue Jul 26 2022 - 06:27:16 EST


On 7/26/22 10:33, Andrei Vagin wrote:
We can think about restricting the list of system calls that this hypercall can
execute. In the user-space changes for gVisor, we have a list of system calls
that are not executed via this hypercall. For example, sigprocmask is never
executed by this hypercall, because the kvm vcpu has its signal mask. Another
example is the ioctl syscall, because it can be one of kvm ioctl-s.

The main issue I have is that the system call addresses are not translated.

On one hand, I understand why it's done like this; it's pretty much impossible to do it without duplicating half of the sentry in the host kernel. And the KVM API you're adding is certainly sensible.

On the other hand this makes the hypercall even more specialized, as it depends on the guest's memslot layout, and not self-sufficient, in the sense that the sandbox isn't secure without prior copying and validation of arguments in guest ring0.

== Host Ring3/Guest ring0 mixed mode ==

This is how the gVisor KVM platform works right now. We don’t have a separate
hypervisor, and the Sentry does its functions. The Sentry creates a KVM virtual
machine instance, sets it up, and handles VMEXITs. As a result, the Sentry runs
in the host ring3 and the guest ring0 and can transparently switch between
these two contexts. In this scheme, the sentry syscall time is 3600ns.
This is for the case when a system call is called from gr0.

The benefit of this way is that only a first system call triggers vmexit and
all subsequent syscalls are executed on the host natively.

But it has downsides:
* Each sentry system call trigger the full exit to hr3.
* Each vmenter/vmexit requires to trigger a signal but it is expensive.
* It doesn't allow to support Confidential Computing (SEV-ES/SGX). The Sentry
has to be fully enclosed in a VM to be able to support these technologies.

== Execute system calls from a user-space VMM ==

In this case, the Sentry is always running in VM, and a syscall handler in GR0
triggers vmexit to transfer control to VMM (user process that is running in
hr3), VMM executes a required system call, and transfers control back to the
Sentry. We can say that it implements the suggested hypercall in the
user-space.

The sentry syscall time is 2100ns in this case.

The new hypercall does the same but without switching to the host ring 3. It
reduces the sentry syscall time to 1000ns.

Yeah, ~3000 clock cycles is what I would expect.

What does it translate to in terms of benchmarks? For example a simple netperf/UDP_RR benchmark.

Paolo