Re: [PATCH] drivers/virt: vmgenid: add vm generation id driver

From: Alexander Graf
Date: Tue Oct 20 2020 - 06:01:07 EST




On 19.10.20 19:15, Mathieu Desnoyers wrote:


----- On Oct 17, 2020, at 2:10 PM, Andy Lutomirski luto@xxxxxxxxxx wrote:

On Fri, Oct 16, 2020 at 6:40 PM Jann Horn <jannh@xxxxxxxxxx> wrote:

[adding some more people who are interested in RNG stuff: Andy, Jason,
Theodore, Willy Tarreau, Eric Biggers. also linux-api@, because this
concerns some pretty fundamental API stuff related to RNG usage]

On Fri, Oct 16, 2020 at 4:33 PM Catangiu, Adrian Costin
<acatan@xxxxxxxxxx> wrote:
- Background

The VM Generation ID is a feature defined by Microsoft (paper:
http://go.microsoft.com/fwlink/?LinkId=260709) and supported by
multiple hypervisor vendors.

The feature is required in virtualized environments by apps that work
with local copies/caches of world-unique data such as random values,
uuids, monotonically increasing counters, etc.
Such apps can be negatively affected by VM snapshotting when the VM
is either cloned or returned to an earlier point in time.

The VM Generation ID is a simple concept meant to alleviate the issue
by providing a unique ID that changes each time the VM is restored
from a snapshot. The hw provided UUID value can be used to
differentiate between VMs or different generations of the same VM.

- Problem

The VM Generation ID is exposed through an ACPI device by multiple
hypervisor vendors but neither the vendors or upstream Linux have no
default driver for it leaving users to fend for themselves.

Furthermore, simply finding out about a VM generation change is only
the starting point of a process to renew internal states of possibly
multiple applications across the system. This process could benefit
from a driver that provides an interface through which orchestration
can be easily done.

- Solution

This patch is a driver which exposes the Virtual Machine Generation ID
via a char-dev FS interface that provides ID update sync and async
notification, retrieval and confirmation mechanisms:

When the device is 'open()'ed a copy of the current vm UUID is
associated with the file handle. 'read()' operations block until the
associated UUID is no longer up to date - until HW vm gen id changes -
at which point the new UUID is provided/returned. Nonblocking 'read()'
uses EWOULDBLOCK to signal that there is no _new_ UUID available.

'poll()' is implemented to allow polling for UUID updates. Such
updates result in 'EPOLLIN' events.

Subsequent read()s following a UUID update no longer block, but return
the updated UUID. The application needs to acknowledge the UUID update
by confirming it through a 'write()'.
Only on writing back to the driver the right/latest UUID, will the
driver mark this "watcher" as up to date and remove EPOLLIN status.

'mmap()' support allows mapping a single read-only shared page which
will always contain the latest UUID value at offset 0.

It would be nicer if that page just contained an incrementing counter,
instead of a UUID. It's not like the application cares *what* the UUID
changed to, just that it *did* change and all RNGs state now needs to
be reseeded from the kernel, right? And an application can't reliably
read the entire UUID from the memory mapping anyway, because the VM
might be forked in the middle.

So I think your kernel driver should detect UUID changes and then turn
those into a monotonically incrementing counter. (Probably 64 bits
wide?) (That's probably also a little bit faster than comparing an
entire UUID.)

An option might be to put that counter into the vDSO, instead of a
separate VMA; but I don't know how the other folks feel about that.
Andy, do you have opinions on this? That way, normal userspace code
that uses this infrastructure wouldn't have to mess around with a
special device at all. And it'd be usable in seccomp sandboxes and so
on without needing special plumbing. And libraries wouldn't have to
call open() and mess with file descriptor numbers.

The vDSO might be annoyingly slow for this. Something like the rseq
page might make sense. It could be a generic indication of "system
went through some form of suspend".

This might indeed fit nicely as an extension of my KTLS prototype (extensible rseq):

https://lore.kernel.org/lkml/20200925181518.4141-1-mathieu.desnoyers@xxxxxxxxxxxx/

There are a few ways we could wire things up. One might be to add the
UUID field into the extended KTLS structure (so it's always updated after it
changes on next return to user-space). For this I assume that the Linux scheduler

I think one that that became apparent in the discussion in this thread was that we want a Linux internal generation counter rather than expose the UUID verbatim. That way, we don't give away potential secrets to user space and we can support other architectures more easily.

within the guest VM always preempts all threads before a VM is suspended (is that
indeed true ?).

The VM does not know that it gets snapshotted. It only knows that it gets resumed (through this interface).

This leads to one important question though: how is the UUID check vs commit operation
made atomic with respect to suspend ? Unless we use rseq critical sections in assembly,
where the kernel will abort the rseq critical section on preemption, I don't see how we
can ensure that the UUID value does not change right after it has been checked, before
the "commit" side-effect. And what is the expected "commit" side-effect ? Is it a store
to a variable in user-space memory, or is it issuing a system call which sends a packet over
the network ?

I think the easiest answer I could come up with here would be "make it a u32". Then you can just access it atomically anywhere, no?

The burden on user space with such an interface is still pretty high though. All user space that wants to do a "transaction" based on secrets would now need to read the generation ID at the beginning of the transaction and double check whether it's still the same at the end of it (e.g. before sending out a network packet based on a key derived from randomness?).


Alex



Amazon Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B
Sitz: Berlin
Ust-ID: DE 289 237 879