Re: [PATCH v7 1/2] drivers/misc: sysgenid: add system generation id driver

From: Alexander Graf
Date: Wed Feb 24 2021 - 18:24:04 EST




On 24.02.21 23:41, Michael S. Tsirkin wrote:

On Wed, Feb 24, 2021 at 02:45:03PM +0100, Alexander Graf wrote:
Above should try harder to explan what are the things that need to be
scrubbed and why. For example, I personally don't really know what is
the OpenSSL session token example and what makes it vulnerable. I guess
snapshots can attack each other?




Here's a simple example of a workflow that submits transactions
to a database and wants to avoid duplicate transactions.
This does not require overseer magic. It does however require
a correct genid from hypervisor, so no mmap tricks work.



int genid, oldgenid;
read(&genid);
start:
oldgenid = genid;
transid = submit transaction
read(&genid);
if (genid != oldgenid) {
revert transaction (transid);
goto start:
}

I'm not sure I fully follow. For starters, if this is a VM local database, I
don't think you'd care about the genid. If it's a remote database, your
connection would get dropped already at the point when you clone/resume,
because TCP and your connection state machine will get really confused when
you suddenly have a different IP address or two consumers of the same stream
:).

But for the sake of the argument, let's assume you can have a connectionless
database connection that maintains its own connection uniqueness logic.

Right. E.g. not uncommon with REST APIs. They survive disconnect easily
and use cookies or such.

That
database connector would need to understand how to abort the connection (and
thus the transaction!) when the generation changes.

the point is that instead of all that you discover transaction as
a duplicate and revert it.


And that's logic you
would do with the read/write/notify mechanism. So your main loop would check
for reads on the genid fd and after sending a connection termination, notify
the overlord that it's safe to use the VM now.

The OpenSSL case (with mmap) is for libraries that are stateless and can not
guarantee that they receive a genid notification event timely.

Since you asked, this is mainly important for the PRNG. Imagine an https
server. You create a snapshot. You resume from that snapshot. OpenSSL is
fully initialized with a user space PRNG randomness pool that it considers
safe to consume. However, that means your first connection after resume will
be 100% predictable randomness wise.

I wonder whether something similar is possible here. I.e. use the secret
to encrypt stuff but check the gen ID before actually sending data.
If it changed re-encrypt. Hmm?

I don't see why you would though. Once you control the application level, just use the event based API. That's the much easier to use one. The mmap one is really just there to cover cases where you don't own the main event loop, but can't spend the syscall overhead on every invocation to check if the genid changed.



The mmap mechanism allows the PRNG to reseed after a genid change. Because
we don't have an event mechanism for this code path, that can happen minutes
after the resume. But that's ok, we "just" have to ensure that nobody is
consuming secret data at the point of the snapshot.


Something I am still not clear on is whether it's really important to
skip the system call here. If not I think it's prudent to just stick
to read for now, I think there's a slightly lower chance that
it will get misused. mmap which gives you a laggy gen id value
really seems like it would be hard to use correctly.

The read is not any less racy than the mmap. The real "safety" of the read interface comes from the acknowledge path. And that path requires you to be part of the event loop.









+Simplifyng assumption - safety prerequisite
+-------------------------------------------
+
+**Control the snapshot flow**, disallow snapshots coming at arbitrary
+moments in the workload lifetime.
+
+Use a system-level overseer entity that quiesces the system before
+snapshot, and post-snapshot-resume oversees that software components
+have readjusted to new environment, to the new generation. Only after,
+will the overseer un-quiesce the system and allow active workloads.
+
+Software components can choose whether they want to be tracked and
+waited on by the overseer by using the ``SYSGENID_SET_WATCHER_TRACKING``
+IOCTL.
+
+The sysgenid framework standardizes the API for system software to
+find out about needing to readjust and at the same time provides a
+mechanism for the overseer entity to wait for everyone to be done, the
+system to have readjusted, so it can un-quiesce.
+
+Example snapshot-safe workflow
+------------------------------
+
+1) Before taking a snapshot, quiesce the VM/container/system. Exactly
+ how this is achieved is very workload-specific, but the general
+ description is to get all software to an expected state where their
+ event loops dry up and they are effectively quiesced.

If you have ability to do this by communicating with
all processes e.g. through a unix domain socket,
why do you need the rest of the stuff in the kernel?
Quescing is a harder problem than waking up.

That depends. Think of a typical VM workload. Let's take the web server
example again. You can preboot the full VM and snapshot it as is. As long as
you don't allow any incoming connections, you can guarantee that the system
is "quiesced" well enough for the snapshot.

Well you can use a firewall or such to block incoming packets,
but I am not at all sure that means e.g. all socket buffers
are empty.

If it's a fresh VM that only started the web server and did nothing else, there shouldn't be anything in its socket buffers :).

I agree that it won't allow us to cover 100% of all cases automatically and seamlessly. I can't think of any solution that does - if you can think of something I'm all ears. But this API at least gives us a path to slowly move the ecosystem to a point where applications and libraries can enable themselves to become vm/container clone aware. Today we don't even give them the opportunity to self adjust.


Alex



Amazon Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B
Sitz: Berlin
Ust-ID: DE 289 237 879