Re: [PATCH 1/3] mm: userfaultfd: add new UFFDIO_SIGBUS ioctl

From: David Hildenbrand
Date: Fri May 19 2023 - 04:39:19 EST


On 18.05.23 22:38, Axel Rasmussen wrote:
On Thu, May 18, 2023 at 9:05 AM Peter Xu <peterx@xxxxxxxxxx> wrote:

On Wed, May 17, 2023 at 05:43:53PM -0700, Jiaqi Yan wrote:
On Wed, May 17, 2023 at 3:29 PM Axel Rasmussen <axelrasmussen@xxxxxxxxxx> wrote:

On Wed, May 17, 2023 at 3:20 PM Peter Xu <peterx@xxxxxxxxxx> wrote:

On Wed, May 17, 2023 at 06:12:33PM -0400, Peter Xu wrote:
On Thu, May 11, 2023 at 03:00:09PM -0700, James Houghton wrote:
On Thu, May 11, 2023 at 11:24 AM Axel Rasmussen
<axelrasmussen@xxxxxxxxxx> wrote:

So the basic way to use this new feature is:

- On the new host, the guest's memory is registered with userfaultfd, in
either MISSING or MINOR mode (doesn't really matter for this purpose).
- On any first access, we get a userfaultfd event. At this point we can
communicate with the old host to find out if the page was poisoned.
- If so, we can respond with a UFFDIO_SIGBUS - this places a swap marker
so any future accesses will SIGBUS. Because the pte is now "present",
future accesses won't generate more userfaultfd events, they'll just
SIGBUS directly.

I want to clarify the SIGBUS mechanism here when KVM is involved,
keeping in mind that we need to be able to inject an MCE into the
guest for this to be useful.

1. vCPU gets an EPT violation --> KVM attempts GUP.
2. GUP finds a PTE_MARKER_UFFD_SIGBUS and returns VM_FAULT_SIGBUS.
3. KVM finds that GUP failed and returns -EFAULT.

This is different than if GUP found poison, in which case KVM will
actually queue up a SIGBUS *containing the address of the fault*, and
userspace can use it to inject an appropriate MCE into the guest. With
UFFDIO_SIGBUS, we are missing the address!

I see three options:
1. Make KVM_RUN queue up a signal for any VM_FAULT_SIGBUS. I think
this is pointless.
2. Don't have UFFDIO_SIGBUS install a PTE entry, but instead have a
UFFDIO_WAKE_MODE_SIGBUS, where upon waking, we return VM_FAULT_SIGBUS
instead of VM_FAULT_RETRY. We will keep getting userfaults on repeated
accesses, just like how we get repeated signals for real poison.
3. Use this in conjunction with the additional KVM EFAULT info that
Anish proposed (the first part of [1]).

I think option 3 is fine. :)

Or... option 4) just to use either MADV_HWPOISON or hwpoison-inject? :)

I just remember Axel mentioned this in the commit message, and just in case
this is why option 4) was ruled out:

They expect that once poisoned, pages can never become
"un-poisoned". So, when we live migrate the VM, we need to preserve
the poisoned status of these pages.

Just to supplement on this point: we do have unpoison (echoing to
"debug/hwpoison/hwpoison_unpoison"), or am I wrong?

If I read unpoison_memory() correctly, once there is a real hardware
memory corruption (hw_memory_failure will be set), unpoison will stop
working and return EOPNOTSUPP.

I know some cloud providers evacuating VMs once a single memory error
happens, so not supporting unpoison is probably not a big deal for
them. BUT others do keep VM running until more errors show up later,
which could be long after the 1st error.

We're talking about postcopy migrating a VM has poisoned page on src,
rather than on dst host, am I right? IOW, the dest hwpoison should be
fake.

If so, then I would assume that's the case where all the pages on the dest
host is still all good (so hw_memory_failure not yet set, or I doubt the
judgement of being a migration target after all)?

The other thing is even if dest host has hw poisoned page, I'm not sure
whether hw_memory_failure is the only way to solve this.

I saw that this is something got worked on before from Zhenwei, David used
to have some reasoning on why it was suggested like using a global knob:

https://lore.kernel.org/all/d7927214-e433-c26d-7a9c-a291ced81887@xxxxxxxxxx/

Two major issues here afaics:

- Zhenwei's approach only considered x86 hwpoison - it relies on kpte
having !present in entries but that's x86 specific rather than generic
to memory_failure.c.

- It is _assumed_ that hwpoison injection is for debugging only.

I'm not sure whether you can fix 1) by some other ways, e.g., what if the
host just remember all the hardware poisoned pfns (or remember
soft-poisoned ones, but then here we need to be careful on removing them
from the list when it's hwpoisoned for real)? It sounds like there's
opportunity on providing a generic solution rather than relying on
!pte_present().

For 2) IMHO that's not a big issue, you can declare it'll be used in !debug
but production systems so as to boost the feature importance with a real
use case.

So far I'd say it'll be great to leverage what it's already there in linux
and make it as generic as possible. The only issue is probably
CAP_ADMIN... not sure whether we can have some way to provide !ADMIN
somehow, or you can simply work around this issue.

As you mention below I think the key distinction is the scope - I
think MADV_HWPOISON affects the whole system, including other
processes.

For our purposes, we really just want to "poison" this particular
virtual address (the HVA, from the VM's perspective), not even other
mappings of the same shared memory. I think that behavior is different
from MADV_HWPOISON, at least.

MADV_HWPOISON really is the wrong interface to use. See "man madvise".

We don't want to allow arbitrary users to hwpoison+offline absolutely healthy physical memory, which is what MADV_HWPOISON is all about.

As you say, we want to turn an unpopulated (!present) virtual address to mimic like we had a MCE on a page that would have been previously mapped here: install a hwpoison marker without actually poisoning any present page. In fact, we'd even want to fail if there *is* something mapped.

Sure, one could teach MADV_HWPOISON to allow unprivileged users to do that for !present PTE entries, and fail for unprivileged users if there is a present PTE entry. I'm not sure if that's the cleanest approach, though, and a new MADV as suggested in this thread would eventually be cleaner.

--
Thanks,

David / dhildenb