Re: [PATCH v1 1/5] KVM: arm64: Enable ring-based dirty memory tracking

From: Gavin Shan
Date: Fri Aug 26 2022 - 02:05:52 EST


Hi Marc,

On 8/25/22 6:57 AM, Marc Zyngier wrote:
On Wed, 24 Aug 2022 17:21:50 +0100,
Peter Xu <peterx@xxxxxxxxxx> wrote:

On Wed, Aug 24, 2022 at 03:45:11PM +0100, Marc Zyngier wrote:
On Wed, 24 Aug 2022 00:19:04 +0100,
Peter Xu <peterx@xxxxxxxxxx> wrote:

On Tue, Aug 23, 2022 at 11:47:03PM +0100, Marc Zyngier wrote:
Atomicity doesn't guarantee ordering, unfortunately.

Right, sorry to be misleading. The "atomicity" part I was trying to say
the kernel will always see consistent update on the fields.

The ordering should also be guaranteed, because things must happen with
below sequence:

(1) kernel publish dirty GFN data (slot, offset)
(2) kernel publish dirty GFN flag (set to DIRTY)
(3) user sees DIRTY, collects (slots, offset)
(4) user sets it to RESET
(5) kernel reads RESET

Maybe. Maybe not. The reset could well be sitting in the CPU write
buffer for as long as it wants and not be seen by the kernel if the
read occurs on another CPU. And that's the crucial bit: single-CPU is
fine, but cross CPU isn't. Unfortunately, the userspace API is per-CPU
on collection, and global on reset (this seems like a bad decision,
but it is too late to fix this).

Regarding the last statement, that's something I had question too and
discussed with Paolo, even though at that time it's not a outcome of
discussing memory ordering issues.

IIUC the initial design was trying to avoid tlb flush flood when vcpu
number is large (each RESET per ring even for one page will need all vcpus
to flush, so O(N^2) flushing needed). With global RESET it's O(N). So
it's kind of a trade-off, and indeed until now I'm not sure which one is
better. E.g., with per-ring reset, we can have locality too in userspace,
e.g. the vcpu thread might be able to recycle without holding global locks.

I don't get that. On x86, each CPU must perform the TLB invalidation
(there is an IPI for that). So whether you do a per-CPU scan of the
ring or a global scan is irrelevant: each entry you find in any of the
rings must result in a global invalidation, since you've updated the
PTE to make the page RO.

The same is true on ARM, except that the broadcast is done in HW
instead of being tracked in SW.

Buy anyway, this is all moot. The API is what it is, and it isn't
going to change any time soon. All we can do is add some
clarifications to the API for the more relaxed architectures, and make
sure the kernel behaves accordingly.

[...]

It may be safe, but it isn't what the userspace API promises.

The document says:

After processing one or more entries in the ring buffer, userspace calls
the VM ioctl KVM_RESET_DIRTY_RINGS to notify the kernel about it, so that
the kernel will reprotect those collected GFNs. Therefore, the ioctl
must be called *before* reading the content of the dirty pages.

I'd say it's not an explicit promise, but I think I agree with you that at
least it's unclear on the behavior.

This is the least problematic part of the documentation. The bit I
literally choke on is this:

<quote>
It's not necessary for userspace to harvest the all dirty GFNs at once.
However it must collect the dirty GFNs in sequence, i.e., the userspace
program cannot skip one dirty GFN to collect the one next to it.
</quote>

This is the core of the issue. Without ordering rules, the consumer on
the other side cannot observe the updates correctly, even if userspace
follows the above to the letter. Of course, the kernel itself must do
the right thing (I guess it amounts to the kernel doing a
load-acquire, and userspace doing a store-release -- effectively
emulating x86...).

Since we have a global recycle mechanism, most likely the app (e.g. current
qemu impl) will use the same thread to collect/reset dirty GFNs, and
trigger the RESET ioctl(). In that case it's safe, IIUC, because no
cross-core ops.

QEMU even guarantees this by checking it (kvm_dirty_ring_reap_locked):

if (total) {
ret = kvm_vm_ioctl(s, KVM_RESET_DIRTY_RINGS);
assert(ret == total);
}

I think the assert() should never trigger as mentioned above. But ideally
maybe it should just be a loop until cleared gfns match total.

Right. If userspace calls the ioctl on every vcpu, then things should
work correctly. It is only that the overhead is higher than what it
should be if multiple vcpus perform a reset at the same time.


In other words, without further straightening of the API, this doesn't
work as expected on relaxed memory architectures. So before this gets
enabled on arm64, this whole ordering issue must be addressed.

How about adding some more documentation for KVM_RESET_DIRTY_RINGS on the
possibility of recycling partial of the pages, especially when collection
and the ioctl() aren't done from the same thread?

I'd rather tell people about the ordering rules. That will come at
zero cost on x86.

Any suggestions will be greatly welcomed.

I'll write a couple of patch when I get the time, most likely next
week. Gavin will hopefully be able to take them as part of his series.


Thanks, Marc. Please let me know where I can check out the patches
when they're ready. I can include the patches into this series in
next revision :)

Thanks,
Gavin