Re: [PATCH 1/4] KVM: Always flush async #PF workqueue when vCPU is being destroyed

From: Sean Christopherson
Date: Fri Jan 26 2024 - 12:20:57 EST


On Fri, Jan 26, 2024, Vitaly Kuznetsov wrote:
> > +static void kvm_flush_and_free_async_pf_work(struct kvm_async_pf *work)
> > +{
> > + /*
> > + * The async #PF is "done", but KVM must wait for the work item itself,
> > + * i.e. async_pf_execute(), to run to completion. If KVM is a module,
> > + * KVM must ensure *no* code owned by the KVM (the module) can be run
> > + * after the last call to module_put(), i.e. after the last reference
> > + * to the last vCPU's file is put.
> > + *
>
> Do I understand correctly that the problem is also present on the
> "normal" path, i.e.:
>
> KVM_REQ_APF_READY
> kvm_check_async_pf_completion()
> kmem_cache_free(,work)
>
> on one CPU can actually finish _before_ work is fully flushed on the
> other (async_pf_execute() has already added an item to 'done' list but
> hasn't completed)? Is it just the fact that the window of opportunity
> to get the freed item re-purposed is so short that no real issue was
> ever noticed?

Sort of? It's not a problem with accessing a freed obect, the issue is that
async_pf_execute() can still be executing while KVM-the-module is unloaded.

The reason the "normal" path is problematic is because it removes the
work entry from async_pf.done and from async_pf.queue, i.e. makes it invisible
to kvm_arch_destroy_vm(). So to hit the bug where the "normal" path processes
the completed work, userspace would need to terminate the VM (i.e. exit() or
close all fds), and KVM would need to completely tear down the VM, all before
async_pf_execute() finished it's last few instructions.

Which is possible, just comically unlikely :-)

> In that case I'd suggest we emphasize that in the comment as currently it
> sounds like kvm_arch_destroy_vm() is the only probemmatic path.

How's this?

/*
* The async #PF is "done", but KVM must wait for the work item itself,
* i.e. async_pf_execute(), to run to completion. If KVM is a module,
* KVM must ensure *no* code owned by the KVM (the module) can be run
* after the last call to module_put(). Note, flushing the work item
* is always required when the item is taken off the completion queue.
* E.g. even if the vCPU handles the item in the "normal" path, the VM
* could be terminated before async_pf_execute() completes.
*
* Wake all events skip the queue and go straight done, i.e. don't
* need to be flushed (but sanity check that the work wasn't queued).
*/

> > + * Wake all events skip the queue and go straight done, i.e. don't
> > + * need to be flushed (but sanity check that the work wasn't queued).
> > + */
> > + if (work->wakeup_all)
> > + WARN_ON_ONCE(work->work.func);
> > + else
> > + flush_work(&work->work);
> > + kmem_cache_free(async_pf_cache, work);
> > }
> >
> > void kvm_clear_async_pf_completion_queue(struct kvm_vcpu *vcpu)
> > @@ -114,7 +132,6 @@ void kvm_clear_async_pf_completion_queue(struct kvm_vcpu *vcpu)
> > #else
> > if (cancel_work_sync(&work->work)) {
> > mmput(work->mm);
> > - kvm_put_kvm(vcpu->kvm); /* == work->vcpu->kvm */
> > kmem_cache_free(async_pf_cache, work);
> > }
> > #endif
> > @@ -126,7 +143,18 @@ void kvm_clear_async_pf_completion_queue(struct kvm_vcpu *vcpu)
> > list_first_entry(&vcpu->async_pf.done,
> > typeof(*work), link);
> > list_del(&work->link);
> > - kmem_cache_free(async_pf_cache, work);
> > +
> > + spin_unlock(&vcpu->async_pf.lock);
> > +
> > + /*
> > + * The async #PF is "done", but KVM must wait for the work item
> > + * itself, i.e. async_pf_execute(), to run to completion. If
> > + * KVM is a module, KVM must ensure *no* code owned by the KVM
> > + * (the module) can be run after the last call to module_put(),
> > + * i.e. after the last reference to the last vCPU's file is put.
> > + */

Doh, this is a duplicate comment, I'll delete it.

> > + kvm_flush_and_free_async_pf_work(work);
> > + spin_lock(&vcpu->async_pf.lock);
> > }
> > spin_unlock(&vcpu->async_pf.lock);
> >
> > @@ -151,7 +179,7 @@ void kvm_check_async_pf_completion(struct kvm_vcpu *vcpu)
> >
> > list_del(&work->queue);
> > vcpu->async_pf.queued--;
> > - kmem_cache_free(async_pf_cache, work);
> > + kvm_flush_and_free_async_pf_work(work);
> > }
> > }
> >
> > @@ -186,7 +214,6 @@ bool kvm_setup_async_pf(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
> > work->arch = *arch;
> > work->mm = current->mm;
> > mmget(work->mm);
> > - kvm_get_kvm(work->vcpu->kvm);
> >
> > INIT_WORK(&work->work, async_pf_execute);
>
> Reviewed-by: Vitaly Kuznetsov <vkuznets@xxxxxxxxxx>
>
> --
> Vitaly
>