Re: [PATCH v6 02/12] Halt vcpu if page it tries to access isswapped out.

From: Gleb Natapov
Date: Thu Oct 07 2010 - 13:47:44 EST


On Thu, Oct 07, 2010 at 11:50:08AM +0200, Avi Kivity wrote:
> On 10/04/2010 05:56 PM, Gleb Natapov wrote:
> >If a guest accesses swapped out memory do not swap it in from vcpu thread
> >context. Schedule work to do swapping and put vcpu into halted state
> >instead.
> >
> >Interrupts will still be delivered to the guest and if interrupt will
> >cause reschedule guest will continue to run another task.
> >
> >
> >+
> >+static bool can_do_async_pf(struct kvm_vcpu *vcpu)
> >+{
> >+ if (unlikely(!irqchip_in_kernel(vcpu->kvm) ||
> >+ kvm_event_needs_reinjection(vcpu)))
> >+ return false;
> >+
> >+ return kvm_x86_ops->interrupt_allowed(vcpu);
> >+}
>
> Strictly speaking, if the cpu can handle NMIs it can take an apf?
>
We can always do apf, but if vcpu can't do anything hwy bother. For NMI
watchdog yes, may be it is worth to allow apf if nmi is allowed.

> >@@ -5112,6 +5122,13 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
> > if (unlikely(r))
> > goto out;
> >
> >+ kvm_check_async_pf_completion(vcpu);
> >+ if (vcpu->arch.mp_state == KVM_MP_STATE_HALTED) {
> >+ /* Page is swapped out. Do synthetic halt */
> >+ r = 1;
> >+ goto out;
> >+ }
> >+
>
> Why do it here in the fast path? Can't you halt the cpu when
> starting the page fault?
Page fault may complete before guest re-entry. We do not want to halt vcpu
in this case.
>
> I guess the apf threads can't touch mp_state, but they can have a
> KVM_REQ to trigger the check.
This will require KVM_REQ check on fast path, so what's the difference
performance wise.

>
> > if (kvm_check_request(KVM_REQ_EVENT, vcpu) || req_int_win) {
> > inject_pending_event(vcpu);
> >
> >@@ -5781,6 +5798,9 @@ int kvm_arch_vcpu_reset(struct kvm_vcpu *vcpu)
> >
> > kvm_make_request(KVM_REQ_EVENT, vcpu);
> >
> >+ kvm_clear_async_pf_completion_queue(vcpu);
> >+ memset(vcpu->arch.apf.gfns, 0xff, sizeof vcpu->arch.apf.gfns);
>
> An ordinary for loop is less tricky, even if it means one more line.
>
> >
> >@@ -6040,6 +6064,7 @@ void kvm_arch_flush_shadow(struct kvm *kvm)
> > int kvm_arch_vcpu_runnable(struct kvm_vcpu *vcpu)
> > {
> > return vcpu->arch.mp_state == KVM_MP_STATE_RUNNABLE
> >+ || !list_empty_careful(&vcpu->async_pf.done)
> > || vcpu->arch.mp_state == KVM_MP_STATE_SIPI_RECEIVED
> > || vcpu->arch.nmi_pending ||
> > (kvm_arch_interrupt_allowed(vcpu)&&
>
> Unrelated, shouldn't kvm_arch_vcpu_runnable() look at
> vcpu->requests? Specifically KVM_REQ_EVENT?
I think KVM_REQ_EVENT is covered by checking nmi and interrupt queue
here.

>
> >+static void kvm_add_async_pf_gfn(struct kvm_vcpu *vcpu, gfn_t gfn)
> >+{
> >+ u32 key = kvm_async_pf_hash_fn(gfn);
> >+
> >+ while (vcpu->arch.apf.gfns[key] != -1)
> >+ key = kvm_async_pf_next_probe(key);
>
> Not sure what that -1 converts to on i386 where gfn_t is u64.
Will check.

> >+
> >+void kvm_arch_async_page_not_present(struct kvm_vcpu *vcpu,
> >+ struct kvm_async_pf *work)
> >+{
> >+ vcpu->arch.mp_state = KVM_MP_STATE_HALTED;
> >+
> >+ if (work == kvm_double_apf)
> >+ trace_kvm_async_pf_doublefault(kvm_rip_read(vcpu));
> >+ else {
> >+ trace_kvm_async_pf_not_present(work->gva);
> >+
> >+ kvm_add_async_pf_gfn(vcpu, work->arch.gfn);
> >+ }
> >+}
>
> Just have vcpu as the argument for tracepoints to avoid
> unconditional kvm_rip_read (slow on Intel), and call kvm_rip_read()
> in tp_fast_assign(). Similarly you can pass work instead of
> work->gva, though that's not nearly as important.
>
Will do.

> >+
> >+TRACE_EVENT(
> >+ kvm_async_pf_not_present,
> >+ TP_PROTO(u64 gva),
> >+ TP_ARGS(gva),
>
> Do you actually have a gva with tdp? With nested virtualization,
> how do you interpret this gva?
With tdp it is gpa just like tdp_page_fault gets gpa where shadow page
version gets gva. Nested virtualization is too complex to interpret.

> >+
> >+TRACE_EVENT(
> >+ kvm_async_pf_completed,
> >+ TP_PROTO(unsigned long address, struct page *page, u64 gva),
> >+ TP_ARGS(address, page, gva),
>
> What does address mean? There's also gva?
>
hva.

> >+
> >+ TP_STRUCT__entry(
> >+ __field(unsigned long, address)
> >+ __field(struct page*, page)
> >+ __field(u64, gva)
> >+ ),
> >+
> >+ TP_fast_assign(
> >+ __entry->address = address;
> >+ __entry->page = page;
> >+ __entry->gva = gva;
> >+ ),
>
> Recording a struct page * in a tracepoint? Userspace can read this
> entry, better to the page_to_pfn() here.
>
OK.

>
> >+void kvm_clear_async_pf_completion_queue(struct kvm_vcpu *vcpu)
> >+{
> >+ /* cancel outstanding work queue item */
> >+ while (!list_empty(&vcpu->async_pf.queue)) {
> >+ struct kvm_async_pf *work =
> >+ list_entry(vcpu->async_pf.queue.next,
> >+ typeof(*work), queue);
> >+ cancel_work_sync(&work->work);
> >+ list_del(&work->queue);
> >+ if (!work->page) /* work was canceled */
> >+ kmem_cache_free(async_pf_cache, work);
> >+ }
>
> Are you holding any lock here?
>
> If not, what protects vcpu->async_pf.queue?
Nothing. It is accessed only from vcpu thread.

> If yes, cancel_work_sync() will need to aquire it too (in case work
> is running now and needs to take the lock, and cacncel_work_sync()
> needs to wait for it) -> deadlock.
>
Work never touches this list.

> >+
> >+ /* do alloc nowait since if we are going to sleep anyway we
> >+ may as well sleep faulting in page */
> /*
> * multi
> * line
> * comment
> */
>
> (but a good one, this is subtle)
>
> I missed where you halt the vcpu. Can you point me at the function?
>
> Note this is a synthetic halt and must not be visible to live
> migration, or we risk live migrating a halted state which doesn't
> really exist.
>
> Might be simplest to drain the apf queue on any of the save/restore ioctls.
>
So that "info cpu" will interfere with apf? Migration should work
in regular way. apf state should not be migrated since it has no meaning
on the destination. I'll make sure synthetic halt state will not
interfere with migration.

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/