Re: [PATCH] KVM: SVM: Update destination when updating pi irte

From: Joao Martins
Date: Thu May 18 2023 - 05:02:31 EST




On 18/05/2023 09:19, Maxim Levitsky wrote:
> У чт, 2023-05-18 у 11:58 +0800, dengqiao.joey пише:
>> On Wed, May 17, 2023 at 11:32 PM Joao Martins <joao.m.martins@xxxxxxxxxx> wrote:
>>> On 17/05/2023 13:04, dengqiao.joey wrote:
>>>> This seems to be a different issue. Joao's patch tries to fix the issue
>>>> that IRTE is not changed. The problem I encountered is that IRTE does
>>>> get changed, thus the destination field is also cleared by
>>>> amd_iommu_activate_guest_mode().
>>>
>>> Whether the amd_iommu_activate_guest_mode() clears the destination field or not
>>> doesn't change the situation I think. So I don't think that is the problem
>>> you're having. amd_iommu_activate_guest_mode() will always make IRTEs with
>>> isRun=0. Which means your VF interrupts get injected via GALog events and the
>>> DestID is meaningless there :/ Even if you leave the old affinity there with
>>> isRun=1 bit it's still wrong as by the time you run the vcpu, the wrong affinity
>>> will be used by the VF. More fundamentally I am not sure that on IRTE routing
>>> updates in the VM that you can look on vcpu->cpu the way you are using it could
>>> be potentially invalid when vcpus are blocked/preempted. You're changing IRTEs
>>> in the whole VM in pi_update_irte()
>>>
>>> My changes essentially handle the fact where IRTEs will be always updated with
>>> the right GATag set in the IRTE ... and then when the vCPU migrates, wakes or
>>> block+unblocks it will change IRTE[DestID] with the new pcpu (which the vcpu is
>>> running on) via amd_iommu_update_ga(). So there's a good chance my changes fixes
>>> your affinity issue.
>>>
>>> Joao
>>
>> Sorry I forgot to mention before that in my application scenario, the vcpu runs on
>> a dedicated isolated cpu and uses "mwait" instead of "halt". So it will not be migrated,
>> blocked/wake and rarely get preempted by other threads. I think your patch can not fix
>> my issue because the vcpu rarely gets the opportunity to execute amd_iommu_update_ga()
>> from vcpu_load().
>>
>> So each time the interrupt affinity is updated I can observe the loss of VF interrupt.
>> After applying my patch, the problem is solved.
>>
>
> Just to see if I understand the issue correctly:
> > vcpu 'affinity' of hardware interrupt to a guest vCPU
> (which is in AMD case pretty much address of apic backing page + vector)
>
> changes in these cases:
>
> 1. when new VFIO device is attached to the VM and the guest enables MSI,
> which triggers attachment of irqfd which is matched with eventfd from the
> hardware device and irq bypass is created.
>
> 2. when the guest changes the MSI/MSI-X settings - this is similar to 1,
> and can happen any time
>
> 3. When the vCPU is migrated.
>
> As I understand the case 1 and 2, the avic_pi_update_irte is called and indeed,
> if the vCPU is already running and in MWAIT/HLT state especially, then
> it will no longer receive doorbell kicks, and the GA log handler won't do anything
> to this vCPU either since it is running (it just does kvm_vcpu_wake_up)
>
That's by design I think, as the IOMMU will still update the APIC table in the
specified vector of the table. The problem is that it is more inneficient, but
it will still inject the interrupt correctly as activate_guest_mode will also
update the vector and the new descriptor data representing the new vCPU APIC.
The only thing that gets invalidated by vcpu affinity is the doorbell isn't
right as VCPU affinity has changed, and thus we need to update it next vmentry.

>
> In case 3, the vCPU is loaded again eventually and that calls the
> avic_update_iommu_vcpu_affinity which calls into iommu 'amd_iommu_update_ga'
> to update target vCPU in the iommu remap entry.
>
Correct.

I think the issue here is that affinity updates invalidate the current affinity,
and that makes it depend on vcpus wakeups, but if the vCPU never exits the IRTE
is never marked as Running until the next vcpu_load().

>
> Now it is possible that usually the MSI/MSI-x config space updates don't happen often
> (e.g if guest doesn't migrate interrupts often, or device has per cpu interrupts (e.g nvme)),
> and also vcpu loads/puts mask this, especially since MSI config space update itself is done
> via userspace vmexit (e.g qemu machine model) which also triggers vcpu load/puts.
>
Yeap.

> I think that we do need to a flag indicating if the vCPU is currently running and if yes,
> then use svm->vcpu.cpu (or put -1 to it when it not running or something)
> (currently the vcpu->cpu remains set when vCPU is put)
>
> In other words if a vCPU is running, then avic_pi_update_irte should put correct pCPU number,
> and if it raced with vCPU put/load, then later should win and put the correct value.
> This can be done either with a lock or barriers.
>
If this could be done, it could remove cost from other places and avoid this
little dance of the galog (and avoid its usage as it's not the greatest design
aspect of the IOMMU). We anyways already need to do IRT flushes in all these
things with regards to updating any piece of the IRTE, but we need some care
there two to avoid invalidating too much (which is just as expensive and per-VCPU).

> In the current form, this patch is wrong since if the target is sleeping, then it will tell iommu
> to send doorbells to last pCPU the target was running on.
>
Exactly my point.

> FYI, I had the same issue with my nested avic implementation - when we have a nested vCPU run,
> I need to translate the shadow physid table with the pCPU numbers, and in the same time,
> vCPU migration also triggers update to the same table. I used a lock in the first implementation
> but it can be improved.
>

Interesting.

> As for varibale I used a private svm variable, 'nested_avic_active' which were true only when
> the vCPU is loaded and its nested avic activated.
>
I had a different take on it largelly based on how we do other things (e.g.
pfncache updates). By the time we update vCPUs affinity we will inevitably
enable GALog with today's semantics, until IRTEs/VCPUs are marked as isRunning
again So what I was doing was asking the vCPUs to go out of guest mode (but not
those that are blocked as that is covered) and request that the individual vCPUs
to update their affinity. This would be another way.

But still quite expensive (as many IPIs as vCPUs updated), but it works as
intended and guest will immediately see the right vcpu affinity. But I honestly
prefer going towards your suggestion (via vcpu.pcpu) if we can have some
insurance that vcpu.cpu is safe to use in pi_update_irte if protected against
preemption/blocking of the VCPU. I would suggest we would get some smarts in
changing affinity, such that the update of the GATag/GAVector/GARootPtr happens
we also has a pcpu in there. Perhaps we could even remove activate_guest_mode()
and just batch all these updates. If you have some changes on the vcpu::cpu I am
happy to try it out on the AMD IOMMU side on top of it.

The snippet below isn't exactly a formal RFC patch (it's better splitted in my
local branch), but I am attaching it anyhow the diffs for discussions purposes.

------------------------------->8----------------------------------
Subject: KVM: SVM: Avoid GALog when updating IRTE affinity

Signed-off-by: Joao Martins <joao.m.martins@xxxxxxxxxx>
---
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index fb9d1f2d6136..2dd41dbb51cb 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -113,6 +113,8 @@
KVM_ARCH_REQ_FLAGS(31, KVM_REQUEST_WAIT | KVM_REQUEST_NO_WAKEUP)
#define KVM_REQ_HV_TLB_FLUSH \
KVM_ARCH_REQ_FLAGS(32, KVM_REQUEST_WAIT | KVM_REQUEST_NO_WAKEUP)
+#define KVM_REQ_APICV_AFFINITY_UPDATE \
+ KVM_ARCH_REQ_FLAGS(33, KVM_REQUEST_WAIT)

#define CR0_RESERVED_BITS \
(~(unsigned long)(X86_CR0_PE | X86_CR0_MP | X86_CR0_EM | X86_CR0_TS \
@@ -1731,6 +1733,8 @@ struct kvm_x86_ops {
* Returns vCPU specific APICv inhibit reasons
*/
unsigned long (*vcpu_get_apicv_inhibit_reasons)(struct kvm_vcpu *vcpu);
+
+ void (*update_apicv_affinity)(struct kvm_vcpu *vcpu);
};
diff --git a/arch/x86/kvm/svm/avic.c b/arch/x86/kvm/svm/avic.c
index cfc8ab773025..535619a94770 100644
--- a/arch/x86/kvm/svm/avic.c
+++ b/arch/x86/kvm/svm/avic.c
@@ -39,6 +39,7 @@
* as far as hardware is concerned.
*/
#define AVIC_VCPU_ID_MASK AVIC_PHYSICAL_MAX_INDEX_MASK
+#define AVIC_VCPU_ID_NR (AVIC_PHYSICAL_MAX_INDEX_MASK + 1)

#define AVIC_VM_ID_SHIFT HWEIGHT32(AVIC_PHYSICAL_MAX_INDEX_MASK)
#define AVIC_VM_ID_MASK (GENMASK(31, AVIC_VM_ID_SHIFT)
>> AVIC_VM_ID_SHIFT)
@@ -878,8 +879,10 @@ get_pi_vcpu_info(struct kvm *kvm, struct
kvm_kernel_irq_routing_entry *e,
int avic_pi_update_irte(struct kvm *kvm, unsigned int host_irq,
uint32_t guest_irq, bool set)
{
+ DECLARE_BITMAP(vcpu_bitmap, AVIC_VCPU_ID_NR);
struct kvm_kernel_irq_routing_entry *e;
struct kvm_irq_routing_table *irq_rt;
+ bool update_affinity = false;
int idx, ret = 0;

if (!kvm_arch_has_assigned_device(kvm) ||
@@ -933,8 +936,16 @@ int avic_pi_update_irte(struct kvm *kvm, unsigned int host_irq,
* we can reference to them directly when we update vcpu
* scheduling information in IOMMU irte.
*/
- if (!ret && pi.is_guest_mode)
+ if (!ret && pi.is_guest_mode) {
svm_ir_list_add(svm, &pi);
+ if (!update_affinity) {
+ bitmap_zero(vcpu_bitmap,
+ AVIC_VCPU_ID_NR);
+ update_affinity = true;
+ }
+ __set_bit(svm->vcpu.vcpu_id,
+ vcpu_bitmap);
+ }
} else {
/* Use legacy mode in IRTE */
struct amd_iommu_pi_data pi;
@@ -979,6 +990,14 @@ int avic_pi_update_irte(struct kvm *kvm, unsigned int host_irq,
ret = 0;
out:
srcu_read_unlock(&kvm->irq_srcu, idx);
+ if (update_affinity) {
+ preempt_disable();
+ local_irq_enable();
+ kvm_make_vcpus_request_mask(kvm, KVM_REQ_APICV_AFFINITY_UPDATE,
+ vcpu_bitmap);
+ local_irq_disable();
+ preempt_enable();
+ }
return ret;
}

@@ -1012,6 +1031,21 @@ avic_update_iommu_vcpu_affinity(struct kvm_vcpu *vcpu,
int cpu, bool r)
return ret;
}

+void avic_update_affinity(struct kvm_vcpu *vcpu)
+{
+ int h_physical_id = kvm_cpu_get_apicid(vcpu->cpu);
+ struct vcpu_svm *svm = to_svm(vcpu);
+ u64 entry;
+
+ entry = READ_ONCE(*(svm->avic_physical_id_cache));
+
+ /* Nothing to do if IsRunning == '0' due to vCPU blocking. */
+ if (!(entry & AVIC_PHYSICAL_ID_ENTRY_IS_RUNNING_MASK))
+ return;
+
+ avic_update_iommu_vcpu_affinity(vcpu, h_physical_id, true);
+}
+
void avic_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
{
u64 entry;
diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index ca32389f3c36..2f3eb13ac248 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -4935,6 +4935,7 @@ static struct kvm_x86_ops svm_x86_ops __initdata = {

.vcpu_deliver_sipi_vector = svm_vcpu_deliver_sipi_vector,
.vcpu_get_apicv_inhibit_reasons = avic_vcpu_get_apicv_inhibit_reasons,
+ .update_apicv_affinity = avic_update_affinity,
};

/*
diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h
index f44751dd8d5d..c6c47c5c7ad8 100644
--- a/arch/x86/kvm/svm/svm.h
+++ b/arch/x86/kvm/svm/svm.h
@@ -706,6 +706,7 @@ void avic_vcpu_unblocking(struct kvm_vcpu *vcpu);
void avic_ring_doorbell(struct kvm_vcpu *vcpu);
unsigned long avic_vcpu_get_apicv_inhibit_reasons(struct kvm_vcpu *vcpu);
void avic_refresh_virtual_apic_mode(struct kvm_vcpu *vcpu);
+void avic_update_affinity(struct kvm_vcpu *vcpu);


/* sev.c */
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index ceb7c5e9cf9e..62fe938ea1c2 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -10458,6 +10458,18 @@ static void kvm_vcpu_reload_apic_access_page(struct
kvm_vcpu *vcpu)
static_call_cond(kvm_x86_set_apic_access_page_addr)(vcpu);
}

+void kvm_vcpu_update_apicv_affinity(struct kvm_vcpu *vcpu)
+{
+ if (!lapic_in_kernel(vcpu))
+ return;
+
+ if (!kvm_x86_ops.update_apicv_affinity)
+ return;
+
+ kvm_x86_ops.update_apicv_affinity(vcpu);
+}
+EXPORT_SYMBOL_GPL(kvm_vcpu_update_apicv_affinity);
+
void __kvm_request_immediate_exit(struct kvm_vcpu *vcpu)
{
smp_send_reschedule(vcpu->cpu);
@@ -10586,6 +10598,8 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
vcpu_load_eoi_exitmap(vcpu);
if (kvm_check_request(KVM_REQ_APIC_PAGE_RELOAD, vcpu))
kvm_vcpu_reload_apic_access_page(vcpu);
+ if (kvm_check_request(KVM_REQ_APICV_AFFINITY_UPDATE, vcpu))
+ kvm_vcpu_update_apicv_affinity(vcpu);
if (kvm_check_request(KVM_REQ_HV_CRASH, vcpu)) {
vcpu->run->exit_reason = KVM_EXIT_SYSTEM_EVENT;
vcpu->run->system_event.type = KVM_SYSTEM_EVENT_CRASH;
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index cb5c13eee193..cd28aabd28f3 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -317,6 +317,7 @@ bool kvm_make_vcpus_request_mask(struct kvm *kvm, unsigned
int req,

return called;
}
+EXPORT_SYMBOL_GPL(kvm_make_vcpus_request_mask);

bool kvm_make_all_cpus_request_except(struct kvm *kvm, unsigned int req,
struct kvm_vcpu *except)