Re: [PATCH] KVM: SVM: Update destination when updating pi irte

From: Maxim Levitsky
Date: Thu May 18 2023 - 04:20:49 EST


У чт, 2023-05-18 у 11:58 +0800, dengqiao.joey пише:
> On Wed, May 17, 2023 at 11:32 PM Joao Martins <joao.m.martins@xxxxxxxxxx> wrote:
> > On 17/05/2023 13:04, dengqiao.joey wrote:
> > > This seems to be a different issue. Joao's patch tries to fix the issue
> > > that IRTE is not changed. The problem I encountered is that IRTE does
> > > get changed, thus the destination field is also cleared by
> > > amd_iommu_activate_guest_mode().
> >
> > Whether the amd_iommu_activate_guest_mode() clears the destination field or not
> > doesn't change the situation I think. So I don't think that is the problem
> > you're having. amd_iommu_activate_guest_mode() will always make IRTEs with
> > isRun=0. Which means your VF interrupts get injected via GALog events and the
> > DestID is meaningless there :/ Even if you leave the old affinity there with
> > isRun=1 bit it's still wrong as by the time you run the vcpu, the wrong affinity
> > will be used by the VF. More fundamentally I am not sure that on IRTE routing
> > updates in the VM that you can look on vcpu->cpu the way you are using it could
> > be potentially invalid when vcpus are blocked/preempted. You're changing IRTEs
> > in the whole VM in pi_update_irte()
> >
> > My changes essentially handle the fact where IRTEs will be always updated with
> > the right GATag set in the IRTE ... and then when the vCPU migrates, wakes or
> > block+unblocks it will change IRTE[DestID] with the new pcpu (which the vcpu is
> > running on) via amd_iommu_update_ga(). So there's a good chance my changes fixes
> > your affinity issue.
> >
> > Joao
>
> Sorry I forgot to mention before that in my application scenario, the vcpu runs on
> a dedicated isolated cpu and uses "mwait" instead of "halt". So it will not be migrated,
> blocked/wake and rarely get preempted by other threads. I think your patch can not fix
> my issue because the vcpu rarely gets the opportunity to execute amd_iommu_update_ga()
> from vcpu_load().
>
> So each time the interrupt affinity is updated I can observe the loss of VF interrupt.
> After applying my patch, the problem is solved.
>

Just to see if I understand the issue correctly:

vcpu 'affinity' of hardware interrupt to a guest vCPU
(which is in AMD case pretty much address of apic backing page + vector)

changes in these cases:

1. when new VFIO device is attached to the VM and the guest enables MSI,
which triggers attachment of irqfd which is matched with eventfd from the
hardware device and irq bypass is created.

2. when the guest changes the MSI/MSI-X settings - this is similar to 1,
and can happen any time

3. When the vCPU is migrated.

As I understand the case 1 and 2, the avic_pi_update_irte is called and indeed,
if the vCPU is already running and in MWAIT/HLT state especially, then
it will no longer receive doorbell kicks, and the GA log handler won't do anything
to this vCPU either since it is running (it just does kvm_vcpu_wake_up)


In case 3, the vCPU is loaded again eventually and that calls the
avic_update_iommu_vcpu_affinity which calls into iommu 'amd_iommu_update_ga'
to update target vCPU in the iommu remap entry.


Now it is possible that usually the MSI/MSI-x config space updates don't happen often
(e.g if guest doesn't migrate interrupts often, or device has per cpu interrupts (e.g nvme)),
and also vcpu loads/puts mask this, especially since MSI config space update itself is done
via userspace vmexit (e.g qemu machine model) which also triggers vcpu load/puts.

I think that we do need to a flag indicating if the vCPU is currently running and if yes,
then use svm->vcpu.cpu (or put -1 to it when it not running or something)
(currently the vcpu->cpu remains set when vCPU is put)

In other words if a vCPU is running, then avic_pi_update_irte should put correct pCPU number,
and if it raced with vCPU put/load, then later should win and put the correct value.
This can be done either with a lock or barriers.

In the current form, this patch is wrong since if the target is sleeping, then it will tell iommu
to send doorbells to last pCPU the target was running on.

FYI, I had the same issue with my nested avic implementation - when we have a nested vCPU run,
I need to translate the shadow physid table with the pCPU numbers, and in the same time,
vCPU migration also triggers update to the same table. I used a lock in the first implementation
but it can be improved.

As for varibale I used a private svm variable, 'nested_avic_active' which were true only when
the vCPU is loaded and its nested avic activated.

Best regards,
Maxim Levitsky