Re: [PATCH] KVM: SVM: Update destination when updating pi irte

From: Alejandro Jimenez
Date: Fri Jul 21 2023 - 14:45:07 EST



On 7/19/23 11:57, Sean Christopherson wrote:
On Fri, Jul 14, 2023, Joao Martins wrote:
+Suravee, +Alejandro

On 29/06/2023 23:35, Sean Christopherson wrote:
On Thu, May 18, 2023, Joao Martins wrote:
On 18/05/2023 09:19, Maxim Levitsky wrote:
I think that we do need to a flag indicating if the vCPU is currently
running and if yes, then use svm->vcpu.cpu (or put -1 to it when it not
running or something) (currently the vcpu->cpu remains set when vCPU is
put)

In other words if a vCPU is running, then avic_pi_update_irte should put
correct pCPU number, and if it raced with vCPU put/load, then later should
win and put the correct value. This can be done either with a lock or
barriers.

If this could be done, it could remove cost from other places and avoid this
little dance of the galog (and avoid its usage as it's not the greatest design
aspect of the IOMMU). We anyways already need to do IRT flushes in all these
things with regards to updating any piece of the IRTE, but we need some care
there two to avoid invalidating too much (which is just as expensive and per-VCPU).
...

But still quite expensive (as many IPIs as vCPUs updated), but it works as
intended and guest will immediately see the right vcpu affinity. But I honestly
prefer going towards your suggestion (via vcpu.pcpu) if we can have some
insurance that vcpu.cpu is safe to use in pi_update_irte if protected against
preemption/blocking of the VCPU.
I think we have all the necessary info, and even a handy dandy spinlock to ensure
ordering. Disclaimers: compile tested only, I know almost nothing about the IOMMU
side of things, and I don't know if I understood the needs for the !IsRunning cases.

I was avoiding grabbing that lock, but now that I think about it it shouldn't do
much harm.

My only concern has mostly been whether we mark the IRQ isRunning=1 on a vcpu
that is about to block as then the doorbell rang by the IOMMU won't do anything
to the guest. But IIUC the physical ID cache read-once should cover that
Acquiring ir_list_lock in avic_vcpu_{load,put}() when modifying
AVIC_PHYSICAL_ID_ENTRY_IS_RUNNING_MASK is the key to avoiding ordering issues.
E.g. without the spinlock, READ_ONCE() wouldn't prevent svm_ir_list_add() from
racing with avic_vcpu_{load,put}() and ultimately shoving stale data into the IRTE.

It *should* actually be safe to drop the READ_ONCE() since acquiring/releasing
the spinlock will prevent multiple loads from observing different values. I kept
them mostly to keep the diff small, and to be conservative.

The WRITE_ONCE() needs to stay to ensure that hardware doesn't see inconsitent
information due to store tearing.

If this patch works, I think it makes sense to follow-up with a cleanup patch to
drop the READ_ONCE() and add comments explaining why KVM uses WRITE_ONCE() but
not READ_ONCE().
I tested the patch on top of v6.5-rc1, to also use the host kernel param "amd_iommu=irtcachedis" from:

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=66419036f68a

and found no issues running a 380 vCPU guest (using idle=poll), with a mlx VF, on a Genoa host.

I didn't run an interrupt intensive workload, but stressed the affinity changes in a tight loop on the guest running:

--
rcpu=$(($RANDOM % $(nproc)))

# the mlx5_comp* IRQs are in the 35-49 range
rirq=$((35 + $RANDOM % (50-35)))

echo $rcpu > /proc/irq/$rirq/smp_affinity_list

--

As suggested by Joao, I checked to see if there were any 'spurious' GA Log events that are received while the target vCPU is running. The window for this to happen is quite tight with the new changes, so after 100k affinity changes there are only 2 reported GA Log events on the guest:

--
@galog_events: 2
@galog_on_running[303]: 1
@galog_on_running[222]: 1

@vcpuHist:
[128, 256)             1 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[256, 512)             1 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|

--

where when running with an unpatched host kernel there will be a significant number detected:

--
@galog_events: 2476

[...]

@vcpuHist:
[0]                    2 |                                                    |
[1]                    1 |                                                    |
[2, 4)                11 |                                                    |
[4, 8)                13 |                                                    |
[8, 16)               51 |@@@                                                 |
[16, 32)              99 |@@@@@                                               |
[32, 64)             213 |@@@@@@@@@@@@                                        |
[64, 128)            381 |@@@@@@@@@@@@@@@@@@@@@@                              |
[128, 256)           869 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[256, 512)           834 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@   |

--

The script I used makes assumptions about strict ordering in which the probes will be registered, but given that the observed behavior is as predicted the assumptions seem sound. It is pasted below, in case there are concerns about the logic.

Thank you,

Alejandro

---

#!/usr/bin/bpftrace

BEGIN
{
    printf("Tracing GALog events between affinity updates... Hit Ctrl-C to end.\n");
    zero(@irteModified);
    zero(@galog_events);
    zero(@galog_on_running);
}

/*
 * amd_iommu_activate_guest_mode() is mostly called from
 * amd_ir_set_vcpu_affinity() to process vCPU affinity updates, but it also
 * gets called by avic_set_pi_irte_mode() for re-enabling AVIC after inhibition
 * so this data is unreliable during boot time where most of the inhibition
 * events take place.
 */
kprobe:amd_iommu_activate_guest_mode
{
    /*
     * $vcpu_gatag = (struct amd_ir_data *)arg0->gatag;
     * pahole -C amd_ir_data --hex drivers/iommu/amd/iommu.o
     * shows offset of u32 ga_tag field is 0x40
     * AVIC GATAG encodes vCPU ID in LSB 9 bits
     */
    $vcpu_gatag = (*(uint32 *)(arg0 + 0x40)) & 0x1FF;

    @irteModified[$vcpu_gatag] = 1;
}

tracepoint:kvm:kvm_avic_ga_log
{
    $vcpuid = args->vcpuid;

    @galog_events = count();

    /*
     * GALog reports an entry, and it's expected that the IRTE.isRunning bit
     * is 0. The question becomes if it was cleared by an affinity update
     * and has not been restored by a subsequent call to amd_iommu_update_ga
     */
    if (@irteModified[args->vcpuid] != 0) {
        @galog_on_running[args->vcpuid] = count();

        @vcpuHist = hist(args->vcpuid);
        //@vcpuHist = lhist($args->vcpuid, 0, 380, 1);
    }
}


kprobe:amd_iommu_update_ga
{
    /*
     * $vcpu_gatag = (struct amd_ir_data *)arg0->gatag;
     * pahole -C amd_ir_data --hex drivers/iommu/amd/iommu.o
     * shows offset of u32 ga_tag field is 0x40
     * AVIC GATAG encodes vCPU ID in LSB 9 bits
     */
    $vcpu_gatag = (*(uint32 *)(arg0 + 0x40)) & 0x1FF;

    /*
     * With the guest running with idle=poll, avic_vcpu_put() should not
     * be called, and any GA Log events detected must be spurious i.e.
     * targetting a vCPU that is currently running. Only clear the flag
     * when setting IsRun to 1 (as in via avic_vcpu_load() or
     * svm_ir_list_add()), to capture the spurious GA Log events.
     * arg1 ==> isRun
     */
    if (arg1 != 0) {
        @irteModified[$vcpu_gatag] = 0;
    }
}

END
{
    clear(@irteModified);
    print(@galog_events);
    print(@galog_on_running);
    print(@vcpuHist);
}