RE: [PATCH v3] PCI: hv: Only reuse existing IRTE allocation for Multi-MSI

From: Dexuan Cui
Date: Fri Nov 11 2022 - 19:58:45 EST


> From: Lorenzo Pieralisi <lpieralisi@xxxxxxxxxx>
> Sent: Friday, November 11, 2022 1:56 AM
> > ...
> > + * For the single-MSI and MSI-X cases, it's OK for
> > hv_compose_msi_req_get_cpu()
> > + * to always return the same dummy vCPU, because a second call to
> > + * hv_compose_msi_msg() contains the "real" vCPU, causing Hyper-V to
> > choose a
> > + * new pCPU for the interrupt. But for the multi-MSI case, the second call to
> > + * hv_compose_msi_msg() exits without sending a message to the vPCI VSP,
> > so the
>
> Why ? Can't you fix _that_ ? Why can't the initial call to
> hv_compose_msi_msg() determine the _real_ target vCPU ?

Unluckily I can't fix this because of the way it works in x86's irq management
code. This is out of the control of the pci-hyperv driver.

hv_compose_msi_msg() uses irq_data_get_effective_affinity_mask() to get
the "effective"mask.

On x86, when the irq is initialized, irq_data_update_effective_affinity() is
called from apic_update_irq_cfg() -- please refer to the below debug code.

When the initial call to hv_compose_msi_msg() is invoked, it's from
pci_alloc_irq_vectors(), and the x86 irq code always passes CPU0 to pci-hyperv.
Please see the below "cpumask_first(cpu_online_mask)" in
vector_assign_managed_shutdown().

When hv_compose_msi_msg() is invoked the second time, it's from
request_irq(), and the x86 irq code passes the real effectivey CPU to pci-hyperv.

I added the below debug code and pasted the trimmed output below.

--- a/arch/x86/kernel/apic/vector.c
+++ b/arch/x86/kernel/apic/vector.c
@@ -179,6 +179,7 @@ static void vector_assign_managed_shutdown(struct irq_data *irqd)
unsigned int cpu = cpumask_first(cpu_online_mask);

apic_update_irq_cfg(irqd, MANAGED_IRQ_SHUTDOWN_VECTOR, cpu);
+ WARN(irqd->irq >= 24, "cdx: vector_assign_managed_shutdown: irq=%d, cpu=%d\n", irqd->irq, cpu);
}

static int reserve_managed_vector(struct irq_data *irqd)
@@ -251,6 +252,7 @@ assign_vector_locked(struct irq_data *irqd, const struct cpumask *dest)
return vector;
apic_update_vector(irqd, vector, cpu);
apic_update_irq_cfg(irqd, vector, cpu);
+ WARN(irqd->irq >= 24, "cdx: assign_vector_locked: irq=%d, cpu=%d\n", irqd->irq, cpu);

return 0;
}
@@ -328,6 +330,7 @@ assign_managed_vector(struct irq_data *irqd, const struct cpumask *dest)
return vector;
apic_update_vector(irqd, vector, cpu);
apic_update_irq_cfg(irqd, vector, cpu);
+ WARN(irqd->irq >= 24, "cdx: assign_managed_vector: irq=%d, cpu=%d\n", irqd->irq, cpu);
return 0;
}

--- a/drivers/pci/controller/pci-hyperv.c
+++ b/drivers/pci/controller/pci-hyperv.c
@@ -1803,6 +1803,7 @@ static void hv_compose_msi_msg(struct irq_data *data, struct msi_msg *msg)
u32 size;
int ret;

+ WARN(1, "cdx: hv_compose_msi_msg: irq=%d\n", data->irq);
/* Reuse the previous allocation */
if (data->chip_data) {
int_desc = data->chip_data;

1 cdx: vector_assign_managed_shutdown: irq=24, cpu=0
2 WARNING: CPU: 3 PID: 87 at arch/x86/kernel/apic/vector.c:182 vector_assign_managed_shutdown+0x53/0x60
3 RIP: 0010:vector_assign_managed_shutdown+0x53/0x60
4 reserve_irq_vector_locked+0x41/0xa0
5 x86_vector_alloc_irqs+0x298/0x460
6 irq_domain_alloc_irqs_hierarchy+0x1b/0x50
7 irq_domain_alloc_irqs_parent+0x17/0x30
8 msi_domain_alloc+0x83/0x150
9 irq_domain_alloc_irqs_hierarchy+0x1b/0x50
10 __irq_domain_alloc_irqs+0xdf/0x320
11 __msi_domain_alloc_irqs+0x103/0x3e0
12 msi_domain_alloc_irqs_descs_locked+0x3e/0x90
13 pci_msi_setup_msi_irqs+0x2d/0x40
14 __pci_enable_msix_range+0x2fd/0x420
15 pci_alloc_irq_vectors_affinity+0xb0/0x110
16 nvme_reset_work+0x1cf/0x1170
17 process_one_work+0x21f/0x3f0
18 worker_thread+0x4a/0x3c0
19 kthread+0xff/0x130
20 ret_from_fork+0x22/0x30
21
22 cdx: vector_assign_managed_shutdown: irq=24, cpu=0
23 WARNING: CPU: 3 PID: 87 at arch/x86/kernel/apic/vector.c:182 vector_assign_managed_shutdown+0x53/0x60
24 RIP: 0010:vector_assign_managed_shutdown+0x53/0x60
25 x86_vector_activate+0x149/0x1e0
26 __irq_domain_activate_irq+0x58/0x90
27 __irq_domain_activate_irq+0x38/0x90
28 irq_domain_activate_irq+0x2d/0x50
29 __msi_domain_alloc_irqs+0x1bb/0x3e0
30 msi_domain_alloc_irqs_descs_locked+0x3e/0x90
31 pci_msi_setup_msi_irqs+0x2d/0x40
32 __pci_enable_msix_range+0x2fd/0x420
33 pci_alloc_irq_vectors_affinity+0xb0/0x110
34 nvme_reset_work+0x1cf/0x1170
35 process_one_work+0x21f/0x3f0
36 worker_thread+0x4a/0x3c0
37 kthread+0xff/0x130
38 ret_from_fork+0x22/0x30
39
40
41 cdx: hv_compose_msi_msg: irq=24
42 WARNING: CPU: 3 PID: 87 at drivers/pci/controller/pci-hyperv.c:1806 hv_compose_msi_msg+0x3f/0x5d0 [pci_hyperv]
43 RIP: 0010:hv_compose_msi_msg+0x3f/0x5d0 [pci_hyperv]
44 irq_chip_compose_msi_msg+0x32/0x50
45 msi_domain_activate+0x45/0xa0
46 __irq_domain_activate_irq+0x58/0x90
47 irq_domain_activate_irq+0x2d/0x50
48 __msi_domain_alloc_irqs+0x1bb/0x3e0
49 msi_domain_alloc_irqs_descs_locked+0x3e/0x90
50 pci_msi_setup_msi_irqs+0x2d/0x40
51 __pci_enable_msix_range+0x2fd/0x420
52 pci_alloc_irq_vectors_affinity+0xb0/0x110
53 nvme_reset_work+0x1cf/0x1170
54 process_one_work+0x21f/0x3f0
55 worker_thread+0x4a/0x3c0
56 kthread+0xff/0x130
57 ret_from_fork+0x22/0x30
58
59
60
61 cdx: assign_vector_locked: irq=24, cpu=3
62 WARNING: CPU: 0 PID: 87 at arch/x86/kernel/apic/vector.c:255 assign_vector_locked+0x160/0x170
63 RIP: 0010:assign_vector_locked+0x160/0x170
64 assign_irq_vector_any_locked+0x6a/0x150
65 x86_vector_activate+0x93/0x1e0
66 __irq_domain_activate_irq+0x58/0x90
67 __irq_domain_activate_irq+0x38/0x90
68 irq_domain_activate_irq+0x2d/0x50
69 irq_activate+0x29/0x30
70 __setup_irq+0x2e5/0x780
71 request_threaded_irq+0x112/0x180
72 pci_request_irq+0xa3/0xf0
73 queue_request_irq+0x74/0x80
74 nvme_reset_work+0x408/0x1170
75 process_one_work+0x21f/0x3f0
76 worker_thread+0x4a/0x3c0
77 kthread+0xff/0x130
78 ret_from_fork+0x22/0x30
79
80 cdx: hv_compose_msi_msg: irq=24
81 WARNING: CPU: 0 PID: 87 at drivers/pci/controller/pci-hyperv.c:1806 hv_compose_msi_msg+0x3f/0x5d0 [pci_hyperv]
82 RIP: 0010:hv_compose_msi_msg+0x3f/0x5d0 [pci_hyperv]
83 irq_chip_compose_msi_msg+0x32/0x50
84 msi_domain_activate+0x45/0xa0
85 __irq_domain_activate_irq+0x58/0x90
86 irq_domain_activate_irq+0x2d/0x50
87 irq_activate+0x29/0x30
88 __setup_irq+0x2e5/0x780
89 request_threaded_irq+0x112/0x180
90 pci_request_irq+0xa3/0xf0
91 queue_request_irq+0x74/0x80
92 nvme_reset_work+0x408/0x1170
93 process_one_work+0x21f/0x3f0
94 worker_thread+0x4a/0x3c0
95 kthread+0xff/0x130
96 ret_from_fork+0x22/0x30

> > + * original dummy vCPU is used. This dummy vCPU must be round-robin'ed
> > so that
> > + * the pCPUs are spread out. All interrupts for a multi-MSI device end up
> > using
> > + * the same pCPU, even though the vCPUs will be spread out by later calls
> > + * to hv_irq_unmask(), but that is the best we can do now.
> > + *
> > + * With current Hyper-V, the HVCALL_RETARGET_INTERRUPT hypercall does
> *not*
>
> "current" Hyper-V means nothing, remove it or provide versioning
> information. Imagine yourself reading this comment some time
> in the future.

Good point. @Wei, can you please help fix this? I think we can change
"With current Hyper-V"
To
"With Hyper-V in Nov 2022".

BTW, it's hard to provide the exact versioning info, because technically
there are many versions of Hyper-V...

> I can't claim to understand how this MSI vCPU to pCPU mapping is made to
> work in current code but I can't ack this patch sorry, if you feel like
> it is good to merge it it is your and Hyper-V maintainers call, feel
> free to go ahead - I can review PCI hyper-V changes that affect PCI
> and IRQs core APIs, I don't know Hyper-V internals.
>
> Lorenzo

I understand. Thanks!

I discussed the issue with Hyper-V team. I believe I understood it and
this patch is the right thing to have.

@Wei, Bjorn spotted two typos in the commit message, and Lorenzo
suggested a change to the above "current". Can you please help
fix these and merge the patch? Or, let me know if it'd be easier if
I should send v4.

Thanks,
Dexuan