vector space exhaustion on 4.14 LTS kernels

From: Josh Hunt
Date: Mon Nov 19 2018 - 17:35:24 EST


Hi Thomas

We have a class of machines that appear to be exhausting the vector space on cpus 0 and 1 which causes some breakage later on when trying to set the affinity. The boxes are running the 4.14 LTS kernel.

I instrumented 4.14 and here's what I see:

[ 28.328849] __assign_irq_vector: irq:512 cpu:0 mask:ff,ffffffff onlinemask:ff,ffffffff vector:0
[ 28.329847] __assign_irq_vector: irq:512 cpu:2 vector:222 cfgvect:0 off:14 old_domain:00,00000000 domain:00,00000000 vector_search:00,00000004 update
[ 28.329847] default_cpu_mask_to_apicid: irq:512 mask:00,00000004
...
[ 31.729154] __assign_irq_vector: irq:512 cpu:0 mask:ff,ffffffff onlinemask:ff,ffffffff vector:222
[ 31.729154] __assign_irq_vector: irq:512 cpu:0 mask:ff,ffffffff vector_cpumask:00,00000001 vector:222
...
[ 31.729154] __assign_irq_vector: irq:512 cpu:2 vector:00,00000004 domain:00,00000004 success
[ 31.729154] default_cpu_mask_to_apicid: irq:512 hwirq:512 mask:00,00000004
[ 31.729154] apic_set_affinity: irq:512 mask:ff,ffffffff err:0
...
[ 32.818152] mlx5_irq_set_affinity_hint: 0: irq:512 mask:00,00000001
...
[ 39.531242] __assign_irq_vector: irq:512 cpu:0 mask:00,00000001 onlinemask:ff,ffffffff vector:222
[ 39.531244] __assign_irq_vector: irq:512 cpu:0 mask:00,00000001 vector_cpumask:00,00000001 vector:222
[ 39.531245] __assign_irq_vector: irq:512 cpu:0 vector:00,00000001 domain:00,00000004
...
[ 39.531384] __assign_irq_vector: irq:512 cpu:0 vector:37 current_vector:37 next_cpu2
[ 39.531385] __assign_irq_vector: irq:512 cpu:128 searched:00,00000001 vector:00,00000000 continue
[ 39.531386] apic_set_affinity: irq:512 mask:00,00000001 err:-28

The affinity values:

root@xxxxxxxxxxxxx:/proc/irq/512# grep . *
affinity_hint:00,00000001
effective_affinity:00,00000004
effective_affinity_list:2
grep: mlx5_comp0@pci:0000:65:00.1: Is a directory
node:0
smp_affinity:ff,ffffffff
smp_affinity_list:0-39
spurious:count 3
spurious:unhandled 0
spurious:last_unhandled 0 ms

I noticed your change, a0c9259dc4e1 "irq/matrix: Spread interrupts on allocation", and this sounds like what we're hitting. Booting 4.19 does not have this problem. I haven't booted 4.15 yet, but can do it to confirm the above commit is what resolves this.

Since 4.14 doesn't have the matrix allocator it's not a trivial backport. I was wondering a) if you agree with my assessment and b) if there's any plans on resolving this on the 4.14 allocator? If not I can attempt to backport the idea to 4.14 to spread the interrupts around on allocation.

Thanks
Josh