RE: [5.14-rc1] mlx5_core receives no interrupts with maxcpus=8

From: Dexuan Cui
Date: Mon Jul 19 2021 - 16:59:38 EST


> From: Saeed Mahameed <saeed@xxxxxxxxxx>
> Sent: Monday, July 19, 2021 1:18 PM
> > > ...
> > > It turns out that adding "intremap=off" can work around the issue!
> > >
> > > The root cause is still not clear yet. I don't know why Windows is
> > > good here.
> >
> > The card is stuck in the FW, maybe Saeed knows why. I tried your
> > scenario and it worked for me.
> >
> > Thanks
>
> I don't think the FW is stuck since we see the cmd completion after
> timeout, this means that the 1st interrupt from the device got lost.
>
> "wait_func_handle_exec_timeout:1062:(pid 1416): cmd[0]:
> CREATE_EQ(0x301) recovered after timeout"
>
> the fact that this happens on 5.14 and 5.4 kernels and the issue is
> worked around via bringing the cpus online, or disabling intremap,
> means that there is something wrong with the interrupt remapping
> mechanism, maybe the interrupt is being delivered on an offline cpu ?
> is this a qemu/VM guest or a bare metal host ?

Thanks for the replies!

This is a bare metal x86-64 host with Intel CPUs. Yes, I believe the
issue is in the IOMMU Interrupt Remapping mechanism rather in the
NIC driver. I just don't understand why bringing the CPUs online and
offline can work around the issue. I'm trying to dump the IOMMU IR
table entries to look for any error.

Thanks,
Dexuan