Re: [REGRESSION] mlx5: Driver remove during hot unplug is broken

From: Saeed Mahameed
Date: Fri Jun 12 2020 - 18:02:05 EST


On Fri, 2020-06-12 at 15:09 +0200, Niklas Schnelle wrote:
> Hello Parav, Hello Saeed,
>
> our CI system for IBM Z Linux found a hang[0] when hot unplugging a
> ConnectX-4 Lx VF from a z/VM guest
> in Linus' current tree and added during the merge window.
> Sadly it didn't happen all the time which sent me on the wrong path
> for two full git bisects.
>
> Anyway, I've now tracked this down to the following commit which when
> reverted
> fixes the issue:
>
> 41798df9bfca ("net/mlx5: Drain wq first during PCI device removal")
>
> Looking at the diff I'd say the likely culprit is that before
> the commit the order of calls was:
>
> mlx5_unregister_device(dev)
> mlx5_drain_health_wq(dev)
>
> But with the commit it becomes
>
> mlx5_drain_health_wq(dev)
> mlx5_unregister_device(dev)
>
> So without really knowing anything about these functions I would
> guess that with the device still registered the drained
> queue does not remain empty as new entries are added.
> Does that sound plausible to you?
>

I don't think it is related, maybe this is similar to some issues
addressed lately by Shay's patches:

https://patchwork.ozlabs.org/project/netdev/patch/20200611224708.235014-2-saeedm@xxxxxxxxxxxx/
https://patchwork.ozlabs.org/project/netdev/patch/20200611224708.235014-3-saeedm@xxxxxxxxxxxx/

net/mlx5: drain health workqueue in case of driver load error
net/mlx5: Fix fatal error handling during device load

> Best regards,
> Niklas Schnelle
>
> [0] dmesg output:
> [ 36.447442] mlx5_core 0000:00:00.0: poll_health:694:(pid 0): Fatal
> error 1 detected
> [ 36.447450] mlx5_core 0000:00:00.0: print_health_info:372:(pid 0):
> assert_var[0] 0xffffffff
> [ 36.447453] mlx5_core 0000:00:00.0: print_health_info:372:(pid 0):
> assert_var[1] 0xffffffff
> [ 36.447455] mlx5_core 0000:00:00.0: print_health_info:372:(pid 0):
> assert_var[2] 0xffffffff
> [ 36.447458] mlx5_core 0000:00:00.0: print_health_info:372:(pid 0):
> assert_var[3] 0xffffffff
> [ 36.447461] mlx5_core 0000:00:00.0: print_health_info:372:(pid 0):
> assert_var[4] 0xffffffff
> [ 36.447463] mlx5_core 0000:00:00.0: print_health_info:375:(pid 0):
> assert_exit_ptr 0xffffffff
> [ 36.447467] mlx5_core 0000:00:00.0: print_health_info:377:(pid 0):
> assert_callra 0xffffffff
> [ 36.447471] mlx5_core 0000:00:00.0: print_health_info:380:(pid 0):
> fw_ver 65535.65535.65535
> [ 36.447475] mlx5_core 0000:00:00.0: print_health_info:381:(pid 0):
> hw_id 0xffffffff
> [ 36.447478] mlx5_core 0000:00:00.0: print_health_info:382:(pid 0):
> irisc_index 255
> [ 36.447492] mlx5_core 0000:00:00.0: print_health_info:383:(pid 0):
> synd 0xff: unrecognized error
> [ 36.447621] mlx5_core 0000:00:00.0: print_health_info:385:(pid 0):
> ext_synd 0xffff
> [ 36.447624] mlx5_core 0000:00:00.0: print_health_info:387:(pid 0):
> raw fw_ver 0xffffffff
> [ 36.447885] crw_info : CRW reports slct=0, oflw=0, chn=0, rsc=B,
> anc=0, erc=0, rsid=0
> [ 36.447897] zpci: 0000:00:00.0: Event 0x303 reconfigured PCI
> function 0x514
> [ 47.099220] mlx5_core 0000:00:00.0: poll_health:709:(pid 0):
> device's health compromised - reached miss count
> [ 47.099228] mlx5_core 0000:00:00.0: print_health_info:372:(pid 0):
> assert_var[0] 0xffffffff
> [ 47.099231] mlx5_core 0000:00:00.0: print_health_info:372:(pid 0):
> assert_var[1] 0xffffffff
> [ 47.099234] mlx5_core 0000:00:00.0: print_health_info:372:(pid 0):
> assert_var[2] 0xffffffff
> [ 47.099236] mlx5_core 0000:00:00.0: print_health_info:372:(pid 0):
> assert_var[3] 0xffffffff
> [ 47.099239] mlx5_core 0000:00:00.0: print_health_info:372:(pid 0):
> assert_var[4] 0xffffffff
> [ 47.099241] mlx5_core 0000:00:00.0: print_health_info:375:(pid 0):
> assert_exit_ptr 0xffffffff
> [ 47.099245] mlx5_core 0000:00:00.0: print_health_info:377:(pid 0):
> assert_callra 0xffffffff
> [ 47.099249] mlx5_core 0000:00:00.0: print_health_info:380:(pid 0):
> fw_ver 65535.65535.65535
> [ 47.099253] mlx5_core 0000:00:00.0: print_health_info:381:(pid 0):
> hw_id 0xffffffff
> [ 47.099256] mlx5_core 0000:00:00.0: print_health_info:382:(pid 0):
> irisc_index 255
> [ 47.099327] mlx5_core 0000:00:00.0: print_health_info:383:(pid 0):
> synd 0xff: unrecognized error
> [ 47.099329] mlx5_core 0000:00:00.0: print_health_info:385:(pid 0):
> ext_synd 0xffff
> [ 47.099330] mlx5_core 0000:00:00.0: print_health_info:387:(pid 0):
> raw fw_ver 0xffffffff
> [ 100.539106] mlx5_core 0000:00:00.0: wait_func:991:(pid 121):
> 2RST_QP(0x50a) timeout. Will cause a leak of a command resource
> [ 100.539118] infiniband mlx5_0: destroy_qp_common:2525:(pid 121):
> mlx5_ib: modify QP 0x00072c to RESET failed
> [ 141.499325] mlx5_core 0000:00:00.0: wait_func:991:(pid 32):
> QUERY_VPORT_COUNTER(0x770) timeout. Will cause a leak of a command
> resource
> [ 161.978957] mlx5_core 0000:00:00.0: wait_func:991:(pid 121):
> DESTROY_QP(0x501) timeout. Will cause a leak of a command resource

Shay's patches also came to avoid such command timeouts.