[REGRESSION] mlx5: Driver remove during hot unplug is broken

From: Niklas Schnelle
Date: Fri Jun 12 2020 - 09:10:02 EST


Hello Parav, Hello Saeed,

our CI system for IBM Z Linux found a hang[0] when hot unplugging a ConnectX-4 Lx VF from a z/VM guest
in Linus' current tree and added during the merge window.
Sadly it didn't happen all the time which sent me on the wrong path for two full git bisects.

Anyway, I've now tracked this down to the following commit which when reverted
fixes the issue:

41798df9bfca ("net/mlx5: Drain wq first during PCI device removal")

Looking at the diff I'd say the likely culprit is that before
the commit the order of calls was:

mlx5_unregister_device(dev)
mlx5_drain_health_wq(dev)

But with the commit it becomes

mlx5_drain_health_wq(dev)
mlx5_unregister_device(dev)

So without really knowing anything about these functions I would
guess that with the device still registered the drained
queue does not remain empty as new entries are added.
Does that sound plausible to you?

Best regards,
Niklas Schnelle

[0] dmesg output:
[ 36.447442] mlx5_core 0000:00:00.0: poll_health:694:(pid 0): Fatal error 1 detected
[ 36.447450] mlx5_core 0000:00:00.0: print_health_info:372:(pid 0): assert_var[0] 0xffffffff
[ 36.447453] mlx5_core 0000:00:00.0: print_health_info:372:(pid 0): assert_var[1] 0xffffffff
[ 36.447455] mlx5_core 0000:00:00.0: print_health_info:372:(pid 0): assert_var[2] 0xffffffff
[ 36.447458] mlx5_core 0000:00:00.0: print_health_info:372:(pid 0): assert_var[3] 0xffffffff
[ 36.447461] mlx5_core 0000:00:00.0: print_health_info:372:(pid 0): assert_var[4] 0xffffffff
[ 36.447463] mlx5_core 0000:00:00.0: print_health_info:375:(pid 0): assert_exit_ptr 0xffffffff
[ 36.447467] mlx5_core 0000:00:00.0: print_health_info:377:(pid 0): assert_callra 0xffffffff
[ 36.447471] mlx5_core 0000:00:00.0: print_health_info:380:(pid 0): fw_ver 65535.65535.65535
[ 36.447475] mlx5_core 0000:00:00.0: print_health_info:381:(pid 0): hw_id 0xffffffff
[ 36.447478] mlx5_core 0000:00:00.0: print_health_info:382:(pid 0): irisc_index 255
[ 36.447492] mlx5_core 0000:00:00.0: print_health_info:383:(pid 0): synd 0xff: unrecognized error
[ 36.447621] mlx5_core 0000:00:00.0: print_health_info:385:(pid 0): ext_synd 0xffff
[ 36.447624] mlx5_core 0000:00:00.0: print_health_info:387:(pid 0): raw fw_ver 0xffffffff
[ 36.447885] crw_info : CRW reports slct=0, oflw=0, chn=0, rsc=B, anc=0, erc=0, rsid=0
[ 36.447897] zpci: 0000:00:00.0: Event 0x303 reconfigured PCI function 0x514
[ 47.099220] mlx5_core 0000:00:00.0: poll_health:709:(pid 0): device's health compromised - reached miss count
[ 47.099228] mlx5_core 0000:00:00.0: print_health_info:372:(pid 0): assert_var[0] 0xffffffff
[ 47.099231] mlx5_core 0000:00:00.0: print_health_info:372:(pid 0): assert_var[1] 0xffffffff
[ 47.099234] mlx5_core 0000:00:00.0: print_health_info:372:(pid 0): assert_var[2] 0xffffffff
[ 47.099236] mlx5_core 0000:00:00.0: print_health_info:372:(pid 0): assert_var[3] 0xffffffff
[ 47.099239] mlx5_core 0000:00:00.0: print_health_info:372:(pid 0): assert_var[4] 0xffffffff
[ 47.099241] mlx5_core 0000:00:00.0: print_health_info:375:(pid 0): assert_exit_ptr 0xffffffff
[ 47.099245] mlx5_core 0000:00:00.0: print_health_info:377:(pid 0): assert_callra 0xffffffff
[ 47.099249] mlx5_core 0000:00:00.0: print_health_info:380:(pid 0): fw_ver 65535.65535.65535
[ 47.099253] mlx5_core 0000:00:00.0: print_health_info:381:(pid 0): hw_id 0xffffffff
[ 47.099256] mlx5_core 0000:00:00.0: print_health_info:382:(pid 0): irisc_index 255
[ 47.099327] mlx5_core 0000:00:00.0: print_health_info:383:(pid 0): synd 0xff: unrecognized error
[ 47.099329] mlx5_core 0000:00:00.0: print_health_info:385:(pid 0): ext_synd 0xffff
[ 47.099330] mlx5_core 0000:00:00.0: print_health_info:387:(pid 0): raw fw_ver 0xffffffff
[ 100.539106] mlx5_core 0000:00:00.0: wait_func:991:(pid 121): 2RST_QP(0x50a) timeout. Will cause a leak of a command resource
[ 100.539118] infiniband mlx5_0: destroy_qp_common:2525:(pid 121): mlx5_ib: modify QP 0x00072c to RESET failed
[ 141.499325] mlx5_core 0000:00:00.0: wait_func:991:(pid 32): QUERY_VPORT_COUNTER(0x770) timeout. Will cause a leak of a command resource
[ 161.978957] mlx5_core 0000:00:00.0: wait_func:991:(pid 121): DESTROY_QP(0x501) timeout. Will cause a leak of a command resource