Re: [syzbot] unregister_netdevice: waiting for DEV to become free (7)

From: Guoqing Jiang
Date: Wed Nov 23 2022 - 04:49:20 EST




On 11/22/22 11:28 AM, wangyufen wrote:

在 2022/11/22 10:13, Jason Gunthorpe 写道:
On Fri, Nov 18, 2022 at 02:28:53PM +0100, Dmitry Vyukov wrote:
On Fri, 18 Nov 2022 at 12:39, syzbot
<syzbot+5e70d01ee8985ae62a3b@xxxxxxxxxxxxxxxxxxxxxxxxx> wrote:

Hello,

syzbot found the following issue on:

HEAD commit:    9c8774e629a1 net: eql: Use kzalloc instead of kmalloc/memset
git tree:       net-next
console output: https://syzkaller.appspot.com/x/log.txt?x=17bf6cc8f00000
kernel config: https://syzkaller.appspot.com/x/.config?x=9eb259db6b1893cf
dashboard link: https://syzkaller.appspot.com/bug?extid=5e70d01ee8985ae62a3b
compiler:       gcc (Debian 10.2.1-6) 10.2.1 20210110, GNU ld (GNU Binutils for Debian) 2.35.2
syz repro: https://syzkaller.appspot.com/x/repro.syz?x=1136d592f00000
C reproducer: https://syzkaller.appspot.com/x/repro.c?x=1193ae64f00000

Bisection is inconclusive: the issue happens on the oldest tested release.

bisection log: https://syzkaller.appspot.com/x/bisect.txt?x=167c33a2f00000
final oops: https://syzkaller.appspot.com/x/report.txt?x=157c33a2f00000
console output: https://syzkaller.appspot.com/x/log.txt?x=117c33a2f00000

IMPORTANT: if you fix the issue, please add the following tag to the commit:
Reported-by: syzbot+5e70d01ee8985ae62a3b@xxxxxxxxxxxxxxxxxxxxxxxxx

iwpm_register_pid: Unable to send a nlmsg (client = 2)
infiniband syj1: RDMA CMA: cma_listen_on_dev, error -98
unregister_netdevice: waiting for vlan0 to become free. Usage count = 2

+RDMA maintainers

There are 4 reproducers and all contain:

r0 = socket$nl_rdma(0x10, 0x3, 0x14)
sendmsg$RDMA_NLDEV_CMD_NEWLINK(...)

Also the preceding print looks related (a bug in the error handling
path there?):

infiniband syj1: RDMA CMA: cma_listen_on_dev, error -98

I'm pretty sure it is an rxe bug

ib_device_set_netdev() will hold the netdev until the caller destroys
the ib_device

rxe calls it during rxe_register_device() because the user asked for a
stacked ib_device on top of the netdev

Presumably rxe needs to have a notifier to also self destroy the rxe
device if the underlying net device is to be destroyed?

Can someone from rxe check into this?

The following patch may fix the issue:

--- a/drivers/infiniband/core/cma.c
+++ b/drivers/infiniband/core/cma.c
@@ -4049,6 +4049,9 @@ int rdma_listen(struct rdma_cm_id *id, int backlog)
        return 0;
 err:
        id_priv->backlog = 0;
+       if (id_priv->cma_dev)
+               cma_release_dev(id_priv);
+
        /*
         * All the failure paths that lead here will not allow the req_handler's
         * to have run.


But it is the caller's responsibility to destroy it since commit dd37d2f59eb8.

The causes are as follows:

rdma_listen()
  rdma_bind_addr()
    cma_acquire_dev_by_src_ip()
      cma_attach_to_dev()
        _cma_attach_to_dev()
          cma_dev_get()

Thanks for the analysis.

And for the two callers of cma_listen_on_dev, looks they have
different behaviors with regard to handling failure.

1. cma_listen_on_all which calls both
            list_del_init(&to_destroy->device_item)
    and
            rdma_destroy_id(&to_destroy->id)

2. cma_add_one invokes cma_process_remove to delete to_destroy,
cma_process_remove call both list_del_init(&id_priv->listen_item)
and list_del_init(&id_priv->device_item), but it doesn't call
rdma_destroy_id(&dev_id_priv->id) which is also different with
_cma_cancel_listens.

I am wondering if this is needed.

diff --git a/drivers/infiniband/core/cma.c b/drivers/infiniband/core/cma.c
index cc2222b85c88..48e283d1389b 100644
--- a/drivers/infiniband/core/cma.c
+++ b/drivers/infiniband/core/cma.c
@@ -5231,6 +5231,7 @@ static void cma_process_remove(struct cma_device *cma_dev)
                cma_id_get(id_priv);
                mutex_unlock(&lock);

+               rdma_destroy_id(&dev_id_priv->id);
                cma_send_device_removal_put(id_priv);

                mutex_lock(&lock);

Thanks,
Guoqing