RE: [PATCH net] iavf: Do not restart Tx queues after reset task failure

From: Keller, Jacob E
Date: Thu Nov 10 2022 - 16:14:16 EST




> -----Original Message-----
> From: Leon Romanovsky <leon@xxxxxxxxxx>
> Sent: Thursday, November 10, 2022 1:07 PM
> To: Jakub Kicinski <kuba@xxxxxxxxxx>
> Cc: ivecera <ivecera@xxxxxxxxxx>; Keller, Jacob E <jacob.e.keller@xxxxxxxxx>;
> netdev@xxxxxxxxxxxxxxx; sassmann@xxxxxxxxxx; Piotrowski, Patryk
> <patryk.piotrowski@xxxxxxxxx>; SlawomirX Laba <slawomirx.laba@xxxxxxxxx>;
> Brandeburg, Jesse <jesse.brandeburg@xxxxxxxxx>; Nguyen, Anthony L
> <anthony.l.nguyen@xxxxxxxxx>; David S. Miller <davem@xxxxxxxxxxxxx>; Eric
> Dumazet <edumazet@xxxxxxxxxx>; Paolo Abeni <pabeni@xxxxxxxxxx>; intel-
> wired-lan@xxxxxxxxxxxxxxxx; open list <linux-kernel@xxxxxxxxxxxxxxx>
> Subject: Re: [PATCH net] iavf: Do not restart Tx queues after reset task failure
>
> On Thu, Nov 10, 2022 at 12:24:18PM -0800, Jakub Kicinski wrote:
> > On Thu, 10 Nov 2022 19:07:02 +0200 Leon Romanovsky wrote:
> > > > > Yes I think you're right. A ton of people check it without the
> > > > > lock but I think thats not strictly safe. Is dev_close safe to
> > > > > call when netif_running is false? Why not just remove the check
> > > > > and always call dev_close then.
> > > >
> > > > Check for a bit value (like netif_runnning()) is much cheaper than
> > > > unconditionally taking global lock like RTNL.
> > >
> > > This cheap operation is racy and performed in non-performance
> > > critical path.
> >
> > To be clear - the rtnl_lock around the entire if is still racy
> > in the grand scheme of things, no? What's stopping someone from
> > bringing the device right back up after you drop the lock?
>

I think the reset flow uses netif_device_detach() to detach the device before reset. Is that enough to prevent other calls to dev_close outside the driver?

Also, perhaps we should avoid re-attaching the device if the reset fails...

> I want to believe what there is some sort of state machine that won't
> allow simple toggling of dev_close/dev_open. If it doesn't, rtnl_lock
> users should audit their code for possible toggling right after that
> lock is dropped.
>

I think the key is that normally dev_open and dev_close are done by iproute2 netlink messages? so if we close it, its possible userspace could immediately open it.. though I think that isn't allowed while the device is detached, so we should stay closed until we re-attach, at which point dev_open can fail by noticing the VF is disabled...


> Anyway, this discussion reminds me our devl_lock debate where we had
> completely opposite views if rtnl_lock model is the right one.
> https://lore.kernel.org/netdev/20211101073259.33406da3@kicinski-fedora-
> PC1C0HJN/
>
> Let's not start argue again, we had enough back then. :)
>
> Thanks