Re: [PATCH 05/14] mwifiex: re-register wiphy across reset

From: Brian Norris
Date: Tue Jun 27 2017 - 16:49:16 EST


On Thu, Jun 22, 2017 at 03:02:34PM +0200, Johannes Berg wrote:
> On Wed, 2017-06-21 at 11:27 -0700, Brian Norris wrote:

> > > Without checking the code now, it seems entirely plausible that
> > > this is
> > > holding some lock that would lock out the control path entirely,
> > > for
> > > the duration until the wiphy is actually unregistered?
> > >
> > > Actually, you can't unregister with the relevant locks held
> > > (without
> > > causing deadlocks), so perhaps it's marking the wiphy as
> > > unavailable so
> > > that all operations fail?
> >
> > One of the above two sounds along the right line. But it's something
> > I couldn't really figure out how to do quite right.
> >
> > Dumb question: how would I mark the wiphy as unavailable? Is there
> > something I can do at the cfg80211 level? Or would I really have to
> > guard all the cfg80211 entry points into mwifiex with a flag or lock?
>
> There isn't really a good way to do this. You can, of course, call
> wiphy_unregister(), but if you could do that you'd already have the
> problem solved, I think?

That's probably along the right track. There are still some things we'd
need to do properly before that though, and this is where all the
problems are so far. (Also, this is what Kalle was already objecting to;
he didn't think we should be unregistering/recreating the wiphy, but I
think he ended up softening on that a bit.)

For one, I still expect I should be removing the wireless dev's before
unregistering the wihpy, no? Otherwise, there will be existing wdevs
backed by an unregistered wiphy?

And that gets to the heart of another bug: deleting interfaces (e.g.,
"iw dev foo del") races with a lot of stuff -- like see

mwifiex_process_sta_event() ->
EVENT_EXT_SCAN_REPORT ->
netif_running(priv->netdev)

Because mwifiex_del_virtual_intf() doesn't stop any outstanding
commands, we can be both deleting the netdev and processing scans for
it.

> I'm not really familiar enough with the context this happens in - can't
> you let all the operations that try to talk to the firmware fail
> (because the firmware is dead, or whatever) and then call
> wiphy_unregister()?

Yes, something like that, barring some of the other bugs mentioned.

> > Also, IIUC, we need to wait for all control paths to complete (or
> > cancel) before we can free up the associated resources; so just
> > marking "unavailable" isn't enough.
>
> Yeah, I suppose so. Though if you just do all the freeing after
> wiphy_unregister() it'll do that for you?

Yes, I think so. Then part of the problem is probably that some of the
current "cancel command" logic is tied up with the "free command
structures" logic. So we're freeing some stuff too early.

Anyway, those sorts of bugs aside, IIUC the full sequence for teardown
should probably be something like:

1. Stop TX queues
2. Cancel outstanding commands (let them fail or finish, etc.) -- but
DON'T free their backing resources yet
3. Remove wdevs
4. wiphy_unregister()
5. Free up resources

Current problems are at least:

* we don't do step 4 in the right place (if at all; see this patch)
* step 2 mixes in "free"ing resources too early

Brian