Re: [PATCH 0/2] net: stmmac: allow sharing MDIO lines

From: Russell King (Oracle)
Date: Tue Aug 08 2023 - 15:26:54 EST


On Tue, Aug 08, 2023 at 04:09:11PM +0200, Bartosz Golaszewski wrote:
> On Tue, Aug 8, 2023 at 3:26 PM Russell King (Oracle)
> <linux@xxxxxxxxxxxxxxx> wrote:
> >
> > On Tue, Aug 08, 2023 at 10:13:09AM +0200, Bartosz Golaszewski wrote:
> > > Ok so upon some further investigation, the actual culprit is in stmmac
> > > platform code - it always tries to register an MDIO bus - independent
> > > of whether there is an actual mdio child node - unless the MAC is
> > > marked explicitly as having a fixed-link.
> > >
> > > When I fixed that, MAC1's probe is correctly deferred until MAC0 has
> > > created the MDIO bus.
> > >
> > > Even so, isn't it useful to actually reference the shared MDIO bus in some way?
> > >
> > > If the schematics look something like this:
> > >
> > > -------- -------
> > > | MAC0 |--MDIO-----| PHY |
> > > -------- | | -------
> > > | |
> > > -------- | | -------
> > > | MAC1 |-- ----| PHY |
> > > -------- -------
> > >
> > > Then it would make sense to model it on the device tree?
> >
> > So I think what you're saying is that MAC0 and MAC1's have MDIO bus
> > masters, and the hardware designer decided to tie both together to
> > a single set of clock and data lines, which then go to two PHYs.
>
> The schematics I have are not very clear on that, but now that you
> mention this, it's most likely the case.
>
> >
> > In that case, I would strongly advise only registering one MDIO bus,
> > and avoid registering the second one - thereby preventing any issues
> > caused by both MDIO bus masters trying to talk at the same time.
> >
>
> I sent a patch for that earlier today.
>
> > The PHYs should be populated in firmware on just one of the buses.
> >
> > You will also need to ensure that whatever registers the bus does
> > make sure that the clocks necessary for communicating on the bus
> > are under control of the MDIO bus code and not the ethernet MAC
> > code. We've run into problems in the past where this has not been
> > the case, and it means - taking your example above - that when MAC1
> > wants to talk to its PHY, if MAC0 isn't alive it can't.
>
> Good point, but it's worse than that: when MAC0 is unbound, it will
> unregister the MDIO bus and destroy all PHY devices. These are not
> refcounted so they will literally go from under MAC1. Not sure how
> this can be dealt with?

That has been a problem in the past, where a MII bus has been
registered by a driver, and then because its probe defers, the MII
bus gets torn down.

The "simple" solution to this is... try to avoid registering the MII
bus until you're sure that the probing will not defer. It is far from
perfect, since there's still the opportunity to unbind the driver
causing the MII bus to vanish along with the PHYs.

I have mentioned trying to address the issue of PHY drivers being
unbound in the past, and there's been some improvements with that,
but if the phy_device vanishes while something is using it, it
certainly will not end well. phylib is not the only case of this,
there are numerous instances of it. One of the recent ones that
I happened to be reminded of today is the pcs-rzn1-miic thing...
If you have a look at miic_create() and consider what would happen
if:

if (!pdev || !platform_get_drvdata(pdev))
return ERR_PTR(-EPROBE_DEFER);

... another thread ended up executing miic_remove() for this
platform device at this very point ...

miic_port = kzalloc(sizeof(*miic_port), GFP_KERNEL);
if (!miic_port)
return ERR_PTR(-ENOMEM);

miic = platform_get_drvdata(pdev);
device_link_add(dev, miic->dev, DL_FLAG_AUTOREMOVE_CONSUMER);

The devm allocation for "miic" would be freed, so either miic
ends up a stale pointer if it happened after this point, or
if miic_remove() completes, then platform_get_drvdata() returns
NULL and we oops the kernel here.

It's an unlikely race, but it's still a race. Sadly, the kernel
is getting riddled with things like this. I used to point these
things out, but having been shouted down many times I've given
up raising it.

Another example is the direct rendering manager bridge code
(drm_bridge).

I suggest a similar approach to not caring too much about this
for your own sanity... providing it doesn't actually cause a
problem!

--
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTP is here! 80Mbps down 10Mbps up. Decent connectivity at last!