Re: [PATCH] net: phy: Don't disable irqs on shutdown if WoL is enabled

From: Russell King (Oracle)
Date: Tue Aug 08 2023 - 18:57:15 EST


On Tue, Aug 08, 2023 at 02:59:41PM -0700, Florian Fainelli wrote:
> On 8/8/23 14:53, Jakub Kicinski wrote:
> > On Fri, 4 Aug 2023 09:17:57 +0200 Uwe Kleine-König wrote:
> > > Most PHYs signal WoL using an interrupt. So disabling interrupts breaks
> > > WoL at least on PHYs covered by the marvell driver. So skip disabling
> > > irqs on shutdown if WoL is enabled.
> > >
> > > While at it also explain the motivation that irqs are disabled at all.
> > >
> > > Fixes: e2f016cf7751 ("net: phy: add a shutdown procedure")
> > > Signed-off-by: Uwe Kleine-König <u.kleine-koenig@xxxxxxxxxxxxxx>
> >
> > What do we do with this one? It sounded like Russell was leaning
> > towards a revert?
>
> Yes, though I believe this will create a different kind of regression for
> what Iona was addressing initially. Then it becomes a choice of which
> regression do we consider to be the worst to handle until something better
> comes up.
>
> Russell what are your thoughts?

In this situation where a fix for a problem is provided which then
causes a regression by fixing that problem, I've seen Linus T state
that it means the fix was incorrect. That seems entirely sensible.

We are, of course, in the situation where reverting the commit
restores the old behaviour and thus fixes a regression, but causes
a regression for another user.

If it is possible to quickly come up with a fix that avoids any
regression to either use case, then that is obviously preferable.
However, if that's not possible, then it seems going back to the
original situation (i.o.w. reverting) is sensible.

Now, the fact is that many PHYs do use their interrupts to signal
that a wake-up happened, and disabling the IRQ from the PHY will
prevent WoL from working. Other PHYs have a separate pin for this.
Two recent examples are AR8035, which only has a single interrupt
pin which covers all interrupts from that PHY, and AR8031 or AR8033
which have a separate WOL_INT pin which might be used - or the
main interrupt pin.

If we hibernate the system, then people how have configured WoL
are going to expect it to work - but disabling the ability for
the PHY to raise an interrupt will prevent it.

So, clearly always disabling PHY interrupts can have a detrimental
effect on the ability to wake a system up using WoL - where the PHY
interrupt is used to signal WoL to the rest of the system.

Now, if waking the system up from hibernation using WoL involves
the PHY asserting its interrupt pin, then the system must be
capable of dealing with the PHY asserting its interrupt while the
system is booting. Remember that the way Linux hibernation works,
that boot is just the same as a regular boot right up through the
normal kernel initialisation. It is only towards the end that the
kernel detects the signature in swap space, and then does the
funky stuff to resume from the saved data.

So, during that boot, the system has to cope with that interrupt
having been asserted by the PHY hardware. Either system firmware
has to recognise that was the wake-up event and deal with it (e.g.
disabling the interrupt source) before passing control to the
kernel, or the kernel has to be able to cope with that interrupt
being stuck at active state until the PHY driver can deal with it.

Obviously, if WoL is not enabled or supported, then disabling the
PHY interrupt should be harmless - but that will have the effect
of masking any issues that a platform may have until PHY based
WoL has been enabled.

Also, don't forget that we have this kexec thing - and the
.shutdown methods will be called just before handing control to
the new kernel.

Uwe's patch solves the problem that he's experiencing - because
it makes the interrupt disabling dependent on the WoL configuration.

However, Ioana's problem would still remain - and enabling WoL on
that platform will make it reappear - and thus it still needs to be
properly fixed.

If that problem is properly fixed, then we don't need to disable
PHY interrupts, which means a revert would be the right approach.

Honestly, I don't know what would be best - and I don't believe we've
heard from Ioana about the problem that was trying to be addressed
(e.g. exactly when it happened and why.)

If I had to guess:
- the PHY in question may be sharing an interrupt with another device
- when that other device probes and claims its interrupt, an interrupt
storm ensues
- the interrupt layer disables the interrupt input, rendering both the
PHY and other device unusable.

I think I've covered all the possibilities, all the issues, outcomes,
and the politics as far as Linus T would state. I'm also quite sure
that there will be no way to satisfy everyone!

Bearing in mind that it is holiday season, and we're at -rc5, I
think we should give Ioana a bit more time to respond before we
make a decision. Maybe a little over a week? If we don't hear anything,
then I think following our established policy and reverting would be
the correct way forward.

--
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTP is here! 80Mbps down 10Mbps up. Decent connectivity at last!