Re: [PATCH] net: phy: Don't disable irqs on shutdown if WoL is enabled

From: Ioana Ciornei
Date: Wed Aug 09 2023 - 10:22:07 EST


On Tue, Aug 08, 2023 at 11:56:56PM +0100, Russell King (Oracle) wrote:
> On Tue, Aug 08, 2023 at 02:59:41PM -0700, Florian Fainelli wrote:
> > On 8/8/23 14:53, Jakub Kicinski wrote:
> > > On Fri, 4 Aug 2023 09:17:57 +0200 Uwe Kleine-König wrote:
> > > > Most PHYs signal WoL using an interrupt. So disabling interrupts breaks
> > > > WoL at least on PHYs covered by the marvell driver. So skip disabling
> > > > irqs on shutdown if WoL is enabled.
> > > >
> > > > While at it also explain the motivation that irqs are disabled at all.
> > > >
> > > > Fixes: e2f016cf7751 ("net: phy: add a shutdown procedure")
> > > > Signed-off-by: Uwe Kleine-König <u.kleine-koenig@xxxxxxxxxxxxxx>
> > >
> > > What do we do with this one? It sounded like Russell was leaning
> > > towards a revert?
> >
> > Yes, though I believe this will create a different kind of regression for
> > what Iona was addressing initially. Then it becomes a choice of which
> > regression do we consider to be the worst to handle until something better
> > comes up.
> >
> > Russell what are your thoughts?
>
> In this situation where a fix for a problem is provided which then
> causes a regression by fixing that problem, I've seen Linus T state
> that it means the fix was incorrect. That seems entirely sensible.
>
> We are, of course, in the situation where reverting the commit
> restores the old behaviour and thus fixes a regression, but causes
> a regression for another user.
>
> If it is possible to quickly come up with a fix that avoids any
> regression to either use case, then that is obviously preferable.
> However, if that's not possible, then it seems going back to the
> original situation (i.o.w. reverting) is sensible.
>
> Now, the fact is that many PHYs do use their interrupts to signal
> that a wake-up happened, and disabling the IRQ from the PHY will
> prevent WoL from working. Other PHYs have a separate pin for this.
> Two recent examples are AR8035, which only has a single interrupt
> pin which covers all interrupts from that PHY, and AR8031 or AR8033
> which have a separate WOL_INT pin which might be used - or the
> main interrupt pin.
>
> If we hibernate the system, then people how have configured WoL
> are going to expect it to work - but disabling the ability for
> the PHY to raise an interrupt will prevent it.
>
> So, clearly always disabling PHY interrupts can have a detrimental
> effect on the ability to wake a system up using WoL - where the PHY
> interrupt is used to signal WoL to the rest of the system.
>
> Now, if waking the system up from hibernation using WoL involves
> the PHY asserting its interrupt pin, then the system must be
> capable of dealing with the PHY asserting its interrupt while the
> system is booting. Remember that the way Linux hibernation works,
> that boot is just the same as a regular boot right up through the
> normal kernel initialisation. It is only towards the end that the
> kernel detects the signature in swap space, and then does the
> funky stuff to resume from the saved data.
>
> So, during that boot, the system has to cope with that interrupt
> having been asserted by the PHY hardware. Either system firmware
> has to recognise that was the wake-up event and deal with it (e.g.
> disabling the interrupt source) before passing control to the
> kernel, or the kernel has to be able to cope with that interrupt
> being stuck at active state until the PHY driver can deal with it.
>
> Obviously, if WoL is not enabled or supported, then disabling the
> PHY interrupt should be harmless - but that will have the effect
> of masking any issues that a platform may have until PHY based
> WoL has been enabled.
>
> Also, don't forget that we have this kexec thing - and the
> .shutdown methods will be called just before handing control to
> the new kernel.
>
> Uwe's patch solves the problem that he's experiencing - because
> it makes the interrupt disabling dependent on the WoL configuration.
>
> However, Ioana's problem would still remain - and enabling WoL on
> that platform will make it reappear - and thus it still needs to be
> properly fixed.
>
> If that problem is properly fixed, then we don't need to disable
> PHY interrupts, which means a revert would be the right approach.

Yes, my initial problem would still remain if WoL is enabled on any of
the PHYs that have a shared IRQ line.

The problem is that I find this combination - shared IRQ and WoL enabled
- impossible to make it work. Maybe we could look into denying WoL from
being enabled on PHYs which have a shared interrupt line.

>
> Honestly, I don't know what would be best - and I don't believe we've
> heard from Ioana about the problem that was trying to be addressed
> (e.g. exactly when it happened and why.)
>
> If I had to guess:
> - the PHY in question may be sharing an interrupt with another device
> - when that other device probes and claims its interrupt, an interrupt
> storm ensues
> - the interrupt layer disables the interrupt input, rendering both the
> PHY and other device unusable.
>

That's a perfect summary of the problem that I was trying to fix.

The board in question is a LS1021ATSN which has two AR8031 PHYs that
share an interrupt line. In case only one of the PHYs is probed and
there are pending interrupts on the PHY#2 an IRQ storm will happen
since there is no entity to clear the interrupt from PHY#2's registers.
PHY#1's driver will get stuck in .handle_interrupt() indefinitely.

> I think I've covered all the possibilities, all the issues, outcomes,
> and the politics as far as Linus T would state. I'm also quite sure
> that there will be no way to satisfy everyone!
>
> Bearing in mind that it is holiday season, and we're at -rc5, I
> think we should give Ioana a bit more time to respond before we
> make a decision. Maybe a little over a week? If we don't hear anything,
> then I think following our established policy and reverting would be
> the correct way forward.
>