Re: [net-next] net: dsa: qca8k: only change the MIB_EN bit in MODULE_EN register

From: Vladimir Oltean
Date: Mon Nov 08 2021 - 16:59:34 EST


On Mon, Nov 08, 2021 at 10:56:51PM +0100, Robert Marko wrote:
> On Mon, Nov 8, 2021 at 10:46 PM Vladimir Oltean <olteanv@xxxxxxxxx> wrote:
> >
> > On Mon, Nov 08, 2021 at 10:39:27PM +0100, Robert Marko wrote:
> > > On Mon, Nov 8, 2021 at 10:18 PM Vladimir Oltean <olteanv@xxxxxxxxx> wrote:
> > > >
> > > > On Mon, Nov 08, 2021 at 10:10:19PM +0100, Robert Marko wrote:
> > > > > On Mon, Nov 8, 2021 at 9:21 PM Vladimir Oltean <olteanv@xxxxxxxxx> wrote:
> > > > > >
> > > > > > Timed out waiting for ACK/NACK from John.
> > > > > >
> > > > > > On Thu, Nov 04, 2021 at 01:49:27PM +0100, Robert Marko wrote:
> > > > > > > From: Gabor Juhos <j4g8y7@xxxxxxxxx>
> > > > > > >
> > > > > > > The MIB module needs to be enabled in the MODULE_EN register in
> > > > > > > order to make it to counting. This is done in the qca8k_mib_init()
> > > > > > > function. However instead of only changing the MIB module enable
> > > > > > > bit, the function writes the whole register. As a side effect other
> > > > > > > internal modules gets disabled.
> > > > > >
> > > > > > Please be more specific.
> > > > > > The MODULE_EN register contains these other bits:
> > > > > > BIT(0): MIB_EN
> > > > > > BIT(1): ACL_EN (ACL module enable)
> > > > > > BIT(2): L3_EN (Layer 3 offload enable)
> > > > > > BIT(10): SPECIAL_DIP_EN (Enable special DIP (224.0.0.x or ff02::1) broadcast
> > > > > > 0 = Use multicast DP
> > > > > > 1 = Use broadcast DP)
> > > > > >
> > > > > > >
> > > > > > > Fix up the code to only change the MIB module specific bit.
> > > > > >
> > > > > > Clearing which one of the above bits bothers you? The driver for the
> > > > > > qca8k switch supports neither layer 3 offloading nor ACLs, and I don't
> > > > > > really know what this special DIP packet/header is).
> > > > > >
> > > > > > Generally the assumption for OF-based drivers is that one should not
> > > > > > rely on any configuration done by prior boot stages, so please explain
> > > > > > what should have worked but doesn't.
> > > > >
> > > > > Hi,
> > > > > I think that the commit message wasn't clear enough and that's my fault for not
> > > > > fixing it up before sending.
> > > >
> > > > Yes, it is not. If things turn out to need changing, you should resend
> > > > with an updated commit message.
> > > >
> > > > > MODULE_EN register has 3 more bits that aren't documented in the QCA8337
> > > > > datasheet but only in the IPQ4019 one but they are there.
> > > > > Those are:
> > > > > BIT(31) S17C_INT (This one is IPQ4019 specific)
> > > > > BIT(9) LOOKUP_ERR_RST_EN
> > > > > BIT(10) QM_ERR_RST_EN
> > > >
> > > > Are you sure that BIT(10) is QM_ERR_RST_EN on IPQ4019? Because in the
> > > > QCA8334 document I'm looking at, it is SPECIAL_DIP_EN.
> > >
> > > Sorry, QM_ERR_RST_EN is BIT(8) and it as well as LOOKUP_ERR_RST_EN should
> > > be exactly the same on QCA833x switches as well as IPQ4019 uses a
> > > variant of QCA8337N.
> > > >
> > > > > Lookup and QM bits as well as the DIP default to 1 while the INT bit is 0.
> > > > >
> > > > > Clearing the QM and Lookup bits is what is bothering me, why should we clear HW
> > > > > default bits without mentioning that they are being cleared and for what reason?
> > > >
> > > > To be fair, BIT(9) is marked as RESERVED and documented as being set to 1,
> > > > so writing a zero is probably not very smart.
> > > >
> > > > > We aren't depending on the bootloader or whatever configuring the switch, we are
> > > > > even invoking the HW reset before doing anything to make sure that the
> > > > > whole networking
> > > > > subsystem in IPQ4019 is back to HW defaults to get rid of various
> > > > > bootloader hackery.
> > > > >
> > > > > Gabor found this while working on IPQ4019 support and to him and to me it looks
> > > > > like a bug.
> > > >
> > > > A bug with what impact? I don't have a description of those bits that
> > > > get unset. What do they do, what doesn't work?
> > >
> > > LOOKUP_ERR_RST_EN:
> > > 1b1:Enableautomatic software reset by hardware due to
> > > lookup error.
> > >
> > > QM_ERR_RST_EN:
> > > 1b1:enableautomatic software reset by hardware due to qm
> > > error.
> > >
> > > So clearing these 2 disables the built-in error recovery essentially.
> > >
> > > To me clearing the bits even if they are not breaking something now
> > > should at least have a comment in the code that indicates that it's intentional
> > > for some reason.
> > > I wish John would explain the logic behind this.
> >
> > That sounds... aggressive. Have you or Gabor exercised this error path?
> > What is supposed to happen? Is software prepared for the hardware to
> > automatically reset?
>
> I am not trying to be aggressive, but to me, this is either a bug or they are
> intentionally cleaned but it's not documented.
> Have tried triggering the QM error, but couldn't hit it even when
> doing crazy stuff.
> It should be nearly impossible to hit it, but it's there to prevent
> the switch from just locking up
> under extreme conditions (At least that is how I understand it).
>
> I don't think the driver currently even monitors the QM registers at all.
> I can understand clearing these bits intentionally, but it's gotta be
> documented otherwise
> somebody else is gonna think is a bug/mistake/whatever in the code.

Oh no no, I'm not saying that you're aggressive, but the hardware
behavior of automatically performing a software reset.

The driver keeps state. If the switch just resets by itself, what do you
think will continue to work fine afterwards? The code path needs testing.
I am not convinced that a desynchronized software state is any better
than a lockup.