Re: [PATCH v2 Resent 6/6] i3c: master: svc: fix random hot join failure since timeout errory

From: Frank Li
Date: Fri Oct 20 2023 - 15:58:43 EST


On Fri, Oct 20, 2023 at 07:03:37PM +0200, Miquel Raynal wrote:
> Hi Frank,
>
> Frank.li@xxxxxxx wrote on Fri, 20 Oct 2023 11:47:48 -0400:
>
> > On Fri, Oct 20, 2023 at 05:20:06PM +0200, Miquel Raynal wrote:
> > > Hi Frank,
> > >
> > > Frank.li@xxxxxxx wrote on Fri, 20 Oct 2023 10:47:52 -0400:
> > >
> > > > On Fri, Oct 20, 2023 at 04:35:25PM +0200, Miquel Raynal wrote:
> > > > > Hi Frank,
> > > > >
> > > > > Frank.li@xxxxxxx wrote on Fri, 20 Oct 2023 10:18:55 -0400:
> > > > >
> > > > > > On Fri, Oct 20, 2023 at 04:06:45PM +0200, Miquel Raynal wrote:
> > > > > > > Hi Frank,
> > > > > > >
> > > > > > > Frank.li@xxxxxxx wrote on Thu, 19 Oct 2023 11:39:42 -0400:
> > > > > > >
> > > > > > > > On Thu, Oct 19, 2023 at 08:44:52AM +0200, Miquel Raynal wrote:
> > > > > > > > > Hi Frank,
> > > > > > > > >
> > > > > > > > > Frank.Li@xxxxxxx wrote on Wed, 18 Oct 2023 11:59:26 -0400:
> > > > > > > > >
> > > > > > > > > > master side report:
> > > > > > > > > > silvaco-i3c-master 44330000.i3c-master: Error condition: MSTATUS 0x020090c7, MERRWARN 0x00100000
> > > > > > > > > >
> > > > > > > > > > BIT 20: TIMEOUT error
> > > > > > > > > > The module has stalled too long in a frame. This happens when:
> > > > > > > > > > - The TX FIFO or RX FIFO is not handled and the bus is stuck in the
> > > > > > > > > > middle of a message,
> > > > > > > > > > - No STOP was issued and between messages,
> > > > > > > > > > - IBI manual is used and no decision was made.
> > > > > > > > >
> > > > > > > > > I am still not convinced this should be ignored in all cases.
> > > > > > > > >
> > > > > > > > > Case 1 is a problem because the hardware failed somehow.
> > > > > > > >
> > > > > > > > But so far, no action to handle this case in current code.
> > > > > > >
> > > > > > > Yes, but if you detect an issue and ignore it, it's not better than
> > > > > > > reporting it without handling it. Instead of totally ignoring this I
> > > > > > > would at least write a debug message (identical to what's below) before
> > > > > > > returning false, even though I am not convinced unconditionally
> > > > > > > returning false here is wise. If you fail a hardware sequence because
> > > > > > > you added a printk, it's a problem. Maybe you consider this line as
> > > > > > > noise, but I believe it's still an error condition. Maybe, however,
> > > > > > > this bit gets set after the whole sequence, and this is just a "bus
> > > > > > > is idle" condition. If that's the case, then you need some
> > > > > > > additional heuristics to properly ignore the bit?
> > > > > > >
> > > > > >
> > > > > > dev_err(master->dev,
> > > > > > "Error condition: MSTATUS 0x%08x, MERRWARN 0x%08x\n",
> > > > > > mstatus, merrwarn);
> > > > > > +
> > > > > > + /* ignore timeout error */
> > > > > > + if (merrwarn & SVC_I3C_MERRWARN_TIMEOUT)
> > > > > > + return false;
> > > > > > +
> > > > > >
> > > > > > Is it okay move SVC_I3C_MERRWARN_TIMEOUT after dev_err?
> > > > >
> > > > > I think you mentioned earlier that the problem was not the printk but
> > > > > the return value. So perhaps there is a way to know if the timeout
> > > > > happened after a transaction and was legitimate or not?
> > > >
> > > > Error message just annoise user, don't impact function. But return false
> > > > let IBI thread running to avoid dead lock.
> > > >
> > > > >
> > > > > In any case we should probably lower the log level for this error.
> > > >
> > > > Only SVC_I3C_MERRWARN_TIMEOUT is warning
> > > >
> > > > Maybe below logic is better
> > > >
> > > > if (merrwarn & SVC_I3C_MERRWARN_TIMEOUT) {
> > > > dev_dbg(master->dev,
> > > > "Error condition: MSTATUS 0x%08x, MERRWARN 0x%08x\n",
> > > > mstatus, merrwarn);
> > > > return false;
> > > > }
> > > >
> > > > dev_err(master->dev,
> > > > "Error condition: MSTATUS 0x%08x, MERRWARN 0x%08x\n",
> > > > mstatus, merrwarn);
> > > > ....
> > > >
> > >
> > > Yes, this looks better but I wonder if we should add an additional
> > > condition to just return false in this case;
> >
> > What's additional condition we can check?
>
> Well, you're the one bothered with an error case which is not a real
> error. You're saying "this error is never a problem" and I am saying
> that I believe it is not a problem is your particular case, but in
> general there might be situations where it *is* a problem. So you need
> to find proper conditions to check against in order to determine
> whether this is just an info with no consequence or an error.

I checked R** code of this TIMEOUT, which is quite simple, set to 1 if SDA
is low over 100us if I understand correctly. I also checked, if I add delay
before emit stop, TIMEOUT will be set. (Read can auto emit stop accoring to
RDTERM, so just saw TIMEOUT at write transaction).

TIMEOUT just means condition "I3C bus's SDA low over 100us" happened since
written 1 to TIMEOUT.

I think "I3C bus's SDA over 100us" means nothing for linux drivers.

I think there are NO sitation where it *is* a problem. If it was problem,
there are NO solution to resolve it at linux driver side. And I think it
already happen many times silencely.

Frank

>
> > > something saying "this
> > > timeout is legitimate and has no impact".
> >
> > Add comments "this timeout is legitimate and has no impact" or dev_dbg
> > print that?
>
> No I'm talking about the additional heuristics.
>
> Thanks,
> Miquèl