Re: [RFC PATCH 01/10] firmware: arm_scmi: Reset properly xfer SCMI status

From: Cristian Marussi
Date: Tue Jun 08 2021 - 06:10:59 EST


Hi Sudeep,

On Mon, Jun 07, 2021 at 07:27:54PM +0100, Sudeep Holla wrote:
> On Mon, Jun 07, 2021 at 07:01:37PM +0100, Cristian Marussi wrote:
> > On Mon, Jun 07, 2021 at 06:38:09PM +0100, Sudeep Holla wrote:
> > > On Sun, Jun 06, 2021 at 11:12:23PM +0100, Cristian Marussi wrote:
> > > > When an SCMI command transfer fails due to some protocol issue an SCMI
> > > > error code is reported inside the SCMI message payload itself and it is
> > > > then retrieved and transcribed by the specific transport layer into the
> > > > xfer.hdr.status field by transport specific .fetch_response().
> > > >
> > > > The core SCMI transport layer never explicitly reset xfer.hdr.status,
> > > > so when an xfer is reused, if a transport misbehaved in handling such
> > > > status field, we risk to see an invalid ghost error code.
> > > >
> > > > Reset xfer.hdr.status to SCMI_SUCCESS right before each transfer is
> > > > started.
> > > >
> > >
> > > Any particular reason why it can't be part of xfer_get_init which has other
> > > initialisations ? If none, please move it there.
> > >
> >
> > Well it was there initially then I moved it here.
> >
> > The reason is mostly the same as the reason for the other patch in this
> > series that adds a reinit_completion() in this same point: the core does
> > not forbid to reuse an xfer multiple times, once obtained with xfer_get()
> > or xfer_get_init(), and indeed some protocols do such a thing: they
> > implements such do_xfer looping and bails out on error.
> >
>
> Makes sense. But it is okay to retain xfer->transfer_id for every transfer
> in such a loop ?
>
No you are right and indeed I saw that anomaly, but I have not addressed
it since, even if wrong, it is harmless and transfer_id is really used
only for debugging/profiling, while the missing reinit_completion is
potentially broken.

> > In the way that it is implemented now in protocols poses no problem
> > indeed because the do_xfer loop bails out on error and the xfer is put,
> > but as soon as some protocol is implemented that violates this common
> > practice and it just keeps on reuse an xfer after an error fo other
> > do_xfers() this breaks...so it seemed more defensive to just reinit the
> > completion and the status before each send.
>
> Fair enough. But they use it to send same message I guess, may be if it
> gave error or something ? I would like to really know such a sequence
> instead of assisting that 😉.
>

So the current real 'looping do_xfer' behavior is safe and so this missing
reinit is only potentially broken in the future, and we cannot really
know now in advance about some future protocol needs, but it seems as of now
wrong that you'll want to keep going on and reuse an xfer for the same command
after an error in your loop.

On the other side we allow such behaviour, so I thought was good to
provide a safe net if it is misused.

But, beside this patches, that, as said, are more defensive that strictly
needed as of now, I think now it's worth mentioning that this same 'issue'
affects also, as an example, the new mechanism I introduced later in this
same series to always use monotonically increasing sequence number for
outgoing messages.

In that case I stick to the current behavior and I assign such monotonically
increasing sequence numbers to message during xfer_get, but the potential
issue is the same: if a do_xfer loop is used you end up reusing the same
seq_num for multiple do_xfers (so defeating really the mechanism itself
that aims not to reuse immediately the most recently used seq_num).

In that case I did this to keep it simple and to avoid placing more burden
on tx path by picking and assigning a seq_num upon each transfer...but, again,
also this behavior of picking a seq_num only at xfer_get is NOT really broken
as of now even for do_xfer loops since we bail out on error and you won't
really reuse that xfer.

It's just that in this seq_num selection case seems to add a lot of burden
and complexity if moved to the do_xfer phase, while status/reinit seemed
to me cheaper to move it in the do_xfer so I tried to play defensive.

At the end, in general I would say that all of these ops (status/reinit/
seq_nums/transfer_id) DO really belong logically to the do_xfer phase more than
to the xfer_get/xfer_get_init, but in reality we can cope with having them
@xfer_get/get_init and this keeps things simple and reduce burden, especially
in the monotonic seq_nums case: so I am not so sure anymore if it is fine to
move reinit/status to the do_xfer, as proposed here, while keeping seq_nums
(for good reasons) to the xfer_get phase, because we'd use 2 different strategies
to address similar issues.

I would say: just keep reinit and status in the xfer_get phase instead and
maybe warn somehow if a failed xfer is detected being reused. (but this
would anyway need a check in every tx transaction to see if status != SUCCESS
so is it worth ?)

Lot of overthinking for a one-liner :D ... sorry

Thanks,
Cristian


> --
> Regards,
> Sudeep