Re: [PATCH v2] firmware: arm_scmi: Avoid to call mbox_client_txdone on txdone_irq mode

From: Cristian Marussi
Date: Wed Mar 06 2024 - 11:08:23 EST


On Wed, Mar 06, 2024 at 09:49:16AM +0000, Cristian Marussi wrote:
> On Wed, Mar 06, 2024 at 06:13:48AM +0000, Flash Liu (劉炳傳) wrote:
> > Hi Cristian,
> >
> > Kindly ping :)
> >
>
> Hi Pin-Chuan,
>
> sorry for the delay...I have NOT forgot :D, indeed I was just testing
> yesterday with some mailbox IP of ours equipped with a TxAck IRQ and I
> would have some question for you because I've seen some anomalies while
> using this: does your solution work reliably in your setup ALSO when
> multiple SCMI transactions are attempted ? (like cpufreq issuing cmds
> while polling a sensor from some other thread)
>
> ...I'll dig deeper today in this matter, but my current suspect is that
> using the mailbox TXAck IRQ to let the controller tick the internal mailbox
> queues does not play well with SCMI, since the SCMI TX channel is really the
> SCMI "a2p" bidirectional channel and it is associated with just only one shmem
> area used to hold both the egressing CMD and to receive the incoming REPLY
> from the platform: so if you let the controller tick the queues as soon as you
> received the TXAck you are telling the mailbox subsystem to queue another message
> on the same area while you are not really done, since only the client know
> when it's done with the whole SCMI transaction and the reply has been fetched.
>
> Indeed, for these reasons we have the BUSY/FREE bit in the shmem to protect it
> from pending new requests until the previous one has completed, but when the
> waited-for reply comes in, the platform would have cleared the BUSY bit and
> let the new queued message overwrite the pending reply prematurely, and one
> message is lost...
>
> ...but as said I want to delve deeper into this, as of now just suppositions
> and maybe I am just missing something more that has to be configured
> properly...
>
> Thanks,
> Cristian
>

Hi again :D,

so articulating more on my supposition that TxAck-capable mailboxes and
SCMI do not play well together (and would not be worth either really...)

Consider the following scenario.

1. scmi: mbox_send_message(X) is called from SCMI stack to send mesg-X
on the a2p channel (a command)

2. mbox: mesg-X is
2a. queued by mbox subsystem [add_to_rbuf(X)]
2b. submitted for transmission [msg_submit(X)]
2c. prepared by SCMI clk->tx_prepare
2d. finally sent via mhu driver .send_data
2e. mesg-X is now an active_req for mbox and in-flight for SCMI

3. scmi: ANOTHER mesg-Y tx is attempted via mbox_send_message(Y)

4. mbox: mesg-Y is
4a. queued by mbox_subsys [add_to_rbuf()]
4b. NOT submitted since there is already an active_req=mesg-X pending

Any further SCMI mesg TX attempt will behave similarly: queued/not_submitted
till at some in the future someone calls the txdone routines, which in turn
calls tx_tick()...this SOMEONE can be the client, like it is now, or the
controller if it is configured to use the TxAck IRQ (as per-your-patch)...
..so lets see what happen in your TxAck enabled case.

5. TxAck IRQ received, transmission of mesg-X has been successfull
(NOTE that SCMI at this point is still waiting for a reply to mesg-X..)

5a. controller calls mbox_chan_txdone()
5b. mbox in turn calls tx_tick
5c. active_req is cleared and next queued mesg-Y is submitted
5d. mesg-Y transmission gets anyway stuck on cl->tx_prepare since
we check the SCMI shmem BUSY bit and busy-loop there till it
clears: this clearing can happen ONLY after the mesg-X reply
has come through, since it is the platform SCMI server that
clears it having delivered the reply in the shmem.

6. platform SCMI server replies to mesg-X finally:
6a. platform writes reply in shmem
6b. platform clears BUSY bit

-- note SCMI stack is still waiting for a reply at this point...
so waiting for an IRQ OR by simply spinning on that same BUSY bit
if polling mode was requested for the transaction....

...lets assume you are in IRQ mode:

7. mesg-Y sender which was spinning on BUSY bit (blocked on tx_prepare)
is immediately cleared to send and so tx_prepare can proceed further
and completely overwrite the just received mesg-X, which is now LOST

.in case you were polling I guess you will have anyway some corruption
due to the race between the polling-mesg-X-receiver retrieving the reply
and the same tx_prepare codeflow as above...

Indeed, the spec says that you should protect the channel till the reply
has been retrieved from the SCMI (even after the BUSY bit is cleared), and
in other transport we DO have some form of locks, BUT here in mailboxes
there is not since it is NOT needed IF you stick to the non-TxAck original
behaviour, since the tx_tick, as it is now, will be run by the SCMI stack
ONLY after it has waited for mesg-X and retrieved the mesg-X-reply payload
..not before.

Instead, if you enable the TxAck mode you are basically letting the controller
itself issue the tx_tick(), which means "previous TX is done, please proceed
with the next", BUT the current TX is really NOT done at all as intended
by the client (SCMI), since the reply is missing and the only entity which
can have the whole picture about when a transaction is completed (or timed-out)
is the SCMI client.

As said, I think the fundamental clash is between what the mailbox
subsystem considers a TXDONE event (and related actions) and what
instead is considered a completed transaction on the SCMI a2p channel:
i.e. CMD_sent + REPLY_retrieved.

At the end, anyway, would it be worth in any way to leverage such TxAck
capabilities (somehow) of a mailbox in the SCMI world ?

I mean, even if we make this work, what is supposed to happen better and faster
when using a TxAck instead of a TX_polling mode like it is now ?

..because the SCMI stack cannot really do anything with this information in
this case, given that there is just one single a2p_shem area for sending command
and receiving replies...it has just to wait anyway even after the
TxAck...

.maybe it is a bit more power-favourable to sleep_wait for the TxAck IRQ
instead of polling the MHU regs ?... other than this the TxAck means nothing
really in the context of the SCMI world, since you cannot safely queue anything
more till the previous exchange has fully completed...

..in other non-SCMI scenarios that I experimented with, it really makes a
difference havig the TxAck since it avoids all the internal polling/requeueing
dance in the mailbox subsystem, but in this case I think is all made useless by
the way SCMI/SMT based transport works...i.e. using a single shmem area for
the a2p channel..

..not really a short and sweet email... :P ... any feedback or further
ideas are welcome anyway...especially if you can prove that all of the
above is somehow wrong, or that there is a good reason to leverage the
TxAck capable mailboxes :D

Thanks,
Cristian