Re: [PATCH] fsi: occ: Improve response status checking

From: Joel Stanley
Date: Mon Feb 07 2022 - 05:12:12 EST


On Mon, 31 Jan 2022 at 15:29, Eddie James <eajames@xxxxxxxxxxxxx> wrote:
>
>
> On 1/30/22 23:56, Joel Stanley wrote:
> > On Mon, 10 Jan 2022 at 15:58, Eddie James <eajames@xxxxxxxxxxxxx> wrote:
> >> If the driver sequence number coincidentally equals the previous
> >> command response sequence number, the driver may proceed with
> >> fetching the entire buffer before the OCC has processed the current
> >> command. To be sure the correct response is obtained, check the
> >> command type and also retry if any of the response parameters have
> >> changed when the rest of the buffer is fetched.
> >>
> >> Signed-off-by: Eddie James <eajames@xxxxxxxxxxxxx>
> >> ---
> >> drivers/fsi/fsi-occ.c | 63 ++++++++++++++++++++++++++++++-------------
> >> 1 file changed, 44 insertions(+), 19 deletions(-)
> >>
> >> diff --git a/drivers/fsi/fsi-occ.c b/drivers/fsi/fsi-occ.c
> >> index 7eaab1be0aa4..67569282dd69 100644
> >> --- a/drivers/fsi/fsi-occ.c
> >> +++ b/drivers/fsi/fsi-occ.c
> >> @@ -451,6 +451,15 @@ static int occ_trigger_attn(struct occ *occ)
> >> return rc;
> >> }
> >>
> >> +static void fsi_occ_print_timeout(struct occ *occ, struct occ_response *resp,
> >> + u8 seq_no, u8 cmd_type)
> >> +{
> >> + dev_err(occ->dev,
> >> + "resp timeout status=%02x seq=%d cmd=%d, our seq=%d cmd=%d\n",
> >> + resp->return_status, resp->seq_no, resp->cmd_type, seq_no,
> >> + cmd_type);
> >> +}
> >> +
> >> int fsi_occ_submit(struct device *dev, const void *request, size_t req_len,
> >> void *response, size_t *resp_len)
> >> {
> >> @@ -461,12 +470,14 @@ int fsi_occ_submit(struct device *dev, const void *request, size_t req_len,
> >> struct occ_response *resp = response;
> >> size_t user_resp_len = *resp_len;
> >> u8 seq_no;
> >> + u8 cmd_type;
> >> u16 checksum = 0;
> >> u16 resp_data_length;
> >> const u8 *byte_request = (const u8 *)request;
> >> - unsigned long start;
> >> + unsigned long end;
> >> int rc;
> >> size_t i;
> >> + bool retried = false;
> >>
> >> *resp_len = 0;
> >>
> >> @@ -478,6 +489,8 @@ int fsi_occ_submit(struct device *dev, const void *request, size_t req_len,
> >> return -EINVAL;
> >> }
> >>
> >> + cmd_type = byte_request[1];
> >> +
> >> /* Checksum the request, ignoring first byte (sequence number). */
> >> for (i = 1; i < req_len - 2; ++i)
> >> checksum += byte_request[i];
> >> @@ -509,30 +522,30 @@ int fsi_occ_submit(struct device *dev, const void *request, size_t req_len,
> >> if (rc)
> >> goto done;
> >>
> >> - /* Read occ response header */
> >> - start = jiffies;
> >> +retry:
> >> + end = jiffies + timeout;
> >> do {
> >> + /* Read occ response header */
> >> rc = occ_getsram(occ, 0, resp, 8);
> >> if (rc)
> >> goto done;
> >>
> >> if (resp->return_status == OCC_RESP_CMD_IN_PRG ||
> >> resp->return_status == OCC_RESP_CRIT_INIT ||
> >> - resp->seq_no != seq_no) {
> >> - rc = -ETIMEDOUT;
> >> -
> >> - if (time_after(jiffies, start + timeout)) {
> >> - dev_err(occ->dev, "resp timeout status=%02x "
> >> - "resp seq_no=%d our seq_no=%d\n",
> >> - resp->return_status, resp->seq_no,
> >> - seq_no);
> >> + resp->seq_no != seq_no || resp->cmd_type != cmd_type) {
> > You're testing for two different types of conditions. The first is
> > when the SBE is busy doing something else:
> >
> > if (resp->return_status == OCC_RESP_CMD_IN_PRG ||
> > resp->return_status == OCC_RESP_CRIT_INIT ||
> >
> > And the others are when the message is not for the current user:
> >
> > resp->seq_no != seq_no || resp->cmd_type != cmd_type) {
> >
> > Should we be separating them out? It makes sense that the first means
> > we should keep trying. For the second case should we bail straight
> > away, instead of waiting for the timeout?
>
>
> They're really the same thing actually. If the sequence number or
> command type is incorrect, it means the OCC is processing a different
> command, and we need to wait for it to get to our command.

And we sometimes see one but not the other (ie, the return_status
doesn't cover all cases)?

>
>
> >
> > How often do we see one vs the other?
> >
> >> + if (time_after(jiffies, end)) {
> >> + fsi_occ_print_timeout(occ, resp, seq_no,
> >> + cmd_type);
> >> + rc = -ETIMEDOUT;
> >> goto done;
> >> }
> >>
> >> set_current_state(TASK_UNINTERRUPTIBLE);
> >> schedule_timeout(wait_time);
> >> + } else {
> >> + break;
> >> }
> >> - } while (rc);
> >> + } while (true);
> > Use while (true) instead of do { } while (true) to make it clearer
> > what's going on. Or refactor it to put the time_after in the while(),
> > as this is what the loop is waiting on.
>
>
> OK, I went in circles (pun intended) working on this loop, but I'll try
> and make it look better.
>
>
> >
> >> /* Extract size of response data */
> >> resp_data_length = get_unaligned_be16(&resp->data_length);
> >> @@ -543,17 +556,29 @@ int fsi_occ_submit(struct device *dev, const void *request, size_t req_len,
> >> goto done;
> >> }
> >>
> >> - dev_dbg(dev, "resp_status=%02x resp_data_len=%d\n",
> >> - resp->return_status, resp_data_length);
> >> -
> >> - /* Grab the rest */
> >> + /* Now get the entire response; get header again in case it changed */
> >> if (resp_data_length > 1) {
> >> - /* already got 3 bytes resp, also need 2 bytes checksum */
> >> - rc = occ_getsram(occ, 8, &resp->data[3], resp_data_length - 1);
> >> + rc = occ_getsram(occ, 0, resp, resp_data_length + 7);
> >> if (rc)
> >> goto done;
> >> +
> >> + if (resp->return_status == OCC_RESP_CMD_IN_PRG ||
> >> + resp->return_status == OCC_RESP_CRIT_INIT ||
> >> + resp->seq_no != seq_no || resp->cmd_type != cmd_type) {
> >> + if (!retried) {
> >> + retried = true;
> >> + goto retry;
> > Not sure about this.
> >
> > How often did this situation come up?
> >
> > Did you consider instead returning an error here?
>
>
> Well I can't say it's frequent, but hitting this condition was what
> drove making this change in the first place. It needs to be handled.

I was concerned about the pattern of using goto back up the function.

Would it make more sense the have the caller retry, instead of adding
retries in the sbefifo driver?

>
> Previously if this occurrred, we got a checksum error, so it effectively
> already returned an error.
>
>
> >
> >> + }
> >> +
> >> + fsi_occ_print_timeout(occ, resp, seq_no, cmd_type);
> >> + rc = -ETIMEDOUT;
> >> + goto done;
> >> + }
> >> }
> >>
> >> + dev_dbg(dev, "resp_status=%02x resp_data_len=%d\n",
> >> + resp->return_status, resp_data_length);
> >> +
> >> occ->client_response_size = resp_data_length + 7;
> >> rc = occ_verify_checksum(occ, resp, resp_data_length);
> >>
> >> @@ -598,7 +623,7 @@ static int occ_probe(struct platform_device *pdev)
> >> occ->version = (uintptr_t)of_device_get_match_data(dev);
> >> occ->dev = dev;
> >> occ->sbefifo = dev->parent;
> >> - occ->sequence_number = 1;
> >> + occ->sequence_number = (u8)((jiffies % 0xff) + 1);
> > This is interesting. You didn't mention this in the commit message;
> > you're trying to get a random number for the sequence number?
>
>
> Yea, this reduces the chances of hitting that retry above. If it's
> always 1, then every time the driver is bound it tries the first command
> with the same sequence number. This is a problem when FSI scanning with
> the host already running, as the driver gets unbound/rebound several
> times in a row, and we easily hit the checksum problem, since we proceed
> to get the full response even though it's not for the latest command,
> and then the buffer is updated at the same time. So using a non-zero
> random number is very helpful.

Makes sense. Perhaps do something like this instead of hand rolling it?

get_random_bytes(occ->sequence_number, 1);

If you could add some of your explanations to the commit message, I'd
like to get this fix merged soon.

Cheers,

Joel



>
>
> Thanks,
>
> Eddie
>
>
> >
> >> mutex_init(&occ->occ_lock);
> >>
> >> if (dev->of_node) {
> >> --
> >> 2.27.0
> >>