Re: [PATCH v2 4/5] nvme-fc: Make initial connect attempt synchronous

From: Hannes Reinecke
Date: Tue Jun 27 2023 - 02:51:56 EST


On 6/27/23 08:39, Hannes Reinecke wrote:
On 6/27/23 08:18, Daniel Wagner wrote:
On Mon, Jun 26, 2023 at 02:33:18PM +0300, Dan Carpenter wrote:
@@ -2943,6 +2943,8 @@ nvme_fc_create_io_queues(struct nvme_fc_ctrl *ctrl)
      /* force put free routine to ignore io queues */
      ctrl->ctrl.tagset = NULL;
+    if (ret > 0)
+        ret = -EIO;

All these checks for ret > 0 make me unhappy.  I don't understand how
they are a part of the commit.

We have two types of error message types in the nvme subsystem. The negative
values are the usual ones and positive ones are nvme protocol errors.

For example if the authentication fails because of invalid credentials when
doing the authentication nvmf_connect_admin_queue() will return a value of
NVME_SC_AUTH_REQUIRED. This is also the value which gets propagated to this
point here. The problem is any positive error code is interpreted as a valid
pointer later in the code, which results in a crash.

I have tried to look at the context and I think maybe you are working
around the fact that qla_nvme_ls_req() returns QLA_FUNCTION_FAILED on
error.

The auth blktests are exercising the error path here and that's why I added
this check. BTW, we already use in other places, this is not completely new in
this subsystem.

Also the qla_nvme_ls_req() function EINVAL on error.  I just wrote a
commit message saying that none of the callers cared but I missed that
apparently gets returned to nvme_fc_init_ctrl().  :/
https://lore.kernel.org/all/49866d28-4cfe-47b0-842b-78f110e61aab@moroto.mountain/

Thank!

Let's just fix qla_nvme_ls_req() instead of working around it here.

And let's add a WARN_ON_ONCE() somewhere to prevent future bugs.

This makes sense for the driver APIs. Though for the core nvme subsystem this
needs to be discusses/redesigned how to handle the protocol errors first.

I would stick with the 'normal' nvme syntax of having negative errors as internal errors (ie errnos), '0' for no error, and positive numbers as NVMe protocol errors.
As such I would also advocate to not map NVMe protocol errors onto error numbers but rather fix the callers to not do a pointer conversion.

Aw. Now I see it.

It's the ->create_ctrl() callback which will always return a controller pointer or an error value.
If we were to return a protocol error we would need to stick it into the controller structure itself. But if we doing that then we'll be ending up with a non-existing controller, ie we'd be returning a structure for a dead controller. Not exactly pretty, but it would allow us to improve
the userland API to return the NVMe protocol error by reading from the
fabrics device; the controller structure itself would be cleaned up when closing that device.

Hmm.

Cheers,

Hannes
--
Dr. Hannes Reinecke Kernel Storage Architect
hare@xxxxxxx +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Ivo Totev, Andrew
Myers, Andrew McDonald, Martje Boudien Moerman