Re: [PATCH 1/1] block: System crashes when cpu hotplug + bouncing port

From: Daniel Wagner
Date: Tue Jun 29 2021 - 07:50:32 EST


On Tue, Jun 29, 2021 at 06:06:21PM +0800, Ming Lei wrote:
> > No, I don't see any errors. I am still trying to reproduce it on real
> > hardware. The setup with blktests running in Qemu did work with all
> > patches applied (the once from me and your patches).
> >
> > About the error argument: Later in the code path, e.g. in
> > __nvme_submit_sync_cmd() transport errors (incl. canceled request) are
> > handled as well, hence the upper layer will see errors during connection
> > attempts. My point is, there is nothing special about the connection
> > attempt failing. We have error handling code in place and the above
> > state machine has to deal with it.
>
> My two patches not only avoids the kernel panic, but also allow
> request to be allocated successfully, then connect io queue request can
> be submitted to driver even though all CPUs in hctx->cpumask is offline,
> then nvmef can be setup well.
>
> That is the difference with yours to fail the request allocation, then
> connect io queues can't be done, and the whole host can't be setup
> successfully, then become a brick. The point is that cpu offline shouldn't
> fail to setup nvme fc/rdma/tcp/loop.

Right, I think I see your point now.

> > Anyway, avoiding the if in the hotpath is a good thing. I just don't
> > think your argument about no error can happen is correct.
>
> Again, it isn't related with avoiding the if, and it isn't in hotpath
> at all.

I mixed up blk_mq_alloc_request() with blk_mq_alloc_request_hctx().

Thanks for the explanation. I'll keep trying to replicated the problem
on real hardware and see if these patches mitigate it.

Thanks,
Daniel