Re: [PATCH] nvme_core: scan namespaces asynchronously

From: Sagi Grimberg
Date: Wed Jan 17 2024 - 09:19:19 EST




On 1/16/24 21:14, stuart hayes wrote:


On 1/12/2024 1:36 PM, stuart hayes wrote:



On 04/01/2024 18:47, Keith Busch wrote:
On Thu, Jan 04, 2024 at 10:38:26AM -0600, Stuart Hayes wrote:
Currently NVME namespaces are scanned serially, so it can take a long time
for all of a controller's namespaces to become available, especially with a
slower (fabrics) interface with large number (~1000) of namespaces.

Use async function calls to make namespace scanning happen in parallel,
and add a (boolean) module parameter "async_ns_scan" to enable this.

Hm, we're not doing a whole lot of blocking IO to bring up a namespace,
so I'm a little surprised it makes a noticable difference. How much time
improvement are you observing by parallelizing the scan? Is there a
tipping point in Number of Namespaces where inline scanning is better
than asynchronous? And if it is a meaningful gain, let's not introduce
another module parameter to disable it.

I don't think it is a good idea since some of the namespace characteristics must be validated during re-connection time for example.
I actually prepared a patch that makes sure we sync the ns scanning before kicking the ns blk queue to avoid that situations.
for example, if for some reason ns1 change its uuid then we must remove it and open a new bdev instead. We can't kick old request to it...



Sorry for the delayed response--I thought I could get exact data on how long it takes with and
without the patch before I responded, it is taking a while (I'm having to rely on someone else
to do the testing).  I'll respond with the data as soon as I get it--hopefully it won't be too
much longer.  The time it takes to scan namespaces adds up when there are 1000 namespaces and
you have a fabrics controller on a network that isn't too fast.

I don't expect there would be any reason to disable this.  I only put the module parameter to
disable it in case there was some unforeseen issue, but I can remove that.

To Max Gurtovoy--this patch wouldn't change when or how namespaces are validated... it just
puts the actual scan work function on a workqueue so the scans can happen in parallel.  It will
do the same work to scan, at the same point, and it will wait for all the scanning to finish
before proceeding.  I don't understand how this patch would make the situation you mention any
worse.


I have numbers for the namespace scan time improvement.  Below is the amount of time it took for
all of the namespaces to show up when connecting to a controller with 1002 namespaces:

network latency   time without patch    time with patch
  0                        6s                 1s
 50                      210s                10s
100                      417s                18s


That is a big improvement. I wouldn't say that 1000+ namespaces
is a common requirement. But the improvement speaks for itself.