blk-mq: takes hours for scsi scanning finish when thousands of LUNs

From: Zhangqing Luo
Date: Mon Oct 19 2015 - 10:40:23 EST


Hi, Jens,Tejun

Current problem we meet:

When Multiple Queue is used for scsi, the period between each LUN probe
is increasing as the number of block request queue goes up, eventually
it takes hours for Initiator to finish scanning thousands of LUNs.

Kernel version we're using: 4.1.6-14
I tested with MainLine, it also has same issue.

My analysis:

As for each new LUN, scsi_mq_alloc_queue is called when Multiple Queue
is enabled for scsi to allocate and initialize the new queue, the process is:
->blk_mq_init_queue
->blk_mq_init_allocated_queue
->blk_mq_add_queue_tag_set
->blk_mq_update_tag_set_depth

blk_mq_update_tag_set_depth calls blk_mq_freeze_queue for each
queue in the set tag list, if the queue is already in the tag list,
we can see that it takes tens of ticks to freeze the queue, if there are
thousands of queues in the list, it will take more than ten seconds(HZ=1000)
for a new queue to finish its initialization, this is really a high delay,

my question here is: why blk_mq_update_tag_set_depth updates
each queue in the tag list?
if this thing must be done, as the code below shows just changing flags
depending on 'shared' variable why shouldn't we store the previous result of 'shared'
and compare with the current result, if it's unchanged, nothing will be done
and avoid looping all queues in list.

static void blk_mq_update_tag_set_depth(struct blk_mq_tag_set *set)
{
struct blk_mq_hw_ctx *hctx;
struct request_queue *q;
bool shared;
int i;

if (set->tag_list.next == set->tag_list.prev)
shared = false;
else
shared = true;

list_for_each_entry(q, &set->tag_list, tag_set_list) {
blk_mq_freeze_queue(q);

queue_for_each_hw_ctx(q, hctx, i) {
if (shared)
hctx->flags |= BLK_MQ_F_TAG_SHARED;
else
hctx->flags &= ~BLK_MQ_F_TAG_SHARED;
}
blk_mq_unfreeze_queue(q);
}
}

Deep insight:
I dig more into the delay and find more details, for each already existing
queue, the delay comes from blk_mq_freeze_queue_wait function
because the request queue->mq_usage_counter is not zero, blocked on
it and woken up by percpu_ref_switch_to_atomic_rcu after a grace period,
let me explain how this happens.

Although for each new allocating queue, PERCPU_REF_INIT_ATOMIC flag is set
when allocating the percpu ref, but when registering queue,
blk_mq_finish_init changes it to PERCPU by calling percpu_ref_switch_to_percpu,

So every time blk_mq_freeze_queue_start, it runs in this way

blk_mq_freeze_queue_start
->percpu_ref_kill->percpu_ref_kill_and_confirm
->__percpu_ref_switch_to_atomic
->call_rcu_sched(&ref->rcu,percpu_ref_switch_to_atomic_rcu)

and blk_mq_freeze_queue_wait blocks on queue->mq_usage_counter
as it is not zero, and wake up by percpu_ref_switch_to_atomic_rcu
after a grace period


My question here is why should we change ref to PERCPU at blk_mq_finish_init?
because of this changing, delay appears.

My suggestion:

Don't change ref to PERCPU by blk_mq_finish_init, keeping
it in ATOMIC, this can solve the problem.
Or optimize blk_mq_update_tag_set_depth to avoid such delay

The patch below is one possible fix, as I can't see if it can introduce
other regression issue, so could you guy look into it and give some advice?

/Thanks



0001-Avoid-long-delay-for-scsi-scanning-when-thousands-of.patch