Re: [PATCH RFC] scsi: core: remove unsed 'restarts' from scsi_device

From: Yu Kuai
Date: Mon Nov 28 2022 - 01:09:10 EST


Hi,

在 2022/11/28 12:12, Ming Lei 写道:
On Mon, Nov 28, 2022 at 11:35:18AM +0800, Yu Kuai wrote:


在 2022/11/28 11:27, Ming Lei 写道:
On Sat, Nov 26, 2022 at 04:54:46PM +0800, Yu Kuai wrote:
Hi,

在 2022/11/18 19:30, Yu Kuai 写道:
From: Yu Kuai <yukuai3@xxxxxxxxxx>

During code review, I found that 'restarts' is not useful anymore after
the following commits:

1) commit ab3cee3762e5 ("blk-mq: In blk_mq_dispatch_rq_list() "no budget"
is a reason to kick")
2) commit d3b38596875d ("blk-mq: run queue no matter whether the request
is the last request")
3) commit 673235f91531 ("scsi: core: Fix race between handling STS_RESOURCE
and completion")

Now that if get budget ever failed, block layer will make sure to
trigger new run queue for the hctx. Hence there is no need to run queue
from scsi layer in this case.


But scsi_run_queue_async() needs to run all hw queue because budget is
actually LUN/request queue wide.

Why the hw queue need to run if get budget never failed in this hw
queue?

Because all hw queues share the queue wide budget, and once budget
is available, all hw queues are re-run, and the hw queue won't be
scheduled actually if there is nothing to run, see
blk_mq_run_hw_queue().

I still don't get it why all hw queues should be re-run, is this just
for performance or fixing a bug? And I'm not sure this behavior is good
for performance.





Does anyone has suggestions about this patch?

More info why I tried to remove this:

while testing megaraid with 4 nvme with none elevator, the default
queue_depth is 128, while I test it with fio 128 jobs and 1 iodepth,
bw is about 4Gib/s, however, if I test with 128 jobs and 2 iodepth,
bw is decreased to about 0.8Gib/s, and with this patch applied,
bw can stay 4Gib/s in the later case.

What is .can_queue and nr_hw_queues in your setting?
test cmd:
fio -name=0 -ioengine=libaio -direct=1 -group_reporting=1 -randseed=2022
-rwmixread=70 -refill_buffers -filename=/dev/sdg -numjobs=128 -size=1TB
-runtime=60s -bs=4k -iodepth=2 -rw=randwrite

test environment:
arm64 Kunpeng-920, 128 cpu
megaraid with 4 NVMEs, 128 hctx and queue_depth is 128

From your setting, megaraid should sets ->host_tagset, that said there
is only 128 tags for all 4 NVMEs(128 hw queue shares the all 128 tags
too).

That looks one really bad setting.

It's right that is bad, but the point here is to triggered get budget
failed frequently. If I set queue_depth to 2048, and I use 128 numjobs
with 32 iodpeth, performance is still bad.

BTW, why do you drive nvme via megaraid instead nvme driver?

And by the way, after Jan's patch "blk-mq: Improve performance of non-mq
IO schedulers with multiple HW queues", scsi_run_queue_async() can only
garantee to run hw queue for the current cpu, not all the hw queues.

That isn't true, each hctx is still run in case of none & kyber scheduler.

Yes, but current default hctx shared elevator is deadline.

thanks,
Ming

.