Re: stalling IO regression since linux 5.12, through 5.18

From: Yu Kuai
Date: Thu Sep 01 2022 - 04:19:27 EST


在 2022/09/01 16:03, Jan Kara 写道:
On Thu 01-09-22 15:02:03, Yu Kuai wrote:
Hi, Chris

在 2022/08/20 15:00, Ming Lei 写道:
On Fri, Aug 19, 2022 at 03:20:25PM -0400, Chris Murphy wrote:


On Thu, Aug 18, 2022, at 1:24 AM, Ming Lei wrote:
On Thu, Aug 18, 2022 at 12:27:04AM -0400, Chris Murphy wrote:


On Thu, Aug 18, 2022, at 12:18 AM, Chris Murphy wrote:
On Thu, Aug 18, 2022, at 12:12 AM, Chris Murphy wrote:
On Wed, Aug 17, 2022, at 11:41 PM, Ming Lei wrote:

OK, can you post the blk-mq debugfs log after you trigger it on v5.17?

Same boot, 3rd log. But the load is above 300 so I kinda need to sysrq+b soon.

https://drive.google.com/file/d/1375H558kqPTdng439rvG6LuXXWPXLToo/view?usp=sharing


Also please test the following one too:


diff --git a/block/blk-mq.c b/block/blk-mq.c
index 5ee62b95f3e5..d01c64be08e2 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1991,7 +1991,8 @@ bool blk_mq_dispatch_rq_list(struct blk_mq_hw_ctx
*hctx, struct list_head *list,
if (!needs_restart ||
(no_tag && list_empty_careful(&hctx->dispatch_wait.entry)))
blk_mq_run_hw_queue(hctx, true);
- else if (needs_restart && needs_resource)
+ else if (needs_restart && (needs_resource ||
+ blk_mq_is_shared_tags(hctx->flags)))
blk_mq_delay_run_hw_queue(hctx, BLK_MQ_RESOURCE_DELAY);

blk_mq_update_dispatch_busy(hctx, true);



With just this patch on top of 5.17.0, it still hangs. I've captured block debugfs log:
https://drive.google.com/file/d/1ic4YHxoL9RrCdy_5FNdGfh_q_J3d_Ft0/view?usp=sharing

The log is similar with before, and the only difference is RESTART not
set.

Also follows another patch merged to v5.18 and it fixes io stall too, feel free to test it:

8f5fea65b06d blk-mq: avoid extending delays of active hctx from blk_mq_delay_run_hw_queues

Have you tried this patch?

We meet a similar problem in our test, and I'm pretty sure about the
situation at the scene,

Our test environment:nvme with bfq ioscheduler,

How io is stalled:

1. hctx1 dispatch rq from bfq in service queue, bfqq becomes empty,
dispatch somehow fails and rq is inserted to hctx1->dispatch, new run
work is queued.

2. other hctx tries to dispatch rq, however, in service bfqq is
empty, bfq_dispatch_request return NULL, thus
blk_mq_delay_run_hw_queues is called.

3. for the problem described in above patch,run work from "hctx1"
can be stalled.

Above patch should fix this io stall, however, it seems to me bfq do
have some problems that in service bfqq doesn't expire under following
situation:

1. dispatched rqs don't complete
2. no new rq is issued to bfq

And I guess:
3. there are requests queued in other bfqqs
?

Yes, of course, other bfqqs still have requests, but current
implementation have flaws that even if other bfqqs doesn't have
requests, bfq_asymmetric_scenario() can still return true because
num_groups_with_pending_reqs > 0. We tried to fix this, however, there
seems to be some misunderstanding with Paolo, and it's not applied to
mainline yet...

Thanks,
Kuai

Otherwise I don't see a point in expiring current bfqq because there's
nothing bfq could do anyway. But under normal circumstances the request
completion should not take so long so I don't think it would be really
worth it to implement some special mechanism for this in bfq.

Honza