Re: [RFC PATCH V2 5/6] ublk_drv: consider recovery feature in aborting mechanism

From: Ziyang Zhang
Date: Sun Sep 04 2022 - 07:24:03 EST

Next message: Dominique Martinet: "[PATCH] net/9p: use a dedicated spinlock for trans_fd"
Previous message: Gabriel Francisco: "Re: [Regression] Bug 216420 - Unable to handle kernel NULL pointer dereference at virtual address 00000008"
In reply to: Ming Lei: "Re: [RFC PATCH V2 5/6] ublk_drv: consider recovery feature in aborting mechanism"
Next in thread: Ming Lei: "Re: [RFC PATCH V2 5/6] ublk_drv: consider recovery feature in aborting mechanism"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On 2022/9/3 21:30, Ming Lei wrote:
> On Wed, Aug 31, 2022 at 11:54 PM ZiyangZhang
> <ZiyangZhang@xxxxxxxxxxxxxxxxx> wrote:
>>
>> We change the default behavior of aborting machenism. Now monitor_work
>> will not be manually scheduled by ublk_queue_rq() or task_work after a
>> ubq_daemon or process is dying(PF_EXITING). The monitor work should
>> find a dying ubq_daemon or a crash process by itself. Then, it can
>
> Looks you don't consider one dying ubq_daemon as one crash candidate.
> Most io implementation is done in the ubq pthread, so it should be
> covered by the crash recovery.
>
>> start the aborting machenism. We do such modification is because we want
>> to strictly separate the STOP_DEV procedure and monitor_work. More
>> specifically, we ensure that monitor_work must not be scheduled after
>> we start deleting gendisk and ending(aborting) all inflight rqs. In this
>> way we are easy to consider recovery feature and unify it into existing
>> aborting mechanism. Really we do not want too many "if can_use_recovery"
>> checks.
>
> Frankly speaking, not sure we need to invent new wheel for the
> 'aborting' mechanism.
>
> In theory, you needn't change the current monitor work and cancel
> dev/queue. What you need is how to handle the dying ubq daemon:
>
> 1) without user recovery, delete disk if any ubq daemon is died.
>
> 2) with user recovery:
> - quiesce request queue and wait until all inflight requests are
> requeued(become IDLE);
> - call io_uring_cmd_done for any active io slot
> - send one kobj_uevent(KOBJ_CHANGE) to notify userspace for handling
> the potential crash; if it is confirmed as crash by userspace,
> userspace will send command to handle it.
> (this way will simplify userspace too, since we can add one utility
> and provide it via udev script for handling rec
>
> Checking one flag lockless is usually not safe, also not sure why we
> need such flag here, and the original check is supposed to work.
>
> overy)
>
>>
>> With recovery feature disabled and after a ubq_daemon crash:
>> (1) monitor_work notices the crash and schedules stop_work
>
> driver can't figure out if it is crash, and it can just see if the
> ubq deamon is died or not. And crash detection logic should be done
> in userspace, IMO.
>
>> (2) stop_work calls ublk_stop_dev()
>> (3) In ublk_stop_dev():
>> (a) It sets 'force_abort', which prevents new rqs in ublk_queue_rq();
>
> Please don't add new flag in fast path lockless, and the original check
> is supposed to be reused for recovery feature.
>
>> If ublk_queue_rq() does not see it, rqs can still be ended(aborted)
>> in fallback wq.
>> (b) Then it cancels monitor_work;
>> (c) Then it schedules abort_work which ends(aborts) all inflight rqs.
>> (d) At the same time del_gendisk() is called.
>> (e) Finally, we complete all ioucmds.
>>
>> Note: we do not change the existing behavior with reocvery disabled. Note
>> that STOP_DEV ctrl-cmd can be processed without reagrd to monitor_work.
>>
>> With recovery feature enabled and after a process crash:
>> (1) monitor_work notices the crash and all ubq_daemon are dying.
>> We do not consider a "single" ubq_daemon(pthread) crash. Please send
>> STOP_DEV ctrl-cmd which calling ublk_stop_dev() for this case.
>
> Can you consider why you don't consider it as one crash? IMO, most of
> userspace block logic is run in ubq_daemon, so it is reasonable to
> consider it.
>
> ublk_reinit_dev() is supposed to be run in standalone context, just like
> ublk_stop_dev(), we need monitor_work to provide forward progress,
> so don't run wait in monitor work.
>
> And please don't change this model for making forward progress.
>
>

Hi, Ming.

I will take your advice and provide V4 soon. Here is the new design:

(0) No modification in fast patch. We just requeue rqs with a dying ubq_daemon
and schedule monitor_work immediately.

BTW: I think here we should call 'blk_mq_delay_kick_requeue_list()' after
requeuing a rq. Otherwise del_gendisk() in ublk_stop_dev() hangs.

(1) Add quiesce_work, which is scheduled by monitor_work after a ubq_daemon
is dying with recovery enabled.

(2) quiesce_work runs ublk_quiesce_dev(). It accquires the ub lock, and
quiescses the request queue(only once). On each dying ubq, call
ublk_quiesce_queue(). It waits until all inflight rqs(ACTIVE) are
requeued(IDLE). Finally it completes all ioucmds.
Note: So We need to add a per ubq flag 'quiesced', which means
we have done this 'quiesce && clean' stuff on the ubq.

(3) After the request queue is quiesced, change ub's state to STATE_QUIESCED.
This state can be checked by GET_DEV_INFO ctrl-cmd, just like STATE_LIVE. So
user can detect a crash by sending GET_DEV_INFO and getting STATE_QUIESCED
back.

BTW, I'm unsure that sending one kobj_uevent(KOBJ_CHANGE) really helps. Users
have may ways to detect a dying process/pthread. For example, they can 'ps'
ublksrv_pid or check ub's state by GET_DEV_INFO ctrl-cmd. Anyway, this work
can be done in the future. We can introduce a better way to detect a crash.
For this patchset, let's focus on how to deal with a dying ubq_daemon.
Do you agree?

(4) Do not change ublk_stop_dev(). BTW, ublk_stop_dev() and ublk_quiescd_dev()
exclude each other by accqiring ub lock.

(5) ublk_stop_dev() has to consider a quiesced ubq. It should unquiesce request
queue(only once) if it is quiesced. There is nothing else ublk_stop_dev()
has to do. Inflight rqs requeued before will be aborted naturally by
del_gendisk().

(6) ublk_quiesce_dev() cannot be run after gendisk is removed(STATE_DEAD).

(7) No need to run ublk_quiesce_queue() on a 'quiesced' ubq by checking the flag.
Note: I think this check is safe here.

(8) START_USER_RECOVERY needs to consider both a dying process and pthread(ubq_daemon).

For a dying process, it has to reset ub->dev_info.ublksrv_pid and ub->mm. This can
by done by passing a qid = -1 in ctrl-cmd. We should make sure all ubq_daemons
are dying in this case.

otherwise it is a dying pthread. Only this ubq is reinit. Users may send many
START_USER_RECOVERY with different qid to recover many ubqs.

Thanks for reviewing patches.

Regards,
Zhang.

Next message: Dominique Martinet: "[PATCH] net/9p: use a dedicated spinlock for trans_fd"
Previous message: Gabriel Francisco: "Re: [Regression] Bug 216420 - Unable to handle kernel NULL pointer dereference at virtual address 00000008"
In reply to: Ming Lei: "Re: [RFC PATCH V2 5/6] ublk_drv: consider recovery feature in aborting mechanism"
Next in thread: Ming Lei: "Re: [RFC PATCH V2 5/6] ublk_drv: consider recovery feature in aborting mechanism"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]