Re: [PATCH] zram: do not lookup algorithm in backends table

From: Andrew Morton
Date: Wed Jun 22 2022 - 15:19:48 EST


On Thu, 23 Jun 2022 00:20:05 +0900 Sergey Senozhatsky <senozhatsky@xxxxxxxxxxxx> wrote:

> On (22/06/22 11:35), Sergey Senozhatsky wrote:
> > Always use crypto_has_comp() so that crypto can lookup module,
> > call usermodhelper to load the modules, wait for usermodhelper
> > to finish and so on. Otherwise crypto will do all of these steps
> > under CPU hot-plug lock and this looks like too much stuff to
> > handle under the CPU hot-plug lock. Besides this can end up in
> > a deadlock when usermodhelper triggers a code path that attempts
> > to lock the CPU hot-plug lock, that zram already holds.
>
> And we think that we (not exactly "we", our partners) actually
> see a deadlock. It goes something like this:
>
> - path A. zram grabs CPU hot-plug lock, execs /sbin/modprobe from crypto
> and waits for modprobe to finish

Nope, can't do that.

> disksize_store
> zcomp_create
> __cpuhp_state_add_instance
> __cpuhp_state_add_instance_cpuslocked
> zcomp_cpu_up_prepare
> crypto_alloc_base
> crypto_alg_mod_lookup
> call_usermodehelper_exec
> wait_for_completion_killable
> do_wait_for_common
> schedule

The usermode helper is free to do anything it wants, including
operations that take the CPU hotplug lock. Or operations which might
in the future be changed to take that lock.

> - path B. async work kthread that brings in scsi device. It wants to
> register CPUHP states at some point, and it needs the CPU hot-plug
> lock for that, which is owned by zram.
>
> async_run_entry_fn
> scsi_probe_and_add_lun
> scsi_mq_alloc_queue
> blk_mq_init_queue
> blk_mq_init_allocated_queue
> blk_mq_realloc_hw_ctxs
> __cpuhp_state_add_instance
> __cpuhp_state_add_instance_cpuslocked
> mutex_lock
> schedule
>
> - path C. modprobe sleeps, waiting for all aync works to finish.
>
> load_module
> do_init_module
> async_synchronize_full
> async_synchronize_cookie_domain
> schedule
>
> And none can make any progress.
>
> So I think we need to move crypto_alg_mod_lookup()->call_usermodehelper_exec()
> out of CPU hot-plug lock and pre-load modules in advance, before we grab the
> hot-plug lock.

If the locking is fixed, why is there still a need to preload modules?