Re: [PATCH V2] lib/group_cpus.c: avoid to acquire cpu hotplug lock in group_cpus_evenly

From: Chengming Zhou
Date: Fri Aug 18 2023 - 03:00:30 EST


Hi,

On 2023/8/18 09:52, Ming Lei wrote:
> group_cpus_evenly() could be part of storage driver's error handler,
> such as nvme driver, when may happen during CPU hotplug, in which
> storage queue has to drain its pending IOs because all CPUs associated
> with the queue are offline and the queue is becoming inactive. And
> handling IO needs error handler to provide forward progress.
>
> Then dead lock is caused:
>
> 1) inside CPU hotplug handler, CPU hotplug lock is held, and blk-mq's
> handler is waiting for inflight IO
>
> 2) error handler is waiting for CPU hotplug lock
>
> 3) inflight IO can't be completed in blk-mq's CPU hotplug handler because
> error handling can't provide forward progress.
>
> Solve the deadlock by not holding CPU hotplug lock in group_cpus_evenly(),
> in which two stage spreads are taken: 1) the 1st stage is over all present
> CPUs; 2) the end stage is over all other CPUs.
>
> Turns out the two stage spread just needs consistent 'cpu_present_mask', and
> remove the CPU hotplug lock by storing it into one local cache. This way
> doesn't change correctness, because all CPUs are still covered.
>
> Cc: Keith Busch <kbusch@xxxxxxxxxx>
> Cc: linux-nvme@xxxxxxxxxxxxxxxxxxx
> Cc: linux-block@xxxxxxxxxxxxxxx
> Reported-by: Yi Zhang <yi.zhang@xxxxxxxxxx>
> Reported-by: Guangwu Zhang <guazhang@xxxxxxxxxx>
> Tested-by: Guangwu Zhang <guazhang@xxxxxxxxxx>
> Signed-off-by: Ming Lei <ming.lei@xxxxxxxxxx>
> ---
> V2:
> - fix "Cc: block list"
> - add tested-by tag
>
> lib/group_cpus.c | 22 ++++++++++++++++------
> 1 file changed, 16 insertions(+), 6 deletions(-)
>
> diff --git a/lib/group_cpus.c b/lib/group_cpus.c
> index aa3f6815bb12..15006e79196f 100644
> --- a/lib/group_cpus.c
> +++ b/lib/group_cpus.c
> @@ -348,6 +348,7 @@ struct cpumask *group_cpus_evenly(unsigned int numgrps)
> {
> unsigned int curgrp = 0, nr_present = 0, nr_others = 0;
> cpumask_var_t *node_to_cpumask;
> + cpumask_var_t local_cpu_present_mask;
> cpumask_var_t nmsk, npresmsk;
> int ret = -ENOMEM;
> struct cpumask *masks = NULL;
> @@ -355,6 +356,16 @@ struct cpumask *group_cpus_evenly(unsigned int numgrps)
> if (!zalloc_cpumask_var(&nmsk, GFP_KERNEL))
> return NULL;
>
> + if (!zalloc_cpumask_var(&local_cpu_present_mask, GFP_KERNEL))
> + goto fail_local_pres_mask;
> +
> + /*
> + * Make a local cache of 'cpu_present_mask', so the two stages
> + * spread can observe consistent 'cpu_present_mask' without holding
> + * cpu hotplug lock.
> + */
> + cpumask_copy(local_cpu_present_mask, cpu_present_mask);
> +

Maybe we can reuse npresmsk instead of allocating another cpumask?
In the first stage: npresmsk = cpu_present_mask
In the second stage: npresmsk = cpu_possible_mask & ~npresmsk

> if (!zalloc_cpumask_var(&npresmsk, GFP_KERNEL))
> goto fail_nmsk;
>
> @@ -366,13 +377,11 @@ struct cpumask *group_cpus_evenly(unsigned int numgrps)
> if (!masks)
> goto fail_node_to_cpumask;
>
> - /* Stabilize the cpumasks */
> - cpus_read_lock();
> build_node_to_cpumask(node_to_cpumask);
>
> /* grouping present CPUs first */
> ret = __group_cpus_evenly(curgrp, numgrps, node_to_cpumask,
> - cpu_present_mask, nmsk, masks);
> + local_cpu_present_mask, nmsk, masks);
> if (ret < 0)
> goto fail_build_affinity;
> nr_present = ret;
> @@ -387,15 +396,13 @@ struct cpumask *group_cpus_evenly(unsigned int numgrps)
> curgrp = 0;
> else
> curgrp = nr_present;
> - cpumask_andnot(npresmsk, cpu_possible_mask, cpu_present_mask);
> + cpumask_andnot(npresmsk, cpu_possible_mask, local_cpu_present_mask);
> ret = __group_cpus_evenly(curgrp, numgrps, node_to_cpumask,
> npresmsk, nmsk, masks);
> if (ret >= 0)
> nr_others = ret;
>
> fail_build_affinity:
> - cpus_read_unlock();
> -
> if (ret >= 0)
> WARN_ON(nr_present + nr_others < numgrps);

This fail_build_affinity tag seems unneeded anymore.

The patch looks good to me:

Reviewed-by: Chengming Zhou <zhouchengming@xxxxxxxxxxxxx>

Thanks.

>
> @@ -406,6 +413,9 @@ struct cpumask *group_cpus_evenly(unsigned int numgrps)
> free_cpumask_var(npresmsk);
>
> fail_nmsk:
> + free_cpumask_var(local_cpu_present_mask);
> +
> + fail_local_pres_mask:
> free_cpumask_var(nmsk);
> if (ret < 0) {
> kfree(masks);