Re: [PATCH 6/7] workqueue: Implement system-wide max_active enforcement for unbound workqueues

From: Tejun Heo
Date: Wed Dec 27 2023 - 18:06:43 EST


Hello, Lai.

On Wed, Dec 27, 2023 at 10:51:42PM +0800, Lai Jiangshan wrote:
> static int pwq_calculate_max_active(struct pool_workqueue *pwq)
> {
> + int pwq_nr_online_cpus;
> + int max_active;
> +
> /*
> * During [un]freezing, the caller is responsible for ensuring
> * that pwq_adjust_max_active() is called at least once after
> @@ -4152,7 +4158,18 @@ static int pwq_calculate_max_active(struct pool_workqueue *pwq)
> if ((pwq->wq->flags & WQ_FREEZABLE) && workqueue_freezing)
> return 0;
>
> - return pwq->wq->saved_max_active;
> + if (!(pwq->wq->flags & WQ_UNBOUND))
> + return pwq->wq->saved_max_active;
> +
> + pwq_nr_online_cpus = cpumask_weight_and(pwq->pool->attrs->__pod_cpumask, cpu_online_mask);
> + max_active = DIV_ROUND_UP(pwq->wq->saved_max_active * pwq_nr_online_cpus, num_online_cpus());

So, the problem with this approach is that we can end up segmenting
max_active to too many too small pieces. Imagine a system with an AMD EPYC
9754 - 256 threads spread across 16 L3 caches. Let's say there's a workqueue
used for IO (e.g. encryption) with the default CACHE affinity_scope ans
max_active of 2 * nr_cpus, which isn't uncommon for this type of workqueues.

The above code would limit each L3 domain to 32 concurent work items. Let's
say a thread which is pinned to a CPU is issuing a lot of concurrent writes
with the expectation of being able to saturate all the CPUs. It won't be
able to even get close. The expected behavior is saturating all 256 CPUs on
the system. The resulting behavior would be saturating an eight of them.

The crux of the problem is that the desired worker pool domain and
max_active enforcement domain don't match. We want to be fine grained with
the former but pretty close to the whole system for the latter.

Thanks.

--
tejun