Re: [PATCH 09/24] workqueue: Make unbound workqueues to use per-cpu pool_workqueues

From: Dennis Dalessandro
Date: Mon May 22 2023 - 08:31:59 EST


On 5/22/23 2:41 AM, Leon Romanovsky wrote:
> On Thu, May 18, 2023 at 02:16:54PM -1000, Tejun Heo wrote:
>> A pwq (pool_workqueue) represents an association between a workqueue and a
>> worker_pool. When a work item is queued, the workqueue selects the pwq to
>> use, which in turn determines the pool, and queues the work item to the pool
>> through the pwq. pwq is also what implements the maximum concurrency limit -
>> @max_active.
>>
>> As a per-cpu workqueue should be assocaited with a different worker_pool on
>> each CPU, it always had per-cpu pwq's that are accessed through wq->cpu_pwq.
>> However, unbound workqueues were sharing a pwq within each NUMA node by
>> default. The sharing has several downsides:
>>
>> * Because @max_active is per-pwq, the meaning of @max_active changes
>> depending on the machine configuration and whether workqueue NUMA locality
>> support is enabled.
>>
>> * Makes per-cpu and unbound code deviate.
>>
>> * Gets in the way of making workqueue CPU locality awareness more flexible.
>>
>> This patch makes unbound workqueues use per-cpu pwq's the same way per-cpu
>> workqueues do by making the following changes:
>>
>> * wq->numa_pwq_tbl[] is removed and unbound workqueues now use wq->cpu_pwq
>> just like per-cpu workqueues. wq->cpu_pwq is now RCU protected for unbound
>> workqueues.
>>
>> * numa_pwq_tbl_install() is renamed to install_unbound_pwq() and installs
>> the specified pwq to the target CPU's wq->cpu_pwq.
>>
>> * apply_wqattrs_prepare() now always allocates a separate pwq for each CPU
>> unless the workqueue is ordered. If ordered, all CPUs use wq->dfl_pwq.
>> This makes the return value of wq_calc_node_cpumask() unnecessary. It now
>> returns void.
>>
>> * @max_active now means the same thing for both per-cpu and unbound
>> workqueues. WQ_UNBOUND_MAX_ACTIVE now equals WQ_MAX_ACTIVE and
>> documentation is updated accordingly. WQ_UNBOUND_MAX_ACTIVE is no longer
>> used in workqueue implementation and will be removed later.
>>
>> * All unbound pwq operations which used to be per-numa-node are now per-cpu.
>>
>> For most unbound workqueue users, this shouldn't cause noticeable changes.
>> Work item issue and completion will be a small bit faster, flush_workqueue()
>> would become a bit more expensive, and the total concurrency limit would
>> likely become higher. All @max_active==1 use cases are currently being
>> audited for conversion into alloc_ordered_workqueue() and they shouldn't be
>> affected once the audit and conversion is complete.
>>
>> One area where the behavior change may be more noticeable is
>> workqueue_congested() as the reported congestion state is now per CPU
>> instead of NUMA node. There are only two users of this interface -
>> drivers/infiniband/hw/hfi1 and net/smc. Maintainers of both subsystems are
>> cc'd. Inputs on the behavior change would be very much appreciated.
>
> At least for hfi1, it seems like your changes won't cause to any
> differences as NUMA node is expected to be connected to closest CPU
> anyway in setups relevant to hfi1.
>
> Dennis, am I right?
>
> Thanks

I can see there being an impact as to when things are considered congested since
it's now CPU based vs NUMA. However, this seems like it's a good thing for hfi1.
The purpose of the code in hfi1 is to decide if QP processing should yield the
CPU and allow other QPs to make progress.

Acked-by: Dennis Dalessandro <dennis.dalessandro@xxxxxxxxxxxxxxxxxxxx>