Re: [PATCH v4 09/10] workqueue: Implement system-wide nr_active enforcement for unbound workqueues

From: Marek Szyprowski
Date: Tue Jan 30 2024 - 17:30:32 EST


Dear All,

On 29.01.2024 19:14, Tejun Heo wrote:
> From: Tejun Heo <tj@xxxxxxxxxx>
> Date: Mon, 29 Jan 2024 08:11:25 -1000
>
> A pool_workqueue (pwq) represents the connection between a workqueue and a
> worker_pool. One of the roles that a pwq plays is enforcement of the
> max_active concurrency limit. Before 636b927eba5b ("workqueue: Make unbound
> workqueues to use per-cpu pool_workqueues"), there was one pwq per each CPU
> for per-cpu workqueues and per each NUMA node for unbound workqueues, which
> was a natural result of per-cpu workqueues being served by per-cpu pools and
> unbound by per-NUMA pools.
>
> In terms of max_active enforcement, this was, while not perfect, workable.
> For per-cpu workqueues, it was fine. For unbound, it wasn't great in that
> NUMA machines would get max_active that's multiplied by the number of nodes
> but didn't cause huge problems because NUMA machines are relatively rare and
> the node count is usually pretty low.
>
> However, cache layouts are more complex now and sharing a worker pool across
> a whole node didn't really work well for unbound workqueues. Thus, a series
> of commits culminating on 8639ecebc9b1 ("workqueue: Make unbound workqueues
> to use per-cpu pool_workqueues") implemented more flexible affinity
> mechanism for unbound workqueues which enables using e.g. last-level-cache
> aligned pools. In the process, 636b927eba5b ("workqueue: Make unbound
> workqueues to use per-cpu pool_workqueues") made unbound workqueues use
> per-cpu pwqs like per-cpu workqueues.
>
> While the change was necessary to enable more flexible affinity scopes, this
> came with the side effect of blowing up the effective max_active for unbound
> workqueues. Before, the effective max_active for unbound workqueues was
> multiplied by the number of nodes. After, by the number of CPUs.
>
> 636b927eba5b ("workqueue: Make unbound workqueues to use per-cpu
> pool_workqueues") claims that this should generally be okay. It is okay for
> users which self-regulates concurrency level which are the vast majority;
> however, there are enough use cases which actually depend on max_active to
> prevent the level of concurrency from going bonkers including several IO
> handling workqueues that can issue a work item for each in-flight IO. With
> targeted benchmarks, the misbehavior can easily be exposed as reported in
> http://lkml.kernel.org/r/dbu6wiwu3sdhmhikb2w6lns7b27gbobfavhjj57kwi2quafgwl@htjcc5oikcr3.
>
> Unfortunately, there is no way to express what these use cases need using
> per-cpu max_active. A CPU may issue most of in-flight IOs, so we don't want
> to set max_active too low but as soon as we increase max_active a bit, we
> can end up with unreasonable number of in-flight work items when many CPUs
> issue IOs at the same time. ie. The acceptable lowest max_active is higher
> than the acceptable highest max_active.
>
> Ideally, max_active for an unbound workqueue should be system-wide so that
> the users can regulate the total level of concurrency regardless of node and
> cache layout. The reasons workqueue hasn't implemented that yet are:
>
> - One max_active enforcement decouples from pool boundaires, chaining
> execution after a work item finishes requires inter-pool operations which
> would require lock dancing, which is nasty.
>
> - Sharing a single nr_active count across the whole system can be pretty
> expensive on NUMA machines.
>
> - Per-pwq enforcement had been more or less okay while we were using
> per-node pools.
>
> It looks like we no longer can avoid decoupling max_active enforcement from
> pool boundaries. This patch implements system-wide nr_active mechanism with
> the following design characteristics:
>
> - To avoid sharing a single counter across multiple nodes, the configured
> max_active is split across nodes according to the proportion of each
> workqueue's online effective CPUs per node. e.g. A node with twice more
> online effective CPUs will get twice higher portion of max_active.
>
> - Workqueue used to be able to process a chain of interdependent work items
> which is as long as max_active. We can't do this anymore as max_active is
> distributed across the nodes. Instead, a new parameter min_active is
> introduced which determines the minimum level of concurrency within a node
> regardless of how max_active distribution comes out to be.
>
> It is set to the smaller of max_active and WQ_DFL_MIN_ACTIVE which is 8.
> This can lead to higher effective max_weight than configured and also
> deadlocks if a workqueue was depending on being able to handle chains of
> interdependent work items that are longer than 8.
>
> I believe these should be fine given that the number of CPUs in each NUMA
> node is usually higher than 8 and work item chain longer than 8 is pretty
> unlikely. However, if these assumptions turn out to be wrong, we'll need
> to add an interface to adjust min_active.
>
> - Each unbound wq has an array of struct wq_node_nr_active which tracks
> per-node nr_active. When its pwq wants to run a work item, it has to
> obtain the matching node's nr_active. If over the node's max_active, the
> pwq is queued on wq_node_nr_active->pending_pwqs. As work items finish,
> the completion path round-robins the pending pwqs activating the first
> inactive work item of each, which involves some pool lock dancing and
> kicking other pools. It's not the simplest code but doesn't look too bad.
>
> v4: - wq_adjust_max_active() updated to invoke wq_update_node_max_active().
>
> - wq_adjust_max_active() is now protected by wq->mutex instead of
> wq_pool_mutex.
>
> v3: - wq_node_max_active() used to calculate per-node max_active on the fly
> based on system-wide CPU online states. Lai pointed out that this can
> lead to skewed distributions for workqueues with restricted cpumasks.
> Update the max_active distribution to use per-workqueue effective
> online CPU counts instead of system-wide and cache the calculation
> results in node_nr_active->max.
>
> v2: - wq->min/max_active now uses WRITE/READ_ONCE() as suggested by Lai.
>
> Signed-off-by: Tejun Heo <tj@xxxxxxxxxx>
> Reported-by: Naohiro Aota <Naohiro.Aota@xxxxxxx>
> Link: http://lkml.kernel.org/r/dbu6wiwu3sdhmhikb2w6lns7b27gbobfavhjj57kwi2quafgwl@htjcc5oikcr3
> Fixes: 636b927eba5b ("workqueue: Make unbound workqueues to use per-cpu pool_workqueues")
> Reviewed-by: Lai Jiangshan <jiangshanlai@xxxxxxxxx>

This patch landed in linux-next from 20240130 as commit 5797b1c18919
("workqueue: Implement system-wide nr_active enforcement for unbound
workqueues"). Unfortunately it causes the following regression on ARM64
RK3568 SoC based Odroid-M1 board:

Unable to handle kernel paging request at virtual address ffff0002100296e0
Mem abort info:
  ESR = 0x0000000096000005
  EC = 0x25: DABT (current EL), IL = 32 bits
  SET = 0, FnV = 0
  EA = 0, S1PTW = 0
  FSC = 0x05: level 1 translation fault
Data abort info:
  ISV = 0, ISS = 0x00000005, ISS2 = 0x00000000
  CM = 0, WnR = 0, TnD = 0, TagAccess = 0
  GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0
swapper pgtable: 4k pages, 48-bit VAs, pgdp=000000000255a000
[ffff0002100296e0] pgd=18000001ffff7003, p4d=18000001ffff7003,
pud=0000000000000000
Internal error: Oops: 0000000096000005 [#1] PREEMPT SMP
Modules linked in:
CPU: 0 PID: 0 Comm: swapper/0 Not tainted 6.8.0-rc2-next-20240130+ #14392
Hardware name: Hardkernel ODROID-M1 (DT)
pstate: 600000c9 (nZCv daIF -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
pc : wq_update_node_max_active+0x50/0x1fc
lr : wq_update_node_max_active+0x1f0/0x1fc
..
Call trace:
 wq_update_node_max_active+0x50/0x1fc
 apply_wqattrs_commit+0xf0/0x114
 apply_workqueue_attrs_locked+0x58/0xa0
 alloc_workqueue+0x5ac/0x774
 workqueue_init_early+0x460/0x540
 start_kernel+0x258/0x684
 __primary_switched+0xb8/0xc0
Code: 9100a273 35000d01 53067f00 d0016dc1 (f8607a60)
---[ end trace 0000000000000000 ]---
Kernel panic - not syncing: Attempted to kill the idle task!
---[ end Kernel panic - not syncing: Attempted to kill the idle task! ]---

This wasn't trivial to bisect, because next-20240130 suffers from other
regressions:
https://lore.kernel.org/all/30ddedc9-0829-4a99-9cb1-39190937981c@xxxxxxxxxxx/
but reverting this change and the ones mentioned in that thread fixes
all the issues observed on top of today's linux-next release. Let me
know if there is anything I can do to help fixing this issue.

> ---
> include/linux/workqueue.h | 35 +++-
> kernel/workqueue.c | 341 ++++++++++++++++++++++++++++++++++----
> 2 files changed, 341 insertions(+), 35 deletions(-)

> ...


Best regards
--
Marek Szyprowski, PhD
Samsung R&D Institute Poland