Re: [PATCH 2/2] workqueue: Fix affinity of kworkers attached during late hotplug

From: Peter Zijlstra
Date: Fri Dec 11 2020 - 07:44:04 EST


On Fri, Dec 11, 2020 at 11:39:21AM +0000, Vincent Donnefort wrote:
> Hi Valentin,
>
> On Thu, Dec 10, 2020 at 04:38:30PM +0000, Valentin Schneider wrote:
> > Per-CPU kworkers forcefully migrated away by hotplug via
> > workqueue_offline_cpu() can end up spawning more kworkers via
> >
> > manage_workers() -> maybe_create_worker()
> >
> > Workers created at this point will be bound using
> >
> > pool->attrs->cpumask
> >
> > which in this case is wrong, as the hotplug state machine already migrated
> > all pinned kworkers away from this CPU. This ends up triggering the BUG_ON
> > condition is sched_cpu_dying() (i.e. there's a kworker enqueued on the
> > dying rq).
> >
> > Special-case workers being attached to DISASSOCIATED pools and bind them to
> > cpu_active_mask, mimicking them being present when workqueue_offline_cpu()
> > was invoked.
> >
> > Link: https://lore.kernel.org/r/ff62e3ee994efb3620177bf7b19fab16f4866845.camel@xxxxxxxxxx
> > Fixes: 06249738a41a ("workqueue: Manually break affinity on hotplug")
>
> Isn't the problem introduced by 1cf12e0 ("sched/hotplug: Consolidate
> task migration on CPU unplug") ?
>
> Previously we had:
>
> AP_WORKQUEUE_ONLINE -> set POOL_DISASSOCIATED
> ...
> TEARDOWN_CPU -> clear CPU in cpu_online_mask
> |
> |-AP_SCHED_STARTING -> migrate_tasks()
> |
> AP_OFFLINE
>
> worker_attach_to_pool(), is "protected" by the cpu_online_mask in
> set_cpus_allowed_ptr(). IIUC, now, the tasks being migrated before the
> cpu_online_mask is actually flipped, there's a window, between
> CPUHP_AP_SCHED_WAIT_EMPTY and CPUHP_TEARDOWN_CPU where a kworker can wake-up
> a new one, for the hotunplugged pool that wouldn't be caught by the
> hotunplug migration.

Yes, very much so, however the commit Valentin picked was supposed to
preemptively fix this. So we can consider this a fix for the fix.

But I don't mind an alternative or perhaps even second Fixes tag on
this.