Re: [PATCH v4 1/2] driver core: Introduce device_link_wait_removal()

From: Rafael J. Wysocki
Date: Wed Mar 06 2024 - 10:57:43 EST


On Wed, Mar 6, 2024 at 4:24 PM Herve Codina <herve.codina@xxxxxxxxxxx> wrote:
>
> Hi Rafael,
>
> On Wed, 6 Mar 2024 13:48:37 +0100
> "Rafael J. Wysocki" <rafael@xxxxxxxxxx> wrote:
>
> > On Wed, Mar 6, 2024 at 9:51 AM Herve Codina <herve.codina@xxxxxxxxxxx> wrote:
> > >
> > > The commit 80dd33cf72d1 ("drivers: base: Fix device link removal")
> > > introduces a workqueue to release the consumer and supplier devices used
> > > in the devlink.
> > > In the job queued, devices are release and in turn, when all the
> > > references to these devices are dropped, the release function of the
> > > device itself is called.
> > >
> > > Nothing is present to provide some synchronisation with this workqueue
> > > in order to ensure that all ongoing releasing operations are done and
> > > so, some other operations can be started safely.
> > >
> > > For instance, in the following sequence:
> > > 1) of_platform_depopulate()
> > > 2) of_overlay_remove()
> > >
> > > During the step 1, devices are released and related devlinks are removed
> > > (jobs pushed in the workqueue).
> > > During the step 2, OF nodes are destroyed but, without any
> > > synchronisation with devlink removal jobs, of_overlay_remove() can raise
> > > warnings related to missing of_node_put():
> > > ERROR: memory leak, expected refcount 1 instead of 2
> > >
> > > Indeed, the missing of_node_put() call is going to be done, too late,
> > > from the workqueue job execution.
> > >
> > > Introduce device_link_wait_removal() to offer a way to synchronize
> > > operations waiting for the end of devlink removals (i.e. end of
> > > workqueue jobs).
> > > Also, as a flushing operation is done on the workqueue, the workqueue
> > > used is moved from a system-wide workqueue to a local one.
> > >
> > > Fixes: 80dd33cf72d1 ("drivers: base: Fix device link removal")
> >
> > No, it is not fixed by this patch.
>
> Was explicitly asked by Saravana on v1 review:
> https://lore.kernel.org/linux-kernel/CAGETcx9uP86EHyKJNifBMd23oCsA+KpMa+e36wJEEnHDve+Avg@xxxxxxxxxxxxxx/

Well, I don't think this is a valid request, sorry.

> The commit 80dd33cf72d1 introduces the workqueue and so some asynchronous tasks
> on removal.
> This patch and the next one allows to re-sync execution waiting for jobs in
> the workqueue when it is needed.

I get this, but still, this particular individual patch by itself
doesn't fix anything. Do you agree with this?

If somebody applies this patch without the next one in the series,
they will not get any change in behavior, so the tag is at least
misleading.

You can claim that the next patch on top of this one fixes things, so
adding a Fixes tag to the other patch would be fine.

There is an explicit dependency between them (the second patch is not
even applicable without the first one, or if it is, the resulting code
won't compile anyway), but you can make a note to the maintainer that
they need to go to -stable together.

> >
> > In fact, the only possibly observable effect of this patch is the
> > failure when the allocation of device_link_wq fails AFAICS.
> >
> > > Cc: stable@xxxxxxxxxxxxxxx
> >
> > So why?
>
> Cc:stable is needed as this patch is a prerequisite of patch 2 (needed
> to fix the asynchronous workqueue task issue).

Dependencies like this can be expressed in "Cc: stable" tags.
Personally, I'd do it like this:

Cc: stable@xxxxxxxxxxxxxxx # 5.13: Depends on the first patch in the series