Re: [PATCH v3 2/2] of: overlay: Synchronize of_overlay_remove() with the devlink removals

From: Nuno Sá
Date: Mon Mar 04 2024 - 10:34:48 EST


On Mon, 2024-03-04 at 09:22 -0600, Rob Herring wrote:
> On Thu, Feb 29, 2024 at 12:18:49PM +0100, Nuno Sá wrote:
> > On Thu, 2024-02-29 at 11:52 +0100, Herve Codina wrote:
> > > In the following sequence:
> > >   1) of_platform_depopulate()
> > >   2) of_overlay_remove()
> > >
> > > During the step 1, devices are destroyed and devlinks are removed.
> > > During the step 2, OF nodes are destroyed but
> > > __of_changeset_entry_destroy() can raise warnings related to missing
> > > of_node_put():
> > >   ERROR: memory leak, expected refcount 1 instead of 2 ...
> > >
> > > Indeed, during the devlink removals performed at step 1, the removal
> > > itself releasing the device (and the attached of_node) is done by a job
> > > queued in a workqueue and so, it is done asynchronously with respect to
> > > function calls.
> > > When the warning is present, of_node_put() will be called but wrongly
> > > too late from the workqueue job.
> > >
> > > In order to be sure that any ongoing devlink removals are done before
> > > the of_node destruction, synchronize the of_overlay_remove() with the
> > > devlink removals.
> > >
> > > Fixes: 80dd33cf72d1 ("drivers: base: Fix device link removal")
> > > Cc: stable@xxxxxxxxxxxxxxx
> > > Signed-off-by: Herve Codina <herve.codina@xxxxxxxxxxx>
> > > ---
> > >  drivers/of/overlay.c | 10 +++++++++-
> > >  1 file changed, 9 insertions(+), 1 deletion(-)
> > >
> > > diff --git a/drivers/of/overlay.c b/drivers/of/overlay.c
> > > index 2ae7e9d24a64..7a010a62b9d8 100644
> > > --- a/drivers/of/overlay.c
> > > +++ b/drivers/of/overlay.c
> > > @@ -8,6 +8,7 @@
> > >  
> > >  #define pr_fmt(fmt) "OF: overlay: " fmt
> > >  
> > > +#include <linux/device.h>
> >
> > This is clearly up to the DT maintainers to decide but, IMHO, I would very
> > much
> > prefer to see fwnode.h included in here rather than directly device.h (so
> > yeah,
> > renaming the function to fwnode_*).
>
> IMO, the DT code should know almost nothing about fwnode because that's
> the layer above it. But then overlay stuff is kind of a layer above the
> core DT code too.

Yeah, my reasoning is just that it may be better than knowing about device.h
code... But maybe I'm wrong :)

>
> > But yeah, I might be biased by own series :)
> >
> > >  #include <linux/kernel.h>
> > >  #include <linux/module.h>
> > >  #include <linux/of.h>
> > > @@ -853,6 +854,14 @@ static void free_overlay_changeset(struct
> > > overlay_changeset *ovcs)
> > >  {
> > >   int i;
> > >  
> > > + /*
> > > + * Wait for any ongoing device link removals before removing some
> > > of
> > > + * nodes. Drop the global lock while waiting
> > > + */
> > > + mutex_unlock(&of_mutex);
> > > + device_link_wait_removal();
> > > + mutex_lock(&of_mutex);
> >
> > I'm still not convinced we need to drop the lock. What happens if someone
> > else
> > grabs the lock while we are in device_link_wait_removal()? Can we guarantee
> > that
> > we can't screw things badly?
>
> It is also just ugly because it's the callers of
> free_overlay_changeset() that hold the lock and now we're releasing it
> behind their back.
>
> As device_link_wait_removal() is called before we touch anything, can't
> it be called before we take the lock? And do we need to call it if
> applying the overlay fails?
>

My natural feeling was to put it right before checking the node refcount... and
I would like to still see proof that there's any potential deadlock. I did not
checked the code but the issue with calling it before we take the lock is that
likely the device links wont be removed because the overlay removal path (which
unbinds devices from drivers) needs to run under the lock?

- Nuno Sá