Re: DRM-managed resources / devm_drm_dev_alloc leaking resources

From: Maxime Ripard
Date: Fri Nov 18 2022 - 11:16:15 EST


Hi,

Javier and I looked into it some more today, and I think we have a
better idea of what is going on.

On Thu, Nov 17, 2022 at 05:53:11PM +0100, Maxime Ripard wrote:
> After trying to get more kunit tests for KMS, I found out that the
> recent kunit helpers we merged to create a DRM device [1] are broken and
> won't free their device-managed and DRM-managed resources.
>
> With some help from Thomas, we've dug into this and it turns out that if
> we allocate a device with root_device_register, initialise our drm
> device with devm_drm_dev_alloc(), register it using drm_dev_register(),
> unregister it using drm_dev_unregister/drm_dev_unplug and then remove
> the parent device, neither the device managed nor the DRM managed
> actions are run.
>
> root_device_register initializes the device by eventually calling
> device_initialize() which sets the initial reference count of the root
> device to 1 [2]. devm_drm_dev_alloc() then comes in, drm_dev_init() will
> increase the root device refcount [3] and initialize our DRM device to 1
> [4]. drm_dev_register(), through drm_minor_register() and device_add(),
> will increase the root device refcount [5].
>
> When unrolling things, drm_dev_unregister(), through
> drm_minor_unregister() and device_del(), will give up its reference [6].
> root_device_unregister(), through device_unregister(), will also give up
> its own [7].
>
> So we end up with this for the reference counts:
>
> +------------------------+-------------+------------+
> | funcs | root device | DRM device |
> +------------------------+-------------+------------+
> | root_device_register | 1 | N/A |
> | devm_drm_dev_alloc | 2 | 1 |
> | drm_dev_register | 3 | 1 |
> | drm_dev_unregister | 2 | 1 |
> | root_device_unregister | 1 | 1 |
> +------------------------+-------------+------------+
>
> If we go back to the list of reference taken, the root device reference
> and the initial drm_device reference, both taken by devm_drm_dev_alloc()
> through drm_dev_init(), haven't been put back.
>
> If we look at the drm_dev_init code(), we can see that it sets up a
> DRM-managed action [8] that will put back the device reference [9]. The
> DRM-managed code is executed by the drm_managed_cleanup() function, that
> is executed as part of a release hook [10] executed once we give up the
> final reference to the DRM device [11].
>
> If we go back a little, the final reference to the DRM device is
> actually the initial one setup by devm_drm_dev_alloc(). This function
> has superseded drm_dev_alloc(), with the documentation that we do need a
> final drm_dev_put() to put back our final reference [12].
>
> devm_drm_dev_alloc() is a more convenient variant that has been
> introduced explicitly to not require that drm_dev_put(), and states it
> as such in the documentation [13]. It does so by adding a device-managed
> action that will call drm_dev_put() [14].
>
> Device-managed actions are ran as part devres_release_all() that is
> called by device_release() [15], itself being run when the last
> reference on the device is put back [16][17][18].
>
> So if we sum things up, the DRM device will only give its last root
> device reference when the last DRM device reference will be put back,
> and the last DRM device reference will be put back when the last device
> reference will be put back, which sounds very circular to me, with both
> ending up in a deadlock scenario.
>
> I've added two kunit tests that demonstrate the issue: we register a
> device, allocate and register a DRM device, register a DRM-managed
> action, remove the DRM device and the parent device, and wait for the
> action to execute. drm_register_unregister_with_devm_test() uses the
> broken(?) devm_drm_dev_alloc and is failing.
> drm_register_unregister_test uses the deprecated drm_dev_alloc() that
> requires an explicit call to drm_dev_put() which works fine.
>
> It's also worth noting that Thomas tested with simpledrm and it seems to
> work fine.

Indeed, the transition from simpledrm to a full-blown DRM driver handles
this properly. It's using a platform_device_unregister() [1] and
eventually device_del() [2][3]. That part is handled just like
root_device_unregister() [4][5]. Basically, both will call device_del(),
and then device_put(), so device_del() is called while holding a
reference.

As we've seen before, at this point the DRM driver still holds a
reference to the device as well.

device_del() will call bus_remove_device() [6], which will be skipped
for the root device since it doesn't have a bus [7].

It will then call device_release_driver() [8], which is basically forwarded
to __device_release_driver() [9][10], that will call device_unbind_cleanup() [11].

device_unbind_cleanup() calls devres_release_all() directly [12], that
runs all the device-managed actions [13]. And it does so WHILE THERE'S
STILL A REFCOUNT OF 2!

I would expect the call to devres_release_all to happen only in
device_release, once all the device reference have been put back. Not 4
functions in as a side effect, and while there's still some active
references.

> Using a platform_device instead of the root_device doesn't
> change anything to the outcome in my tests, so there might be a more
> subtle behaviour involved.

This one has the same symptom but a different cause. I was just
registering a platform_device but it wasn't bound to any driver. While
it's valid to do so according to that comment [13], it doesn't have any
driver so the check for the driver in device_release_driver() [8], and
never hits device_unbind_cleanup().

Thanks again to Thomas and Javier for their help
Maxime

1: https://elixir.bootlin.com/linux/latest/source/drivers/video/aperture.c#L199
2: https://elixir.bootlin.com/linux/latest/source/drivers/base/platform.c#L793
3: https://elixir.bootlin.com/linux/latest/source/drivers/base/platform.c#L751
4: https://elixir.bootlin.com/linux/latest/source/drivers/base/core.c#L4153
5: https://elixir.bootlin.com/linux/latest/source/drivers/base/core.c#L3733
6: https://elixir.bootlin.com/linux/latest/source/drivers/base/core.c#L3704
7: https://elixir.bootlin.com/linux/latest/source/drivers/base/bus.c#L511
8: https://elixir.bootlin.com/linux/latest/source/drivers/base/bus.c#L529
9: https://elixir.bootlin.com/linux/latest/source/drivers/base/dd.c#L1298
10: https://elixir.bootlin.com/linux/latest/source/drivers/base/dd.c#L1275
11: https://elixir.bootlin.com/linux/latest/source/drivers/base/dd.c#L1255
12: https://elixir.bootlin.com/linux/latest/source/drivers/base/dd.c#L530
13: https://elixir.bootlin.com/linux/latest/source/drivers/base/core.c#L2312

Attachment: signature.asc
Description: PGP signature