Re: [CRAZY-RFF] debugfs: track open files and release on remove

From: Greg KH
Date: Fri Oct 09 2020 - 04:03:12 EST


On Fri, Oct 09, 2020 at 09:53:06AM +0200, Johannes Berg wrote:
> [RFF = Request For Flaming]
>
> THIS IS PROBABLY COMPLETELY CRAZY.
>
> Currently, if a module is unloaded while debugfs files are being
> kept open, things crash since the ->release() method is called
> only at the actual release, despite the proxy_fops, and then the
> code is no longer around.
>
> The correct way to fix this is to annotate all the debugfs fops
> with .owner= THIS_MODULE, but as a learning exercise and to see
> how much hate I can possibly receive, I figured I'd try to work
> around this in debugfs itself.

For lots of debugfs files, .owner should already be set, if you use the
DEFINE_SIMPLE_ATTRIBUTE() or DEFINE_DEBUGFS_ATTRIBUTE() macros.

But yes, not all.

> First, if the fops have a release method and no owner, we track
> all the open files - currently using a very simple array data
> structure for it linked into struct debugfs_fsdata. This required
> changing the API of debugfs_file_get() and debugfs_file_put() to
> pass the struct file * to it.
>
> Then, once we know which files are still open at remove time, we
> can call all their release functions, and avoid calling them from
> the proxy_fops. Of course still call them from the proxy_fops if
> the file is still around, and clean up, so the release isn't all
> deferred to the end, but done as soon as possible.
>
> This introduces a potential locking issue, we could have a fops
> struct that takes a lock in its release that the same module is
> also taking around debugfs_remove(). To flag these issues using
> lockdep, take inode_lock_shared() in full_proxy_release(), see
> the comment there. If this triggers for some fops, add the owner
> as it should have been in the first place.
>
> Over adding the owner everywhere this has two small advantages:
> 1) you can unload the module while a debugfs file is open;
> 2) no need to change hundreds of places in the kernel :-)

I thought the proxy-ops stuff was supposed to fix this issue already.
Why isn't it, what is broken in them that causes this to still crash?

And of course, removing kernel modules is never a guaranteed operation,
nor is it anything that ever happens automatically, so is this really an
issue? :)

thanks,

greg k-h