Re: [PATCH] module: print module name on refcount error

From: Jean Delvare
Date: Mon Jul 03 2023 - 06:45:44 EST


On Sat, 1 Jul 2023 17:57:27 +0200, Jean Delvare wrote:
> On Fri, 30 Jun 2023 16:05:33 -0700, Luis Chamberlain wrote:
> > On Mon, Jun 26, 2023 at 12:32:52PM +0200, Jean Delvare wrote:
> > > If module_put() triggers a refcount error, include the culprit
> > > module name in the warning message, to easy further investigation of
> > > the issue.
> > >
> > > Signed-off-by: Jean Delvare <jdelvare@xxxxxxx>
> > > Suggested-by: Michal Hocko <mhocko@xxxxxxxx>
> > > Cc: Luis Chamberlain <mcgrof@xxxxxxxxxx>
> > > ---
> > > kernel/module/main.c | 4 +++-
> > > 1 file changed, 3 insertions(+), 1 deletion(-)
> > >
> > > --- linux-6.3.orig/kernel/module/main.c
> > > +++ linux-6.3/kernel/module/main.c
> > > @@ -850,7 +850,9 @@ void module_put(struct module *module)
> > > if (module) {
> > > preempt_disable();
> > > ret = atomic_dec_if_positive(&module->refcnt);
> > > - WARN_ON(ret < 0); /* Failed to put refcount */
> > > + WARN(ret < 0,
> > > + KERN_WARNING "Failed to put refcount for module %s\n",
> > > + module->name);
> > > trace_module_put(module, _RET_IP_);
> > > preempt_enable();
> > > }
> >
> > The mod struct ends up actually being allocated, we first read the ELF
> > passed by userspace and we end up allocating space for struct module
> > when reading the ELF section ".gnu.linkonce.this_module". We cache
> > the ELF section index in info->index.mod, we finally copy the module
> > into the allocated space with move_module().
> >
> > In linux-next code this is much more clear now.
> >
> > What prevents us from racing to free the module and thus invalidating
> > the name?
> >
> > For instance the system call to delete_module() could hammer and
> > so have tons of threads racing try_stop_module(), eventually one of
> > them could win and free_module() would kick in gear.
> >
> > What prevents code from racing the free with a random module_put()
> > called by some other piece of code?
> >
> > I realize this may implicate even the existing code seems racy.
>
> You are the maintainer so I'll trust your expertise, but this is how I
> understand it: if we hit this WARN, this means reference counting is
> screwed. If this is an underflow, we still have a reference to the
> module while refcnt is zero, meaning the module could be removed at any
> time. This is inherent to the issue we are reporting, and not related
> to the proposed change. The name is just one field of struct module,
> refcnt is in the very same situation already.
>
> So the whole piece of code is best effort reporting and assumes (both
> before and after my proposed change) that nobody attempted to unload
> the module yet.

I thought some more about it and one potential problem with my proposed
change is if the module has indeed already been freed and the memory
already reused for a different purpose. We are in trouble already (we
just called atomic_dec_if_positive on a random memory location) but the
WARN message could become very messy if the memory where module.name
used to reside no longer contains any string terminator (binary zero).

So we probably want to play it safe and add a length limitation when
printing the module name. Something like:

WARN(ret < 0,
KERN_WARNING "Failed to put refcount for module %.*s\n",
(int)MODULE_NAME_LEN, module->name);

--
Jean Delvare
SUSE L3 Support