Re: [PATCH] x86/resctrl: Fix memory leak on kernfs dir removal

From: Willem de Bruijn
Date: Mon Oct 26 2020 - 12:50:25 EST


On Mon, Oct 26, 2020 at 12:24 PM Reinette Chatre
<reinette.chatre@xxxxxxxxx> wrote:
>
> +Xiaochen
>
> Hi Willem,
>
> As you described in the report you sent directly to us there are indeed
> more issues than the one described here surrounding the kernfs node
> reference counting in resctrl. Xiaochen is actively working on patch(es)
> for all the issues and you could continue working with him ... now
> externally?

Great to hear. I wasn't aware of that. Of course. Externally or
off-list first, whichever you prefer.

For reference, one other issue occurs on mount/umount:

for i in {1..200000}; do
mount -t resctrl resctrl /sys/fs/resctrl;
umount /sys/fs/resctrl;
done

> On 10/26/2020 8:09 AM, Willem de Bruijn wrote:
> > From: Willem de Bruijn <willemb@xxxxxxxxxx>
> >
> > Resctrl takes an extra kernfs ref on directory entries, to access
> > the entry on cleanup in rdtgroup_kn_unlock after removing the entire
> > subtree with kfree_remove.
> >
> > But the path takes an extra ref both on mkdir and on rmdir.
> >
>
> On resource group (control as well as monitoring) creation via a mkdir
> an extra kernfs node reference is obtained to ensure that the rdtgroup
> structure remains accessible for the rdtgroup_kn_unlock() calls where it
> is removed on deletion. This symmetry ties the resource group's lifetime
> with the kernfs node. The extra kernfs node reference count is dropped
> by kernfs_put() in rdtgroup_kn_unlock() as is documented in the comment
> removed by this patch.
>
> As you state there is an extra reference obtained in rmdir, that is
> unnecessary.
>
> > The kernfs_get on mkdir causes a memleak in the unlikely exit with
> > error in the same function, as no extra kernfs_put exists and no extra
> > rdtgroup_kn_unlock occurs.
>
> This is a bug.
>
> >
> > More importantly, essentially the same happens in the normal path, as
> > this simple program demonstrates:
> >
> > for i in {1..200000}; do
> > mkdir /sys/fs/resctrl/task1
> > rmdir /sys/fs/resctrl/task1
> > done
> > slabtop
> >
> > When taking an extra ref for the duration of kernfs_remove, it is
> > easiest to reason about when holding this extra ref as short as
> > possible. For that, the refcnt on error reason and free on umount
> > (rmdir_all_sub), remove the first kernfs_get on mkdir, leaving the
> > other on rmdir.
>
> rmdir_all_sub() may be prevented from just removing the resource group
> if there are any waiters. In this case the resource group would be
> removed by rdtgroup_kn_unlock() by the last waiter at which point a
> reference would be dropped. With this patch there would be no reference
> to drop.

Ah, indeed. It would be easier to reason about if rdtgroup_kn_lock_live
takes an extra ref that rdtgroup_kn_unlock releases? But either way.
I had certainly missed that path.

> Indeed, there is another issue where the kfree(rdtgrp) in
> rmdir_all_sub() (the case when there are no waiters) is missing a
> kernfs_put(). Xiaochen is meticulously working through all of this.
>
> >
> > As the caller of rdtgroup_rmdir, kernfs_iop_rmdir, itself takes a
> > reference on the kernfs object, the extra reference is possibly not
> > needed at all.
>
> This is not obvious to me. Are you referring to
> kernfs_iop_rmdir()->kernfs_get_active(kn)? That is a different reference
> (kn->active as opposed to kn->count)?

I thought that would have the same effect of ensuring that kn can
be dereferenced safely throughout rdtgroup_rmdir. But judging from the
WARN_ONCE in kernfs_put the rules on count vs active are not quite that
simple.