Re: [PATCH] kernfs: attach uuid for every kernfs and report it in fsid

From: Amir Goldstein
Date: Wed Jul 12 2023 - 01:44:12 EST


On Wed, Jul 12, 2023 at 1:04 AM Ivan Babrou <ivan@xxxxxxxxxxxxxx> wrote:
>
> On Tue, Jul 11, 2023 at 2:49 AM Amir Goldstein <amir73il@xxxxxxxxx> wrote:
> >
> > On Tue, Jul 11, 2023 at 12:21 AM Ivan Babrou <ivan@xxxxxxxxxxxxxx> wrote:
> > >
> > > On Mon, Jul 10, 2023 at 12:40 PM Greg Kroah-Hartman
> > > <gregkh@xxxxxxxxxxxxxxxxxxx> wrote:
> > > >
> > > > On Mon, Jul 10, 2023 at 11:33:38AM -0700, Ivan Babrou wrote:
> > > > > The following two commits added the same thing for tmpfs:
> > > > >
> > > > > * commit 2b4db79618ad ("tmpfs: generate random sb->s_uuid")
> > > > > * commit 59cda49ecf6c ("shmem: allow reporting fanotify events with file handles on tmpfs")
> > > > >
> > > > > Having fsid allows using fanotify, which is especially handy for cgroups,
> > > > > where one might be interested in knowing when they are created or removed.
> > > > >
> > > > > Signed-off-by: Ivan Babrou <ivan@xxxxxxxxxxxxxx>
> > > > > ---
> > > > > fs/kernfs/mount.c | 13 ++++++++++++-
> > > > > 1 file changed, 12 insertions(+), 1 deletion(-)
> > > > >
> > > > > diff --git a/fs/kernfs/mount.c b/fs/kernfs/mount.c
> > > > > index d49606accb07..930026842359 100644
> > > > > --- a/fs/kernfs/mount.c
> > > > > +++ b/fs/kernfs/mount.c
> > > > > @@ -16,6 +16,8 @@
> > > > > #include <linux/namei.h>
> > > > > #include <linux/seq_file.h>
> > > > > #include <linux/exportfs.h>
> > > > > +#include <linux/uuid.h>
> > > > > +#include <linux/statfs.h>
> > > > >
> > > > > #include "kernfs-internal.h"
> > > > >
> > > > > @@ -45,8 +47,15 @@ static int kernfs_sop_show_path(struct seq_file *sf, struct dentry *dentry)
> > > > > return 0;
> > > > > }
> > > > >
> > > > > +int kernfs_statfs(struct dentry *dentry, struct kstatfs *buf)
> > > > > +{
> > > > > + simple_statfs(dentry, buf);
> > > > > + buf->f_fsid = uuid_to_fsid(dentry->d_sb->s_uuid.b);
> > > > > + return 0;
> > > > > +}
> > > > > +
> > > > > const struct super_operations kernfs_sops = {
> > > > > - .statfs = simple_statfs,
> > > > > + .statfs = kernfs_statfs,
> > > > > .drop_inode = generic_delete_inode,
> > > > > .evict_inode = kernfs_evict_inode,
> > > > >
> > > > > @@ -351,6 +360,8 @@ int kernfs_get_tree(struct fs_context *fc)
> > > > > }
> > > > > sb->s_flags |= SB_ACTIVE;
> > > > >
> > > > > + uuid_gen(&sb->s_uuid);
> > > >
> > > > Since kernfs has as lot of nodes (like hundreds of thousands if not more
> > > > at times, being created at boot time), did you just slow down creating
> > > > them all, and increase the memory usage in a measurable way?
> > >
> > > This is just for the superblock, not every inode. The memory increase
> > > is one UUID per kernfs instance (there are maybe 10 of them on a basic
> > > system), which is trivial. Same goes for CPU usage.
> > >
> > > > We were trying to slim things down, what userspace tools need this
> > > > change? Who is going to use it, and what for?
> > >
> > > The one concrete thing is ebpf_exporter:
> > >
> > > * https://github.com/cloudflare/ebpf_exporter
> > >
> > > I want to monitor cgroup changes, so that I can have an up to date map
> > > of inode -> cgroup path, so that I can resolve the value returned from
> > > bpf_get_current_cgroup_id() into something that a human can easily
> > > grasp (think system.slice/nginx.service). Currently I do a full sweep
> > > to build a map, which doesn't work if a cgroup is short lived, as it
> > > just disappears before I can resolve it. Unfortunately, systemd
> > > recycles cgroups on restart, changing inode number, so this is a very
> > > real issue.
> > >
> > > There's also this old wiki page from systemd:
> > >
> > > * https://freedesktop.org/wiki/Software/systemd/Optimizations
> > >
> > > Quoting from there:
> > >
> > > > Get rid of systemd-cgroups-agent. Currently, whenever a systemd cgroup runs empty a tool "systemd-cgroups-agent" is invoked by the kernel which then notifies systemd about it. The need for this tool should really go away, which will save a number of forked processes at boot, and should make things faster (especially shutdown). This requires introduction of a new kernel interface to get notifications for cgroups running empty, for example via fanotify() on cgroupfs.
> > >
> > > So a similar need to mine, but for different systemd-related needs.
> > >
> > > Initially I tried adding this for cgroup fs only, but the problem felt
> > > very generic, so I pivoted to having it in kernfs instead, so that any
> > > kernfs based filesystem would benefit.
> > >
> > > Given pretty much non-existing overhead and simplicity of this, I
> > > think it's a change worth doing, unless there's a good reason to not
> > > do it. I cc'd plenty of people to make sure it's not a bad decision.
> > >
> >
> > I agree. I think it was a good decision.
> > I have some followup questions though.
> >
> > I guess your use case cares about the creation of cgroups?
> > as long as the only way to create a cgroup is via vfs
> > vfs_mkdir() -> ... cgroup_mkdir()
> > fsnotify_mkdir() will be called.
> > Is that a correct statement?
>
> As far as I'm aware, this is the only way. We have the cgroups mailing
> list CC'd to confirm.
>
> I checked systemd and docker as real world consumers and both use
> mkdir and are visible in fanotify with this patch applied.
>
> > Because if not, then explicit fsnotify_mkdir() calls may be needed
> > similar to tracefs/debugfs.
> >
> > I don't think that the statement holds for dieing cgroups,
> > so explicit fsnotify_rmdir() are almost certainly needed to make
> > inotify/fanotify monitoring on cgroups complete.
> >
> > I am on the fence w.r.t making the above a prerequisite to merging
> > your patch.
> >
> > One the one hand, inotify monitoring of cgroups directory was already
> > possible (I think?) with the mentioned shortcomings for a long time.
> >
> > On the other hand, we have an opportunity to add support to fanotify
> > monitoring of cgroups directory only after the missing fsnotify hooks
> > are added, making fanotify API a much more reliable option for
> > monitoring cgroups.
> >
> > So I am leaning towards requiring the missing fsnotify hooks before
> > attaching a unique fsid to cgroups/kernfs.
>
> Unless somebody responsible for cgroups says there's a different way
> to create cgroups, I think this requirement doesn't apply.
>

I was more concerned about the reliability of FAN_DELETE for
dieing cgroups without an explicit rmdir() from userspace.

> > In any case, either with or without the missing hooks, I would not
> > want this patch merged until Jan had a chance to look at the
> > implications and weigh in on the missing hooks question.
> > Jan is on vacation for three weeks, so in the meanwhile, feel free
> > to implement and test the missing hooks or wait for his judgement.
>
> Sure, I can definitely wait.
>
> > On an unrelated side topic,
> > I would like to point your attention to this comment in the patch that
> > was just merged to v6.5-rc1:
> >
> > 69562eb0bd3e ("fanotify: disallow mount/sb marks on kernel internal pseudo fs")
> >
> > /*
> > * mount and sb marks are not allowed on kernel internal pseudo fs,
> > * like pipe_mnt, because that would subscribe to events on all the
> > * anonynous pipes in the system.
> > *
> > * SB_NOUSER covers all of the internal pseudo fs whose objects are not
> > * exposed to user's mount namespace, but there are other SB_KERNMOUNT
> > * fs, like nsfs, debugfs, for which the value of allowing sb and mount
> > * mark is questionable. For now we leave them alone.
> > */
> >
> > My question to you, as the only user I know of for fanotify FAN_REPORT_FID
> > on SB_KERNMOUNT, do you have plans to use a mount or filesystem mark
> > to monitor cgroups? or only inotify-like directory watches?
>
> My plan is to use FAN_MARK_FILESYSTEM for the whole cgroup mount,
> since that is what I was able to make work with my limited
> understanding of the whole fanotify thing. I started with
> fanotify_fid.c example from here:
>
> * https://man7.org/linux/man-pages/man7/fanotify.7.html
>
> My existing code does the mark this way (on v6.5-rc1 with my patch applied):
>
> ret = fanotify_mark(fd, FAN_MARK_ADD | FAN_MARK_ONLYDIR |
> FAN_MARK_FILESYSTEM, FAN_CREATE | FAN_DELETE | FAN_ONDIR, AT_FDCWD,
> argv[1]);
>
> My goal is to set a watch for all cgroups and drop capabilities, so
> that I can keep monitoring for events while being unprivileged.

As long as you do not need to open_by_handle_at() or as long as you
retain CAP_DAC_READ_SEARCH.

> As far
> as I'm aware, this sort of recursive monitoring without races isn't
> possible with inode level monitoring (I might be wrong here).

You are not wrong.

>
> I do get -EINVAL for FAN_MARK_MOUNT instead of FAN_MARK_FILESYSTEM.

Right. FAN_CREATE | FAN_DELETE not supported with FAN_MARK_MOUNT.

Thanks,
Amir.