Re: [PATCH V6 00/10] namespaces: log namespaces per task

From: Richard Guy Briggs
Date: Mon Apr 27 2015 - 22:06:15 EST


On 15/04/24, Eric W. Biederman wrote:
> Richard Guy Briggs <rgb@xxxxxxxxxx> writes:
> > On 15/04/22, Richard Guy Briggs wrote:
> >> On 15/04/20, Eric W. Biederman wrote:
> >> > Richard Guy Briggs <rgb@xxxxxxxxxx> writes:
> >> >
> >> > > The purpose is to track namespace instances in use by logged processes from the
> >> > > perspective of init_*_ns by logging the namespace IDs (device ID and namespace
> >> > > inode - offset).
> >> >
> >> > In broad strokes the user interface appears correct.
> >> >
> >> > Things that I see that concern me:
> >> >
> >> > - After Als most recent changes these inodes no longer live in the proc
> >> > superblock so the device number reported in these patches is
> >> > incorrect.
> >>
> >> Ok, found the patchset you're talking about:
> >> 3d3d35b kill proc_ns completely
> >> e149ed2 take the targets of /proc/*/ns/* symlinks to separate fs
> >> f77c801 bury struct proc_ns in fs/proc
> >> 33c4294 copy address of proc_ns_ops into ns_common
> >> 6344c43 new helpers: ns_alloc_inum/ns_free_inum
> >> 6496452 make proc_ns_operations work with struct ns_common * instead of void *
> >> 3c04118 switch the rest of proc_ns_operations to working with &...->ns
> >> ff24870 netns: switch ->get()/->put()/->install()/->inum() to working with &net->ns
> >> 58be2825 make mntns ->get()/->put()/->install()/->inum() work with &mnt_ns->ns
> >> 435d5f4 common object embedded into various struct ....ns
> >>
> >> Ok, I've got some minor jigging to do to get inum too...
> >
> > Do I even need to report the device number anymore since I am concluding
> > s_dev is never set (or always zero) in the nsfs filesystem by
> > mount_pseudo() and isn't even mountable?
>
> We still need the dev. We do have a device number get_anon_bdev fills it in.

Fine, it has a device number. There appears to be only one of these
allocated per kernel. I can get it from &nsfs->fs_supers (and take the
first instance given by hlist_for_each_entry and verify there are no
others). Why do I need it, again?

> > In fact, I never needed to
> > report the device since proc ida/idr and inodes are kernel-global and
> > namespace-oblivious.
>
> This is the bit I really want to keep to be forward looking. If we
> every need to preserve the inode numbers across a migration we could
> have different super blocks with different inode numbers for the same
> namespace.

I don't quite follow your argument here, but can accept that in the
future we might add other namespace devices. I wonder if we might do
that augmentation later and leave out the device number for now...

> >> > - I am nervous about audit logs being flooded with users creating lots
> >> > of namespaces. But that is more your lookout than mine.
> >>
> >> There was a thought to create a filter to en/disable this logging...
> >> It is an auxiliary record to syscalls, so they can be ignored by userspace tools.
> >>
> >> > - unshare is not logging when it creates new namespaces.
> >>
> >> They are all covered:
> >> sys_unshare > unshare_userns > create_user_ns
> >> sys_unshare > unshare_nsproxy_namespaces > create_new_namespaces > copy_mnt_ns
> >> sys_unshare > unshare_nsproxy_namespaces > create_new_namespaces > copy_utsname > clone_uts_ns
> >> sys_unshare > unshare_nsproxy_namespaces > create_new_namespaces > copy_ipcs > get_ipc_ns
> >> sys_unshare > unshare_nsproxy_namespaces > create_new_namespaces > copy_pid_ns > create_pid_namespace
> >> sys_unshare > unshare_nsproxy_namespaces > create_new_namespaces > copy_net_ns
>
> Then why the special change to fork? That was not reflected on
> the unshare path as far as I could see.

Fork can specify more than one CLONE flag at once, so collecting them
all in one statementn seemed helpful. setns can only set one at a time.

> >> > As small numbers are nice and these inodes all live in their own
> >> > superblock now we should be able to remove the games with
> >> > PROC_DYNAMIC_FIRST and just use small numbers for these inodes
> >> > everywhere.
> >>
> >> That is compelling if I can untangle the proc inode allocation code from the
> >> ida/idr. Should be as easy as defining a new ns_alloc_inum (and ns_free_inum)
> >> to use instead of proc_alloc_inum with its own ns_inum_ida and ns_inum_lock,
> >> then defining a NS_DYNAMIC_FIRST and defining NS_{IPC,UTS,USER,PID}_INIT_INO in
> >> the place of the existing PROC_*_INIT_INO.
>
> Something like that. Just a new ida/idr allocator specific to that
> superblock.
>
> Yeah. It is somewhere on my todo, but I have been prioritizing getting
> the bugs that look potentially expoloitable fixed in the mount
> namespace. Al made things nice for one case but left a mess for a bunch
> of others.
>
> >> > I honestly don't know how much we are going to care about namespace ids
> >> > during migration. So far this is not a problem that has come up.
> >>
> >> Not for CRIU, but it will be an issue for a container auditor that aggregates
> >> information from individually auditted hosts.
> >>
> >> > I don't think migration becomes a practical concern (other than
> >> > interface wise) until achieve a non-init namespace auditd. The easy way
> >> > to handle migration would be to log a setns of every process from their
> >> > old namespaces to their new namespaces. As you appear to have a setns
> >> > event defined.
> >>
> >> Again, this would be taken care of by a layer above that is container-aware
> >> across multiple hosts.
>
>
> >> > How to handle the more general case beyond audit remains unclear. I
> >> > think it will be a little while yet before we start dealing with
> >> > migrating applications that care. When we do we will either need to
> >> > generate some kind of hot-plug event that userspace can respond to and
> >> > discover all of the appropriate file-system nodes have changed, or we
> >> > will need to build a mechanism in the kernel to preserve these numbers.
> >>
> >> I don't expect to need to preserve these numbers. The higher layer application
> >> will be able to do that translation.
>
> We need to be very aware of what is happening.
>
> The situation I am concerned about looks something like.
>
> Program A:
> fd1 = open(/proc/self/ns/net);
> fstat(fd1, &stat1)
>
> ... later ...
>
> fd2 = open(/var/run/netns/johnny);
> fstat(fd2, &stat2);
>
> if ((stat1.st_dev == stat2.st_dev) &&
> (stat1.st_ino == stat2.st_ino)) {
> /* Same netns do something... */
> }
>
>
> What happens when we migrate Program A with it's cached stat data of
> of a network namespace file?
>
> This requires either a hotplug event that Program A listens to or that
> the inode number and device number are preserved across migration.
>
> Exactly what we do depends on where we are when it comes up. But this
> is not something some layer about the program can abstract it all out so
> we don't need to worry about it.

Ok, understood, we can't just punt this one to a higher layer...

So this comes back to a question above, which is how do we determine
which device it is from? Sounds like we need something added to
ns_common or one of the 6 namespace types structs.

> Eric

- RGB

--
Richard Guy Briggs <rbriggs@xxxxxxxxxx>
Senior Software Engineer, Kernel Security, AMER ENG Base Operating Systems, Red Hat
Remote, Ottawa, Canada
Voice: +1.647.777.2635, Internal: (81) 32635, Alt: +1.613.693.0684x3545
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/