RE: Could not mount sysfs when enable userns but disable netns

From: chenhanxiao@xxxxxxxxxxxxxx
Date: Mon Jul 14 2014 - 05:32:49 EST




> -----Original Message-----
> From: Eric W. Biederman [mailto:ebiederm@xxxxxxxxxxxx]
> Sent: Saturday, July 12, 2014 12:29 AM
> To: Serge E. Hallyn
> Cc: Chen, Hanxiao/陈 晗霄; Serge Hallyn (serge.hallyn@xxxxxxxxxx); Greg
> Kroah-Hartman; containers@xxxxxxxxxxxxxxxxxxxxxxxxxx;
> linux-kernel@xxxxxxxxxxxxxxx
> Subject: Re: Could not mount sysfs when enable userns but disable netns
>
> "Serge E. Hallyn" <serge@xxxxxxxxxx> writes:
>
> > Quoting chenhanxiao@xxxxxxxxxxxxxx (chenhanxiao@xxxxxxxxxxxxxx):
> >> Hello,
> >>
> >> How to reproduce:
> >> 1. Prepare a container, enable userns and disable netns
> >> 2. use libvirt-lxc to start a container
> >> 3. libvirt could not mount sysfs then failed to start.
> >>
> >> Then I found that
> >> commit 7dc5dbc879bd0779924b5132a48b731a0bc04a1e says:
> >> "Don't allow mounting sysfs unless the caller has CAP_SYS_ADMIN rights
> >> over the net namespace."
> >>
> >> But why should we check sysfs mouont permission over net namespace?
> >> We've already checked CAP_SYS_ADMIN though.
>
> We already checked capable(CAP_SYS_ADMIN) and it failed.

But on my machine, capable(CAP_SYS_ADMIN) passed
but failed in kobj_ns_current_may_mount.

I added some printks in sysfs_mount:
if (!(flags & MS_KERNMOUNT)) {
- if (!capable(CAP_SYS_ADMIN) && !fs_fully_visible(fs_type))
+ if (!capable(CAP_SYS_ADMIN) && !fs_fully_visible(fs_type)) {
+ printk(KERN_WARNING "Failed in capable\n");
return ERR_PTR(-EPERM);
+ }

- if (!kobj_ns_current_may_mount(KOBJ_NS_TYPE_NET))
+ if (!kobj_ns_current_may_mount(KOBJ_NS_TYPE_NET)) {
+ printk(KERN_WARNING "Failed in kobj_ns_current_may_mount\n");
return ERR_PTR(-EPERM);
+ }

And found:
Jul 14 09:55:26 localhost systemd: Starting Container lxc-chx.
Jul 14 09:55:26 localhost systemd-machined: New machine lxc-chx.
Jul 14 09:55:26 localhost systemd: Started Container lxc-chx.
Jul 14 09:55:26 localhost kernel: [ 784.044709] Failed in kobj_ns_current_may_mount
Jul 14 09:55:26 localhost systemd-machined: Machine lxc-chx terminated.

>
> >> What the relationship between sysfs and net namespace,
> >> or this check is a little redundant?
>
> You want a bind mount not a new fresh mount.
>

Yes, we need to modify libvirt's codes to deal with sysfs
when enable userns but disable netns.

Thanks,
- Chen

> When looking at how evil actors could abuse things it turned out that in
> some circumstances the root user (before a user namespace is created)
> needs to control the policy on which filesystems may be mounted. There
> are files in sysfs and in proc that you never want to see in a chroot
> jail, as they just create more surface area to attack.
>
> The only reason for creating a new fresh mount of sysfs is to get access
> to /sys/class/net. So to keep things simple we restrict creation of
> that mount to cases where the mounter has permisions over the network
> namespace, and cases where nothing interesing is mounted on top of
> sysfs.
>
> If a new /sys/class/net is not needed it is possible to bind mount the
> existing copy of sysfs to the new location without loss of
> functionality.
>
> > It is not redundant. The whole point is that after clone(CLONE_NEWUSER)
> > you get a newly filled set of capabilities. But you should not have
> > privileges over the host's network namesapce. After you unshare a new
> > network namespace, you *should* have privilege over it. So the fact
> > that we've already check CAP_SYS_ADMIN means nothing, because the
> > capabilities need to be targeted.
>
> Exactly the tests are failing because the caller is not the global root
> and so the code is properly failing the permission checks.
>
> Eric