Re: pivot_root(".", ".") and the fchdir() dance

From: Eric W. Biederman
Date: Mon Oct 07 2019 - 11:47:22 EST


"Michael Kerrisk (man-pages)" <mtk.manpages@xxxxxxxxx> writes:

> Hello Eric,
>
> On 9/30/19 2:42 PM, Eric W. Biederman wrote:
>> "Michael Kerrisk (man-pages)" <mtk.manpages@xxxxxxxxx> writes:
>>
>>> Hello Eric,
>>>
>>> A ping on my question below. Could you take a look please?
>>>
>>> Thanks,
>>>
>>> Michael
>>>
>>>>>>> The concern from our conversation at the container mini-summit was that
>>>>>>> there is a pathology if in your initial mount namespace all of the
>>>>>>> mounts are marked MS_SHARED like systemd does (and is almost necessary
>>>>>>> if you are going to use mount propagation), that if new_root itself
>>>>>>> is MS_SHARED then unmounting the old_root could propagate.
>>>>>>>
>>>>>>> So I believe the desired sequence is:
>>>>>>>
>>>>>>>>>> chdir(new_root);
>>>>>>> +++ mount("", ".", MS_SLAVE | MS_REC, NULL);
>>>>>>>>>> pivot_root(".", ".");
>>>>>>>>>> umount2(".", MNT_DETACH);
>>>>>>>
>>>>>>> The change to new new_root could be either MS_SLAVE or MS_PRIVATE. So
>>>>>>> long as it is not MS_SHARED the mount won't propagate back to the
>>>>>>> parent mount namespace.
>>>>>>
>>>>>> Thanks. I made that change.
>>>>>
>>>>> For what it is worth. The sequence above without the change in mount
>>>>> attributes will fail if it is necessary to change the mount attributes
>>>>> as "." is both put_old as well as new_root.
>>>>>
>>>>> When I initially suggested the change I saw "." was new_root and forgot
>>>>> "." was also put_old. So I thought there was a silent danger without
>>>>> that sequence.
>>>>
>>>> So, now I am a little confused by the comments you added here. Do you
>>>> now mean that the
>>>>
>>>> mount("", ".", MS_SLAVE | MS_REC, NULL);
>>>>
>>>> call is not actually necessary?
>>
>> Apologies for being slow getting back to you.
>
> NP. Thanks for your reply.
>
>> To my knowledge there are two cases where pivot_root is used.
>> - In the initial mount namespace from a ramdisk when mounting root.
>> This is the original use case and somewhat historical as rootfs
>> (aka an initial ramfs) may not be unmounted.
>>
>> - When setting up a new mount namespace to jettison all of the mounts
>> you don't need.
>>
>> The sequence:
>>
>> chdir(new_root);
>> pivot_root(".", ".");
>> umount2(".", MNT_DETACH);
>>
>> is perfect for both use cases (as nothing needs to be known about the
>> directory layout of the new root filesystem).
>>
>> In the case when you are setting up a new mount namespace propogating
>> changes in the mount layout to another mount namespace is fatal. But
>> that is not a concern for using that pivot_root sequence above because
>> pivot_root will fail deterministically if
>> 'mount("", ".", MS_SLAVE | MS_REC, NULL)' is needed but not specified.
>>
>> So I would document the above sequence of three system calls in the
>> man-page.
>
> Okay. I've changed the example to be just those three calls.
>
>> I would document that pivot_root will fail if propagation would occur.
>
> Yep. That's in the page already.
>
>> I would document in pivot_root or under unshare(CLONE_NEWNS) that if
>> mount propagation is enabled (the default with systemd) that you
>> need to call 'mount("", "/", MS_SLAVE | MS_REC, NULL);' or
>> 'mount("", "/", MS_PRIVATE | MS_REC, NULL);' after creating a mount
>> namespace. Or mounts will propagate backwards, which is usually
>> not what people want.
>
> Thanks. Instead, I have added the following text to
> mount_namespaces(7), the page that is referred to by both clone(2) and
> unshare(2) in their discussions of CLONE_NEWNS:
>
> An application that creates a new mount namespace
> directly using clone(2) or unshare(2) may desire to preâ
> vent propagation of mount events to other mount namesâ
> paces (as is is done by unshare(1)). This can be done by
> changing the propagation type of mount points in the new
> namesapace to either MS_SLAVE or MS_PRIVATE. using a
> call such as the following:
>
> mount(NULL, "/", MS_SLAVE | MS_REC, NULL);

Yes.

>> Creating of a mount namespace in a user namespace automatically does
>> 'mount("", "/", MS_SLAVE | MS_REC, NULL);' if the starting mount
>> namespace was not created in that user namespace. AKA creating
>> a mount namespace in a user namespace does the unshare for you.
>
> Oh -- I had forgotten that detail. But it is documented
> (by you, I think) in mount_namespaces(7):
>
> * A mount namespace has an owner user namespace. A
> mount namespace whose owner user namespace is differâ
> ent from the owner user namespace of its parent mount
> namespace is considered a less privileged mount namesâ
> pace.
>
> * When creating a less privileged mount namespace,
> shared mounts are reduced to slave mounts. (Shared
> and slave mounts are discussed below.) This ensures
> that mappings performed in less privileged mount
> namespaces will not propagate to more privileged mount
> namespaces.
>
> There's one point that description that troubles me. There is a
> reference to "parent mount namespace", but as I understand things
> there is no parental relationship among mount namespaces instances
> (or am I wrong?). Should that wording not be rather something
> like "the mount namespace of the process that created this mount
> namespace"?

How about "the mount namespace this mount namespace started as a copy of"

You are absolutely correct there is no relationship between mount
namespaces. There is just the propagation tree between mounts. (Which
acts similarly to a parent/child relationship but is not at all the same
thing).

Eric