Re: pivot_root(".", ".") and the fchdir() dance

From: Michael Kerrisk (man-pages)
Date: Tue Aug 06 2019 - 15:35:54 EST


Hello Aleksa,

On 8/5/19 3:37 PM, Aleksa Sarai wrote:
> On 2019-08-05, Michael Kerrisk (man-pages) <mtk.manpages@xxxxxxxxx> wrote:
>> On 8/5/19 12:36 PM, Aleksa Sarai wrote:
>>> On 2019-08-01, Michael Kerrisk (man-pages) <mtk.manpages@xxxxxxxxx> wrote:
>>>> I'd like to add some documentation about the pivot_root(".", ".")
>>>> idea, but I have a doubt/question. In the lxc_pivot_root() code we
>>>> have these steps
>>>>
>>>> oldroot = open("/", O_DIRECTORY | O_RDONLY | O_CLOEXEC);
>>>> newroot = open(rootfs, O_DIRECTORY | O_RDONLY | O_CLOEXEC);
>>>>
>>>> fchdir(newroot);
>>>> pivot_root(".", ".");
>>>>
>>>> fchdir(oldroot); // ****
>>>
>>> This one is "required" because (as the pivot_root(2) man page states),
>>> it's technically not guaranteed by the kernel that the process's cwd
>>> will be the same after pivot_root(2):
>>>
>>>> pivot_root() may or may not change the current root and the current
>>>> working directory of any processes or threads which use the old root
>>>> directory.
>>>
>>> Now, if it turns out that we can rely on the current behaviour (and the
>>> man page you're improving is actually inaccurate on this point) then
>>> you're right that this fchdir(2) isn't required.
>>
>> I'm not sure that I follow your logic here. In the old manual page
>> text that you quote above, it says that [pivot_root() may change the
>> CWD of any processes that use the old root directory]. Two points
>> there:
>>
>> (1) the first fchdir() has *already* changed the CWD of the calling
>> process to the new root directory, and
>
> Right, I (and presumably the LXC folks as well) must've missed the
> qualifier on the end of the sentence and was thinking that it said "you
> can't trust CWD after pivot_root(2)".
>
> My follow-up was going to be that we need to be in the old root to
> umount, but as you mentioned that shouldn't be necessary either since
> the umount will apply to the stacked mount (which is somewhat
> counter-intuitively the earlier mount not the later one -- I will freely
> admit that I don't understand all of the stacked and tucked mount
> logic in VFS).

Your not alone. I don't follow that code easily either. But, looking
at the order that the mounts were stacked in /proc/PID/mountinfo
helped clarify things for me.

>> (2) the manual page implied but did not explicitly say that the
>> CWD of processes using the old root may be changed *to the new root
>> directory* (rather than changed to some arbitrary location!);
>> presumably, omitting to mention that detail explicitly in the manual
>> page was an oversight, since that is indeed the kernel's behavior.
>>
>> The point is, the manual page was written 19 years ago and has
>> barely been changed since then. Even at the time that the system
>> call was officially released (in Linux 2.4.0), the manual page was
>> already inaccurate in a number of details, since it was written
>> about a year beforehand (during the 2.3 series), and the
>> implementation already changed by the time of 2.4.0, but the manual
>> page was not changed then (or since, but I'm working on that).
>>
>> The behavior has in practice always been (modulo the introduction
>> of mount namespaces in 2001/2002):
>>
>> pivot_root() changes the root directory and the current
>> working directory of each process or thread in the same
>> mount namespace to new_root if they point to the old root
>> directory.
>>
>> Given that this has been the behavior since Linux 2.4.0 was
>> released, it improbable that this will ever change, since,
>> notwithstanding what the manual page says, this would be an ABI
>> breakage.
>>
>> I hypothesize that the original manual page text, written before
>> the system call was even officially released, reflects Werner's
>> belief at the time that perhaps in the not too distant future
>> the implementation might change. But, 18 years on from 2.4.0,
>> it has not.
>>
>> And arguably, the manual page should reflect that reality,
>> describing what the kernel actually does, rather than speculating
>> that things might (after 19 years) still sometime change.
>
> I wasn't aware of the history of the man page, and took it as gospel
> that we should avoid making assumptions about current's CWD surrounding
> a pivot_root(2). Given the history (and that it appears the behaviour
> was never intended to be changed after being merged), we should
> definitely drop that text to avoid the confusion which has already
> caused us container folks to implement this in a
> more-convoluted-than-necessary fashion.
>
> In case you haven't noticed already, you might want to also send a patch
> to util-linux to also update pivot_root(8) which makes the same mistake
> in its text:
>
>> Note that, depending on the implementation of pivot_root, root and cwd
>> of the caller may or may not change.
>
> Then again, it's also possible this text is independently just as vague
> for other reasons.

I think that page was also written by Werner, back in the day. So I
think it's vague for the same reasons.

>>> And this one is required because we are in @oldroot at this point, due
>>> to the first fchdir(2). If we don't have the first one, then switching
>>> from "." to "/" in the mount/umount2 calls should fix the issue.
>>
>> See my notes above for why I therefore think that the second fchdir()
>> is also not needed (and therefore why switching from "." to "/" in the
>> mount()/umount2() calls is unnecessary.
>
> My gut feeling reading this was that operating on "." will result in you
> doing the later mount operations on @newroot (since "." is @newroot) not
> on the stacked mount which isn't your CWD.
>
> *But* my gut feeling is obviously wrong (since you've tested it), and I
> will again admit I don't understand quite how CWD references interact
> with mount operations -- especially in the context of stacked mounts.
>
>> Do you agree with my analysis?
>
> Minus the mount bits that I'm not too sure about (I defer to
> Christian/Serge/et al on that point), it seems reasonable to me.

Okay.

> My only real argument for keeping it the way it is, is that it's much
> easier (for me, at least) to understand the semantics with explicit
> fchdir(2)s. But that's not really a good reason to continue doing it the
> way we do it now -- if it's documented in ah man page that'd be more than
> sufficient to avoid confusion when reading snippets that do it without
> the fchdir(2)s.

Yes. I'm wanting to get a some feedback/confirmation from others
before I finalize the changes ti the manual page. The feedback from
you and Philipp has already been helpful. I'm hoping that Serge or
Andy might also chip in.

Cheers,

Michael

--
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/