Re: [q] chdir/chroot by dentry and not by name.

From: Alexander Viro (aviro@redhat.com)
Date: Mon Apr 17 2000 - 04:30:09 EST


On Mon, 17 Apr 2000, Jamie Lokier wrote:

> > You _should_ _not_ chdir() by dentry. At all. Think what will happen if
> > filesystem will be mounted in two different places, sharing the dentry tree.
> > Dentry is not enough. And lookup_dentry() is going out - walk_name() is the
> > replacement.
>
> <raise eyebrow>
>
> Dentries have completely changed their meaning during the second feature
> freeze this year?

Not completely. See below for rough description of new architecture - it's
not too different from the old one.

> How is it you're able to thoroughly mangle this stuff
> and I can't get a simple DT_DIR patch looked at?

<voice tone="nasty">
        For that you can be grateful to devfs. I would be _glad_ to postpone
these changes. But now they became pretty much mandatory - thanks to the fs with
sufficient set of methods and need to do multiple mounts.
#include <threads/ritual/ork/protecting/devfs/pointless.h>
</voice>

> fchdir() does still work without races doesn't it?

Sure. And chdir() is also not going away.

        OK, there we go. First of all, the main problem with multiple mounts
is the following:

1. We should never have more than one dentry for a writable directory.

Print it and hang it on the wall. It's a fundamental requirement. There is
no way to work around it in our VFS. I tried to invent a scheme that would
allow that for more than a year. And I've done most of namespace-related code
in our VFS since the moment when Bill Hawes stopped working on it, so I suspect
that right now I have the best working knowledge of that stuff. There is no
fscking way to survive multiple dentries for writable directory without major
lossage. Period.

2. Consequently, we should not have several dentry trees for the same
filesystem.

3. Consequently, if we want to have the filesystem mounted in several
places we have to share the dentry tree. Including the root dentry.

4. Consequently, ->d_covers and ->d_mounts are BAD ideas. We have to separate
that information (mount linkage) from the struct dentry.

Notice that unified tree consists of chunks that come from individual
filesystems. And linkage between them (what is mounted where) is already kept
in a different way than the linkage between dentries within the chunk.
Chunks themselves form a tree. Each node in that tree corresponds to one
mountpoint and thus to the directory tree of filesystem mounted there.
1--4 means that we have to separate that 'tree of chunks' from dentry trees
and make the nodes in that tree _refer_ to dentry trees. Moreover, we must
permit to have several chunks refering to the same dentry tree - that's
precisely what we get for multiple mounts.

Let's explore what changes it would require. First of all, dentry becomes
insufficient for walking through the unified tree. _If_ we want to do such
a walk we also need to know which chunk we are talking about. HOWEVER, we
rarely need such walking. Most of the kernel couldn't care less for the
chunk we are in - if you want rmdir() you don't care about the mounting, you
just want the bloody dentry and that's enough. Even more so for read/write/
stat/lstat/readlink/almost everything. Almost all operations are local to
thei individual filesystem and don't care where and how many times it's
mounted. So we can keep using dentries almost everywhere we used to.

        Now, let's see what _will_ change. First of all, we should carry the
information about the chunk we are in through the lookup. Just as we carry
the pointer to dentry we are in. Not a big deal. We should be able to keep
track of crossing the chunk boundaries, but we have to do it anyway. However,
we should know which chunk we are in when we start walking. IOW, we need
to
        a) know the chunk where the cwd is.
        b) know the chunk where the root is.
Easy enough - we need to extend fs_struct a bit and take care to set both
"dentry" and "chunk" components upon chdir()/chroot()/fchdir(). Not a big
deal for the first two, but the third requires to store the chunk in struct
file of opened directory. IOW,
        c) in addition to f_dentry we need to store the chunk.
Fine. There is not that much places where we open files (see below). We also
need to
        d) know which chunk we are in when we follow the link.
Trivial, since we keep track of it in the sole caller of ->follow_link() anyway.
That's it for namespace walking.

        Another problem is that we need to know the chunk if we ever want to
know the full pathname of object. It adds to the list above
        e) chunk where the swap component sits.
That's it. The rest is covered in (a)--(c).

        Now, (ignoring the stuff with the places where we open files) we
need to choose the structure that would represent nodes in the "tree of
chunks". Fortunately, we already had such a structure - struct vfsmount.
It was a natural candidate for the per-mountpoint stuff, just as struct
super_block is for mountpoint-independent data. That required moving the
quota options into struct super_block (obviously). With that done we got
the material for chunks tree.
        What do we need to know about every chunk? Well, obviously we
need dentry of mountpoint, root dentry of mounted fs and parent chunk.
That allows for trivial crossing the mounpoints, erm, rootwards. For
crossing them in other direction (into the mounted fs) we need a bit more -
several mountpoints _may_ (normally will not, but we have to account for that)
have the same dentry (in differnet chunks, indeed). So we have a _set_ of
chunks over the dentry of mountpoint. They all have different parent chunks,
thus crossing the mountpoint turns into
        find a chunk that would
                belong to set over current dentry and
                had the parent equal to current chunk.
Data structure for that is a separate story and final choice will take
profiling for different uses, but one of the trivial (and effecient in
normal cases) variants is the cyclic list of vfsmounts anchored in dentry.
In absence of the case when two mountpoints have same dentry (in different
chunks) it's as efficient as the old scheme was.
        As for the files opening, the problem was in the binfmt drivers,
mostly due to the fact that do_execve() left opening to the ->load_binary().
Which was a BAD idea, since it lead to code duplication in all of them.
Fixed by shifting the opening into exec.c and passing struct file instead of
struct dentry.

        That's mostly it as far as design counts - everything else was the
matter of coding, choosing decent interfaces, etc. and is the matter of
putting it into the tree in small steps. Large part is already there and
I'ld rather postpone the description of all gory details until all this stuff
will go. Infrastructure is already there (almost all - the last piece is sent
to Linus and it deals with the <expletives> /usr/gnuemul/{solaris,etc.}
handling). Once it will be in I'll post the description of new interfaces.
        Main part of pending patches consists of almost complete rewrite of
fs/super.c, so changes in that area are _not_ a good idea right now.
Filesystems are already there - in that part all changes are already done,
except the changes in autofs - it is intimately tied to the mount-related
stuff. I have this stuff done, so it's not going to be a problem.
        Resulting design gives a lot of interesting opportunities, e.g. it
allows to store all metadata in the dentry tree and don't bother with
'backplane' trees a-la current procfs. Other obvious things include loopback
mounts (add a new vfsmount and we are done) and dealing with filesystems not
visible to any user (create a vfsmount visible only to the kernel and you
are done).
        Folks, could you please wait until the interface of lookup will settle
down? I promise to give complete description of the interfaces and data
structures once the thing will be there. Right now it's in transit. I hope that
mess above gives some idea of where we are moving to - it definitely contains
all crucial ideas.
                                                                Al

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/



This archive was generated by hypermail 2b29 : Sun Apr 23 2000 - 21:00:10 EST