Re: Linux-2.2.2-pre2..

Alexander Viro (viro@math.psu.edu)
Sun, 7 Feb 1999 03:54:21 -0500 (EST)


On Sat, 6 Feb 1999, Linus Torvalds wrote:

>
>
> On Sun, 7 Feb 1999, Alexander Viro wrote:
> >
> > Could you elaborate? I'ld see the point if we would split the *method*,
> > but I don't see the reason for splitting the VFS-level code. Method has to
> > distingush between directory/non-directory move anyway - '..' flipping and
> > emptiness check are fs-specific. Ditto for nlink overflow checking (BTW,
> > it asks for additional field in struct super, IMHO. Say it, s_max_nlink).
>
> Basically, splitting the method is what we should do for 2.3.x, but it's
> just not feasible for 2.2.x at this stage.
>
> But that's why I want the VFS layer split - so that when we _do_ split the
> method, the VFS layer is all ready for it. Because basically a directory
> rename _is_ very different from a normal file rename.

OK. I'm still not convinced that it's needed, but let it be.

> > BTW, we'll have to redo it in 2.3 - serialization is way too rough right
> > now and we can easily make if finer if we'll use d_flags to mark source of
> > rename() in progress
>
> No, you can't just mark the rename in process - if you really want to
> split up the per-fs rename semaphore into finer-grained setups so that you
> can do multiple directory moves concurrently, then you have to poison not
> just the source, you have to poison the whole path down from the source
> down to the root of the filesystem.
See below.
> > (if there are no marked points on the path from
> > target to the closest common ancestor of target and source we can go ahead
> > without waiting for other renames).
>
> Right, but the only sane way of detecting this is to poison down to the
> root, and if you hit another process' poison bit you'd roll back and wait
> with the rename.

Erm, no. Look: our condition is (for each x such that (x is an ancestor
of target) && !(x is an ancestor of source) x is not marked). Now, let
x0 be the closest ancestor of target that either is root or is marked.
Condition turns into (x0 is ancestor of source). That is,

l = target;
while (l!=l->d_parent && !IS_SOURCE(l) && l!=source)
l = l->d_parent;
if (l==source)
return -EINVAL;
if (l->parent==l)
goto we_are_OK;
m = source->d_parent;
while(m->d_parent!=m && m!=l)
m = m->d_parent;
if (m!=l) {
wait_on_rename(l);
goto retry;
}
we_are_OK:
mark_source(source);
/*go ahead*/

That's it. No need for excessive marking. And we got O(depth) complexity,
same as in is_subdir() (covered here).

> HOWEVER - I don't actually think it makes much sense to try to aim for
> tons of concurrency in directory renaming. It's not as if it is a very
> common operation anyway, so the per-fs lock works wonderfully well (and
> doesn't impact any other operations).
>
>
> > ObRaces: assume that /foo/bar/a and /foo/bar/b are both symlinks to "..".
> > Now, unlink("/foo/bar/a/bar/b") and unlink("/foo/bar/b/bar/a") shouldn't
> > *both* succeed, right?
>
> Why not? How you got to the point is meaningless. You don't unlink a
> _path_, you unlink the last entry it points to.

Ahem... How do you like the second example? We get successful
lookup on a path that *never* existed.

> Basically, if you have senseless users, there's _no_ point in the kernel
> trying to second-guess what they mean. You'll get it wrong anyway, and
> you'll add a lot of complexity trying to get it right.

Sorry, I wasn't clear enough on that point. It's *not* a
second-guessing. I want to (a) obtain a SMP-clean namei.c/dcache.c and
(b) simplify race-prevention code. There is a lot of such code in VFS and
in methods and we can *seriously* cut down on that stuff if we'll solve
all ordering/atomicity issues in lookup phase, leaving the filesystems
with situation when they can simply go ahead and not worry about
interferention at all. I think that we abuse i_sem on directories. Heck,
reread your own comment in rmdir() code ;-) This stuff belongs to dcache
level, not inode one.

> Basic rule: make it as complex as you have to, but no more.

Yup. That's why I want to do this thing. It makes the situation (and code)
less complex. Anyway, it's *not* a 2.2.early issue and I think that it's
not a 2.2 issue at all. If we'll have rename() serialization in VFS we'll
be able to change it without touching (and breaking) filesystems. I'm glad
that this pain in ass will go away now. If you really want to get a
description of the lookup atomicity stuff I can sit down and turn my notes
and comments into the coherent text in a week or so, but I don't think
that time is right. I'ld rather wait with it at least until March/April.
Save tomorrow for tomorrow.

BTW, I'ld really like to hear comments of AFFS folks re hardlinks on
directories. In its current form AFFS is pretty insecure and I can royally
screw the system up if I have write permissions on a single directory on
mounted AFFS. Do we want VFS to be able to deal with multiple links on a
directory? If the answer is 'yes' we'ld better start doing it. Otherwise
we need to do something with AFFS stuff. IMHO the former is Wrong(tm).
Who maintains AFFS these days?
Cheers,
Al

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/