Re: [patch 3/6] vfs: mountinfo stable peer group id

From: Al Viro
Date: Thu Mar 20 2008 - 17:43:44 EST


On Wed, Mar 19, 2008 at 07:37:51PM +0100, Miklos Szeredi wrote:

> > Argh... OK, I'll try to put something together tonight, after I get some
> > sleep - 31 hours of uptime _suck_ ;-/
>
> Gosh, yes.

Folks, we have some problems. I've sat down to write documentation on
locking rules, etc. for struct vfsmount and there are some serious
turds in the area. Among the other things we have interesting race
between shrink_submounts() and copy_tree() - the former calls
propagate_mount_busy(), which walks ->mnt_share and friends. The
latter adds elements to those. Wouldn't be a problem, except that
shrink_submounts() only holds vfsmount_lock while everything else
(including copy_tree()) uses namespace_sem to protect those.

There's another interesting issue with shrink_submounts() - it can
put vfsmounts back on the wrong list. If e.g. cifs is mounted under
nfs and does an automount, umount of nfs might push cifs expirables
*to* *nfs* *list*.

I'm not sure that I understand Trond's intentions with that stuff;
what exactly are we trying to achieve and why the hell is it fs-specific
to start with?

Another interesting thing is locking in mark_mounts_for_expiry(); we
go to all sorts of convolutions that are, AFAICS, not needed anymore.
We used to need rather interesting logics back when we had per-namespace
semaphores; thus grabbing the namespace, dropping vfsmount_lock, dealing
with races in case if something manages to walk into the potential
victim, etc.

I think that having the damn thing global *and* having umount_tree()
callable under vfsmount_lock (we unhash everything in the subtree and
put it on the kill list, so that nothing can walk into those suckers
via hash lookup; release_mounts() called after we are done (and after
releasing vfsmount_lock *and* namespace_sem) will take care of actual
talking to filesystems) removes the need of all that crap.

IOW, mark_mounts_for_expiry() should do the following:
grabbing namespace_sem exclusive
grab vfsmount_lock
walk the list as it does now, except that it should do the right
check from the very beginning (propagate_mount_busy())
without dropping the vfsmount_lock, go through the collected list,
calling umount_tree()
drop the locks
do release_mounts()
The second pass is needed since umount_tree() might do interesting things
to expiry list, so we make life easier for ourselves by leaving that to
second pass when we just want to drain the resulting list until it's empty.

Does anybody see holes in the above?

shrink_submounts() is _probably_ similar (lock/collect/umount_tree on all/
unlock/release_mounts), but I'm not sure if I understand WTF is really
attempted in there.

Is there any reason why we do that in ->umount_begin() and not *after*
it, unconditionally, straight from do_umount()? AFAICS, the only reason
why it's done from fs-specific code is figuring out which mount-list
should the stuff go back to, and that's both broken *and* not needed
with sanitized locking as above. While we are at it, I'd rather return
->umount_begin() to its previous prototype, TYVM - the less filesystem
sees vfsmounts, the better off we all are...

Comments? If nobody objects, I'm going to do that in vfs-fixes branch
and then push to Linus...
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/