Re: [REVIEW][PATCH] mount: In propagate_umount handle overlapping mount propagation trees

From: Andrei Vagin
Date: Fri Oct 21 2016 - 05:07:24 EST


On Tue, Oct 18, 2016 at 10:46:40PM -0500, Eric W. Biederman wrote:
>
> Adrei Vagin pointed out that time to executue propagate_umount can go
> non-linear (and take a ludicrious amount of time) when the mount
> propogation trees of the mounts to be unmunted by a lazy unmount
> overlap.
>
> While investigating the horrible performance I realized that in
> the case overlapping mount trees since the addition of locked
> mount support the code has been failing to unmount all of the
> mounts it should have been unmounting.
>
> Make the walk of the mount propagation trees nearly linear by using
> MNT_MARK to mark pieces of the mount propagation trees that have
> already been visited, allowing subsequent walks to skip over
> subtrees.
>
> Make the processing of mounts order independent by adding a list of
> mount entries that need to be unmounted, and simply adding a mount to
> that list when it becomes apparent the mount can safely be unmounted.
> For mounts that are locked on other mounts but otherwise could be
> unmounted move them from their parnets mnt_mounts to mnt_umounts so
> that if and when their parent becomes unmounted these mounts can be
> added to the list of mounts to unmount.
>
> Add a final pass to clear MNT_MARK and to restore mnt_mounts
> from mnt_umounts for anything that did not get unmounted.
>
> Add the functions propagation_visit_next and propagation_revisit_next
> to coordinate walking of the mount tree and setting and clearing the
> mount mark.
>
> The skipping of already unmounted mounts has been moved from
> __lookup_mnt_last to mark_umount_candidates, so that the new
> propagation functions can notice when the propagation tree passes
> through the initial set of unmounted mounts. Except in umount_tree as
> part of the unmounting process the only place where unmounted mounts
> should be found are in unmounted subtrees. All of the other callers
> of __lookup_mnt_last are from mounted subtrees so the not checking for
> unmounted mounts should not affect them.
>
> A script to generate overlapping mount propagation trees:
> $ cat run.sh
> mount -t tmpfs test-mount /mnt
> mount --make-shared /mnt
> for i in `seq $1`; do
> mkdir /mnt/test.$i
> mount --bind /mnt /mnt/test.$i
> done
> cat /proc/mounts | grep test-mount | wc -l
> time umount -l /mnt
> $ for i in `seq 10 16`; do echo $i; unshare -Urm bash ./run.sh $i; done
>
> Here are the performance numbers with and without the patch:
>
> mhash | 8192 | 8192 | 8192 | 131072 | 131072 | 104857 | 104857
> mounts | before | after | after (sys) | after | after (sys) | after | after (sys)
> -------------------------------------------------------------------------------------
> 1024 | 0.071s | 0.020s | 0.000s | 0.022s | 0.004s | 0.020s | 0.004s
> 2048 | 0.184s | 0.022s | 0.004s | 0.023s | 0.004s | 0.022s | 0.008s
> 4096 | 0.604s | 0.025s | 0.020s | 0.029s | 0.008s | 0.026s | 0.004s
> 8912 | 4.471s | 0.053s | 0.020s | 0.051s | 0.024s | 0.047s | 0.016s
> 16384 | 34.826s | 0.088s | 0.060s | 0.081s | 0.048s | 0.082s | 0.052s
> 32768 | | 0.216s | 0.172s | 0.160s | 0.124s | 0.160s | 0.096s
> 65536 | | 0.819s | 0.726s | 0.330s | 0.260s | 0.338s | 0.256s
> 131072 | | 4.502s | 4.168s | 0.707s | 0.580s | 0.709s | 0.592s
>
> Andrei Vagin reports fixing the performance problem is part of the
> work to fix CVE-2016-6213.
>
> A script for a pathlogical set of mounts:
>
> $ cat pathological.sh
>
> mount -t tmpfs base /mnt
> mount --make-shared /mnt
> mkdir -p /mnt/b
>
> mount -t tmpfs test1 /mnt/b
> mount --make-shared /mnt/b
> mkdir -p /mnt/b/10
>
> mount -t tmpfs test2 /mnt/b/10
> mount --make-shared /mnt/b/10
> mkdir -p /mnt/b/10/20
>
> mount --rbind /mnt/b /mnt/b/10/20
>
> unshare -Urm sleep 2
> umount -l /mnt/b
> wait %%
>
> $ unsahre -Urm pathlogical.sh
>
> Cc: stable@xxxxxxxxxxxxxxx
> Fixes: a05964f3917c ("[PATCH] shared mounts handling: umount")
> Fixes: 0c56fe31420c ("mnt: Don't propagate unmounts to locked mounts")
> Reported-by: Andrei Vagin <avagin@xxxxxxxxxx>
> Signed-off-by: "Eric W. Biederman" <ebiederm@xxxxxxxxxxxx>
> ---
>
> Barring some stupid mistake this looks like it fixes both the performance
> and the correctness issues I was able to spot earlier. Andrei if you
> could give this version a look over I would appreciate it.

Eric, could you try out this script:

[root@fc24 mounts]# cat run.sh
set -e -m

mount -t tmpfs zdtm /mnt
mkdir -p /mnt/1 /mnt/2
mount -t tmpfs zdtm /mnt/1
mount --make-shared /mnt/1
mkdir /mnt/1/1

iteration=30
if [ -n "$1" ]; then
iteration=$1
fi

for i in `seq $iteration`; do
mount --bind /mnt/1/1 /mnt/1/1 &
done

ret=0
for i in `seq $iteration`; do
wait -n || ret=1
done

[ "$ret" -ne 0 ] && {
time umount -l /mnt/1
exit 0
}

mount --rbind /mnt/1 /mnt/2
mount --make-slave /mnt/2
mount -t tmpfs zzz /mnt/2/1

nr=`cat /proc/self/mountinfo | grep zdtm | wc -l`
echo -n "umount -l /mnt/1 -> $nr "
/usr/bin/time -f '%E' umount -l /mnt/1

nr=`cat /proc/self/mountinfo | grep zdtm | wc -l`
echo -n "umount -l /mnt/2 -> $nr "
/usr/bin/time -f '%E' umount -l /mnt/2

[root@fc24 mounts]# unshare -Urm sh run.sh 4

It hangs up on my host with this patch.

NMI watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [umount:789]
Modules linked in: nfsv3 nfs fscache bridge stp llc ebtable_filter
ebtables ip6table_filter ip6_tables ppdev crct10dif_pclmul crc32_pclmul
ghash_clmulni_intel virtio_balloon i2c_piix4 parport_pc parport
acpi_cpufreq nfsd auth_rpcgss nfs_acl lockd grace sunrpc binfmt_misc
virtio_net virtio_blk virtio_console crc32c_intel serio_raw virtio_pci
virtio_ring virtio ata_generic pata_acpi
CPU: 0 PID: 789 Comm: umount Not tainted 4.9.0-rc1+ #137
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.9.1-1.fc24
04/01/2014
task: ffff88007c11c100 task.stack: ffffc900007c0000
RIP: 0010:[<ffffffff8125bd87>] [<ffffffff8125bd87>]
__lookup_mnt_last+0x67/0x80
RSP: 0018:ffffc900007c3db0 EFLAGS: 00000286
RAX: ffff88007a5f0900 RBX: ffff88007b136620 RCX: ffff88007a3e2900
RDX: ffff88007a3e2900 RSI: ffff88007b136600 RDI: ffff88007b136600
RBP: ffffc900007c3dc0 R08: ffff880036df5850 R09: ffffffff81249664
R10: ffff88007bd84c38 R11: 0000000100000000 R12: ffff88007bce3f00
R13: ffffc900007c3e00 R14: ffff88007bce3f00 R15: 00007ffe54245328
FS: 00007ff465de0840(0000) GS:ffff88007fc00000(0000)
knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 000055f86b328128 CR3: 000000007ba23000 CR4: 00000000003406f0
Stack:
ffff88007b136600 ffff88007b136480 ffffc900007c3df0 ffffffff8126a70a
ffffc900007c3e58 ffffc900007c3e10 ffffc900007c3e00 ffff880079f99980
ffffc900007c3e48 ffffffff8126b0d5 ffff88007a3e2660 ffff88007a3e24e0
Call Trace:
[<ffffffff8126a70a>] propagation_visit_child.isra.8+0x5a/0xd0
[<ffffffff8126b0d5>] propagate_umount+0x65/0x2e0
[<ffffffff8125a76e>] umount_tree+0x2be/0x2d0
[<ffffffff8125b75f>] do_umount+0x13f/0x340
[<ffffffff8125c3ce>] SyS_umount+0x10e/0x120
[<ffffffff817ba837>] entry_SYSCALL_64_fastpath+0x1a/0xa9

Thanks,
Andrei

>
> Unless we can find a problem I am going to call this the final version.
>
> fs/mount.h | 1 +
> fs/namespace.c | 7 +-
> fs/pnode.c | 198 ++++++++++++++++++++++++++++++++++++++++++++++-----------
> fs/pnode.h | 2 +-
> 4 files changed, 165 insertions(+), 43 deletions(-)
>
> diff --git a/fs/mount.h b/fs/mount.h
> index d2e25d7b64b3..00fe0d1d6ba7 100644
> --- a/fs/mount.h
> +++ b/fs/mount.h
> @@ -58,6 +58,7 @@ struct mount {
> struct mnt_namespace *mnt_ns; /* containing namespace */
> struct mountpoint *mnt_mp; /* where is it mounted */
> struct hlist_node mnt_mp_list; /* list mounts with the same mountpoint */
> + struct list_head mnt_umounts; /* list of children that are being unmounted */
> #ifdef CONFIG_FSNOTIFY
> struct hlist_head mnt_fsnotify_marks;
> __u32 mnt_fsnotify_mask;
> diff --git a/fs/namespace.c b/fs/namespace.c
> index e6c234b1a645..73801391bb00 100644
> --- a/fs/namespace.c
> +++ b/fs/namespace.c
> @@ -237,6 +237,7 @@ static struct mount *alloc_vfsmnt(const char *name)
> INIT_LIST_HEAD(&mnt->mnt_slave_list);
> INIT_LIST_HEAD(&mnt->mnt_slave);
> INIT_HLIST_NODE(&mnt->mnt_mp_list);
> + INIT_LIST_HEAD(&mnt->mnt_umounts);
> #ifdef CONFIG_FSNOTIFY
> INIT_HLIST_HEAD(&mnt->mnt_fsnotify_marks);
> #endif
> @@ -650,13 +651,11 @@ struct mount *__lookup_mnt_last(struct vfsmount *mnt, struct dentry *dentry)
> p = __lookup_mnt(mnt, dentry);
> if (!p)
> goto out;
> - if (!(p->mnt.mnt_flags & MNT_UMOUNT))
> - res = p;
> + res = p;
> hlist_for_each_entry_continue(p, mnt_hash) {
> if (&p->mnt_parent->mnt != mnt || p->mnt_mountpoint != dentry)
> break;
> - if (!(p->mnt.mnt_flags & MNT_UMOUNT))
> - res = p;
> + res = p;
> }
> out:
> return res;
> diff --git a/fs/pnode.c b/fs/pnode.c
> index 234a9ac49958..15e30e861a14 100644
> --- a/fs/pnode.c
> +++ b/fs/pnode.c
> @@ -390,56 +390,153 @@ void propagate_mount_unlock(struct mount *mnt)
> }
>
> /*
> - * Mark all mounts that the MNT_LOCKED logic will allow to be unmounted.
> + * get the next mount in the propagation tree (that has not been visited)
> + * @m: the mount seen last
> + * @origin: the original mount from where the tree walk initiated
> + *
> + * Note that peer groups form contiguous segments of slave lists.
> + * We rely on that in get_source() to be able to find out if
> + * vfsmount found while iterating with propagation_next() is
> + * a peer of one we'd found earlier.
> */
> -static void mark_umount_candidates(struct mount *mnt)
> +static struct mount *propagation_visit_child(struct mount *last_child,
> + struct mount *origin_child)
> {
> - struct mount *parent = mnt->mnt_parent;
> - struct mount *m;
> + struct mount *m = last_child->mnt_parent;
> + struct mount *origin = origin_child->mnt_parent;
> + struct dentry *mountpoint = origin_child->mnt_mountpoint;
> + struct mount *child;
>
> - BUG_ON(parent == mnt);
> + /* Has this part of the propgation tree already been visited? */
> + if (IS_MNT_MARKED(last_child))
> + return NULL;
>
> - for (m = propagation_next(parent, parent); m;
> - m = propagation_next(m, parent)) {
> - struct mount *child = __lookup_mnt_last(&m->mnt,
> - mnt->mnt_mountpoint);
> - if (child && (!IS_MNT_LOCKED(child) || IS_MNT_MARKED(m))) {
> - SET_MNT_MARK(child);
> + SET_MNT_MARK(last_child);
> +
> + /* are there any slaves of this mount? */
> + if (!list_empty(&m->mnt_slave_list)) {
> + m = first_slave(m);
> + goto check_slave;
> + }
> + while (1) {
> + struct mount *master = m->mnt_master;
> +
> + if (master == origin->mnt_master) {
> + struct mount *next = next_peer(m);
> + while (1) {
> + if (next == origin)
> + return NULL;
> + child = __lookup_mnt_last(&next->mnt, mountpoint);
> + if (child && !IS_MNT_MARKED(child))
> + return child;
> + next = next_peer(next);
> + }
> + } else {
> + while (1) {
> + if (m->mnt_slave.next == &master->mnt_slave_list)
> + break;
> + m = next_slave(m);
> + check_slave:
> + child = __lookup_mnt_last(&m->mnt, mountpoint);
> + if (child && !IS_MNT_MARKED(child))
> + return child;
> + }
> }
> +
> + /* back at master */
> + m = master;
> }
> }
>
> /*
> - * NOTE: unmounting 'mnt' naturally propagates to all other mounts its
> - * parent propagates to.
> + * get the next mount in the propagation tree (that has not been revisited)
> + * @m: the mount seen last
> + * @origin: the original mount from where the tree walk initiated
> + *
> + * Note that peer groups form contiguous segments of slave lists.
> + * We rely on that in get_source() to be able to find out if
> + * vfsmount found while iterating with propagation_next() is
> + * a peer of one we'd found earlier.
> */
> -static void __propagate_umount(struct mount *mnt)
> +static struct mount *propagation_revisit_child(struct mount *last_child,
> + struct mount *origin_child)
> {
> - struct mount *parent = mnt->mnt_parent;
> - struct mount *m;
> + struct mount *m = last_child->mnt_parent;
> + struct mount *origin = origin_child->mnt_parent;
> + struct dentry *mountpoint = origin_child->mnt_mountpoint;
> + struct mount *child;
>
> - BUG_ON(parent == mnt);
> + /* Has this part of the propgation tree already been revisited? */
> + if (!IS_MNT_MARKED(last_child))
> + return NULL;
>
> - for (m = propagation_next(parent, parent); m;
> - m = propagation_next(m, parent)) {
> + CLEAR_MNT_MARK(last_child);
>
> - struct mount *child = __lookup_mnt_last(&m->mnt,
> - mnt->mnt_mountpoint);
> - /*
> - * umount the child only if the child has no children
> - * and the child is marked safe to unmount.
> - */
> - if (!child || !IS_MNT_MARKED(child))
> - continue;
> - CLEAR_MNT_MARK(child);
> - if (list_empty(&child->mnt_mounts)) {
> - list_del_init(&child->mnt_child);
> - child->mnt.mnt_flags |= MNT_UMOUNT;
> - list_move_tail(&child->mnt_list, &mnt->mnt_list);
> + /* are there any slaves of this mount? */
> + if (!list_empty(&m->mnt_slave_list)) {
> + m = first_slave(m);
> + goto check_slave;
> + }
> + while (1) {
> + struct mount *master = m->mnt_master;
> +
> + if (master == origin->mnt_master) {
> + struct mount *next = next_peer(m);
> + while (1) {
> + if (next == origin)
> + return NULL;
> + child = __lookup_mnt_last(&next->mnt, mountpoint);
> + if (child && IS_MNT_MARKED(child))
> + return child;
> + next = next_peer(next);
> + }
> + } else {
> + while (1) {
> + if (m->mnt_slave.next == &master->mnt_slave_list)
> + break;
> + m = next_slave(m);
> + check_slave:
> + child = __lookup_mnt_last(&m->mnt, mountpoint);
> + if (child && IS_MNT_MARKED(child))
> + return child;
> + }
> }
> +
> + /* back at master */
> + m = master;
> }
> }
>
> +static void start_umount_propagation(struct mount *child,
> + struct list_head *to_umount)
> +{
> + do {
> + struct mount *parent = child->mnt_parent;
> +
> + if ((child->mnt.mnt_flags & MNT_UMOUNT) ||
> + !list_empty(&child->mnt_mounts))
> + return;
> +
> + if (!IS_MNT_LOCKED(child))
> + list_move_tail(&child->mnt_child, to_umount);
> + else
> + list_move_tail(&child->mnt_child, &parent->mnt_umounts);
> +
> + child = NULL;
> + if (IS_MNT_MARKED(parent))
> + child = parent;
> + } while (child);
> +}
> +
> +static void end_umount_propagation(struct mount *child)
> +{
> + struct mount *parent = child->mnt_parent;
> +
> + if (!list_empty(&parent->mnt_umounts))
> + list_splice_tail_init(&parent->mnt_umounts, &parent->mnt_mounts);
> +}
> +
> +
> /*
> * collect all mounts that receive propagation from the mount in @list,
> * and return these additional mounts in the same list.
> @@ -447,14 +544,39 @@ static void __propagate_umount(struct mount *mnt)
> *
> * vfsmount lock must be held for write
> */
> -int propagate_umount(struct list_head *list)
> +void propagate_umount(struct list_head *list)
> {
> struct mount *mnt;
> + LIST_HEAD(to_umount);
> + LIST_HEAD(tmp_list);
> +
> + /* Find candidates for unmounting */
> + list_for_each_entry(mnt, list, mnt_list) {
> + struct mount *child;
> + for (child = propagation_visit_child(mnt, mnt); child;
> + child = propagation_visit_child(child, mnt))
> + start_umount_propagation(child, &to_umount);
> + }
>
> - list_for_each_entry_reverse(mnt, list, mnt_list)
> - mark_umount_candidates(mnt);
> + /* Begin unmounting */
> + while (!list_empty(&to_umount)) {
> + mnt = list_first_entry(&to_umount, struct mount, mnt_child);
>
> - list_for_each_entry(mnt, list, mnt_list)
> - __propagate_umount(mnt);
> - return 0;
> + list_del_init(&mnt->mnt_child);
> + mnt->mnt.mnt_flags |= MNT_UMOUNT;
> + list_move_tail(&mnt->mnt_list, &tmp_list);
> +
> + if (!list_empty(&mnt->mnt_umounts))
> + list_splice_tail_init(&mnt->mnt_umounts, &to_umount);
> + }
> +
> + /* Cleanup the mount propagation tree */
> + list_for_each_entry(mnt, list, mnt_list) {
> + struct mount *child;
> + for (child = propagation_revisit_child(mnt, mnt); child;
> + child = propagation_revisit_child(child, mnt))
> + end_umount_propagation(child);
> + }
> +
> + list_splice_tail(&tmp_list, list);
> }
> diff --git a/fs/pnode.h b/fs/pnode.h
> index 550f5a8b4fcf..38c6cdb96b34 100644
> --- a/fs/pnode.h
> +++ b/fs/pnode.h
> @@ -41,7 +41,7 @@ static inline void set_mnt_shared(struct mount *mnt)
> void change_mnt_propagation(struct mount *, int);
> int propagate_mnt(struct mount *, struct mountpoint *, struct mount *,
> struct hlist_head *);
> -int propagate_umount(struct list_head *);
> +void propagate_umount(struct list_head *);
> int propagate_mount_busy(struct mount *, int);
> void propagate_mount_unlock(struct mount *);
> void mnt_release_group_id(struct mount *);
> --
> 2.10.1
>