Re: [PATCH v2 10/61] kernel/fork: Use maple tree for dup_mmap() during forking

From: Liam Howlett
Date: Fri Aug 20 2021 - 13:46:25 EST


* Hillf Danton <hdanton@xxxxxxxx> [210820 00:05]:
> On Thu, 19 Aug 2021 13:32:58 +0000 Liam R. Howlett wrote:
> >* Hillf Danton <hdanton@xxxxxxxx> [210818 22:01]:
> >> On Wed, 18 Aug 2021 14:54:29 +0000 Liam R. Howlett wrote:
> >> >* Hillf Danton <hdanton@xxxxxxxx> [210818 04:36]:
> >> >> On Tue, 17 Aug 2021 15:47:11 +0000 Liam R. Howlett wrote:
> >> >> >
> >> >> > static inline void mmget(struct mm_struct *mm)
> >> >> > {
> >> >> > + mt_set_in_rcu(&mm->mm_mt);
> >> >> > atomic_inc(&mm->mm_users);
> >> >> > }
> >> >> >
> >> >> > static inline bool mmget_not_zero(struct mm_struct *mm)
> >> >> > {
> >> >> > + /*
> >> >> > + * There is a race below during task tear down that can cause the =
> >maple
> >> >> > + * tree to enter rcu mode with only a single user. If this race
> >> >> > + * happens, the result would be that the maple tree nodes would re=
> >main
> >> >> > + * active for an extra RCU read cycle.
> >> >> > + */
> >> >> > + mt_set_in_rcu(&mm->mm_mt);
> >> >> > return atomic_inc_not_zero(&mm->mm_users);
> >> >> > }
> >> >>
> >> >> Nit, leave the mm with zero refcount intact.
> >> >>
> >> >> if (atomic_inc_not_zero(&mm->mm_users)) {
> >> >> mt_set_in_rcu(&mm->mm_mt);
> >> >> return true;
> >> >> }
> >> >> return false;
> >> >
> >> >Thanks for looking at this.
> >> >
> >> >I thought about that, but came up with the following scenario:
> >> >
> >> >thread 1 thread 2
> >> >mmget(mm)
> >> > mmget_not_zero() enter..
> >> > atomic_inc_not_zero(&mm->mm_users)
> >> >mmput(mm)
> >> > mt_set_in_rcu(&mm->mm_mt);
> >> > return true;
> >> >
> >>=20
> >> At first glance, given the above mmget, mmput will not hurt anyone.
> >
> >In the case above, the maple tree enters RCU mode with a single user.
> >This will have the result of nodes being freed in RCU mode which is the
> >same result as what happens as it is written, except the racing thread
> >wins (in this case). I thought this was what you were trying to fix?
> >
> >>=20
> >> >
> >> >So I think the above does not remove the race, but does add instructions
> >>=20
> >> If the mm refcount drops to one after mmput then it is one before
> >> atomic_inc_not_zero() which only ensures the mm is stable afterwards
> >> until mmput again.
> >
> >Right. The race we are worried about is the zero referenced mm. If
> >mm->mm_users is safe, then mm->mm_mt is also safe.
> >
> >>=20
> >> >to each call to mmget_not_zero(). I thought the race of having nodes
> >> >sitting around for an rcu read cycle was worth the trade off.
> >>=20
> >> Is one ounce of the mm stability worth two pounds? Or three?
> >
> >I don't see a stability problem with the way it is written.
>
> On the maple tree side, I see in
> [PATCH v2 05/61] Maple Tree: Add new data structure
>
> + * MAPLE_USE_RCU Operate in read/copy/update mode for multi-readers
>
> <...>
>
> +/**
> + * mt_set_in_rcu() - Switch the tree to RCU safe mode.
> + */
> +static inline void mt_set_in_rcu(struct maple_tree *mt)
> +{
> + if (mt_in_rcu(mt))
> + return;
> +
> + mtree_lock(mt);
> + mt->ma_flags |= MAPLE_USE_RCU;
> + mtree_unlock(mt);
> +}
>
> and on the mm side, however, if atomic_inc_not_zero(&mm->mm_users) fails
> then who can be the "RCU multi-readers"?

There wouldn't be one. But the consequence is that the maple tree nodes
will remain active for one extra RCU cycle. This is why there is a big
comment above mmget_not_zero() explaining how this race exists but will
cause no issue.

>
> >Your change does not remove the race.
>
> If atomic_inc_not_zero() fails then there is no pre-condition in any form
> for race; if it succeeds then the race window is slammed.

My example shows how we can still end up with the tree being in RCU mode
with a single thread. atomic_inc_not_zero() succeeds, then the other
thread drops the reference counter before the maple tree enters RCU
mode.

>
> >Can you explain how the stability is affected negatively by the way it
> >is written?
>
> Hard to find the correct answer without knowing why you prefer to update
> the flags for mm->mm_mt with mm->mm_users dropping down to ground.

I prefer to update the flags for mm->mm_mt first because the fallout is
minimal, however the alternative is to increase the execution time for
the vast majority of calls. The trade off just isn't worth it,
especially since the tree may be left in RCU mode with a single user
anyways. There is no stability issue here.

Thank you,
Liam