Re: [GIT PULL] x86/shstk for 6.4

From: Dave Hansen
Date: Mon May 15 2023 - 17:36:58 EST


On 5/12/23 14:55, Linus Torvalds wrote:
>> There's always a race there because mm->mm_users can always get bumped
>> after the fork()er checks it.
> Ugh. I was assuming that if they don't already have a reference to the
> mm, they can't get one (becasue the '1' comes from 'current->mm', and
> nobody else has that mm).
>
> But maybe that's just not true. Looking around, we have things like
>
> pages->source_mm = current->mm;
> mmgrab(pages->source_mm);
>
> that creates a ref to the mm with a grab, and then later it gets used.
>
> So maybe the "no races can happen" is limited to *both* mm_users and
> mm_count being 1. What would increment it in that case, when 'current'
> is obviously busy forking?
>
> That "both are one" might still be the common case for fork(). Hmm?

get_task_mm() is the devil here. It goes right from having a
task_struct to bumping ->mm_users, no ->mm_count needed. It also bumps
->mm_users while holding task_lock(), which means we can't do something
simple like take mmap_lock in there to avoid racing with fork().

I did hack something together that seems to work for fork() and
get_task_mm(). Basically, we let get_task_mm()'s legacy behavior to be
the fast path. But it diverts over to a slow path if get_task_mm()
notices that an mm's refcounts and mmap_lock are consistent with a
fork() happening elsewhere.

The slow path releases the task_lock() and acquires mmap_lock so it can
sleep until the (potential) fork() is finished.

On the other side, the fork() code checks ->mm_users and ->mm_count. It
can now depend on them being stable because it holds mmap_lock and it
diverts the get_task_mm() callers over to the slow path.

This works for two important cases:

1. get_task_mm() callers since they now conditionally use mmap_lock
2. mmgrab() -> mmget_not_zero() users that later take the mmap_lock

I'm also fairly sure it misses some cases outside of those two. The
patch is also quite ugly. The "->task_doing_fast_fork" mechanism is
pure hackery, for instance.

This seems to stay on the fast paths pretty well, even with 'top' or
some other /proc poking going on. In the end, this is balancing the
extra cost of the get_task_mm() slow path with reduced atomic cost in
the fork() path. It looks promising so far.

Is this worth pursuing?