Re: [GIT PULL] x86/shstk for 6.4

From: Linus Torvalds
Date: Mon May 08 2023 - 19:48:13 EST


On Mon, May 8, 2023 at 4:31 PM Linus Torvalds
<torvalds@xxxxxxxxxxxxxxxxxxxx> wrote:
>
> Yeah, right now the non-shadow-stack ptep_set_wrprotect() can just be
> an atomic clear_bit(), which turns into just
>
> lock andb $-3, (%reg)
>
> and I guess that would inevitably become a horror of a cmpxchg loop
> when you need to move the dirty bit to the SW dirty on CPU's where the
> dirty bit can come in late.
>
> How very very horrid.

Oh, maybe not.

So the nice thing here is that we actually *do* have the old value
(although we don't pass it in to ptep_set_wrprotect(), so while the

lock andb $-3,(%reg)

looks simple and efficient, I do think that it wouldn't necessarily be
any worse to replace it with a

lock cmpxchg %new,%old,(%reg)

and a well-predicted branch for the (very very unlikely) failure case.

So maybe unifying these two areas wouldn't be too bad.

In fact, looking at this code a bit more made me realize that we
probably should always have special-cased the common "is the source MM
single-threaded" case.

Pretty much nobody does "fork()" from a threaded app, because it's not
portable anyway.

So:

- the nasty racy case is already exceedingly rare, and we're wasting
huge amounts of time on using a locked access here when in 99% of all
cases we shouldn't do that at all!

- we would probably be *much* better off with a "if (mm->count == 1)"
test that goes off and does *not* do the atomic case (which also
doesn't have any worries about dirty bits). I'll take a well-predicted
conditional branch over an atomic op any day of the week

- once you do that, the nasty racy case isn't even in the critical
path any more, so it might as well do the unconditional "cmpxchg" loop
that works on all CPU's.

And yes, old CPU's may still set the dirty bit *later*, but we can
indeed make the rule be that the *kernel* never sets that "read-only
and dirty" combination, and then on new CPU's the HW guarantees it
doesn't set it either.

How does that sound? I think we'd potentially speed up fork() quite
noticeably by not doing the locked accesses.

Maybe I just think I'm clever, but I'm actually very stupid and am not
thinking of some obvious problem case.

For example, maybe the mm count doesn't end up being 1 commonly after
all, because we're sharing it with the idle thread or something like
that.

Or maybe there's something even more obviously wrong with the above idea.

Linus