Re: [GIT PULL] x86/shstk for 6.4

From: Linus Torvalds
Date: Mon May 08 2023 - 19:31:33 EST


On Mon, May 8, 2023 at 3:57 PM Dave Hansen <dave.hansen@xxxxxxxxx> wrote:
>
> There's a wrinkle to enforcing that universally. From the SDM's
> "ACCESSED AND DIRTY FLAGS" section:
>
> If software on one logical processor writes to a page while
> software on another logical processor concurrently clears the
> R/W flag in the paging-structure entry that maps the page,
> execution on some processors may result in the entry’s dirty
> flag being set.

I was actually wondering about that.

I had this memory that we've done special things in the past to make
sure that the dirty bit is guaranteed stable (ie the whole
"ptep_clear()" dance). But I wasn't sure.

> This behavior is gone on shadow stack CPUs

Ok, so Intel has actually tightened up the rules on setting dirty, and
now guarantees that it will set dirty only if the pte is actually
writable?

> We could probably tolerate the cost for some of the users like ksm. But
> I can't think of a way to do it without making fork() suffer. fork() of
> course modifies the PTE (RW->RO) and flushes the TLB now. But there
> would need to be a Present=0 PTE in there somewhere before the TLB flush.

Yeah, we don't want to make fork() any worse than it already is. No
question about that.

But if we make the rule be that having the exact dirty bit vs rw bit
semantics only matters for CPUs that do the shadow stack thing, and on
*those* CPU's it's ok to not go through the dance, can we then come up
with a sequence that works for everybody?

> So, the rule would be something like:
>
> The *kernel* will never itself create Write=0,Dirty=1 PTEs
>
> That won't prevent the hardware from still being able to do it behind
> our backs on older CPUs. But it does avoid a few of the special cases.

Right. So looking at the fork() case as a nasty example, right now we have

ptep_set_wrprotect()

on the source pte of a fork(), which atomically just clears the WRITE
bit (and thus guarantees that dirty bits cannot get lost, simply
because it doesn't matter if some other CPU atomically sets another
bit concurrently).

On the destination we don't have any races with concurrent accesses,
and just do entirely non-atomic

pte = pte_wrprotect(pte);

and then eventually (after other bit games) do

set_pte_at(dst_vma->vm_mm, addr, dst_pte, pte);

and basically you're saying that there is no possible common sequence
for that ptep_set_wrprotect() that doesn't penalize some case.

Hmm.

Yeah, right now the non-shadow-stack ptep_set_wrprotect() can just be
an atomic clear_bit(), which turns into just

lock andb $-3, (%reg)

and I guess that would inevitably become a horror of a cmpxchg loop
when you need to move the dirty bit to the SW dirty on CPU's where the
dirty bit can come in late.

How very very horrid.

Linus