Re: [GIT PULL] x86/shstk for 6.4

From: Dave Hansen
Date: Mon May 08 2023 - 18:57:20 EST


On 5/6/23 13:09, Linus Torvalds wrote:
> Now, my reaction here is
>
> - the whole shadow stack notion of "dirty but not writable is a magic
> marker" is *DISGUSTING*. It's wrong.
>
> Whatever Intel designer that came up with that - instead of just
> picking another bit for the *HARDWARE* to check - should be ashamed.
>> Now we have to pick a software bit instead, and play games for
> this. BAD BAD BAD.
>
> I'm assuming this is something where Microsoft went "we already
> don't have that, and we want all the sw bits for sw, so do this". But
> from a design standpoint it's just nasty.

Heh, I won't name names. But, yeah, it was something like that.

> - But if we have to play those games, just *play* them. Do it all
> unconditionally, and make the x86-64 rules be that "dirty but not
> writable" is something we should never have.

There's a wrinkle to enforcing that universally. From the SDM's
"ACCESSED AND DIRTY FLAGS" section:

If software on one logical processor writes to a page while
software on another logical processor concurrently clears the
R/W flag in the paging-structure entry that maps the page,
execution on some processors may result in the entry’s dirty
flag being set.

This behavior is gone on shadow stack CPUs, but it does exist on older
ones. We could theoretically stop being exposed to it by transitioning
all PTE operations that today do:

1. RW => RO (usually more than one)
2. TLB flush

to instead take a trip through Present=0 first:

1. RW => Present=0
2. TLB flush
3. Present=0 => RO

Similar to what we do for doing Dirty=1->0.

We could probably tolerate the cost for some of the users like ksm. But
I can't think of a way to do it without making fork() suffer. fork() of
course modifies the PTE (RW->RO) and flushes the TLB now. But there
would need to be a Present=0 PTE in there somewhere before the TLB flush.

That fundamentally means there needs to be a second look at the PTEs and
some fault handling for folks that do read-only accesses to the PTEs
during the Present=0 window.

That said, there are some places like:

pte_mksaveddirty()
and
pte_clear_saveddirty()

that are doing _extra_ things on shadow stack systems. That stuff could
be made the common case without functionally breaking any old systems.

So, the rule would be something like:

The *kernel* will never itself create Write=0,Dirty=1 PTEs

That won't prevent the hardware from still being able to do it behind
our backs on older CPUs. But it does avoid a few of the special cases.