Re: [RFC 11/14] x86: add support for Dynamic Kernel Stacks

From: Andy Lutomirski
Date: Mon Mar 11 2024 - 20:03:13 EST




On Mon, Mar 11, 2024, at 4:56 PM, Nadav Amit wrote:
>> On 12 Mar 2024, at 1:41, Andy Lutomirski <luto@xxxxxxxxxx> wrote:
>>
>> On Mon, Mar 11, 2024, at 4:34 PM, Dave Hansen wrote:
>>> On 3/11/24 15:17, Andy Lutomirski wrote:
>>>> I *think* that all x86 implementations won't fill the TLB for a
>>>> non-accessed page without also setting the accessed bit,
>>>
>>> That's my understanding as well. The SDM is a little more obtuse about it:
>>>
>>>> Whenever the processor uses a paging-structure entry as part of
>>>> linear-address translation, it sets the accessed flag in that entry
>>>> (if it is not already set).
>>>
>>> but it's there.
>>>
>>> But if we start needing Accessed=1 to be accurate, clearing those PTEs
>>> gets more expensive because it needs to be atomic to lock out the page
>>> walker. It basically needs to start getting treated similarly to what
>>> is done for Dirty=1 on userspace PTEs. Not the end of the world, of
>>> course, but one more source of overhead.
>>
>> In my fantasy land where I understand the x86 paging machinery, suppose we're in finish_task_switch(), and suppose prev is Not Horribly Buggy (TM). In particular, suppose that no other CPU is concurrently (non-speculatively!) accessing prev's stack. Prev can't be running, because whatever magic lock prevents it from being migrated hasn't been released yet. (I have no idea what lock this is, but it had darned well better exist so prev isn't migrated before switch_to() even returns.)
>>
>> So the current CPU is not accessing the memory, and no other CPU is accessing the memory, and BPF doesn't exist, so no one is being utterly daft and a kernel read probe, and perf isn't up to any funny business, etc. And a CPU will never *speculatively* set the accessed bit (I told you it's fantasy land), so we just do it unlocked:
>>
>> if (!pte->accessed) {
>> *pte = 0;
>> reuse the memory;
>> }
>>
>> What could possibly go wrong?
>>
>> I admit this is not the best idea I've ever had, and I will not waste anyone's time by trying very hard to defend it :)
>>
>
> Just a thought: you don’t care if someone only reads from the stack's
> page (you can just install another page later). IOW: you only care if
> someone writes.
>
> So you can look on the dirty-bit, which is not being set speculatively
> and save yourself one problem.

Doesn't this buy a new problem? Install a page, run the thread without using the page but speculatively load the PTE as read-only into the TLB, context-switch out the thread, (entirely safely and correctly) determine that the page wasn't used, remove it from the PTE, use it for something else and fill it with things that aren't zero, run the thread again, and read from it. Now it has some other thread's data!

One might slightly credibly argue that this isn't a problem -- between RSP and the bottom of the area that one nominally considers to the by the stack is allowed to return arbitrary garbage, especially in the kernel where there's no red zone (until someone builds a kernel with a redzone on a FRED system, hmm), but this is still really weird. If you *write* in that area, the CPU hopefully puts the *correct* value in the TLB and life goes on, but how much do you trust anyone to have validated what happens when a PTE is present, writable and clean but the TLB contains a stale entry pointing somewhere else? And is it really okay to do this to the poor kernel?

If we're going to add a TLB flush on context switch, then (a) we are being rather silly and (b) we might as well just use atomics to play with the accessed bit instead, I think.