Re: [RFC 11/14] x86: add support for Dynamic Kernel Stacks

From: Nadav Amit
Date: Mon Mar 11 2024 - 19:56:40 EST




> On 12 Mar 2024, at 1:41, Andy Lutomirski <luto@xxxxxxxxxx> wrote:
>
> On Mon, Mar 11, 2024, at 4:34 PM, Dave Hansen wrote:
>> On 3/11/24 15:17, Andy Lutomirski wrote:
>>> I *think* that all x86 implementations won't fill the TLB for a
>>> non-accessed page without also setting the accessed bit,
>>
>> That's my understanding as well. The SDM is a little more obtuse about it:
>>
>>> Whenever the processor uses a paging-structure entry as part of
>>> linear-address translation, it sets the accessed flag in that entry
>>> (if it is not already set).
>>
>> but it's there.
>>
>> But if we start needing Accessed=1 to be accurate, clearing those PTEs
>> gets more expensive because it needs to be atomic to lock out the page
>> walker. It basically needs to start getting treated similarly to what
>> is done for Dirty=1 on userspace PTEs. Not the end of the world, of
>> course, but one more source of overhead.
>
> In my fantasy land where I understand the x86 paging machinery, suppose we're in finish_task_switch(), and suppose prev is Not Horribly Buggy (TM). In particular, suppose that no other CPU is concurrently (non-speculatively!) accessing prev's stack. Prev can't be running, because whatever magic lock prevents it from being migrated hasn't been released yet. (I have no idea what lock this is, but it had darned well better exist so prev isn't migrated before switch_to() even returns.)
>
> So the current CPU is not accessing the memory, and no other CPU is accessing the memory, and BPF doesn't exist, so no one is being utterly daft and a kernel read probe, and perf isn't up to any funny business, etc. And a CPU will never *speculatively* set the accessed bit (I told you it's fantasy land), so we just do it unlocked:
>
> if (!pte->accessed) {
> *pte = 0;
> reuse the memory;
> }
>
> What could possibly go wrong?
>
> I admit this is not the best idea I've ever had, and I will not waste anyone's time by trying very hard to defend it :)
>

Just a thought: you don’t care if someone only reads from the stack's page (you can just install another page later). IOW: you only care if someone writes.

So you can look on the dirty-bit, which is not being set speculatively and save yourself one problem.