Re: [PATCH 07/32] mm: Bring back vmalloc_exec

From: Andy Lutomirski
Date: Sat Jun 17 2023 - 16:35:39 EST




On Sat, Jun 17, 2023, at 1:08 PM, Kent Overstreet wrote:
> On Sat, Jun 17, 2023 at 12:19:41PM -0700, Andy Lutomirski wrote:
>> On Sat, Jun 17, 2023, at 8:34 AM, Kent Overstreet wrote:

>> Great, then propose a model where the codegen operates in an
>> extra-safe protected context. Or pre-generate the most common
>> variants, have them pull their constants from memory instead of
>> immediates, and use that.
>
> I'll do no such nonsense.

You can avoid generating code beyond what gcc generates at all, or you can pre-generate code but not on an ongoing basis at runtime, or you can generate code at runtime correctly. I don't think there are many other options.

>
>> > If what you were saying was true, it would be an issue any time we
>> > mapped in new executable code for userspace - minor page faults would be
>> > stupidly slow.
>>
>> I literally mentioned this in the email.
>
> No, you didn't. Feel free to link or cite if you think otherwise.

"It's clear that a way to do this without
serializing must exist, because that's what happens when code is paged
in from a user program."

>> > text_poke() doesn't even send IPIs.
>>
>> text_poke() and the associated machinery is unbelievably complicated.
>
> It's not that bad.

This is a useless discussion.

>
> The only reference to IPIs in text_poke() is the comment that indicates
> that flush_tlb_mm_range() may sometimes do IPIs, but explicitly
> indicates that it does _not_ do IPIs the way text_poke() is using it.
>
>> Also, arch/x86/kernel/alternative.c contains:
>>
>> void text_poke_sync(void)
>> {
>> on_each_cpu(do_sync_core, NULL, 1);
>> }
>
> ...which is for modifying code that is currently being executed, not the
> text_poke() or text_poke_copy() paths.
>
>>
>> The magic in text_poke() was developed over the course of years, and
>> Intel architects were involved.
>>
>> (And I think some text_poke() stuff uses RCU, which is another way to
>> sync without IPI. I doubt the performance characteristics are
>> appropriate for bcachefs, but I could be wrong.)
>
> No, it doesn't use RCU.

It literally says in alternative.c:

* Not safe against concurrent execution; useful for JITs to dump
* new code blocks into unused regions of RX memory. Can be used in
* conjunction with synchronize_rcu_tasks() to wait for existing
* execution to quiesce after having made sure no existing functions
* pointers are live.

I don't know whether any callers actually do this. I didn't look.

>
>> > I think you've been misled about some things :)
>>
>> I wish.
>
> Given your comments on text_poke(), I think you are. You're confusing
> synchronization requirements for _self modifying_ code with the
> synchronization requirements for writing new code to memory, and then
> executing it.

No, you are misunderstanding the difference.

Version A:

User mmap()s an executable file (DSO, whatever). At first, there is either no PTE or a not-present PTE. At some point, in response to a page fault or just the kernel prefetching, the kernel fills in the backing page and then creates the PTE. From the CPU's perspective, the virtual address atomically transitions from having nothing there to having the final code there. It works (despite the manual having nothing to say about this case). It's also completely unavoidable.

Version B:

Kernel vmallocs some space *and populates the pagetables*. There is backing storage, that is executable (or it's a non-NX system, although those are quite rare these days).

Because the CPU hates you, it speculatively executes that code. (Maybe you're under attack. Maybe you're just unlucky. Doesn't matter.) It populates the instruction cache, remembers the decoded instructions, etc. It does all the things that make the manual say scary things about serialization. It notices that the speculative execution was wrong and backs it out, but nothing is invalidated.

Now you write code into there. Either you do this from a different CPU or you do it at a different linear address, so the magic hardware that invalidates for you does not trigger.

Now you jump into that code, and you tell yourself that it was new code because it was all zeros before and you never intentionally executed it. But the CPU could not care less what you think, and you lose.

>
> And given that bcachefs is not doing anything new here - we're doing a
> more limited form of what BPF is already doing - I don't think this is
> even the appropriate place for this discussion. There is a new
> executable memory allocator being developed and posted, which is
> expected to wrap text_poke() in an arch-independent way so that
> allocations can share pages, and so that we can remove the need to have
> pages mapped both writeable and executable.

I don't really care what BPF is doing, and BPF may well have the same problem.

But if I understood what bcachefs is doing, it's creating code vastly more frequently than BPF, in response to entirely unprivileged operations from usermode. It's a whole different amount of exposure.

>
> If you've got knowledge you wish to share on how to get cache coherency
> right, I think that might be a more appropriate thread.

I'll look.