Re: [PATCH] mm/mmap: Map MAP_STACK to VM_STACK

From: Joe Mario
Date: Wed Apr 19 2023 - 12:01:22 EST




On 4/19/23 11:09 AM, Matthew Wilcox wrote:
> On Wed, Apr 19, 2023 at 11:07:04AM -0400, Waiman Long wrote:
>> On 4/18/23 23:46, Matthew Wilcox wrote:
>>> On Tue, Apr 18, 2023 at 09:16:37PM -0400, Waiman Long wrote:
>>>>  1) App runs creating lots of threads.
>>>>  2) It mmap's 256K pages of anonymous memory.
>>>>  3) It writes executable code to that memory.
>>>>  4) It calls mprotect() with PROT_EXEC on that memory so
>>>>     it can subsequently execute the code.
>>>>
>>>> The above mprotect() will fail if the mmap'd region's VMA gets merged with
>>>> the VMA for one of the thread stacks.  That's because the default RHEL
>>>> SELinux policy is to not allow executable stacks.
>>> By the way, this is a daft policy. The policy you really want is
>>> EXEC|WRITE is not allowed. A non-writable stack is useless, so it's
>>> actually a superset of your current policy. Forbidding _simultaneous_
>>> write and executable is just good programming. This way, you don't need
>>> to care about the underlying VMA's current permissions, you just need
>>> to do:
>>>
>>> if ((prot & (PROT_EXEC|PROT_WRITE)) == (PROT_EXEC|PROT_WRITE))
>>> return -EACCESS;
>>
>> I am not totally sure if the application changes the VMA to read-only first.
>> Even if it does that, it highlights another possible issue when an anonymous
>> VMA is merged with a stack VMA. Either the mprotect() to write-protect the
>> VMA will fail or the application will segfault if it writes stuff to the
>> stack. This particular issue is not related to SELinux. It provides another
>> good idea why we should avoid merging stack VMA to anonymous VMA.
>
> mprotect will split the VMA into two VMAs, one that is
> PROT_READ|PROT_WRITE and one the is PROT_READ|PROT_EXEC.
>

But in this case, the latter still has PROT_WRITE.

This was reported by a large data analytics customer. They started getting infrequent random crashes in code they haven't touched in 10 years.

One of the threads in their program mmaps a large region using PROT_READ|PROT_WRITE, and that region just happens to be merged with the thread's stack.

Then they copy a small snipit of code to a location somewhere within that mapped region. For the one page that contains that code, they mprotect it to PROT_READ|PROT_WRITE|PROT_EXEC. I recall they're still reading and writing data elsewhere on that page.

Joe