Re: [PATCH v2 1/2] mm/userfaultfd: Support WP on multiple VMAs

From: Muhammad Usama Anjum
Date: Thu Feb 16 2023 - 01:26:06 EST


On 2/16/23 2:45 AM, Peter Xu wrote:
> On Wed, Feb 15, 2023 at 12:08:11PM +0500, Muhammad Usama Anjum wrote:
>> Hi Peter,
>>
>> Thank you for your review!
>>
>> On 2/15/23 2:50 AM, Peter Xu wrote:
>>> On Tue, Feb 14, 2023 at 01:49:50PM +0500, Muhammad Usama Anjum wrote:
>>>> On 2/14/23 2:11 AM, Peter Xu wrote:
>>>>> On Mon, Feb 13, 2023 at 10:50:39PM +0500, Muhammad Usama Anjum wrote:
>>>>>> On 2/13/23 9:54 PM, Peter Xu wrote:
>>>>>>> On Mon, Feb 13, 2023 at 09:31:23PM +0500, Muhammad Usama Anjum wrote:
>>>>>>>> mwriteprotect_range() errors out if [start, end) doesn't fall in one
>>>>>>>> VMA. We are facing a use case where multiple VMAs are present in one
>>>>>>>> range of interest. For example, the following pseudocode reproduces the
>>>>>>>> error which we are trying to fix:
>>>>>>>>
>>>>>>>> - Allocate memory of size 16 pages with PROT_NONE with mmap
>>>>>>>> - Register userfaultfd
>>>>>>>> - Change protection of the first half (1 to 8 pages) of memory to
>>>>>>>> PROT_READ | PROT_WRITE. This breaks the memory area in two VMAs.
>>>>>>>> - Now UFFDIO_WRITEPROTECT_MODE_WP on the whole memory of 16 pages errors
>>>>>>>> out.
>>>>>>>>
>>>>>>>> This is a simple use case where user may or may not know if the memory
>>>>>>>> area has been divided into multiple VMAs.
>>>>>>>>
>>>>>>>> Reported-by: Paul Gofman <pgofman@xxxxxxxxxxxxxxx>
>>>>>>>> Signed-off-by: Muhammad Usama Anjum <usama.anjum@xxxxxxxxxxxxx>
>>>>>>>> ---
>>>>>>>> Changes since v1:
>>>>>>>> - Correct the start and ending values passed to uffd_wp_range()
>>>>>>>> ---
>>>>>>>> mm/userfaultfd.c | 38 ++++++++++++++++++++++----------------
>>>>>>>> 1 file changed, 22 insertions(+), 16 deletions(-)
>>>>>>>>
>>>>>>>> diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
>>>>>>>> index 65ad172add27..bccea08005a8 100644
>>>>>>>> --- a/mm/userfaultfd.c
>>>>>>>> +++ b/mm/userfaultfd.c
>>>>>>>> @@ -738,9 +738,12 @@ int mwriteprotect_range(struct mm_struct *dst_mm, unsigned long start,
>>>>>>>> unsigned long len, bool enable_wp,
>>>>>>>> atomic_t *mmap_changing)
>>>>>>>> {
>>>>>>>> + unsigned long end = start + len;
>>>>>>>> + unsigned long _start, _end;
>>>>>>>> struct vm_area_struct *dst_vma;
>>>>>>>> unsigned long page_mask;
>>>>>>>> int err;
>>>>>>>
>>>>>>> I think this needs to be initialized or it can return anything when range
>>>>>>> not mapped.
>>>>>> It is being initialized to -EAGAIN already. It is not visible in this patch.
>>>>>
>>>>> I see, though -EAGAIN doesn't look suitable at all. The old retcode for
>>>>> !vma case is -ENOENT, so I think we'd better keep using it if we want to
>>>>> have this patch.
>>>> I'll update in next version.
>>>>
>>>>>
>>>>>>
>>>>>>>
>>>>>>>> + VMA_ITERATOR(vmi, dst_mm, start);
>>>>>>>>
>>>>>>>> /*
>>>>>>>> * Sanitize the command parameters:
>>>>>>>> @@ -762,26 +765,29 @@ int mwriteprotect_range(struct mm_struct *dst_mm, unsigned long start,
>>>>>>>> if (mmap_changing && atomic_read(mmap_changing))
>>>>>>>> goto out_unlock;
>>>>>>>>
>>>>>>>> - err = -ENOENT;
>>>>>>>> - dst_vma = find_dst_vma(dst_mm, start, len);
>>>>>>>> + for_each_vma_range(vmi, dst_vma, end) {
>>>>>>>> + err = -ENOENT;
>>>>>>>>
>>>>>>>> - if (!dst_vma)
>>>>>>>> - goto out_unlock;
>>>>>>>> - if (!userfaultfd_wp(dst_vma))
>>>>>>>> - goto out_unlock;
>>>>>>>> - if (!vma_can_userfault(dst_vma, dst_vma->vm_flags))
>>>>>>>> - goto out_unlock;
>>>>>>>> + if (!dst_vma->vm_userfaultfd_ctx.ctx)
>>>>>>>> + break;
>>>>>>>> + if (!userfaultfd_wp(dst_vma))
>>>>>>>> + break;
>>>>>>>> + if (!vma_can_userfault(dst_vma, dst_vma->vm_flags))
>>>>>>>> + break;
>>>>>>>>
>>>>>>>> - if (is_vm_hugetlb_page(dst_vma)) {
>>>>>>>> - err = -EINVAL;
>>>>>>>> - page_mask = vma_kernel_pagesize(dst_vma) - 1;
>>>>>>>> - if ((start & page_mask) || (len & page_mask))
>>>>>>>> - goto out_unlock;
>>>>>>>> - }
>>>>>>>> + if (is_vm_hugetlb_page(dst_vma)) {
>>>>>>>> + err = -EINVAL;
>>>>>>>> + page_mask = vma_kernel_pagesize(dst_vma) - 1;
>>>>>>>> + if ((start & page_mask) || (len & page_mask))
>>>>>>>> + break;
>>>>>>>> + }
>>>>>>>>
>>>>>>>> - uffd_wp_range(dst_mm, dst_vma, start, len, enable_wp);
>>>>>>>> + _start = (dst_vma->vm_start > start) ? dst_vma->vm_start : start;
>>>>>>>> + _end = (dst_vma->vm_end < end) ? dst_vma->vm_end : end;
>>>>>>>>
>>>>>>>> - err = 0;
>>>>>>>> + uffd_wp_range(dst_mm, dst_vma, _start, _end - _start, enable_wp);
>>>>>>>> + err = 0;
>>>>>>>> + }
>>>>>>>> out_unlock:
>>>>>>>> mmap_read_unlock(dst_mm);
>>>>>>>> return err;
>>>>>>>
>>>>>>> This whole patch also changes the abi, so I'm worried whether there can be
>>>>>>> app that relies on the existing behavior.
>>>>>> Even if a app is dependent on it, this change would just don't return error
>>>>>> if there are multiple VMAs under the hood and handle them correctly. Most
>>>>>> apps wouldn't care about VMAs anyways. I don't know if there would be any
>>>>>> drastic behavior change, other than the behavior becoming nicer.
>>>>>
>>>>> So this logic existed since the initial version of uffd-wp. It has a good
>>>>> thing that it strictly checks everything and it makes sense since uffd-wp
>>>>> is per-vma attribute. In short, the old code fails clearly.
>>>>>
>>>>> While the new proposal is not: if -ENOENT we really have no idea what
>>>>> happened at all; some ranges can be wr-protected but we don't know where
>>>>> starts to go wrong.
>>>> The return error codes can be made to return in better way somewhat. The
>>>> return error codes shouldn't block a correct functionality enhancement patch.
>>>>
>>>>>
>>>>> Now I'm looking at the original problem..
>>>>>
>>>>> - Allocate memory of size 16 pages with PROT_NONE with mmap
>>>>> - Register userfaultfd
>>>>> - Change protection of the first half (1 to 8 pages) of memory to
>>>>> PROT_READ | PROT_WRITE. This breaks the memory area in two VMAs.
>>>>> - Now UFFDIO_WRITEPROTECT_MODE_WP on the whole memory of 16 pages errors
>>>>> out.
>>>>>
>>>>> Why the user app should wr-protect 16 pages at all?
>>>> Taking arguments from Paul here.
>>>>
>>>> The app is free to insert guard pages inside the range (with PROT_NONE) and
>>>> change the protection of memory freely. Not sure why it is needed to
>>>> complicate things by denying any flexibility. We should never restrict what
>>>> is possible and what not. All of these different access attributes and
>>>> their any combination of interaction _must_ work without question. The
>>>> application should be free to change protection on any sub-range and it
>>>> shouldn't break the PAGE_IS_WRITTEN + UFFD_WRITE_PROTECT promise which
>>>> PAGEMAP_IOCTL (patches are in progress) and UFFD makes.
>>>
>>> Because uffd-wp has a limitation on e.g. it cannot nest so far. I'm fine
>>> with allowing mprotect() happening, but just to mention so far it cannot do
>>> "any combinations" yet.
>>>
>>>>
>>>>>
>>>>> If so, uffd_wp_range() will be ran upon a PROT_NONE range which doesn't
>>>>> make sense at all, no matter whether the user is aware of vma concept or
>>>>> not... because it's destined that it's a vain effort.
>>>> It is not a vain effort. The user want to watch/find the dirty pages of a
>>>> range while working with it: reserve and watch at once while Write
>>>> protecting or un-protecting as needed. There may be several different use
>>>> cases. Inserting guard pages to catch out of range access, map something
>>>> only when it is needed; unmap or PROT_NONE pages when they are set free in
>>>> the app etc.
>>>
>>> Fair enough.
>>>
>>>>
>>>>>
>>>>> So IMHO it's the user app needs fixing here, not the interface? I think
>>>>> it's the matter of whether the monitor is aware of mprotect() being
>>>>> invoked.
>>>> No. The common practice is to allocate a big memory chunk at once and have
>>>> own allocator over it (whether it is some specific allocator in a game or a
>>>> .net allocator with garbage collector). From the usage point of view it is
>>>> very limiting to demand constant memory attributes for the whole range.
>>>>
>>>> That said, if we do have the way to do exactly what we want with reset
>>>> through pagemap fd and it is just UFFD ioctl will be working differently,
>>>> it is not a blocker of course, just weird api design.
>>>
>>> Do you mean you'll disable ENGAGE_WP && !GET in your other series? Yes, if
>>> this will service your goal, it'll be perfect to remove that interface.
>> No, we cannot remove it.
>
> If this patch can land, I assume ioctl(UFFDIO_WP) can start to service the
> dirty tracking purpose, then why do you still need "ENGAGE_WP && !GET"?
We don't need it. We need the following operations only:
1 GET
2 ENGAGE_WP + GET
When we have these two operations, we had added the following as well:
3 ENGAGE_WP + !GET or only ENGAGE_WP
This (3) can be removed from ioctl(PAGEMAP_IOCTL) if reviewers ask. I can
remove it in favour of this ioctl(UFFDIO_WP) patch for sure.

>
> Note, I'm not asking to drop ENGAGE_WP entirely, only when !GET.
>
>>
>>>
>>>>
>>>>>
>>>>> In short, I hope we're working on things that helps at least someone, and
>>>>> we should avoid working on things that does not have clear benefit yet.
>>>>> With the WP_ENGAGE new interface being proposed, I just didn't see any
>>>>> benefit of changing the current interface, especially if the change can
>>>>> bring uncertainties itself (e.g., should we fail upon !uffd-wp vmas, or
>>>>> should we skip?).
>>>> We can work on solving uncertainties in case of error conditions. Fail if
>>>> !uffd-wp vma comes.
>>>
>>> Let me try to double check with you here:
>>>
>>> I assume you want to skip any vma that is not mapped at all, as the loop
>>> already does so. So it'll succeed if there're memory holes.
>>>
>>> You also want to explicitly fail if some vma is not registered with uffd-wp
>>> when walking the vma list, am I right? IOW, the tracee _won't_ ever have a
>>> chance to unregister uffd-wp itself, right?
>> Yes, fail if any VMA doesn't have uffd-wp. This fail means the
>> write-protection or un-protection failed on a region of memory with error
>> -ENOENT. This is already happening in this current patch. The unregister
>> code would remain same. The register and unregister ioctls are already
>> going over all the VMAs in a range. I'm not rigid on anything. Let me
>> define the interface below.
>>
>>>
>>>>
>>>>>
>>>>>>
>>>>>>>
>>>>>>> Is this for the new pagemap effort? Can this just be done in the new
>>>>>>> interface rather than changing the old?
>>>>>> We found this bug while working on pagemap patches. It is already being
>>>>>> handled in the new interface. We just thought that this use case can happen
>>>>>> pretty easily and unknowingly. So the support should be added.
>>>>>
>>>>> Thanks. My understanding is that it would have been reported if it
>>>>> affected any existing uffd-wp user.
>>>> I would consider the UFFD WP a recent functionality and it may not being
>>>> used in wide range of app scenarios.
>>>
>>> Yes I think so.
>>>
>>> Existing users should in most cases be applying the ioctl upon valid vmas
>>> somehow. I think the chance is low that someone relies on the errcode to
>>> make other decisions, but I just cannot really tell because the user app
>>> can do many weird things.
>> Correct. The user can use any combination of operation
>> (mmap/mprotect/uffd). They must work in harmony.
>
> No uffd - that's exactly what I'm saying: mprotect is fine here, but uffd
> is probably not, not in a nested way. When you try to UFFDIO_REGISTER upon
> some range that already been registered (by the tracee), it'll fail for you
> immediately:
>
> /*
> * Check that this vma isn't already owned by a
> * different userfaultfd. We can't allow more than one
> * userfaultfd to own a single vma simultaneously or we
> * wouldn't know which one to deliver the userfaults to.
> */
> ret = -EBUSY;
> if (cur->vm_userfaultfd_ctx.ctx &&
> cur->vm_userfaultfd_ctx.ctx != ctx)
> goto out_unlock;
>
> So if this won't work for you, then AFAICT uffd-wp won't work for you (just
> like soft-dirty but in another way, sorry), at least not until someone
> starts to work on the nested.
I was referring to a case where user registers the WP on multiple VMAs and
all the VMAs haven't been registered before. It would work. Right?

>
>>
>>>
>>>>
>>>>>
>>>>>>
>>>>>> Also mwriteprotect_range() gives a pretty straight forward way to WP or
>>>>>> un-WP a range. Async WP can be used in coordination with pagemap file
>>>>>> (PM_UFFD_WP flag in PTE) as well. There may be use cases for it. On another
>>>>>> note, I don't see any use cases of WP async and PM_UFFD_WP flag as
>>>>>> !PM_UFFD_WP flag doesn't give direct information if the page is written for
>>>>>> !present pages.
>>>>>
>>>>> Currently we do maintain PM_UFFD_WP even for swap entries, so if it was
>>>>> written then I think we'll know even if the page was swapped out:
>>>>>
>>>>> } else if (is_swap_pte(pte)) {
>>>>> if (pte_swp_uffd_wp(pte))
>>>>> flags |= PM_UFFD_WP;
>>>>> if (pte_marker_entry_uffd_wp(entry))
>>>>> flags |= PM_UFFD_WP;
>>>>>
>>>>> So it's working?
>>>>>
>>>>>>
>>>>>>>
>>>>>>> Side note: in your other pagemap series, you can optimize "WP_ENGAGE &&
>>>>>>> !GET" to not do generic pgtable walk at all, but use what it does in this
>>>>>>> patch for the initial round or wr-protect.
>>>>>> Yeah, it is implemented with some optimizations.
>>>>>
>>>>> IIUC in your latest public version is not optimized, but I can check the
>>>>> new version when it comes.
>>>> I've some optimizations in next version keeping the code lines minimum. The
>>>> best optimization would be to add a page walk dedicated for this engaging
>>>> write-protect. I don't want to do that. Lets try to improve this patch in
>>>> how ever way possible. So that WP from UFFD ioctl can be used.
>>>
>>> If you want to do this there, I think it can be simply/mostly a copy-paste
>>> of this patch over there by looping over the vmas and apply the protections.
>> I wouldn't want to do this. PAGEMAP IOCTL performing an operation
>> anonymously to UFFD_WP wouldn't look good when we can improve UFFD_WP itself.
>>
>>>
>>> I think it's fine to do it with ioctl(UFFDIO_WP), but I want to be crystal
>>> clear on what interface you're looking for before changing it, and let's
>>> define it properly.
>> Thank you. Let me crystal clear why we have sent this patch and what
>> difference we want.
>>
>> Just like UFFDIO_REGISTER and UFFDIO_UNREGISTER don't care if the requested
>> range has multiple VMAs in the memory region, we want the same thing with
>> UFFDIO_WRITEPROTCET. It looks more uniform and obvious to us as user
>> doesn't care about VMAs. The user only cares about the memory he wants to
>> write-protect. So just update the inside code of UFFDIO_WRITEPROTECT such
>> that it starts to act like UFFDIO_REGISTER/UFFDIO_UNREGISTER. It shouldn't
>> complain if there are multiple VMAs involved under the hood.
>>
>> This patch is visiting all the VMAs in the memory range. The attached
>> patches are my wip v3. If you feel like there can be better way to achieve
>> this, please don't hesitate to send the v3 yourself.
>
> As I said I think you have a point, and let's cross finger this is fine
> (which I mostly agree with).
Thank you so much! I'm sending the v3 in a while.

>
> We can fail on !uffd-wp vmas, sounds reasonable to me. But let's sync up
> with above to make sure this works for you.
>
> Since you've got a patch here, let me comment directly below.
>
>>
>> Thank you so much!
>>
>> --
>> BR,
>> Muhammad Usama Anjum
>
>> From f69069dddda206b190706eef7b2dad3a67a6df10 Mon Sep 17 00:00:00 2001
>> From: Muhammad Usama Anjum <usama.anjum@xxxxxxxxxxxxx>
>> Date: Thu, 9 Feb 2023 16:13:23 +0500
>> Subject: [PATCH v3 1/2] mm/userfaultfd: Support WP on multiple VMAs
>> To: peterx@xxxxxxxxxx, david@xxxxxxxxxx
>> Cc: usama.anjum@xxxxxxxxxxxxx, kernel@xxxxxxxxxxxxx
>>
>> mwriteprotect_range() errors out if [start, end) doesn't fall in one
>> VMA. We are facing a use case where multiple VMAs are present in one
>> range of interest. For example, the following pseudocode reproduces the
>> error which we are trying to fix:
>>
>> - Allocate memory of size 16 pages with PROT_NONE with mmap
>> - Register userfaultfd
>> - Change protection of the first half (1 to 8 pages) of memory to
>> PROT_READ | PROT_WRITE. This breaks the memory area in two VMAs.
>> - Now UFFDIO_WRITEPROTECT_MODE_WP on the whole memory of 16 pages errors
>> out.
>>
>> This is a simple use case where user may or may not know if the memory
>> area has been divided into multiple VMAs.
>>
>> Reported-by: Paul Gofman <pgofman@xxxxxxxxxxxxxxx>
>> Signed-off-by: Muhammad Usama Anjum <usama.anjum@xxxxxxxxxxxxx>
>> ---
>> Changes since v2:
>> - Correct the return error code
>>
>> Changes since v1:
>> - Correct the start and ending values passed to uffd_wp_range()
>> ---
>> mm/userfaultfd.c | 45 ++++++++++++++++++++++++++-------------------
>> 1 file changed, 26 insertions(+), 19 deletions(-)
>>
>> diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
>> index 65ad172add27..a3b48a99b942 100644
>> --- a/mm/userfaultfd.c
>> +++ b/mm/userfaultfd.c
>> @@ -738,9 +738,12 @@ int mwriteprotect_range(struct mm_struct *dst_mm, unsigned long start,
>> unsigned long len, bool enable_wp,
>> atomic_t *mmap_changing)
>> {
>> + unsigned long end = start + len;
>> + unsigned long _start, _end;
>> struct vm_area_struct *dst_vma;
>> unsigned long page_mask;
>> - int err;
>> + int err = -ENOENT;
>> + VMA_ITERATOR(vmi, dst_mm, start);
>>
>> /*
>> * Sanitize the command parameters:
>> @@ -758,30 +761,34 @@ int mwriteprotect_range(struct mm_struct *dst_mm, unsigned long start,
>> * operation (e.g. mremap) running in parallel, bail out and
>> * request the user to retry later
>> */
>> - err = -EAGAIN;
>> - if (mmap_changing && atomic_read(mmap_changing))
>> + if (mmap_changing && atomic_read(mmap_changing)) {
>> + err = -EAGAIN;
>> goto out_unlock;
>> + }
>
> Let's keep the original code and simply set -ENOENT afterwards.
Will update in v3.

>
>>
>> - err = -ENOENT;
>> - dst_vma = find_dst_vma(dst_mm, start, len);
>> + for_each_vma_range(vmi, dst_vma, end) {
>> + err = -ENOENT;
>>
>> - if (!dst_vma)
>> - goto out_unlock;
>> - if (!userfaultfd_wp(dst_vma))
>> - goto out_unlock;
>> - if (!vma_can_userfault(dst_vma, dst_vma->vm_flags))
>> - goto out_unlock;
>> + if (!dst_vma->vm_userfaultfd_ctx.ctx)
>> + break;
>
> What is this check for?
I'd copied it from find_dst_vma(). Will update in v3.

>
>> + if (!userfaultfd_wp(dst_vma))
>> + break;
>> + if (!vma_can_userfault(dst_vma, dst_vma->vm_flags))
>> + break;
>
> I think this is not useful at all (even in the old code)? It could have
> sneaked in somehow when I took the code over from Shaohua/Andrea. Maybe we
> should clean it up since at it.
Will update in v3.

>
>>
>> - if (is_vm_hugetlb_page(dst_vma)) {
>> - err = -EINVAL;
>> - page_mask = vma_kernel_pagesize(dst_vma) - 1;
>> - if ((start & page_mask) || (len & page_mask))
>> - goto out_unlock;
>> - }
>> + if (is_vm_hugetlb_page(dst_vma)) {
>> + err = -EINVAL;
>> + page_mask = vma_kernel_pagesize(dst_vma) - 1;
>> + if ((start & page_mask) || (len & page_mask))
>> + break;
>> + }
>>
>> - uffd_wp_range(dst_mm, dst_vma, start, len, enable_wp);
>> + _start = (dst_vma->vm_start > start) ? dst_vma->vm_start : start;
>> + _end = (dst_vma->vm_end < end) ? dst_vma->vm_end : end;
>
> Maybe:
>
> _start = MAX(dst_vma->vm_start, start);
> _end = MIN(dst_vma->vm_end, end);
>
> ?
Very clever. Thanks. Will update in v3.

>
>>
>> - err = 0;
>> + uffd_wp_range(dst_mm, dst_vma, _start, _end - _start, enable_wp);
>> + err = 0;
>> + }
>> out_unlock:
>> mmap_read_unlock(dst_mm);
>> return err;
>> --
>> 2.39.1
>>
>
>> From ac119b22aa00248ed67c7ac6e285a12391943b15 Mon Sep 17 00:00:00 2001
>> From: Muhammad Usama Anjum <usama.anjum@xxxxxxxxxxxxx>
>> Date: Mon, 13 Feb 2023 21:15:01 +0500
>> Subject: [PATCH v3 2/2] mm/userfaultfd: add VM_WARN_ONCE()
>> To: peterx@xxxxxxxxxx, david@xxxxxxxxxx
>> Cc: usama.anjum@xxxxxxxxxxxxx, kernel@xxxxxxxxxxxxx
>>
>> Add VM_WARN_ONCE() to uffd_wp_range() to detect range (start, len) abuse.
>>
>> Signed-off-by: Muhammad Usama Anjum <usama.anjum@xxxxxxxxxxxxx>
>> ---
>> mm/userfaultfd.c | 2 ++
>> 1 file changed, 2 insertions(+)
>>
>> diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
>> index a3b48a99b942..0536e23ba5f4 100644
>> --- a/mm/userfaultfd.c
>> +++ b/mm/userfaultfd.c
>> @@ -716,6 +716,8 @@ void uffd_wp_range(struct mm_struct *dst_mm, struct vm_area_struct *dst_vma,
>> unsigned int mm_cp_flags;
>> struct mmu_gather tlb;
>>
>> + VM_WARN_ONCE(start < dst_vma->vm_start || start + len > dst_vma->vm_end,
>> + "The address range exceeds VMA boundary.\n");
>> if (enable_wp)
>> mm_cp_flags = MM_CP_UFFD_WP;
>> else
>> --
>> 2.39.1
>>
>
>

--
BR,
Muhammad Usama Anjum