Re: [PATCH v16 0/5] Implement IOCTL to get and optionally clear info about PTEs

From: Muhammad Usama Anjum
Date: Wed Jun 07 2023 - 02:02:58 EST


On 5/30/23 7:07 PM, Muhammad Usama Anjum wrote:
> Hello Peter, David, Michal, Mike, Danylo and Andrei,
>
> I hope you are well. You (and some other) guys have been helping and
> reviewing up to this point. Thank you so much! Please review them again and
> mention if anything comes to your mind. Please send review tags or tested
> by tags in the hope that we merge these soon for next release. Current
> patches fulfill all of our requirement regarding ABI and performance. I
> guess, we should merge now (if somethings comes up, we can always fix them
> along the way).
Your ack/review/tested by would be much appreciated. You guys have
contributed while development. I believe we are very close. I really want
your comments to get this merged soon.

Latest revision (v17) for review:
https://lore.kernel.org/all/20230606060822.1065182-1-usama.anjum@xxxxxxxxxxxxx

>
> Any thoughts/comments are welcome. I'm sending this email as we have not
> had much reviews in the past 3-4 revisions.
>
> Thanks,
> Usama
>
> On 5/25/23 1:55 PM, Muhammad Usama Anjum wrote:
>> *Changes in v16*
>> - Fix a corner case
>> - Add exclusive PM_SCAN_OP_WP back
>>
>> *Changes in v15*
>> - Build fix (Add missed build fix in RESEND)
>>
>> *Changes in v14*
>> - Fix build error caused by #ifdef added at last minute in some configs
>>
>> *Changes in v13*
>> - Rebase on top of next-20230414
>> - Give-up on using uffd_wp_range() and write new helpers, flush tlb only
>> once
>>
>> *Changes in v12*
>> - Update and other memory types to UFFD_FEATURE_WP_ASYNC
>> - Rebaase on top of next-20230406
>> - Review updates
>>
>> *Changes in v11*
>> - Rebase on top of next-20230307
>> - Base patches on UFFD_FEATURE_WP_UNPOPULATED
>> - Do a lot of cosmetic changes and review updates
>> - Remove ENGAGE_WP + !GET operation as it can be performed with
>> UFFDIO_WRITEPROTECT
>>
>> *Changes in v10*
>> - Add specific condition to return error if hugetlb is used with wp
>> async
>> - Move changes in tools/include/uapi/linux/fs.h to separate patch
>> - Add documentation
>>
>> *Changes in v9:*
>> - Correct fault resolution for userfaultfd wp async
>> - Fix build warnings and errors which were happening on some configs
>> - Simplify pagemap ioctl's code
>>
>> *Changes in v8:*
>> - Update uffd async wp implementation
>> - Improve PAGEMAP_IOCTL implementation
>>
>> *Changes in v7:*
>> - Add uffd wp async
>> - Update the IOCTL to use uffd under the hood instead of soft-dirty
>> flags
>>
>> *Motivation*
>> The real motivation for adding PAGEMAP_SCAN IOCTL is to emulate Windows
>> GetWriteWatch() syscall [1]. The GetWriteWatch{} retrieves the addresses of
>> the pages that are written to in a region of virtual memory.
>>
>> This syscall is used in Windows applications and games etc. This syscall is
>> being emulated in pretty slow manner in userspace. Our purpose is to
>> enhance the kernel such that we translate it efficiently in a better way.
>> Currently some out of tree hack patches are being used to efficiently
>> emulate it in some kernels. We intend to replace those with these patches.
>> So the whole gaming on Linux can effectively get benefit from this. It
>> means there would be tons of users of this code.
>>
>> CRIU use case [2] was mentioned by Andrei and Danylo:
>>> Use cases for migrating sparse VMAs are binaries sanitized with ASAN,
>>> MSAN or TSAN [3]. All of these sanitizers produce sparse mappings of
>>> shadow memory [4]. Being able to migrate such binaries allows to highly
>>> reduce the amount of work needed to identify and fix post-migration
>>> crashes, which happen constantly.
>>
>> Andrei's defines the following uses of this code:
>> * it is more granular and allows us to track changed pages more
>> effectively. The current interface can clear dirty bits for the entire
>> process only. In addition, reading info about pages is a separate
>> operation. It means we must freeze the process to read information
>> about all its pages, reset dirty bits, only then we can start dumping
>> pages. The information about pages becomes more and more outdated,
>> while we are processing pages. The new interface solves both these
>> downsides. First, it allows us to read pte bits and clear the
>> soft-dirty bit atomically. It means that CRIU will not need to freeze
>> processes to pre-dump their memory. Second, it clears soft-dirty bits
>> for a specified region of memory. It means CRIU will have actual info
>> about pages to the moment of dumping them.
>> * The new interface has to be much faster because basic page filtering
>> is happening in the kernel. With the old interface, we have to read
>> pagemap for each page.
>>
>> *Implementation Evolution (Short Summary)*
>> From the definition of GetWriteWatch(), we feel like kernel's soft-dirty
>> feature can be used under the hood with some additions like:
>> * reset soft-dirty flag for only a specific region of memory instead of
>> clearing the flag for the entire process
>> * get and clear soft-dirty flag for a specific region atomically
>>
>> So we decided to use ioctl on pagemap file to read or/and reset soft-dirty
>> flag. But using soft-dirty flag, sometimes we get extra pages which weren't
>> even written. They had become soft-dirty because of VMA merging and
>> VM_SOFTDIRTY flag. This breaks the definition of GetWriteWatch(). We were
>> able to by-pass this short coming by ignoring VM_SOFTDIRTY until David
>> reported that mprotect etc messes up the soft-dirty flag while ignoring
>> VM_SOFTDIRTY [5]. This wasn't happening until [6] got introduced. We
>> discussed if we can revert these patches. But we could not reach to any
>> conclusion. So at this point, I made couple of tries to solve this whole
>> VM_SOFTDIRTY issue by correcting the soft-dirty implementation:
>> * [7] Correct the bug fixed wrongly back in 2014. It had potential to cause
>> regression. We left it behind.
>> * [8] Keep a list of soft-dirty part of a VMA across splits and merges. I
>> got the reply don't increase the size of the VMA by 8 bytes.
>>
>> At this point, we left soft-dirty considering it is too much delicate and
>> userfaultfd [9] seemed like the only way forward. From there onward, we
>> have been basing soft-dirty emulation on userfaultfd wp feature where
>> kernel resolves the faults itself when WP_ASYNC feature is used. It was
>> straight forward to add WP_ASYNC feature in userfautlfd. Now we get only
>> those pages dirty or written-to which are really written in reality. (PS
>> There is another WP_UNPOPULATED userfautfd feature is required which is
>> needed to avoid pre-faulting memory before write-protecting [9].)
>>
>> All the different masks were added on the request of CRIU devs to create
>> interface more generic and better.
>>
>> [1] https://learn.microsoft.com/en-us/windows/win32/api/memoryapi/nf-memoryapi-getwritewatch
>> [2] https://lore.kernel.org/all/20221014134802.1361436-1-mdanylo@xxxxxxxxxx
>> [3] https://github.com/google/sanitizers
>> [4] https://github.com/google/sanitizers/wiki/AddressSanitizerAlgorithm#64-bit
>> [5] https://lore.kernel.org/all/bfcae708-db21-04b4-0bbe-712badd03071@xxxxxxxxxx
>> [6] https://lore.kernel.org/all/20220725142048.30450-1-peterx@xxxxxxxxxx/
>> [7] https://lore.kernel.org/all/20221122115007.2787017-1-usama.anjum@xxxxxxxxxxxxx
>> [8] https://lore.kernel.org/all/20221220162606.1595355-1-usama.anjum@xxxxxxxxxxxxx
>> [9] https://lore.kernel.org/all/20230306213925.617814-1-peterx@xxxxxxxxxx
>> [10] https://lore.kernel.org/all/20230125144529.1630917-1-mdanylo@xxxxxxxxxx
>>
>> * Original Cover letter from v8*
>> Hello,
>>
>> Note:
>> Soft-dirty pages and pages which have been written-to are synonyms. As
>> kernel already has soft-dirty feature inside which we have given up to
>> use, we are using written-to terminology while using UFFD async WP under
>> the hood.
>>
>> This IOCTL, PAGEMAP_SCAN on pagemap file can be used to get and/or clear
>> the info about page table entries. The following operations are
>> supported in this ioctl:
>> - Get the information if the pages have been written-to (PAGE_IS_WRITTEN),
>> file mapped (PAGE_IS_FILE), present (PAGE_IS_PRESENT) or swapped
>> (PAGE_IS_SWAPPED).
>> - Write-protect the pages (PAGEMAP_WP_ENGAGE) to start finding which
>> pages have been written-to.
>> - Find pages which have been written-to and write protect the pages
>> (atomic PAGE_IS_WRITTEN + PAGEMAP_WP_ENGAGE)
>>
>> It is possible to find and clear soft-dirty pages entirely in userspace.
>> But it isn't efficient:
>> - The mprotect and SIGSEGV handler for bookkeeping
>> - The userfaultfd wp (synchronous) with the handler for bookkeeping
>>
>> Some benchmarks can be seen here[1]. This series adds features that weren't
>> present earlier:
>> - There is no atomic get soft-dirty/Written-to status and clear present in
>> the kernel.
>> - The pages which have been written-to can not be found in accurate way.
>> (Kernel's soft-dirty PTE bit + sof_dirty VMA bit shows more soft-dirty
>> pages than there actually are.)
>>
>> Historically, soft-dirty PTE bit tracking has been used in the CRIU
>> project. The procfs interface is enough for finding the soft-dirty bit
>> status and clearing the soft-dirty bit of all the pages of a process.
>> We have the use case where we need to track the soft-dirty PTE bit for
>> only specific pages on-demand. We need this tracking and clear mechanism
>> of a region of memory while the process is running to emulate the
>> getWriteWatch() syscall of Windows.
>>
>> *(Moved to using UFFD instead of soft-dirtyi feature to find pages which
>> have been written-to from v7 patch series)*:
>> Stop using the soft-dirty flags for finding which pages have been
>> written to. It is too delicate and wrong as it shows more soft-dirty
>> pages than the actual soft-dirty pages. There is no interest in
>> correcting it [2][3] as this is how the feature was written years ago.
>> It shouldn't be updated to changed behaviour. Peter Xu has suggested
>> using the async version of the UFFD WP [4] as it is based inherently
>> on the PTEs.
>>
>> So in this patch series, I've added a new mode to the UFFD which is
>> asynchronous version of the write protect. When this variant of the
>> UFFD WP is used, the page faults are resolved automatically by the
>> kernel. The pages which have been written-to can be found by reading
>> pagemap file (!PM_UFFD_WP). This feature can be used successfully to
>> find which pages have been written to from the time the pages were
>> write protected. This works just like the soft-dirty flag without
>> showing any extra pages which aren't soft-dirty in reality.
>>
>> The information related to pages if the page is file mapped, present and
>> swapped is required for the CRIU project [5][6]. The addition of the
>> required mask, any mask, excluded mask and return masks are also required
>> for the CRIU project [5].
>>
>> The IOCTL returns the addresses of the pages which match the specific
>> masks. The page addresses are returned in struct page_region in a compact
>> form. The max_pages is needed to support a use case where user only wants
>> to get a specific number of pages. So there is no need to find all the
>> pages of interest in the range when max_pages is specified. The IOCTL
>> returns when the maximum number of the pages are found. The max_pages is
>> optional. If max_pages is specified, it must be equal or greater than the
>> vec_size. This restriction is needed to handle worse case when one
>> page_region only contains info of one page and it cannot be compacted.
>> This is needed to emulate the Windows getWriteWatch() syscall.
>>
>> The patch series include the detailed selftest which can be used as an
>> example for the uffd async wp test and PAGEMAP_IOCTL. It shows the
>> interface usages as well.
>>
>> [1] https://lore.kernel.org/lkml/54d4c322-cd6e-eefd-b161-2af2b56aae24@xxxxxxxxxxxxx/
>> [2] https://lore.kernel.org/all/20221220162606.1595355-1-usama.anjum@xxxxxxxxxxxxx
>> [3] https://lore.kernel.org/all/20221122115007.2787017-1-usama.anjum@xxxxxxxxxxxxx
>> [4] https://lore.kernel.org/all/Y6Hc2d+7eTKs7AiH@x1n
>> [5] https://lore.kernel.org/all/YyiDg79flhWoMDZB@xxxxxxxxx/
>> [6] https://lore.kernel.org/all/20221014134802.1361436-1-mdanylo@xxxxxxxxxx/
>>
>> Regards,
>> Muhammad Usama Anjum
>>
>> Muhammad Usama Anjum (4):
>> fs/proc/task_mmu: Implement IOCTL to get and optionally clear info
>> about PTEs
>> tools headers UAPI: Update linux/fs.h with the kernel sources
>> mm/pagemap: add documentation of PAGEMAP_SCAN IOCTL
>> selftests: mm: add pagemap ioctl tests
>>
>> Peter Xu (1):
>> userfaultfd: UFFD_FEATURE_WP_ASYNC
>>
>> Documentation/admin-guide/mm/pagemap.rst | 58 +
>> Documentation/admin-guide/mm/userfaultfd.rst | 35 +
>> fs/proc/task_mmu.c | 503 ++++++
>> fs/userfaultfd.c | 26 +-
>> include/linux/userfaultfd_k.h | 21 +-
>> include/uapi/linux/fs.h | 53 +
>> include/uapi/linux/userfaultfd.h | 9 +-
>> mm/hugetlb.c | 32 +-
>> mm/memory.c | 27 +-
>> tools/include/uapi/linux/fs.h | 53 +
>> tools/testing/selftests/mm/.gitignore | 1 +
>> tools/testing/selftests/mm/Makefile | 3 +-
>> tools/testing/selftests/mm/config | 1 +
>> tools/testing/selftests/mm/pagemap_ioctl.c | 1459 ++++++++++++++++++
>> tools/testing/selftests/mm/run_vmtests.sh | 4 +
>> 15 files changed, 2262 insertions(+), 23 deletions(-)
>> create mode 100644 tools/testing/selftests/mm/pagemap_ioctl.c
>> mode change 100644 => 100755 tools/testing/selftests/mm/run_vmtests.sh
>>
>

--
BR,
Muhammad Usama Anjum