Re: [PATCH v28 2/6] fs/proc/task_mmu: Implement IOCTL to get and optionally clear info about PTEs

From: Andrei Vagin
Date: Thu Aug 10 2023 - 15:07:30 EST


On Tue, Aug 8, 2023 at 11:16 PM Muhammad Usama Anjum
<usama.anjum@xxxxxxxxxxxxx> wrote:
>
> This IOCTL, PAGEMAP_SCAN on pagemap file can be used to get and/or clear
> the info about page table entries. The following operations are supported
> in this ioctl:
> - Get the information if the pages have Async Write-Protection enabled
> (``PAGE_IS_WPALLOWED``), have been written to (``PAGE_IS_WRITTEN``), file
> mapped (``PAGE_IS_FILE``), present (``PAGE_IS_PRESENT``), swapped
> (``PAGE_IS_SWAPPED``) or page has pfn zero (``PAGE_IS_PFNZERO``).
> - Find pages which have been written to and/or write protect
> (atomic ``PM_SCAN_WP_MATCHING + PM_SCAN_CHECK_WPASYNC``) the pages
> atomically. The (``PM_SCAN_WP_MATCHING``) is used to WP the matched
> pages. The (``PM_SCAN_CHECK_WPASYNC``) aborts the operation if
> non-Async-Write-Protected pages are found. Get is automatically performed
> if output buffer is specified.
>
> This IOCTL can be extended to get information about more PTE bits. The
> entire address range passed by user [start, end) is scanned until either
> the user provided buffer is full or max_pages have been found.
>
> Reviewed-by: Andrei Vagin <avagin@xxxxxxxxx>
> Reviewed-by: Michał Mirosław <mirq-linux@xxxxxxxxxxxx>
> Signed-off-by: Michał Mirosław <mirq-linux@xxxxxxxxxxxx>
> Signed-off-by: Muhammad Usama Anjum <usama.anjum@xxxxxxxxxxxxx>
> ---
> Changes in v28:
> - Fix walk_end one last time after doing through testing
>
> Changes in v27:
> - Add PAGE_IS_HUGE
> - Iterate until temporary buffer is full to do less iterations
> - Don't check if PAGE_IS_FILE if no mask needs it as it is very
> expensive to check per pte
> - bring is_interesting_page() outside pagemap_scan_output() to remove
> the horrible return value check
> - Replace memcpy() with direct copy
> - rename end_addr to walk_end_addr in pagemap_scan_private
> - Abort walk if fatal_signal_pending()
>
> Changes in v26:
> Changes made by Usama:
> - Fix the wrong breaking of loop if page isn't interesting, skip intsead
> - Untag the address and save them into struct
> - Round off the end address to next page
> - Correct the partial hugetlb page handling and returning the error
> - Rename PAGE_IS_WPASYNC to PAGE_IS_WPALLOWED
> - Return walk ending address in walk_end instead of returning in start
> as there is potential of replacing the memory tag
>
> Changes by Michał:
> 1. the API:
> a. return ranges as {begin, end} instead of {begin + len};
> b. rename match "flags" to 'page categories' everywhere - this makes
> it easier to differentiate the ioctl()s categorisation of pages
> from struct page flags;
> c. change {required + excluded} to {inverted + required}. This was
> rejected before, but I'd like to illustrate the difference.
> Old interface can be translated to the new by:
> categories_inverted = excluded_mask
> categories_mask = required_mask | excluded_mask
> categories_anyof_mask = anyof_mask
> The new way allows filtering by: A & (B | !C)
> categories_inverted = C
> categories_mask = A
> categories_anyof_mask = B | C
> e. allow no-op calls
> 2. the implementation:
> a. gather the page-categorising and write-protecting code in one place;
> b. optimization: add whole-vma skipping for WP usecase;
> c. extracted output limiting code to pagemap_scan_output();
> d. extracted range coalescing to pagemap_scan_push_range();
> e. extracted THP entry handling to pagemap_scan_thp_entry();
> f. added a shortcut for non-WP hugetlb scan; avoids conditional
> locking;
> g. extracted scan buffer handling code out of do_pagemap_scan();
> h. rework output code to always try to write pending ranges; if EFAULT
> is generated it always overwrites the original error code;
> (the case of SIGKILL is needlessly trying to write the output
> now, but this should be rare case and ignoring it makes the code
> not needing a goto)
> 3.Change no-GET operation condition from `arg.return_mask == 0` to
> `arg.vec == NULL`. This will allow issuing the ioctl with
> return_mask == 0 to gather matching ranges when the exact category
> is not interesting. (Anticipated for CRIU scanning a large sparse
> anonymous mapping).
>
> Changes in v25:
> - Do proper filtering on hole as well (hole got missed earlier)
>
> Changes in v24:
> - Place WP markers in case of hole as well
>
> Changes in v23:
> - Set vec_buf_index to 0 only when vec_buf_index is set
> - Return -EFAULT instead of -EINVAL if vec is NULL
> - Correctly return the walk ending address to the page granularity
>
> Changes in v22:
> - Interface change to return walk ending address to user:
> - Replace [start start + len) with [start, end)
> - Return the ending address of the address walk in start
>
> Changes in v21:
> - Abort walk instead of returning error if WP is to be performed on
> partial hugetlb
> - Changed the data types of some variables in pagemap_scan_private to
> long
>
> Changes in v20:
> - Correct PAGE_IS_FILE and add PAGE_IS_PFNZERO
>
> Changes in v19:
> - Interface changes such as renaming, return mask and WP can be used
> with any flags specified in masks
> - Internal code changes
>
> Changes in v18:
> - Rebased on top of next-20230613
> - ptep_get() updates
> - remove pmd_trans_unstable() and add ACTION_AGAIN
> - Review updates (Micheal)
>
> Changes in v17:
> - Rebased on next-20230606
> - Made make_uffd_wp_*_pte() better and minor changes
>
> Changes in v16:
> - Fixed a corner case where kernel writes beyond user buffer by one
> element
> - Bring back exclusive PM_SCAN_OP_WP
> - Cosmetic changes
>
> Changes in v15:
> - Build fix:
> - Use generic tlb flush function in pagemap_scan_pmd_entry() instead of
> using x86 specific flush function in do_pagemap_scan()
> - Remove #ifdef from pagemap_scan_hugetlb_entry()
> - Use mm instead of undefined vma->vm_mm
>
> Changes in v14:
> - Fix build error caused by #ifdef added at last minute in some configs
>
> Changes in v13:
> - Review updates
> - mmap_read_lock_killable() instead of mmap_read_lock()
> - Replace uffd_wp_range() with helpers which increases performance
> drastically for OP_WP operations by reducing the number of tlb
> flushing etc
> - Add MMU_NOTIFY_PROTECTION_VMA notification for the memory range
>
> Changes in v12:
> - Add hugetlb support to cover all memory types
> - Merge "userfaultfd: Define dummy uffd_wp_range()" with this patch
> - Review updates to the code
>
> Changes in v11:
> - Find written pages in a better way
> - Fix a corner case (thanks Paul)
> - Improve the code/comments
> - remove ENGAGE_WP + ! GET operation
> - shorten the commit message in favour of moving documentation to
> pagemap.rst
>
> Changes in v10:
> - move changes in tools/include/uapi/linux/fs.h to separate patch
> - update commit message
>
> Change in v8:
> - Correct is_pte_uffd_wp()
> - Improve readability and error checks
> - Remove some un-needed code
>
> Changes in v7:
> - Rebase on top of latest next
> - Fix some corner cases
> - Base soft-dirty on the uffd wp async
> - Update the terminologies
> - Optimize the memory usage inside the ioctl
> ---
> fs/proc/task_mmu.c | 678 ++++++++++++++++++++++++++++++++++++++++
> include/linux/hugetlb.h | 1 +
> include/uapi/linux/fs.h | 59 ++++
> mm/hugetlb.c | 2 +-
> 4 files changed, 739 insertions(+), 1 deletion(-)
>
> diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> index c1e6531cb02ae..0e219a44e97cd 100644
> --- a/fs/proc/task_mmu.c
> +++ b/fs/proc/task_mmu.c
> @@ -19,6 +19,8 @@
> #include <linux/shmem_fs.h>
> #include <linux/uaccess.h>
> #include <linux/pkeys.h>
> +#include <linux/minmax.h>
> +#include <linux/overflow.h>
>
> #include <asm/elf.h>
> #include <asm/tlb.h>
> @@ -1749,11 +1751,687 @@ static int pagemap_release(struct inode *inode, struct file *file)
> return 0;
> }
>
> +#define PM_SCAN_CATEGORIES (PAGE_IS_WPALLOWED | PAGE_IS_WRITTEN | \
> + PAGE_IS_FILE | PAGE_IS_PRESENT | \
> + PAGE_IS_SWAPPED | PAGE_IS_PFNZERO | \
> + PAGE_IS_HUGE)
> +#define PM_SCAN_FLAGS (PM_SCAN_WP_MATCHING | PM_SCAN_CHECK_WPASYNC)
> +
> +#define MASKS_OF_INTEREST(a) (a.category_inverted | a.category_mask | \
> + a.category_anyof_mask | a.return_mask)
> +
> +struct pagemap_scan_private {
> + struct pm_scan_arg arg;
> + unsigned long masks_of_interest, cur_vma_category;
> + struct page_region *vec_buf, cur_buf;

I think we can remove cur_buf. Imho, it makes code a bit more readable.
Here is a quick poc patch:
https://gist.github.com/avagin/2e465e7c362c515ec84d72a201a28de4

> + unsigned long vec_buf_len, vec_buf_index, found_pages, walk_end_addr;
> + struct page_region __user *vec_out;
> +};

...

> +#ifdef CONFIG_HUGETLB_PAGE
> +static int pagemap_scan_hugetlb_entry(pte_t *ptep, unsigned long hmask,
> + unsigned long start, unsigned long end,
> + struct mm_walk *walk)
> +{
> + struct pagemap_scan_private *p = walk->private;
> + struct vm_area_struct *vma = walk->vma;
> + unsigned long categories;
> + spinlock_t *ptl;
> + int ret = 0;
> + pte_t pte;
> +
> + if (~p->arg.flags & PM_SCAN_WP_MATCHING) {
> + /* Go the short route when not write-protecting pages. */
> +
> + pte = huge_ptep_get(ptep);
> + categories = p->cur_vma_category | pagemap_hugetlb_category(pte);
> +
> + if (!pagemap_scan_is_interesting_page(categories, p))
> + return 0;
> +
> + return pagemap_scan_output(categories, p, start, &end);
> + }
> +
> + i_mmap_lock_write(vma->vm_file->f_mapping);
> + ptl = huge_pte_lock(hstate_vma(vma), vma->vm_mm, ptep);
> +
> + pte = huge_ptep_get(ptep);
> + categories = p->cur_vma_category | pagemap_hugetlb_category(pte);
> +
> + if (!pagemap_scan_is_interesting_page(categories, p))
> + goto out_unlock;
> +
> + ret = pagemap_scan_output(categories, p, start, &end);
> + if (start == end)
> + goto out_unlock;
> +
> + if (~categories & PAGE_IS_WRITTEN)
> + goto out_unlock;
> +
> + if (end != start + HPAGE_SIZE) {
> + /* Partial HugeTLB page WP isn't possible. */
> + pagemap_scan_backout_range(p, start, end, start);
> + ret = -EINVAL;

Will this error be returned from ioctl? If the answer is yet, it looks
wrong to me.

> + goto out_unlock;
> + }
> +
> + make_uffd_wp_huge_pte(vma, start, ptep, pte);
> + flush_hugetlb_tlb_range(vma, start, end);
> +
> +out_unlock:
> + spin_unlock(ptl);
> + i_mmap_unlock_write(vma->vm_file->f_mapping);
> +
> + return ret;
> +}

....

> +static int pagemap_scan_get_args(struct pm_scan_arg *arg,
> + unsigned long uarg)
> +{
> + if (copy_from_user(arg, (void __user *)uarg, sizeof(*arg)))
> + return -EFAULT;
> +
> + if (arg->size != sizeof(struct pm_scan_arg))
> + return -EINVAL;
> +
> + /* Validate requested features */
> + if (arg->flags & ~PM_SCAN_FLAGS)
> + return -EINVAL;
> + if ((arg->category_inverted | arg->category_mask |
> + arg->category_anyof_mask | arg->return_mask) & ~PM_SCAN_CATEGORIES)
> + return -EINVAL;
> +
> + arg->start = untagged_addr((unsigned long)arg->start);
> + arg->end = untagged_addr((unsigned long)arg->end);
> + arg->vec = untagged_addr((unsigned long)arg->vec);
> +
> + /* Validate memory pointers */
> + if (!IS_ALIGNED(arg->start, PAGE_SIZE))
> + return -EINVAL;
> + if (!access_ok((void __user *)arg->start, arg->end - arg->start))
> + return -EFAULT;
> + if (!arg->vec && arg->vec_len)
> + return -EFAULT;

It looks more like EINVAL.

> + if (arg->vec && !access_ok((void __user *)arg->vec,
> + arg->vec_len * sizeof(struct page_region)))
> + return -EFAULT;
> +
> + /* Fixup default values */
> + arg->end = ALIGN(arg->end, PAGE_SIZE);
> + if (!arg->max_pages)
> + arg->max_pages = ULONG_MAX;
> +
> + return 0;
> +}
> +

Thanks,
Andrei