Re: [PATCH v28 2/6] fs/proc/task_mmu: Implement IOCTL to get and optionally clear info about PTEs

From: Michał Mirosław
Date: Fri Aug 11 2023 - 09:32:42 EST


On Fri, Aug 11, 2023 at 05:02:44PM +0500, Muhammad Usama Anjum wrote:
> On 8/10/23 10:32 PM, Michał Mirosław wrote:
> > On Wed, 9 Aug 2023 at 08:16, Muhammad Usama Anjum
> > <usama.anjum@xxxxxxxxxxxxx> wrote:
[...]
> >> --- a/fs/proc/task_mmu.c
> >> +++ b/fs/proc/task_mmu.c
> > [...]
> >> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> >> +static unsigned long pagemap_thp_category(pmd_t pmd)
> >> +{
> >> + unsigned long categories = PAGE_IS_HUGE;
> >> +
> >> + if (pmd_present(pmd)) {
> >> + categories |= PAGE_IS_PRESENT;
> >> + if (!pmd_uffd_wp(pmd))
> >> + categories |= PAGE_IS_WRITTEN;
> >> + if (is_zero_pfn(pmd_pfn(pmd)))
> >> + categories |= PAGE_IS_PFNZERO;
> >> + } else if (is_swap_pmd(pmd)) {
> >> + categories |= PAGE_IS_SWAPPED;
> >> + if (!pmd_swp_uffd_wp(pmd))
> >> + categories |= PAGE_IS_WRITTEN;
> >> + }
> >> +
> >> + return categories;
> >> +}
> > I guess THPs can't be file-backed currently, but can we somehow mark
> > this assumption so it can be easily found if the capability arrives?
> Yeah, THPs cannot be file backed. Lets not care for some feature which may
> not arrive in several years or eternity.

Yes, it might not arrive. But please add at least a comment, so that it
is clearly visible that lack if PAGE_IS_FILE here is intentional.

> >> +#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
> >> +
> >> +#ifdef CONFIG_HUGETLB_PAGE
> >> +static unsigned long pagemap_hugetlb_category(pte_t pte)
> >> +{
> >> + unsigned long categories = PAGE_IS_HUGE;
> >> +
> >> + if (pte_present(pte)) {
> >> + categories |= PAGE_IS_PRESENT;
> >> + if (!huge_pte_uffd_wp(pte))
> >> + categories |= PAGE_IS_WRITTEN;
> >> + if (!PageAnon(pte_page(pte)))
> >> + categories |= PAGE_IS_FILE;
> >> + if (is_zero_pfn(pte_pfn(pte)))
> >> + categories |= PAGE_IS_PFNZERO;
> >> + } else if (is_swap_pte(pte)) {
> >> + categories |= PAGE_IS_SWAPPED;
> >> + if (!pte_swp_uffd_wp_any(pte))
> >> + categories |= PAGE_IS_WRITTEN;
> >> + }
> >
> > BTW, can a HugeTLB page be file-backed and swapped out?
> Accourding to pagemap_hugetlb_range(), file-backed HugeTLB page cannot be
> swapped.

Here too a comment that leaving out this case is intentional would be useful.

> > [...]
> >> + walk_start = p.arg.start;
> >> + for (; walk_start < p.arg.end; walk_start = p.arg.walk_end) {
[...[
> >> + ret = mmap_read_lock_killable(mm);
> >> + if (ret)
> >> + break;
> >> + ret = walk_page_range(mm, walk_start, p.arg.end,
> >> + &pagemap_scan_ops, &p);
> >> + mmap_read_unlock(mm);
[...]
> >> + if (ret != -ENOSPC || p.arg.vec_len - 1 == 0 ||
> >> + p.found_pages == p.arg.max_pages)
> >> + break;
> >
> > The second condition is equivalent to `p.arg.vec_len == 1`, but why is
> > that an ending condition? Isn't the last entry enough to gather one
> > more range? (The walk could have returned -ENOSPC due to buffer full
> > and after flushing it could continue with the last free entry.)
> Now we are walking the entire range walk_page_range(). We don't break loop
> when we get -ENOSPC as this error may only mean that the temporary buffer
> is full. So we need check if max pages have been found or output buffer is
> full or ret is 0 or any other error. When p.arg.vec_len = 1 is end
> condition as the last entry is in cur. As we have walked over the entire
> range, cur must be full after which the walk returned.
>
> So current condition is necessary. I've double checked it. I'll change it
> to `p.arg.vec_len == 1`.

If we have walked the whole range, then the loop will end anyway due to
`walk_start < walk_end` not held in the `for()`'s condition.

[...]
> >> +/*
> >> + * struct pm_scan_arg - Pagemap ioctl argument
> >> + * @size: Size of the structure
> >> + * @flags: Flags for the IOCTL
> >> + * @start: Starting address of the region
> >> + * @end: Ending address of the region
> >> + * @walk_end Address where the scan stopped (written by kernel).
> >> + * walk_end == end informs that the scan completed on entire range.
> >
> > Can we ensure this holds also for the tagged pointers?
> No, we cannot.

So this need explanation in the comment here. (Though I'd still like to
know how the address tags are supposed to be used from someone that
knows them.)

Best Regards
Michał Mirosław