Re: [PATCH v19 2/5] fs/proc/task_mmu: Implement IOCTL to get and optionally clear info about PTEs

From: Michał Mirosław
Date: Fri Jun 23 2023 - 05:44:36 EST


On Thu, 22 Jun 2023 at 12:21, Muhammad Usama Anjum
<usama.anjum@xxxxxxxxxxxxx> wrote:
> On 6/22/23 12:45 AM, Andrei Vagin wrote:
> > On Wed, Jun 21, 2023 at 11:34:54AM +0500, Muhammad Usama Anjum wrote:
> >> On 6/20/23 11:03 PM, Andrei Vagin wrote:
[...]
> >>> should it be PM_SCAN_FOUND_MAX_PAGES? Otherwise, it fails the ioctl even
> >>> if it has handled some pages already.
> >> It is a double edge sword. If we don't return error, user will never know
> >> that he needs to specify more max_pages or his output buffer is small and
> >> not coverig the entire range. The real purpose here that user gets aware
> >> that he needs to specify full hugetlb range and found pages should cover
> >> the entire range as well.
> >
> > but if PM_SCAN_OP_WP is set, this error will be fatal, because some
> > pages can be marked write-protected, but they are not be reported to
> > user-space.
> >
> > I think the ioctl has to report back the end address of the handled
> > region. It is like read and write syscalls that can return less data
> > than requested.
> This is good point. I'll abort the walk here instead of retuning the error
> to user.

It would be great if the ioctl returned the address the walk finished
at. This would remove the special case for "buffer full after full
page was added" and if it happens that despite `max_pages` limit was
reached but no more pages would need to be added the caller would not
have to call the ioctl again on the remaining range.

[...]
> >>> How long can it take to run this loop? Should we interrupt it if a
> >>> signal has been queued?
> >> I'm following mincore and pagemap_read here. There is no such thing there.
> >> IOCTL is performing what user has requested.
> >
> > In case of pagemap, its buffer is limited by MAX_RW_COUNT (0x7ffff000),
> > so it can handle maximum 0xffffe00 pages in one iteration. Do you have any
> > limits for input parameters?
> >
> >> If the execution time is a
> >> concern, user should have called the IOCTL on shorter address range.
> >
> > It doesn't work this way. There can be many users and signals has to be
> > delivered in a reasonable time. One of the obvious examples when a signal
> > has to be delivered asap is OOM.
> This IOCTL is just like mincore, but with other flags and functionalities.
> Mincore doesn't put any limit like this. I don't think we should put any
> limit here as well.

I don't think we should treat mincore() as a good API example. Its
interface has similar performance problems to what select() or poll()
does. In this ioctl's case, we can limit the output at this end (large
anyway) as it won't affect the performance if for x TiB of memory the
call is made twice instead of once. (Returning the end of the walk
reached would be much help here.)

Best Regards
Michał Mirosław