Re: [PATCH V2,2/2] mm: madvise: skip unmapped vma holes passed to process_madvise

From: Charan Teja Kalla
Date: Wed Mar 16 2022 - 10:24:13 EST


Thanks Andrew and Minchan.

On 3/16/2022 7:13 AM, Minchan Kim wrote:
> On Tue, Mar 15, 2022 at 04:48:07PM -0700, Andrew Morton wrote:
>> On Tue, 15 Mar 2022 15:58:28 -0700 Minchan Kim <minchan@xxxxxxxxxx> wrote:
>>
>>> On Fri, Mar 11, 2022 at 08:59:06PM +0530, Charan Teja Kalla wrote:
>>>> The process_madvise() system call is expected to skip holes in vma
>>>> passed through 'struct iovec' vector list. But do_madvise, which
>>>> process_madvise() calls for each vma, returns ENOMEM in case of unmapped
>>>> holes, despite the VMA is processed.
>>>> Thus process_madvise() should treat ENOMEM as expected and consider the
>>>> VMA passed to as processed and continue processing other vma's in the
>>>> vector list. Returning -ENOMEM to user, despite the VMA is processed,
>>>> will be unable to figure out where to start the next madvise.
>>>> Fixes: ecb8ac8b1f14("mm/madvise: introduce process_madvise() syscall: an external memory hinting API")
>>>> Cc: <stable@xxxxxxxxxxxxxxx> # 5.10+
>>>
>>> Hmm, not sure whether it's stable material since it changes semantic of
>>> API. It would be better to change the semantic from 5.19 with man page
>>> update to specify the change.
>>
>> It's a very desirable change and it makes the code match the manpage
>> and it's cc:stable. I think we should just absorb any transitory
>> damage which this causes people. I doubt if there will be much - if
>> anyone was affected by this they would have already told us that it's
>> broken?
>
>
> process_madvise fails to return exact processed bytes at several cases
> if it encounters the error, such as, -EINVAL, -EINTR, -ENOMEM in the
> middle of processing vmas. And now we are trying to make exception for
> change for only hole?
I think EINTR will never return in the middle of processing VMA's for
the behaviours supported by process_madvise().

It can return EINTR when:
-------------------------
1) PTRACE_MODE_READ is being checked in mm_access() where it is waiting
on task->signal->exec_update_lock. EINTR returned from here guarantees
that process_madvise() didn't event start processing.
https://elixir.bootlin.com/linux/v5.16.14/source/mm/madvise.c#L1264 -->
https://elixir.bootlin.com/linux/v5.16.14/source/kernel/fork.c#L1318

2) The process_madvise() started processing VMA's but the required
behavior on a VMA needs mmap_write_lock_killable(), from where EINTR is
returned. The current behaviours supported by process_madvise(),
MADV_COLD, PAGEOUT, WILLNEED, just need read lock here.
https://elixir.bootlin.com/linux/v5.16.14/source/mm/madvise.c#L1164
**Thus I think no way for EINTR can be returned by process_madvise() in
the middle of processing.** . No?

for EINVAL:
-----------
The only case, I can think of, where EINVAL can be returned in the
middle of processing is in examples like, given range contains VMA's
with a hole in between and one of the VMA contains the pages that fails
can_madv_lru_vma() condition.
So, it's a limitation that this returns -EINVAL though some bytes are
processed.
OR
Since there exists still some invalid bytes processed it is valid to
return -EINVAL here and user has to check the address range sent?

for ENOMEM:
----------
Though complete range is processed still returns ENOMEM. IMO, This
shouldn't be treated as error which the patch is targeted for. Then
there is limitation case that you mentioned below where it returns
positive processes bytes even though it didn't process anything if it
couldn't find any vma for the first iteration in madvise_walk_vmas

I think the above limitations with EINVAL and ENOMEM are arising because
we are relying on do_madvise() functionality which madvise() call uses
to process a single VMA. When 'struct iovec' vector processing interface
is given in a system call, it is the expectation by the caller that this
system call should return the correct bytes processed to help the user
to take the correct decisions. Please correct me If i am wrong here.

So, should we add the new function say do_process_madvise(), which take
cares of above limitations? or any alternative suggestions here please?

> IMO, it's worth to note in man page.
>

Or the current patch for just ENOMEM is sufficient here and we just have
to update the man page?

> In addition, this change returns positive processes bytes even though
> it didn't process anything if it couldn't find any vma for the first
> iteration in madvise_walk_vmas.

Thanks,
Charan