Re: [PATCH 1/2] mm/madvise: allow process_madvise operations on entire memory range

From: Suren Baghdasaryan
Date: Mon Nov 30 2020 - 14:02:27 EST


On Wed, Nov 25, 2020 at 3:43 PM Minchan Kim <minchan@xxxxxxxxxx> wrote:
>
> On Wed, Nov 25, 2020 at 03:23:40PM -0800, Suren Baghdasaryan wrote:
> > On Wed, Nov 25, 2020 at 3:13 PM Minchan Kim <minchan@xxxxxxxxxx> wrote:
> > >
> > > On Mon, Nov 23, 2020 at 09:39:42PM -0800, Suren Baghdasaryan wrote:
> > > > process_madvise requires a vector of address ranges to be provided for
> > > > its operations. When an advice should be applied to the entire process,
> > > > the caller process has to obtain the list of VMAs of the target process
> > > > by reading the /proc/pid/maps or some other way. The cost of this
> > > > operation grows linearly with increasing number of VMAs in the target
> > > > process. Even constructing the input vector can be non-trivial when
> > > > target process has several thousands of VMAs and the syscall is being
> > > > issued during high memory pressure period when new allocations for such
> > > > a vector would only worsen the situation.
> > > > In the case when advice is being applied to the entire memory space of
> > > > the target process, this creates an extra overhead.
> > > > Add PMADV_FLAG_RANGE flag for process_madvise enabling the caller to
> > > > advise a memory range of the target process. For now, to keep it simple,
> > > > only the entire process memory range is supported, vec and vlen inputs
> > > > in this mode are ignored and can be NULL and 0.
> > > > Instead of returning the number of bytes that advice was successfully
> > > > applied to, the syscall in this mode returns 0 on success. This is due
> > > > to the fact that the number of bytes would not be useful for the caller
> > > > that does not know the amount of memory the call is supposed to affect.
> > > > Besides, the ssize_t return type can be too small to hold the number of
> > > > bytes affected when the operation is applied to a large memory range.
> > >
> > > Can we just use one element in iovec to indicate entire address rather
> > > than using up the reserved flags?
> > >
> > > struct iovec {
> > > .iov_base = NULL,
> > > .iov_len = (~(size_t)0),
> > > };
> > >
> > > Furthermore, it would be applied for other syscalls where have support
> > > iovec if we agree on it.
> > >
> >
> > The flag also changes the return value semantics. If we follow your
> > suggestion we should also agree that in this mode the return value
> > will be 0 on success and negative otherwise instead of the number of
> > bytes madvise was applied to.
>
> Well, return value will depends on the each API. If the operation is
> desruptive, it should return the right size affected by the API but
> would be okay with 0 or error, otherwise.

I'm fine with dropping the flag, I just thought with the flag it would
be more explicit that this is a special mode operating on ranges. This
way the patch also becomes simpler.
Andrew, Michal, Christian, what do you think about such API? Should I
change the API this way / keep the flag / change it in some other way?