Re: [RFC]: userspace memory reaping

From: Suren Baghdasaryan
Date: Thu Oct 15 2020 - 15:26:33 EST


On Thu, Oct 15, 2020 at 2:20 AM Michal Hocko <mhocko@xxxxxxxx> wrote:
>
> On Wed 14-10-20 09:57:20, Suren Baghdasaryan wrote:
> > On Wed, Oct 14, 2020 at 5:09 AM Michal Hocko <mhocko@xxxxxxxx> wrote:
> [...]
> > > > > The need is similar to why oom-reaper was introduced - when a process
> > > > > is being killed to free memory we want to make sure memory is freed
> > > > > even if the victim is in uninterruptible sleep or is busy and reaction
> > > > > to SIGKILL is delayed by an unpredictable amount of time. I
> > > > > experimented with enabling process_madvise(MADV_DONTNEED) operation
> > > > > and using it to force memory reclaim of the target process after
> > > > > sending SIGKILL. Unfortunately this approach requires the caller to
> > > > > read proc/pid/maps to extract the list of VMAs to pass as an input to
> > > > > process_madvise().
> > >
> > > Well I would argue that this is not really necessary. You can simply
> > > call process_madvise with the full address range and let the kernel
> > > operated only on ranges which are safe to tear down asynchronously.
> > > Sure that would require some changes to the existing code to not fail
> > > on those ranges if they contain incompatible vmas but that should be
> > > possible. If we are worried about backward compatibility then a
> > > dedicated flag could override.
> > >
> >
> > IIUC this is very similar to the last option I proposed. I think this
> > is doable if we treat it as a special case. process_madvise() return
> > value not being able to handle a large range would still be a problem.
> > Maybe we can return MAX_INT in those cases?
>
> madvise is documented to return
> On success, madvise() returns zero. On error, it returns -1 and
> errno is set appropriately.
> [...]
> NOTES
> Linux notes
> The Linux implementation requires that the address addr be
> page-aligned, and allows length to be zero. If there are some
> parts of the specified address range that are not mapped, the
> Linux version of madvise() ignores them and applies the call to
> the rest (but returns ENOMEM from the system call, as it should).
>
> I have learned about ENOMEM case only now. And it seems this is indeed
> what we are implementing. So if we want to add a new mode to
> opportunistically attempt madvise on the whole given range without a
> failure then we need a specific flag for that. Advice is a number rather
> than a bitmask but (ab)using the top bit or use negative number space
> (e.g. -MADV_DONTNEED) for that sounds possible albeit bit hackish.

process_madvise() has an additional flag parameter. Why not have a
separate flag to denote that we want to just skip VMA gaps and proceed
without error? Something like MADVF_SKIP_GAPS?

>
> [...]
> > > I do have a vague recollection that we have discussed a kill(2) based
> > > approach as well in the past. Essentially SIG_KILL_SYNC which would
> > > not only send the signal but it would start a teardown of resources
> > > owned by the task - at least those we can remove safely. The interface
> > > would be much more simple and less tricky to use. You just make your
> > > userspace oom killer or potentially other users call SIG_KILL_SYNC which
> > > will be more expensive but you would at least know that as many
> > > resources have been freed as the kernel can afford at the moment.
> >
> > Correct, my early RFC here
> > https://patchwork.kernel.org/project/linux-mm/patch/20190411014353.113252-3-surenb@xxxxxxxxxx
> > was using a new flag for pidfd_send_signal() to request mm reaping by
> > oom-reaper kthread. IIUC you propose to have a new SIG_KILL_SYNC
> > signal instead of a new pidfd_send_signal() flag and otherwise a very
> > similar solution. Is my understanding correct?
>
> Well, I think you shouldn't focus too much on the oom-reaper aspect
> of it. Sure it can be used for that but I believe that a new signal
> should provide a sync behavior. People more familiar with the process
> management would be better off defining what is possible for a new sync
> signal. Ideally not only pro-active process destruction but also sync
> waiting until the target process is released so that you know that once
> kill syscall returns the process is gone.

If your suggestion is for SIG_KILL_SYNC to perform victim's resource
cleanup in the context of the caller while the victim is in
uninterruptible sleep that would definitely be useful. I assume there
are some resources which can't be reclaimed until the process itself
wakes up and handles the SIGKILL. If so, I hope kill(SIG_KILL_SYNC)
would not have to wait for the victim to wake up and handle the
signal. This would really complicate the userspace in cases when we
just want to reclaim whatever we can without victim's involvement and
continue. For cases when waiting is required waitid() with P_PIDFD can
be used.
Would this semantic work?

>
> --
> Michal Hocko
> SUSE Labs