Re: [RFC] simple_lmk: Introduce Simple Low Memory Killer for Android

From: Joel Fernandes
Date: Thu Mar 14 2019 - 22:56:08 EST


On Thu, Mar 14, 2019 at 01:49:11PM -0700, Sultan Alsawaf wrote:
> On Thu, Mar 14, 2019 at 10:47:17AM -0700, Joel Fernandes wrote:
> > About the 100ms latency, I wonder whether it is that high because of
> > the way Android's lmkd is observing that a process has died. There is
> > a gap between when a process memory is freed and when it disappears
> > from the process-table. Once a process is SIGKILLed, it becomes a
> > zombie. Its memory is freed instantly during the SIGKILL delivery (I
> > traced this so that's how I know), but until it is reaped by its
> > parent thread, it will still exist in /proc/<pid> . So if testing the
> > existence of /proc/<pid> is how Android is observing that the process
> > died, then there can be a large latency where it takes a very long
> > time for the parent to actually reap the child way after its memory
> > was long freed. A quicker way to know if a process's memory is freed
> > before it is reaped could be to read back /proc/<pid>/maps in
> > userspace of the victim <pid>, and that file will be empty for zombie
> > processes. So then one does not need wait for the parent to reap it. I
> > wonder how much of that 100ms you mentioned is actually the "Waiting
> > while Parent is reaping the child", than "memory freeing time". So
> > yeah for this second problem, the procfds work will help.
> >
> > By the way another approach that can provide a quick and asynchronous
> > notification of when the process memory is freed, is to monitor
> > sched_process_exit trace event using eBPF. You can tell eBPF the PID
> > that you want to monitor before the SIGKILL. As soon as the process
> > dies and its memory is freed, the eBPF program can send a notification
> > to user space (using the perf_events polling infra). The
> > sched_process_exit fires just after the mmput() happens so it is quite
> > close to when the memory is reclaimed. This also doesn't need any
> > kernel changes. I could come up with a prototype for this and
> > benchmark it on Android, if you want. Just let me know.
>
> Perhaps I'm missing something, but if you want to know when a process has died
> after sending a SIGKILL to it, then why not just make the SIGKILL optionally
> block until the process has died completely? It'd be rather trivial to just
> store a pointer to an onstack completion inside the victim process' task_struct,
> and then complete it in free_task().

I'm not sure if that makes much semantic sense for how the signal handling is
supposed to work. Imagine a parent sends SIGKILL to its child, and then does
a wait(2). Because the SIGKILL blocks in your idea, then the wait cannot
execute, and because the wait cannot execute, the zombie task will not get
reaped and so the SIGKILL senders never gets unblocked and the whole thing
just gets locked up. No? I don't know it just feels incorrect.

Further, in your idea adding stuff to task_struct will simply bloat it - when
this task can easily be handled using eBPF without making any kernel changes.
Either by probing sched_process_free or sched_process_exit tracepoints.
Scheduler maintainers generally frown on adding stuff to task_struct
pointlessly there's a good reason since bloating it effects the performance
etc, and something like this would probably never be ifdef'd out behind a
CONFIG.

thanks,

- Joel