Re: [RFC v1 2/4] kernel/fork.c: implement new process_mmput_async syscall

From: Claudio Imbrenda
Date: Fri Nov 12 2021 - 04:35:34 EST


On Thu, 11 Nov 2021 13:20:11 -0600
ebiederm@xxxxxxxxxxxx (Eric W. Biederman) wrote:

> Claudio Imbrenda <imbrenda@xxxxxxxxxxxxx> writes:
>
> > The goal of this new syscall is to be able to asynchronously free the
> > mm of a dying process. This is especially useful for processes that use
> > huge amounts of memory (e.g. databases or KVM guests). The process is
> > allowed to terminate immediately, while its mm is cleaned/reclaimed
> > asynchronously.
> >
> > A separate process needs use the process_mmput_async syscall to attach
> > itself to the mm of a running target process. The process will then
> > sleep until the last user of the target mm has gone.
> >
> > When the last user of the mm has gone, instead of synchronously free
> > the mm, the attached process is awoken. The syscall will then continue
> > and clean up the target mm.
> >
> > This solution has the advantage that the cleanup of the target mm can
> > happen both be asynchronous and properly accounted for (e.g. cgroups).
> >
> > Tested on s390x.
> >
> > A separate patch will actually wire up the syscall.
>
> I am a bit confused.
>
> You want the process report that it has finished immediately,
> and you want the cleanup work to continue on in the background.
>
> Why do you need a separate process?
>
> Why not just modify the process cleanup code to keep the task_struct
> running while allowing waitpid to reap the process (aka allowing
> release_task to run)? All tasks can be already be reaped after
> exit_notify in do_exit.
>
> I can see some reasons for wanting an opt-in. It is nice to know all of
> a processes resources have been freed when waitpid succeeds.
>
> Still I don't see why this whole thing isn't exit_mm returning
> the mm_sturct when a flag is set, and then having an exit_mm_late
> being called and passed the returned mm after exit_notify.

nevermind, exit_notify is done after cgroup_exit, the teardown would
then not be accounted properly

>
> Or maybe something with schedule_work or task_work, instead of an
> exit_mm_late. I don't see any practical difference.
>
> I really don't see why this needs a whole other process to connect to
> the process you care about asynchronously.
>
> This whole thing seems an exercise in spending lots of resources to free
> resources much later.
>
> Eric