Re: [External] Re: [PATCH v2] mm: add new syscall pidfd_set_mempolicy().

From: Michal Hocko
Date: Mon Nov 14 2022 - 06:46:14 EST


On Mon 14-11-22 00:41:21, Zhongkun He wrote:
> Hi Andrew, thanks for your replay.
>
> > This sounds a bit suspicious. Please share much more detail about
> > these races. If we proced with this design then mpol_put_async()
> > shouild have comments which fully describe the need for the async free.
> >
> > How do we *know* that these races are fully prevented with this
> > approach? How do we know that mpol_put_async() won't free the data
> > until the race window has fully passed?
>
> A mempolicy can be either associated with a process or with a VMA.
> All vma manipulation is somewhat protected by a down_read on
> mmap_lock.In process context there is no locking because only
> the process accesses its own state before.

We shouldn't really rely on mmap_sem for this IMO. There is alloc_lock
(aka task lock) that makes sure the policy is stable so that caller can
atomically take a reference and hold on the policy. And we do not do
that consistently and this should be fixed. E.g. just looking at some
random places like allowed_mems_nr (relying on get_task_policy) is
completely lockless and some paths (like fadvise) do not use any of the
explicit (alloc_lock) or implicit (mmap_lock) locking. That means that
the task_work based approach cannot really work in this case, right?

Playing more tricks will not really help long term. So while your patch
tries to workaround the current state of the art I do not think we
really want that. As stated previously, I would much rather see proper
reference counting instead. I thought we have agreed this would be the
first approach unless the resulting performance is really bad. Have you
concluded that to be the case? I do not see any numbers or notes in the
changelog.
--
Michal Hocko
SUSE Labs