Re: [PATCH v3] mm/oom: do not oom reap task with an unresolved robust futex

From: Nico Pache
Date: Mon Feb 14 2022 - 15:45:03 EST




On 1/18/22 03:51, Michal Hocko wrote:
> On Mon 17-01-22 17:56:28, Nico Pache wrote:
>> On 1/17/22 11:05, Waiman Long wrote:
>>> On 1/17/22 03:52, Michal Hocko wrote:
> [...]
>>>>> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
>>>>> index 1ddabefcfb5a..3cdaac9c7de5 100644
>>>>> --- a/mm/oom_kill.c
>>>>> +++ b/mm/oom_kill.c
>>>>> @@ -667,6 +667,21 @@ static void wake_oom_reaper(struct task_struct *tsk)
>>>>>       if (test_and_set_bit(MMF_OOM_REAP_QUEUED, &tsk->signal->oom_mm->flags))
>>>>>           return;
>>>>>   +#ifdef CONFIG_FUTEX
>>>>> +    /*
>>>>> +     * If the ooming task's SIGKILL has not finished handling the
>>>>> +     * robust futex it is not correct to reap the mm concurrently.
>>>>> +     * Do not wake the oom reaper when the task still contains a
>>>>> +     * robust list.
>>>>> +     */
>>>>> +    if (tsk->robust_list)
>>>>> +        return;
>>>>> +#ifdef CONFIG_COMPAT
>>>>> +    if (tsk->compat_robust_list)
>>>>> +        return;
>>>>> +#endif
>>>>> +#endif
>>>> If this turns out to be really needed, which I do not really see at the
>>>> moment, then this is not the right way to handle this situation. The oom
>>>> victim could get stuck and the oom killer wouldn't be able to move
>>>> forward. If anything the victim would need to get MMF_OOM_SKIP set.
>>
>> I will try this, but I don't immediately see any difference between this return
>> case and setting the bit, passing the oom_reaper_list, then skipping it based on
>> the flag. Do you mind explaining how this could lead to the oom killer getting
>> stuck?
>
> The primary purpose of the oom_reaper is to guarantee a forward
> progress. If a task gets stuck in the kernel - e.g. on locks then it
> won't bail out and won't handle signals (i.e. SIGKILL from the
> userspace). The oom killer prevents new oom victims selection in a
> presence of an existing oom victim (see oom_evaluate_task). That means
> that we not only send a SIGKILL to the victim, we also wake up the oom
> reaper which then asynchronously tears down the private memory of the
> task (thus release at least some of its memory) and once it is done it
> will set MMF_OOM_SKIP flag which will tell the oom killer
> (oom_evaluate_task) that this victim is no longer interesting and a new
> victim can be selected.
>
> Makes sense?
Thank does, Thank you for clearing that up!
>
> Part of the async tear down is also MMF_UNSTABLE handling (see
> __oom_reap_task_mm) which tells #PF handling code
> (check_stable_address_space) that the underlying memory could have been
> tempered with and thus it should return SIGBUS. The underlying
> assumption is that the process (and all tasks which share its mm) has
> been killed and it will never return to the userspace so the de-facto
> memory corruption doesn't matter.
>
> One thing that is still unclear to me is why this leads to any locked up
> tasks. Looking at exit_robust_list I can see that it is accessing the
> userspace memory but this should return EFAULT in this situation. My
> assumption (which might be really wrong) is that futex shared among
> processes which are not sharing mm nor signal handling will be sitting
> in a shared memory.
>
> Now to the actual fix. I do not think we want to hide the task from the
> oom reaper as you are suggesting. Futexes are very likely to be used for
> many processes and that would make the whole async scheme useless. We
> need something like the below.
>
> futex_exit_release is not directly usable as it implicitly depends
> on memory allocation (#PF) and that is not acceptable. So instead we
> need something futex_exit_try_release or similar which would fail the
> operation in case get_user (pagefault_disable) needs to really handle
> the #PF or if the futex_exit_mutex is locked. In other words this would
> have to be a completely non-blocking operation. The oom reaper would
> then bail out.
Yeah Joel came up with a similar idea once we realized his v2 had the issue with
sleeping.

We've recently been discussing the following if statement in __oom_reap_task_mm:
if (vma_is_anonymous(vma) || !(vma->vm_flags & VM_SHARED))

Given the comment above it, and some of the upstream discussion the original
RFC, we are struggling to see why this should be a `||` and not an `&&`. If we
only want to reap anon memory and reaping shared memory can be dangerous is this
statement incorrect?

We have a patch queued up to make this change, but wanted to get your opinion on
why this was originally designed this way in case we are missing something.

-- Nico

> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index ef5860fc7d22..57660d3d1b79 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -613,6 +613,9 @@ static void oom_reap_task(struct task_struct *tsk)
> int attempts = 0;
> struct mm_struct *mm = tsk->signal->oom_mm;
>
> + if (futex_exit_try_release(tsk))
> + goto fail;
> +
> /* Retry the mmap_read_trylock(mm) a few times */
> while (attempts++ < MAX_OOM_REAP_RETRIES && !oom_reap_task_mm(tsk, mm))
> schedule_timeout_idle(HZ/10);
> @@ -621,6 +624,7 @@ static void oom_reap_task(struct task_struct *tsk)
> test_bit(MMF_OOM_SKIP, &mm->flags))
> goto done;
>
> +fail:
> pr_info("oom_reaper: unable to reap pid:%d (%s)\n",
> task_pid_nr(tsk), tsk->comm);
> sched_show_task(tsk);
> @@ -1184,6 +1188,11 @@ SYSCALL_DEFINE2(process_mrelease, int, pidfd, unsigned int, flags)
> if (!reap)
> goto drop_mm;
>
> + if (futex_exit_try_release(tsk)) {
> + ret = -EAGAIN;
> + goto drop_mm;
> + }
> +
> if (mmap_read_lock_killable(mm)) {
> ret = -EINTR;
> goto drop_mm;