Re: [RFC PATCH] mm, oom_reaper: skip mm structs with mmu notifiers

From: Michal Hocko
Date: Wed Aug 30 2017 - 05:09:26 EST


On Wed 30-08-17 10:46:00, Michal Hocko wrote:
> From: Michal Hocko <mhocko@xxxxxxxx>
>
> Andrea has noticed that the oom_reaper doesn't invalidate the range
> via mmu notifiers (mmu_notifier_invalidate_range_start,
> mmu_notifier_invalidate_range_end) and that can corrupt the memory
> of the kvm guest for example.

Forgot to mention that tlb_flush_mmu_tlbonly already invokes mmu
notifiers but that is not sufficient as per Andrea:
: mmu_notifier_invalidate_range cannot be used in replacement of
: mmu_notifier_invalidate_range_start/end. For KVM
: mmu_notifier_invalidate_range is a noop and rightfully so. A MMU
: notifier implementation has to implement either
: ->invalidate_range method or the invalidate_range_start/end
: methods, not both. And if you implement invalidate_range_start/end
: like KVM is forced to do, calling mmu_notifier_invalidate_range in
: common code is a noop for KVM.
:
: For those MMU notifiers that can get away only implementing
: ->invalidate_range, the ->invalidate_range is implicitly called by
: mmu_notifier_invalidate_range_end(). And only those secondary MMUs
: that share the same pagetable with the primary MMU (like AMD
: iommuv2) can get away only implementing ->invalidate_range.

> As the callback is allowed to sleep
> and the implementation is out of hand of the MM it is safer to simply
> bail out if there is an mmu notifier registered. In order to not
> fail too early make the mm_has_notifiers check under the oom_lock
> and have a little nap before failing to give the current oom victim some
> more time to exit.
>
> Fixes: aac453635549 ("mm, oom: introduce oom reaper")
> Noticed-by: Andrea Arcangeli <aarcange@xxxxxxxxxx>
> Cc: stable
> Signed-off-by: Michal Hocko <mhocko@xxxxxxxx>
> ---
>
> Hi,
> Andrea has pointed this out [1] while a different (but similar) bug has been
> discussed. This is an ugly hack to plug the potential memory corruption but
> we definitely want a better fix longterm.

> Does this sound like a viable option for now?
>
> [1] http://lkml.kernel.org/r/20170829140924.GB21615@xxxxxxxxxx
>
> include/linux/mmu_notifier.h | 5 +++++
> mm/oom_kill.c | 15 +++++++++++++++
> 2 files changed, 20 insertions(+)
>
> diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
> index c91b3bcd158f..947f21b451d2 100644
> --- a/include/linux/mmu_notifier.h
> +++ b/include/linux/mmu_notifier.h
> @@ -420,6 +420,11 @@ extern void mmu_notifier_synchronize(void);
>
> #else /* CONFIG_MMU_NOTIFIER */
>
> +static inline int mm_has_notifiers(struct mm_struct *mm)
> +{
> + return 0;
> +}
> +
> static inline void mmu_notifier_release(struct mm_struct *mm)
> {
> }
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 99736e026712..45f1a0c3dd90 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -40,6 +40,7 @@
> #include <linux/ratelimit.h>
> #include <linux/kthread.h>
> #include <linux/init.h>
> +#include <linux/mmu_notifier.h>
>
> #include <asm/tlb.h>
> #include "internal.h"
> @@ -488,6 +489,20 @@ static bool __oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm)
> */
> mutex_lock(&oom_lock);
>
> + /*
> + * If the mm has notifiers then we would need to invalidate them around
> + * unmap_page_range and that is risky because notifiers can sleep and
> + * what they do is basically undeterministic. So let's have a short sleep
> + * to give the oom victim some more time.
> + * TODO: we really want to get rid of this ugly hack and make sure that
> + * notifiers cannot block for unbounded amount of time and add
> + * mmu_notifier_invalidate_range_{start,end} around unmap_page_range
> + */
> + if (mm_has_notifiers(mm)) {
> + schedule_timeout_idle(HZ);
> + goto unlock_oom;
> + }
> +
> if (!down_read_trylock(&mm->mmap_sem)) {
> ret = false;
> trace_skip_task_reaping(tsk->pid);
> --
> 2.13.2
>

--
Michal Hocko
SUSE Labs