Re: [PATCH v2] mm: memory-failure: use rcu lock instead of tasklist_lock when collect_procs()

From: Naoya Horiguchi
Date: Fri Aug 25 2023 - 02:03:14 EST

Next message: Alistair Popple: "Re: [PATCH RESEND 4/4] dax, kmem: calculate abstract distance with general interface"
Previous message: Shrikanth Hegde: "Re: [RFC PATCH 5/7] sched/fair: Adjust the busiest group scanning depth in idle load balance"
Next in thread: Tong Tiangen: "Re: [PATCH v2] mm: memory-failure: use rcu lock instead of tasklist_lock when collect_procs()"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On Tue, Aug 22, 2023 at 01:08:52PM +0100, Matthew Wilcox wrote:
> On Tue, Aug 22, 2023 at 11:41:41AM +0800, Tong Tiangen wrote:
> > 在 2023/8/22 2:33, Matthew Wilcox 写道:
> > > On Mon, Aug 21, 2023 at 05:13:12PM +0800, Tong Tiangen wrote:
> > > > We can see that CPU1 waiting for CPU0 respond IPI，CPU0 waiting for CPU2
> > > > unlock tasklist_lock, CPU2 waiting for CPU1 unlock page->ptl. As a result,
> > > > softlockup is triggered.
> > > >
> > > > For collect_procs_anon(), we will not modify the tasklist, but only perform
> > > > read traversal. Therefore, we can use rcu lock instead of spin lock
> > > > tasklist_lock, from this, we can break the softlock chain above.
> > >
> > > The only thing that's giving me pause is that there's no discussion
> > > about why this is safe. "We're not modifying it" isn't really enough
> > > to justify going from read_lock() to rcu_read_lock(). When you take a
> > > normal read_lock(), writers are not permitted and so you see an atomic
> > > snapshot of the list. With rcu_read_lock() you can see inconsistencies.
> >
> > Hi Matthew:
> >
> > When rcu_read_lock() is used, the task list can be modified during the
> > iteration, but cannot be seen during iteration. After the iteration is
> > complete, the task list can be updated in the RCU mechanism. Therefore, the
> > task list used by iteration can also be considered as a snapshot.
>
> No, that's not true! You are not iterating a snapshot of the list,
> you're iterating the live list. It will change under you. RCU provides
> you with some guarantees about that list. See Documentation/RCU/listRCU.rst
>
> > > For example, if new tasks can be added to the tasklist, they may not
> > > be seen by an iteration. Is this OK?
> >
> > The newly added tasks does not access the HWPoison page, because the
> > HWPoison page has been isolated from the
> > buddy(memory_failure()->take_page_off_buddy()). Therefore, it is safe to see
> > the newly added task during the iteration and not be seen by iteration.
> >
> > Tasks may be removed from the
> > > tasklist after they have been seen by the iteration. Is this OK?
> >
> > Task be seen during iteration are deleted from the task list after
> > iteration, it's task_struct is not released because reference counting is
> > added in __add_to_kill(). Therefore, the subsequent processing of
> > kill_procs() is not affected (sending signals to the task deleted from task
> > list). so i think it's safe too.
>
> I don't know this code, but it seems unsafe to me. Look:
>
> collect_procs_anon:
> for_each_process(tsk) {
> struct task_struct *t = task_early_kill(tsk, force_early);
> add_to_kill_anon_file(t, page, vma, to_kill);
>
> add_to_kill_anon_file:
> __add_to_kill(tsk, p, vma, to_kill, 0, FSDAX_INVALID_PGOFF);
>
> __add_to_kill:
> get_task_struct(tsk);
>
> static inline struct task_struct *get_task_struct(struct task_struct *t)
> {
> refcount_inc(&t->usage);
> return t;
> }
>
> /**
> * refcount_inc - increment a refcount
> * @r: the refcount to increment
> *
> * Similar to atomic_inc(), but will saturate at REFCOUNT_SATURATED and WARN.
> *
> * Provides no memory ordering, it is assumed the caller already has a
> * reference on the object.
> *
> * Will WARN if the refcount is 0, as this represents a possible use-after-free
> * condition.
> */
>
> I don't see anything that prevents that refcount_inc from seeing a zero
> refcount. Usually that would be prevented by tasklist_lock, right?

This "calling get_task_struct in for_each_process loop with read_rcu_lock"
pattern seems to be used also in mm/oom_kill.c (for example in select_bad_process()),
so this might be more generic problem.
I tried to see how OOM code prevents the issue, but there seems nothing to do that.
oom_kill_process(), which is responsible for terminating the victim process, directly
tries to acquire task_lock(victim), despite *victim could be freed at this point.
If someone knows OOM code is safe for some reason, hwpoison could potentially follow
the approach.

Thanks,
Naoya Horiguchi

>
> Andrew, I think this patch is bad and needs to be dropped.
>
>

Next message: Alistair Popple: "Re: [PATCH RESEND 4/4] dax, kmem: calculate abstract distance with general interface"
Previous message: Shrikanth Hegde: "Re: [RFC PATCH 5/7] sched/fair: Adjust the busiest group scanning depth in idle load balance"
Next in thread: Tong Tiangen: "Re: [PATCH v2] mm: memory-failure: use rcu lock instead of tasklist_lock when collect_procs()"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]