Re: [PATCH] sched/numa: fix unsafe get_task_struct() in task_numa_assign()

From: Kirill Tkhai
Date: Sat Oct 18 2014 - 19:13:44 EST


19.10.2014, 00:59, "Oleg Nesterov" <oleg@xxxxxxxxxx>:
>  On 10/18, Kirill Tkhai wrote:
>>   18.10.2014, 01:40, "Oleg Nesterov" <oleg@xxxxxxxxxx>:
>>>   ...
>>>   The
>>>   task_struct itself can't go away,
>>>   ...
>>>   --- a/kernel/sched/fair.c
>>>   +++ b/kernel/sched/fair.c
>>>   @@ -1158,7 +1158,13 @@ static void task_numa_compare(struct task_numa_env *env,
>>>
>>>            rcu_read_lock();
>>>            cur = ACCESS_ONCE(dst_rq->curr);
>>>   - if (cur->pid == 0) /* idle */
>>>   + /*
>>>   + * No need to move the exiting task, and this ensures that ->curr
>>>   + * wasn't reaped and thus get_task_struct() in task_numa_assign()
>>>   + * is safe; note that rcu_read_lock() can't protect from the final
>>>   + * put_task_struct() after the last schedule().
>>>   + */
>>>   + if (is_idle_task(cur) || (cur->flags & PF_EXITING))
>>>                    cur = NULL;
>>>
>>>            /*
>>   Oleg, I've looked once again, and now it's not good for me.
>  Ah. Thanks a lot Kirill for correcting me!
>
>  I was looking at this rcu_read_lock() and I didn't even try to think
>  what it can actually protect. Nothing.

<snip>

>  No, I don't think this can work. Let's look at the current code:
>
>          rcu_read_lock();
>          cur = ACCESS_ONCE(dst_rq->curr);
>          if (cur->pid == 0) /* idle */
>
>  And any dereference, even reading ->pid is not safe. This memory can be
>  freed, unmapped, reused, etc.
>
>  Looks like, task_numa_compare() needs to take dst_rq->lock and get the
>  refernce first.

Yeah, detection of idle is not save. If we reorder the checks almost all
problems will be gone. All except unmapping. JFI, is it possible with
such kernel structures as task_struct? I.e. do mem caches use high memory
in their bosoms?
Thanks!

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index b78280c..114ec33 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1165,7 +1165,30 @@ static void task_numa_compare(struct task_numa_env *env,

rcu_read_lock();
cur = ACCESS_ONCE(dst_rq->curr);
- if (cur->pid == 0) /* idle */
+ /*
+ * No need to move the exiting task, and this ensures that ->curr
+ * wasn't reaped and thus get_task_struct() in task_numa_assign()
+ * is safe; note that rcu_read_lock() can't protect from the final
+ * put_task_struct() after the last schedule().
+ */
+ if (cur->flags & PF_EXITING)
+ cur = NULL;
+ smp_rmb(); /* Pairs with dst_rq->lock unlocking which implies smp_wmb */
+ /*
+ * Check once again to be sure curr is still on dst_rq. Three situations
+ * are possible here:
+ * 1)cur has gone and been freed, and dst_rq->curr is pointing on other
+ * memory. In this case the check will fail;
+ * 2)cur is pointing to a new task, which is using the memory of just
+ * freed cur (and it is new dst_rq->curr). It's OK, because we've
+ * locked RCU before the new task has been even created
+ * (so delayed_put_task_struct() hasn't been called);
+ * 3)we've taken not exiting task (likely case). No need to worry.
+ */
+ if (cur != ACCESS_ONCE(dst_rq->curr))
+ cur = NULL;
+
+ if (is_idle_task(cur))
cur = NULL;

/*


>  Or, perhaps, we need to change the rules to ensure that any "task_struct *"
>  pointer is rcu-safe. Perhaps we have more similar problems... I'd like to
>  avoid this if possible.

RT tree has:

https://git.kernel.org/cgit/linux/kernel/git/paulg/3.10-rt-patches.git/tree/patches/sched-delay-put-task.patch

But other problem was decided there..

>  Hmm. I'll try to think more.
>
>  Thanks!

Kirill
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/