Re: while_each_thread() under rcu_read_lock() is broken?

From: Oleg Nesterov
Date: Mon Jun 21 2010 - 13:12:12 EST


On 06/18, Paul E. McKenney wrote:
>
> On Fri, Jun 18, 2010 at 09:34:03PM +0200, Oleg Nesterov wrote:
> >
> > #define XXX(t) ({
> > struct task_struct *__prev = t;
> > t = next_thread(t);
> > t != g && t != __prev;
> > })
> >
> > #define while_each_thread(g, t) \
> > while (XXX(t))
>
> Isn't the above vulnerable to a pthread_create() immediately following
> the offending exec()? Especially if the task doing the traversal is
> preempted?

Yes, thanks!

> here are some techniques that might (or might not) help:

To simplify, let's consider the concrete example,

rcu_read_lock();

g = t = returns_the_rcu_safe_task_struct_ptr();

do {
printk("%d\n", t->pid);
} while_each_thread(g, t);

rcu_read_unlock();

Whatever we do, without tasklist/siglock this can obviously race
with fork/exit/exec. It is OK to miss a thread, or print the pid
of the already exited/released task.

But it should not loop forever (the subject), and it should not
print the same pid twice (ignoring pid reuse, of course).

And, afaics, there are no problems with rcu magic per se, next_thread()
always returns the task_struct we can safely dereference. The only
problem is that while_each_thread() assumes that sooner or later
next_thread() must reach the starting point, g.

(zap_threads() is different, it must not miss a thread with ->mm
we are going to dump, but it holds mmap_sem).

> o Check ACCESS_ONCE(p->group_leader == g),

but this assumeas that while_each_thread(g, t) always
uses g == group_leader. zap_other_threads(), for example,
starts with g == current and not necessarily the leader.

> if false, restart
> the traversal.

I am not sure we should restart (don't pid the same pid twice),
probably we should break the loop.

So, I am thinking about the first attempt

#define while_each_thread(g, t) \
while ((t = next_thread(t)) != g && pid_alive(g))

again. But this means while_each_thread() can miss more threads
than it currently can under the same conditions. Correct, but
not good.

And,

> This of course might
> require careful attention of the order of updates to ->group_leader
> and the list pointers.

Yes. At least the code above needs barriers, and I am not sure
it can be really correct without more complications.

> o Maintain an ->oldnext field that tracks the old pointer to
> the next task for one RCU grace period after a de_thread()
> operation. When the grace period expires (presumably via
> call_rcu()), the ->oldnext field is set to NULL.

Well, another field in task_struct...

> o Do the de_thread() incrementally. So if the list is tasks A,
> B, and C, in that order, and if we are de-thread()ing B,
> then make A's pointer refer to C,

This breaks while_each_thread() under tasklist/siglock. It must
see all unhashed tasks.

Oh. I'll try to think more, but I also hope someone else can suggest
a simple solution.

> > Please tell me I am wrong!
>
> It would take a braver man than me to say that Oleg Nesterov is wrong!

Hmm. Given that you proved I was wrong in the first lines of your
email, I do not know what to think ;)

Oleg.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/