Re: [regression] cpuset,mm: update tasks' mems_allowed in time(58568d2)

From: David Rientjes
Date: Fri Feb 19 2010 - 05:07:05 EST


On Fri, 19 Feb 2010, Nick Piggin wrote:

> > guarantee_online_cpus() truly does require callback_mutex, the
> > cgroup_scan_tasks() iterator locking can protect changes in the cgroup
> > hierarchy but it doesn't protect a store to cs->cpus_allowed or for
> > hotplug.
>
> Right, but the callback_mutex was being removed by this patch.
>

I was making the case for it to be readded :)

> > top_cpuset.cpus_allowed will always need to track cpu_active_map since
> > those are the schedulable cpus, it looks like that's initialized for SMP
> > and the cpu hotplug notifier does that correctly.
> >
> > I'm not sure what the logic is doing in cpuset_attach() where cs is the
> > cpuset to attach to:
> >
> > if (cs == &top_cpuset) {
> > cpumask_copy(cpus_attach, cpu_possible_mask);
> > to = node_possible_map;
> > }
> >
> > cpus_attach is properly protected by cgroup_lock, but using
> > node_possible_map here will set task->mems_allowed to node_possible_map
> > when the cpuset does not have memory_migrate enabled. This is the source
> > of your oops, I think.
>
> Could be, yes.
>

I'd be interested to see if you still get the same oops with the patch at
the end of this email that fixes this logic.

> But it doesn't matter if stores are done under lock, if the loads are
> not. masks can be multiple words, so there isn't any ordering between
> reading half and old mask and half a new one that results in an invalid
> state. AFAIKS.
>

It doesn't matter for MAX_NUMNODES > BITS_PER_LONG because
task->mems_alllowed only gets updated via cpuset_change_task_nodemask()
where the added nodes are set and then the removed nodes are cleared. The
side effect of this lockless access to task->mems_allowed means we may
have a small race between

nodes_or(tsk->mems_allowed, tsk->mems_allowed, *newmems);

and

tsk->mems_allowed = *newmems;

but the penalty is that we get an allocation on a removed node, which
isn't a big deal, especially since it was previously allowed.

> Well it is exported as cpuset_lock(). And the scheduler has it covered
> in all cases by the looks except for select_task_rq, which is called
> by wakeup code. We should stick WARN_ONs through the cpuset code for
> mutexes not held when they should be.
>

A lot of the reliance on callback_mutex was removed because the strict
hierarchy walking and task membership is now guarded by cgroup_mutex
instead. Some of the comments in kernel/cpuset.c weren't updated so they
still say callback_mutex when in reality they mean cgroup_mutex.
---
diff --git a/kernel/cpuset.c b/kernel/cpuset.c
--- a/kernel/cpuset.c
+++ b/kernel/cpuset.c
@@ -1319,7 +1319,7 @@ static int fmeter_getrate(struct fmeter *fmp)
return val;
}

-/* Protected by cgroup_lock */
+/* Protected by cgroup_mutex held on cpuset_attach() */
static cpumask_var_t cpus_attach;

/* Called by cgroups to determine if a cpuset is usable; cgroup_mutex held */
@@ -1390,8 +1390,12 @@ static void cpuset_attach(struct cgroup_subsys *ss, struct cgroup *cont,
struct cpuset *oldcs = cgroup_cs(oldcont);

if (cs == &top_cpuset) {
- cpumask_copy(cpus_attach, cpu_possible_mask);
- to = node_possible_map;
+ /*
+ * top_cpuset.cpus_allowed and top_cpuset.mems_allowed are
+ * protected by cgroup_lock which is already held here.
+ */
+ cpumask_copy(cpus_attach, top_cpuset.cpus_allowed);
+ to = top_cpuset.mems_allowed;
} else {
guarantee_online_cpus(cs, cpus_attach);
guarantee_online_mems(cs, &to);
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/