Re: [PATCH 2.6.12-rc4] cpuset exit NULL dereference fix

From: Paul Jackson
Date: Thu May 26 2005 - 15:29:04 EST


Simon wrote:
> I feel like this 'scaling problem' still exists even with:

Can you back this feeling with numbers, with real measurements?

I'll bet it's not actually a problem for any size system or load you
care about.

I have not yet actually seen any measurements that taking a quick
semaphore in the exit path is a scaling problem.

Until I see such a problem, I am hoping to only take the most simple and
basic precautions to avoid it.


> Maybe adding more per-cpuset data such as a per-cpuset removal_sem

Robin suggests more possibilities along this same line, in his reply.

I don't see where any further locks are needed. And for some of my
customers configurations, running with the majority of the system in a
single cpuset, per-cpuset locks aren't a big gain over a system wide
semaphore anyway.

We have two issues here:

1) After the exiting task releases its cpuset (decrements the cpuset
reference count), we may need the cpusets 'notify_on_release' flag
and its path in the cpuset filesystem, if the count went to zero,
in order to perform notify_on_release processing.

For the benefit of those who haven't memorized the kernel/cpuset.c
code, notify_on_release processing means calling a usermodehelper
to invoke the /sbin/cpuset_release_agent command, with the path
of the just released cpuset as a command line argument. This /sbin
script usually just does "rmdir /dev/cpuset/$1", removing the abandoned
cpuset.

2) We have two distinct ways of marking a cpuset 'in use'. The tasks
that point to a cpuset increment its reference 'count'. The children
cpusets of that cpuset put themselves on the 'children' list of that
cpuset.

The problem is that we cannot atomically examine both ways at the
same time, using the mechanisms in the code now, except by holding
the global cpuset_sem semaphore.

So far as we know, both of these two issues are only causing us grief in
the exit code (cpuset_exit). Everywhere else that it might matter, we
are far enough off the beaten code path that grabbing the global
cpuset_sem is no problem.

Perhaps we could solve issue (A) by having the cpuset_exit code bundle
up in local variables whatever information it might need from the
cpuset, before decrementing the count, for possible notify_on_release
processing. Then if the atomic_dec_and_test hits the zero count,
the cpuset_exit code will be able to process the notify_on_release
requests, without further referencing the cpuset.

Perhaps we can solve issue (B) by having child cpusets _also_ bump the
atomic reference count. Then when the count hits zero, it will mean
both that there are no more referencing tasks and that there are no
child cpusets. With this, we can atomically identify when a cpuset is
released, without holding the global cpuset_sem semaphore.


So ... I'm inclined to think that there is really no scaling problem
with this patch as proposed. But if there really is a scaling problem,
I'd next persue solution steps (A) and (B) above, before attempting
per-cpuset locks.


--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <pj@xxxxxxxxxxxx> 1.650.933.1373, 1.925.600.0401
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/