Re: Possible netns creation and execution performance/scalability regression since v3.8 due to rcu callbacks being offloaded to multiple cpus

From: Eric W. Biederman
Date: Wed Jun 11 2014 - 19:13:34 EST


"Paul E. McKenney" <paulmck@xxxxxxxxxxxxxxxxxx> writes:

> On Wed, Jun 11, 2014 at 01:46:08PM -0700, Eric W. Biederman wrote:
>> On the chance it is dropping the old nsproxy which calls syncrhonize_rcu
>> in switch_task_namespaces that is causing you problems I have attached
>> a patch that changes from rcu_read_lock to task_lock for code that
>> calls task_nsproxy from a different task. The code should be safe
>> and it should be an unquestions performance improvement but I have only
>> compile tested it.
>>
>> If you can try the patch it will tell is if the problem is the rcu
>> access in switch_task_namespaces (the only one I am aware of network
>> namespace creation) or if the problem rcu case is somewhere else.
>>
>> If nothing else knowing which rcu accesses are causing the slow down
>> seem important at the end of the day.
>>
>> Eric
>>
>
> If this is the culprit, another approach would be to use workqueues from
> RCU callbacks. The following (untested, probably does not even build)
> patch illustrates one such approach.

For reference the only reason we are using rcu_lock today for nsproxy is
an old lock ordering problem that does not exist anymore.

I can say that in some workloads setns is a bit heavy today because of
the synchronize_rcu and setns is more important that I had previously
thought because pthreads break the classic unix ability to do things in
your process after fork() (sigh).

Today daemonize is gone, and notify the parent process with a signal
relies on task_active_pid_ns which does not use nsproxy. So the old
lock ordering problem/race is gone.

The description of what was happening when the code switched from
task_lock to rcu_read_lock to protect nsproxy.

commit cf7b708c8d1d7a27736771bcf4c457b332b0f818
Author: Pavel Emelyanov <xemul@xxxxxxxxxx>
Date: Thu Oct 18 23:39:54 2007 -0700

Make access to task's nsproxy lighter

When someone wants to deal with some other taks's namespaces it has to lock
the task and then to get the desired namespace if the one exists. This is
slow on read-only paths and may be impossible in some cases.

E.g. Oleg recently noticed a race between unshare() and the (sent for
review in cgroups) pid namespaces - when the task notifies the parent it
has to know the parent's namespace, but taking the task_lock() is
impossible there - the code is under write locked tasklist lock.

On the other hand switching the namespace on task (daemonize) and releasing
the namespace (after the last task exit) is rather rare operation and we
can sacrifice its speed to solve the issues above.

The access to other task namespaces is proposed to be performed
like this:

rcu_read_lock();
nsproxy = task_nsproxy(tsk);
if (nsproxy != NULL) {
/ *
* work with the namespaces here
* e.g. get the reference on one of them
* /
} / *
* NULL task_nsproxy() means that this task is
* almost dead (zombie)
* /
rcu_read_unlock();

This patch has passed the review by Eric and Oleg :) and,
of course, tested.

[clg@xxxxxxxxxx: fix unshare()]
[ebiederm@xxxxxxxxxxxx: Update get_net_ns_by_pid]
Signed-off-by: Pavel Emelyanov <xemul@xxxxxxxxxx>
Signed-off-by: Eric W. Biederman <ebiederm@xxxxxxxxxxxx>
Cc: Oleg Nesterov <oleg@xxxxxxxxxx>
Cc: Paul E. McKenney <paulmck@xxxxxxxxxxxxxxxxxx>
Cc: Serge Hallyn <serue@xxxxxxxxxx>
Signed-off-by: Cedric Le Goater <clg@xxxxxxxxxx>
Signed-off-by: Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx>
Signed-off-by: Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx>

Eric
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/