Re: [PATCH] netfilter: use per-cpu spinlock and RCU (v5)

From: Paul E. McKenney
Date: Thu Apr 16 2009 - 10:48:16 EST

Next message: Steven Rostedt: "Re: [PATCH 2/2] utrace-based ftrace "process" engine, v3"
Previous message: Alexander Beregalov: "[PATCH] parisc/superio: fix build breakage"
In reply to: Patrick McHardy: "Re: [PATCH] netfilter: use per-cpu spinlock and RCU (v5)"
Next in thread: Eric Dumazet: "[PATCH] netfilter: use per-cpu recursive spinlock (v6)"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On Thu, Apr 16, 2009 at 03:53:15PM +0200, Patrick McHardy wrote:
> Eric Dumazet wrote:
>> Stephen Hemminger a écrit :
>>> This is an alternative version of ip/ip6/arp tables locking using
>>> per-cpu locks. This avoids the overhead of synchronize_net() during
>>> update but still removes the expensive rwlock in earlier versions.
>>>
>>> The idea for this came from an earlier version done by Eric Dumazet.
>>> Locking is done per-cpu, the fast path locks on the current cpu
>>> and updates counters. The slow case involves acquiring the locks on
>>> all cpu's. This version uses RCU for the table->base reference
>>> but per-cpu-lock for counters.
>
>> This version is a regression over 2.6.2[0-9], because of two points
>> 1) Much more atomic ops :
>> Because of additional
>>> + spin_lock(&__get_cpu_var(ip_tables_lock));
>>> ADD_COUNTER(e->counters, ntohs(ip->tot_len), 1);
>>> + spin_unlock(&__get_cpu_var(ip_tables_lock));
>> added on each counter updates.
>> On many setups, each packet coming in or out of the machine has
>> to update between 2 to 20 rule counters. So to avoid *one* atomic ops
>> of read_unlock(), this v4 version adds 2 to 20 atomic ops...
>
> I agree, this seems to be a step backwards.
>
>> I still not see the problem between the previous version (2.6.2[0-8]) that
>> had a central
>> rwlock, that hurted performance on SMP because of cache line ping pong,
>> and the solution
>> having one rwlock per cpu.
>> We wanted to reduce the cache line ping pong first. This *is* the hurting
>> point,
>> by an order of magnitude.
>
> Dave doesn't seem to like the rwlock approach.

Well, we don't really need an rwlock, especially given that we really
don't want two "readers" incrementing the same counter concurrently.

A safer approach would be to maintain a flag in the task structure
tracking which (if any) of the per-CPU locks you hold. Also maintain
a recursion-depth counter. If the flag says you don't already hold
the lock, set it and acquire the lock. Either way, increment the
recursion-depth counter:

if (current->netfilter_lock_held != cur_cpu) {
BUG_ON(current->netfilter_lock_held != CPU_NONE);
spin_lock(per_cpu(..., cur_cpu));
current->netfilter_lock_held = cur_cpu;
}
current->netfilter_lock_nesting++;

And reverse the process to unlock:

if (--current->netfilter_lock_nesting == 0) {
spin_unlock(per_cpu(..., cur_cpu));
current->netfilter_lock_held = CPU_NONE;
}

> I don't see a way to
> do anything asynchronously like call_rcu() to improve this, so to
> bring up one of Stephens suggestions again:
>
>>> > * use on_each_cpu() somehow to do grace periood?
>
> We could use this to replace the counters, presuming it is
> indeed faster than waiting for a RCU grace period.

One way to accomplish this is to take Mathieu Desnoyers's user-level
RCU implementation and drop it into the kernel, replacing the POSIX
signal handling with on_each_cpu(), smp_call_function(), or whatever.

>> 2) Second problem : Potential OOM
>> About freeing old rules with call_rcu() and/or schedule_work(), this is
>> going
>> to OOM pretty fast on small appliances with basic firewall setups loading
>> rules one by one, as done by original topic reporter.
>> We had reports from guys using linux with 4MB of available ram (French
>> provider free.fr on
>> their applicance box), and we had to use SLAB_DESTROY_BY_RCU thing on
>> conntrack
>> to avoid OOM for their setups. We dont want to use call_rcu() and queue
>> 100 or 200 vfree().
>
> Agreed.

This is not a real problem be handled by doing a synchronize_rcu() every
so often as noted in a prior email elsewhere in this thread:

call_rcu(...);
if (++count > 50) {
synchronize_rcu();
count = 0;
}

This choice of constant would reduce the grace-period pain to 2% of the
full effect, which should be acceptable, at least if I remember the
original problem report of 0.2 seconds growing to 6.0 seconds -- this
would give you:

(6.0-0.2)/50+0.2 = .316

I would argue that 100 milliseconds is an OK penalty for a deprecated
feature. But of course the per-CPU lock approach should avoid even that
penalty, albeit at some per-packet penalty. However, my guess is that
this per-packet penalty is not measureable at the system level.

And if the penalty of a single uncontended lock -is- measureable, I will
be very quick to offer my congratulations, at least once I get my jaw
off my keyboard. ;-)

Thanx, Paul
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Next message: Steven Rostedt: "Re: [PATCH 2/2] utrace-based ftrace "process" engine, v3"
Previous message: Alexander Beregalov: "[PATCH] parisc/superio: fix build breakage"
In reply to: Patrick McHardy: "Re: [PATCH] netfilter: use per-cpu spinlock and RCU (v5)"
Next in thread: Eric Dumazet: "[PATCH] netfilter: use per-cpu recursive spinlock (v6)"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]