[RFC 0/4] net/fib: Speed up trie rebalancing for full view

From: Dmitry Safonov
Date: Tue Mar 26 2019 - 11:30:32 EST


During moving from 3.18 stable kernel to 4.9 on the switches, rebasing
local -specific patches and stuff, it was found that BGP benchmarks for
full view have started to hit the soft lockup detector by holding rtnl
mutex for a couple of seconds on routes changes.

I've found that the hard-coded MAX_WORK doesn't limit the amount of
pending rebalancing work anymore. So that any route adding/removal may
cause massive rebalancing of the trie. And making the hit even worse,
tnodes are freed by RCU with a call to synchronise_rcu() every 512Kb.
That is way too small and even on 2-cores switches is painfully
noticable.

To address those problems, I've introduced sysctl knob to limit the
amount of rebalancing work (by default unlimited as is de facto now).
I've moved synchronise_rcu() into a new shrinker to actually release
memory in OOM.. I believe non-visible to userspace shrinker is better
than a new sysctl knob for the limit (or any hard-coded value).
Though, not sure how sane the result is.
So, I send it as RFC, having qualms that it's not ready for inclusion
as-is yet.

I've looked further into the origin of problems here and was thinking if
it make sense to do the following (rough plan):
1. Introduce a new flag RTNL_FLAG_DUMP_UNLOCKED to dump fib trie without
rtnl lock. I'm a bit out of context, so probably I miss some obvious
reasons why lock needs to be held at this point.
2. Add a new fib_lock mutex for updating a trie. I'm not really sure
that we can always release rtnl for updates, so probably there should
be a strict locking order: rtnl_lock (when needed) => fib_lock.
3. Correct current documentation, that mentions fib_lock as rwsem.
4. ??

I did some experiments on the plan above, but decided to send this RFC
to get opinions of people who understand more. Maybe, my plan is
nonsense and it's not worth invest amount of time it requires.

I've also looked into changes between v3.18...v4.9, and found the
following patches set: https://www.spinics.net/lists/netdev/msg309586.html
While it has impressive results on lookups, it seems to be the reason of
the regression on the big scale: one of the patches has 5 times penalty
to remove a route on a big scale, another adds 10 times penalty by
calling synchronise_rcu() more frequently (I've marked them by Fixes tag
in the patches).

[I was very lavish on Cc list, please ping me in private if you don't
want to be copied on the next version]

Cc: Alexander Duyck <alexander.h.duyck@xxxxxxxxxxxxxxx>
Cc: Alexey Kuznetsov <kuznet@xxxxxxxxxxxxx>
Cc: David Ahern <dsahern@xxxxxxxxx>
Cc: "David S. Miller" <davem@xxxxxxxxxxxxx>
Cc: Eric Dumazet <edumazet@xxxxxxxxxx>
Cc: Hideaki YOSHIFUJI <yoshfuji@xxxxxxxxxxxxxx>
Cc: Ido Schimmel <idosch@xxxxxxxxxxxx>
Cc: netdev@xxxxxxxxxxxxxxx

Thanks,
Dmitry

Dmitry Safonov (4):
net/ipv4/fib: Remove run-time check in tnode_alloc()
net/fib: Provide fib_balance_budget sysctl
net/fib: Check budget before should_{inflate,halve}()
net/ipv4/fib: Don't synchronise_rcu() every 512Kb

Documentation/networking/ip-sysctl.txt | 6 ++
include/net/ip.h | 1 +
net/ipv4/fib_trie.c | 121 +++++++++++++++----------
net/ipv4/sysctl_net_ipv4.c | 7 ++
4 files changed, 87 insertions(+), 48 deletions(-)

--
2.21.0