hard-coded limit on unresolved multicast route cache in ipv4/ipmr.c causes slow, unreliable creation of multicast routes on busy networks

From: Phil Karn
Date: Sat Jul 21 2018 - 21:31:41 EST


I'm running pimd (protocol independent multicast routing) and found that
on busy networks with lots of unresolved multicast routing entries, the
creation of new multicast group routes can be extremely slow and
unreliable, especially when the group in question has little traffic.

A google search revealed the following conversation about the problem
from the fall of 2015:

https://github.com/troglobit/pimd/issues/58

Note especially the comment by kopren on Sep 13, 2016.

The writer traced the problem to function ipmr_cache_unresolved() in
file net/ipmr.c, in the following block of code:

/* Create a new entry if allowable */
if (atomic_read(&mrt->cache_resolve_queue_len) >= 10 ||
(c = ipmr_cache_alloc_unres()) == NULL) {
spin_unlock_bh(&mfc_unres_lock);

kfree_skb(skb);
return -ENOBUFS;
}

This imposes a hard-wired limit of 10 multicast route entries with
unresolved source addresses and upstream interfaces. My problem system
sits on a busy subnet at UC San Diego, and when I run the command 'ip
mroute show' there are almost always exactly 10 unresolved multicast
routes. The authors reported that removing this limit solved their
problem, but I still see the test in the just-released kernel version
4.17.8.

I don't have this problem on my home network or on a small network at a
local high school, both networks having fewer active multicast groups.
The problem only shows up at UCSD.

The problem is most acute with a multicast group that generates only one
packet (decoded ham radio location tracking packets) every 10 seconds or
so. The multicast route *never* resolves and traffic never gets through.

The problem is also severe with a multicast group generating
intermittent bursts of traffic with seconds of idle time between bursts
(audio PCM from a software defined FM receiver with a squelch that stops
the traffic when there's no signal). However, when the receiver is tuned
to NOAA Weather Radio (which generates a continuous stream of traffic)
multicast routing generally worked.

The *only* difference between these three cases was the intensity of
traffic in the multicast groups.

Does this hard-coded limit serve any purpose? Can it be safely increased
to a much larger value, or better yet, removed altogether? If it can't
be removed, can it at least be made configurable through a /proc entry?

Thanks,

Phil Karn, KA9Q