Strange Problems with ARP and Linux

From: Richard Mueller
Date: Tue Jan 17 2006 - 07:20:58 EST


Hy... (the same was posted to linux-net yesterday)

I experienced some strange behaviour with linux and the arp protocol.

1.) Kernel-Version: 2.6.11.7 plus grsec-patches

2.) Setup:

+--------+
| Router |
+---+----+
|
|
+------+-------+
| |
| | Transitnet for
| | Cluster/Router
+----+-----+ +-----+------+
| Primary | | Secondary |
+----+-----+ +-----+------+
| |
| | LAN
+--------------+

Router: C2600 router from ISP

Primary: First(active) linux router
Secondary: Secondary(standby) linux router

Primary/Secondary are configured as a cluster
with the heartbeat package.

The cluster shares a IP-Alias in the transitnet and
many IPs in the LAN-segments. The IP-Alias is always
bound to one node at the same time.

Following IPs and MACs are used for this example:

transit-net:
Router: 10.0.0.1/24 | 00:10:F3:09:10:70
Primary: 10.0.0.10/24 | 00:10:F3:09:11:71
Secondary: 10.0.0.11/24 | 00:10:F3:09:12:72
IP-Alias: 10.0.0.20/24 | depends where it ist bound to

lan:
Primary: 10.1.0.10/24 | 00:10:F3:10:11:71
Secondary: 10.1.0.11/24 | 00:10:F3:10:12:72
IP-Alias: 10.1.0.20/24 | depends where it ist bound to

3.) The Problem

First everything works fine. If I fail the primary node,
the secondary does the take over. The ARP-Entrys are
changing to the MAC of the secondary, and everything is
fine.

Now if you want to ping/ssh/somewhat the shared IP-Alias
in the LAN from the networks behind the C2600 everthing begins:

I. The C2600 is able to deliver the IP-packet to the node because
it has a valid arp-entry.

II. The Linux-machine (secondary) does not have any arp-entrys
(because it was inactive for a while) so it has to initiate
ARP before it can deliver the answer IP-packet.

Then IT HAPPENS:

The Linux Box asks in the transit net:

0.000000 10.1.0.20 -> Broadcast ARP Who has 10.0.0.1? Tell 10.1.0.20

Why does Linux make ARP-requests with SRC-IPs from a different subnet?
This can't be the expected behaviour... :(

BTW:
The C2600 is so "smart" to put an entry with
"10.1.0.20 -> 00:10:F3:09:12:72"
in its ARP-Cache, based on this single ARP-Broadcast
from 10.1.0.20 and after a failback to the primary nobody can reach the
10.1.0.20... :-)


4.) Solution: Dirty Userspace Fix
Ping the C2600 from the primary/secondary infinitely.
The same does a ping-group in heartbeat.
This can't be the real truth... ;-)

5.) Solution: Dirty Kernel-Patch
With my skillful hands I wrote a dirty hack:
<patch>
--- arp.c Fri Jan 13 16:44:06 2006
+++ arp.c.new Fri Jan 13 16:43:52 2006
@@ -342,9 +342,9 @@
switch (IN_DEV_ARP_ANNOUNCE(in_dev)) {
default:
case 0: /* By default announce any local IP */
- if (skb && inet_addr_type(skb->nh.iph->saddr) == RTN_LOCAL)
+ /* if (skb && inet_addr_type(skb->nh.iph->saddr) == RTN_LOCAL)
saddr = skb->nh.iph->saddr;
- break;
+ break; */
case 1: /* Restrict announcements of saddr in same subnet */
if (!skb)
break;
</patch>

6.) Solution: Clean Kernel-Patch
Can anybody improve this patch above to a clean one so that it finds
it way to the vanilla kernel?


bye
richard

--
Richard Müller
Geschäftsführer Technik

team(ix) GmbH
Powering Enterprise Linux Networks
Südwestpark 35
90449 Nürnberg

fon: +49 (911) 30999- 0
fax: +49 (911) 30999-99
mail: rm@xxxxxxxxx
web: http://www.teamix.de
vcf: http://www.teamix.de/vcf/rm.vcf
gpg: 296C 0BAF 8FC8 DCE2 99BD
5777 FA73 ECDC F9F1 8FF7

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/