Re: P-MTU discovery

From: Theodore Y. Ts'o (tytso@MIT.EDU)
Date: Fri Apr 21 2000 - 14:35:58 EST


   Date: Fri, 21 Apr 2000 11:55:04 -0400 (EDT)
   From: jamal <hadi@cyberus.ca>

   I dont know whether telcos are already doing this, but we certainly are in
   Linux. I point the finger to Marc Boucher. He did it!

Ah, I hadn't realized someone had done it already. Is it in ipchains?

   The reason is very simple: NAT that good old friend of IPSEC.
   When you have lotsa boxes that you are masquareding for it is hell to go
   around and start changing their MTU values or doing any sort of per-box
   changes.

Actually, the hack is useful even if you're not doing NAT; any time you
have a configuration where you have a gateway box which is doing some
kind of tunnelling (either PPPOE or IP-IP or something else), and you
have lots of client machines behind the tunnel end-pointing, making lots
of per-box changes a pain.

If you're using dhcp, something you can do to avoid having to change all
of the boxes one at a time is to set the interface-mtu using dhcp to
1400 or 1450. The disadvantage of doing this is that *all* packets get
sent with the restricted MTU, not just ones going out through the
tunnel/gateway. (You'd really like to be able to set a per-route MSS,
but dhcp doesn't appear to have a way of doing that right now.)

   Disabling PMTU at the masquareding box also doesnt help because
   PPPOE adds an extra shim header to the packet. It will break IPSEC in
   most cases (maybe not in the case where your masquareding box is also your
   IPSEC gateway).

Right; that that's the problem; PPPOE, because it adds a shim header,
constricts the link MTU, and so you need to do PMTU discovery at the
endpoints. And in either case, doing PMTU doesn't help if you have
something in the path which is filtering the ICMP messages.

   From a philosophical angle:
   there is no panacea for these kind of problems. I wonder how long youve
   been chasing them. You will continously chase people to try and fix things
   for IPSEC's sake ;-> I wonder how you plan to deal with all those "content
   switching" startups (since that is the greatest thing since sliced bread
   these days). Is the end2end arguement really a dead horse? (I am ducking
   ahead of time). Maybe what the IETF needs is to take alls chairs into some
   end2end non-breakage indoctrination and give them a qualifying test first.

Here's the problem. End2end is great design principle, but it
fundamentally assumes that the intelligence is at the endpoints, and the
middle of the network isn't supposed to do anything special/magical.
But as the internet gets bigger and bigger, trying to change all of the
endpoints to add security, or to handle paths with long latencies
efficiently, gets harder and harder. And so, it gets easier to make
changes in the middle of the network. And most of the (to use Rusty's
phrase) "packet fucking" techniques come from this dilemma: NAT's
(easier than IPV6), firewalls (easier than doing real end-point
security), tcp ack spoofing (easier than upgrading Windows TCP stacks to
make them work correctly over satellite links), etc.

One could argue that by violating the IP architecture, they're engaging
in hill-climbing optimizations that in the long-run will cause someone a
lot of pain. Some things simply won't work if you play such games, and
as long as you acknowledge that fact, use them in good health.

So I've used NAT's before, even though I think that fundamentally
they're evil, because it solved the limited problem I needed to solve at
the time. But I didn't consider them first class objects, but treated
them rather as kludges. So if things broke because of the NAT, I knew
it was coming to me, and I would deal. One of the ways I dealt was to
get myself a /27 at home, but I realize not everyone can get that.

The problem is that more and more users are using things like NAT's and
MSS adjusters, etc., and they don't understand that they're kludges. So
when other protocols start breaking, they blame those other protocols
instead of correctly placing the blame where it belongs.

   Having said that, there could be an alternative solution in Linux. The
   PPPOE code could be made, after dropping the packet, to generate ICMP "too
   big" messages back to the masquareded boxes instead (when packet-size
>PMTU-shim_header). Hopefully, the win* boxes know what to do with these
   messages. And this will work also for UDP. Marc?

That doesn't help. We're doing this today already; it's required by the
RFC's, after all. The problem is that the sender of the big packet has
to receive the ICMP, and if there's something filtering the ICMP
message, you're stuck.

                                                        - Ted

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/



This archive was generated by hypermail 2b29 : Sun Apr 23 2000 - 21:00:19 EST