Re: [RFC][PATCH 1/1] cxgb3i: cxgb3 iSCSI initiator

From: Jeff Garzik
Date: Fri Aug 08 2008 - 18:16:17 EST

Next message: Arnd Bergmann: "Re: [RFC][PATCH 1/4] checkpoint-restart: general infrastructure"
Previous message: Yinghai Lu: "Re: [PATCH 00/42] dyn_array/nr_irqs/sparse_irq support v5"
In reply to: Steve Wise: "Re: [RFC][PATCH 1/1] cxgb3i: cxgb3 iSCSI initiator"
Next in thread: Jeff Garzik: "Re: [RFC][PATCH 1/1] cxgb3i: cxgb3 iSCSI initiator"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Steve Wise wrote:

Hi Jeff,

Mike Christie will not merge this code until he has an explicit acknowledgement from netdev.

As you mentioned, the port stealing approach we've taken has its issues.
We consequently analyzed your suggestion to use a different IP/MAC address for iSCSI and it raises other tough issues (separate ARP and DHCP management, unavailability of common networking tools).
On these grounds, we believe our current approach is the most tolerable.
Would the stack provide a TCP port allocation service, we'd be glad to use it to solve the current concerns.
The cxgb3i driver is up and running here, its merge is pending our decision.

Cheers,
Divy

Hey Dave/Jeff,

I think we need some guidance here on how to proceed. Is the approach currently being reviewed ACKable? Or is it DOA? If its DOA, then what approach do you recommend? I believe Jeff's opinion is a separate ipaddr. But Dave, what do you think? Lets get some agreement on a high level design here.
Possible solutions seen to date include:

1) reserving a socket to allocate the port. This has been NAK'd in the past and I assume is still a no go.

2) creating a 4-tuple allocation service so the host stack, the rdma stack, and the iscsi stack can share the same TCP 4-tuple space. This also has been NAK'd in the past and I assume is still a no go.

3) the iscsi device allocates its own local ephemeral posts (port stealing) and use the host's ip address for the iscsi offload device. This is the current proposal and you can review the thread for the pros and cons. IMO it is the least objectionable (and I think we really should be doing #2).

4) the iscsi device will manage its own ip address thus ensuring 4-tuple uniqueness.

Conceptually, it is a nasty business for the OS kernel to be forced to co-manage an IP address in conjunction with a remote, independent entity.

Hardware designers make the mistake of assuming that firmware management of a TCP port ("port stealing") successfully provides the illusion to the OS that that port is simply inactive, and the OS happily continues internetworking its merry way through life.

This is certainly not true, because of current netfilter and userland application behavior, which often depends on being able to allocate (bind) to random TCP ports. Allocating a TCP port successfully within the OS, that then behaves different from all other TCP ports (because it is the magic iSCSI port) creates a cascading functional disconnect. On that magic iSCSI port, strange errors will be returned instead of proper behavior. Which, in turn, cascades through new (and inevitably under-utilized) error handling paths in the app.

So, of course, one must work around problems like this, which leads to one of two broad choices:

1) implement co-management (sharing) of IP address/port space, between the OS kernel and a remote entity.

2) come up with a solution in hardware that does not require the OS to co-manage the data it has so far been managing exclusively in software.

It should be obvious that we prefer path #2.

For, trudging down path #1 means

* one must give the user the ability to manage shared IP addresses IN A NON-HARDWARE-SPECIFIC manner. Currently most vendors of "TCP port stealing" solutions seem to expect each user to learn a vendor-specific method of identifying and managing the "magic port".

Excuse my language, but, what a fucking security and management nightmare in a cross-vendor environment. It is already a pain, with some [unnamed system/chipset vendors] management stealing TCP ports -- and admins only discover this fact when applications behave strangely on new hardware.

But... its tough to notice because stumbling upon the magic TCP port won't happen often unless the server is heavily loaded. Thus you have a security/application problem once in a blue moon, due to this magic TCP port mentioned in some obscure online documentation nobody has read.

* however, giving the user the ability to co-manage IP addresses means hacking up the kernel TCP code and userland tools for this new concept, something that I think DaveM would rightly be a bit reluctant to do? You are essentially adding a bunch of special case code whenever TCP ports are used:

if (port in list of "magic" TCP ports with special,
hardware-specific behavior)
...
else
do what we've been doing for decades

ISTR Roland(?) pointing out code that already does a bit of this in the IB space... but the point is

Finally, this shared IP address/port co-management thing has several problems listed on the TOE page: http://www.linuxfoundation.org/en/Net:TOE

such as,

* security updates for TCP problems mean that a single IP address can be PARTIALLY SECURE, because security updates for kernel TCP stack and h/w's firmware are inevitably updated separately (even if distributed and compiled together). Yay, we are introducing a wonderful new security problem here.

* from a security, network scanner and packet classifier point of view, a single IP address no longer behaves like Linux. It behaves like Linux... sometime. Depending on whether it is a magic TCP port or not.

Talk about security audit hell.

This should be plenty, so I'm stopping now. But looking down the TOE wiki page I could easily come up with more reasons why "IP address remote co-management" is more complicated and costly than you think.

Jeff

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Next message: Arnd Bergmann: "Re: [RFC][PATCH 1/4] checkpoint-restart: general infrastructure"
Previous message: Yinghai Lu: "Re: [PATCH 00/42] dyn_array/nr_irqs/sparse_irq support v5"
In reply to: Steve Wise: "Re: [RFC][PATCH 1/1] cxgb3i: cxgb3 iSCSI initiator"
Next in thread: Jeff Garzik: "Re: [RFC][PATCH 1/1] cxgb3i: cxgb3 iSCSI initiator"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]