Re: More performance for the TCP stack by using additionalhardware chip on NIC

From: Avi Kivity
Date: Sun Apr 17 2005 - 07:17:09 EST


On Sun, 2005-04-17 at 14:30, Willy Tarreau wrote:

> > TOEs can remove the data copy on receive. In some applications (notably
> > storage), where the application does not touch most of the data, this is
> > a significant advantage that cannot be achieved in a software-only
> > solution.
>
> Well, if the application does not touch most of the data, either it
> is playing as a relay, and the data will at least have to be copied,

it might use copyless send. indeed, copyless send is much easier than
copyless receive.

> or it will play as a client or server which reads from/writes to disk,
> and in this case, I wonder how the NIC will send its writes directly
> to the disk controller without some help.

the TOE dma's data to the application, the disk controller dma's same
data to disk.

but the processor does not touch the data.

>
> What worries me with those NICs is that you have no control on the
> TCP stack. You often have to disable the acceleration when you
> want to insert even 1 firewall rule, use policy routing or even
> do a simple anti-spoofing check. It is exactly like the routers
> which do many things in hardware at wire speed, but jump to snail
> speed when you enable any advanced feature.

this is a very valid concern, which I hadn't thought of. I guess that
will have to be a disadvantage of the solution we will have to live
with.

maybe one day you would be able to offload your firewall and policy
router too :)

>
> > > Also these types of solution always add quite a bit of overhead to
> > > connection setup/teardown making it actually a *loss* for the "many
> > > short connections" types of workloads. Now guess which things certain
> > > benchmarks use, and guess what real world servers do :)
> > >
> >
> > again, this depends on the application.
>
> The speed itself depends on the application. An application which
> goal is to achieve 10 Gbps needs to be written with this goal in
> mind from start, and needs fine usage of the kernel internals, and
> even sometimes good knowledge of the hardware itself. At the moment,
> a non-blocking application needs one copy because the final data
> position in memory is unknown. Probably soon we'll see new prefetch
> syscalls (like in CPUs) which will allow the application to tell
> the system that it expects to fetch some data to a particular place.

aio does this very nicely. in io_submit() you tell the system where you
want your data, in io_getevents() the system tells you you have it.

> Then a very simple TOE card would be able to wake the system up to
> send only TCP headers first, and the system will say "send the
> data there", then wake the application once the data has been copied
> and checksummed. This keeps compatible with firewalls and other
> mechanisms.
>

neat. this would work very well with aio. it's a pity aio development
appears to have stagnated.

> > a copyless solution is probably necessary to achieve 10Gb/s speeds.
>
> That was said for 100 Mbps then Gbps years ago, and the fact is that
> software has improved a lot (zero-copy, epoll, etc...) and at the
> moment, it's relatively easy to drain multi-gigabit from cheap
> hardware. For example, I could fetch 3.2 Gbps of HTTP traffic on
> a $3000 opteron 2GHz with a 4-port intel gigabit NIC, and a non-
> optimized HTTP client which still uses select().
>
> Memory and I/O busses are becoming very large, eg: 8 Gbps for the
> PCI-X 133, multi-gigabytes/s between memory and the CPU, so the
> hardware bottleneck for the 10 Gbps is already at the NIC side
> and not between the CPU and the memory. When you leverage this
> limit, you'll notice that the application needs very large buffers
> (eg: 12.5 MB to support a 10ms scheduling latency on 10 Gbps) and
> good general design (10 Gbps is 125000 open/read/send/close of
> 10 kB files every second).

the aio api is remarkably well suited to such applications, allowing
batching of requests and responses. add that to a
one-process-per-processor design (to avoid scheduling latencies) and you
have most of the solution.

Avi

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/