Re: Fire Engine??

From: Andi Kleen
Date: Wed Nov 26 2003 - 17:40:52 EST


On Wed, 26 Nov 2003 11:30:40 -0800
"David S. Miller" <davem@xxxxxxxxxx> wrote:

>
> > - On TX we are inefficient for the same reason. TCP builds one packet
> > at a time and then goes down through all layers taking all locks (queue,
> > device driver etc.) and submits the single packet. Then repeats that for
> > lots of packets because many TCP writes are > MTU. Batching that would
> > likely help a lot, like it was done in the 2.6 VFS. I think it could
> > also make hard_start_xmit in many drivers significantly faster.
>
> This is tricky, because of getting all of the queueing stuff right.
> All of the packet scheduler APIs would need to change, as would
> the classification stuff, not to mention netfilter et al.

You only need to do a fast path for the default scheduler at the beginning.
Every complicated "slow" API like advanced queuing or netfilter can still fallback to
one packet at a time until cleaned up (similar strategy as was done with the
non linear skbs)

> You're talking about basically redoing the whole TX path if you
> want to really support this.
>
> I'm not saying "don't do this", just that we should be sure we know
> what we're getting if we invest the time into this.

In some profiling I did some time ago queue locks and device driver
locks were the biggest offenders on TX after copy.

The only tricky part is to get the state machine in tcp_do_sendmsg()
right that decides when to flush.

> - user copy and checksum could probably also done faster if they were
> > batched for multiple packets. It is hard to optimize properly for
> > <= 1.5K copies.
> > This is especially true for 4/4 split kernels which will eat an
> > page table look up + lock for each individual copy, but also for others.
>
> I disagree partially, especially in the presence of a chip that provides
> proper implementations of software initiated prefetching.

Especially for prefetching having a list of packets helps because you
can prefetch the next while you're working on the current one. The CPU
hardware prefetcher cannot do that for you.

I did look seriously at faster csum-copy/copy-to-user for K8, but the conclusion
was that all the tricks are only worth it when you can work with bigger amounts of data.
1.5K at a time is just too small.

Ah yes:

- Investigate more performance through explicit prefetching
(e.g. in the device drivers to optimize eth_type_trans() when you can classify the packet
just by looking at the RX ring state. Instead do a prefetch on the packet data
and hope the data is already in cache when the IP stack gets around to look at it)

could be also added to the list

-Andi (who shuts up now because I don't have any time to code on any of this :-( )
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/