2.5.50 + e100 benchmarking

From: Anton Blanchard (anton@samba.org)
Date: Sun Dec 08 2002 - 07:44:44 EST


Ive got the benchmarking itch and am still waiting for the mail to
deliver me some e1000 size christmas presents, so Ive started playing
with some e100s that were lying around.


2.5.50-BK, 2 ppc64 partitions, one e100 card in each, 1500 byte MTU.
In all the runs we were pumping 11.76MB/sec down the socket.

We are sending bytes down a TCP socket (using tridge's socklib), the send
side looks like:


So we are pushing 64kB into the networking layer at a time. And the
read side looks like this:


So we are getting about 8kB per read. (Im guessing due to rx interrupt

First let me explain the patches I have attached.

1. e100_nodisable
e100_intr was the worst function in a profile. PCI reads are very costly
(and PCI reads that flush posted writes are even worse) and we were
disabling and enabling the on chip interrupt bit for each interrupt
(both operations had a PCI read to flush the write).

The question is, why do we need to disable and reenable interrupts via
the on chip status register? At least on ppc64 we cant take the same
interrupt recursively, isnt this the case on x86?

Andrew Morton's cyclesoak to the rescue:

System load: 6.4% || Free: 74.4%(0) 100.1%(1) 100.1%(2) 99.7%(3)

System load: 5.1% || Free: 79.6%(0) 100.0%(1) 100.1%(2) 99.7%(3)

(Ignore the 3 other cpus, I have locked both irq and process to cpu 0)

74.4% -> 79.6% idle. So e100_nodisable is worth 5% on my machine. Not bad.

2. e100_txchecksum
In recent 2.5 I found almost every tx packet had an invalid pseudo
header checksum. We didnt catch this in 2.4 because we would only use tx
checksumming for zero copy. In 2.5 we use it whenever we can (and thats
good, our copy_to/from_user has been optimised to within an inch of its
life thanks to paulus).

Anyway, I know zip about this stuff but it seems (from a quick look
of the acenic and tg3 drivers) that Linux always computes this
checksum. Bottom line is the patch fixes the problems I was seeing.

OK now to get a feel for what is going on:

sending (roughly):
9k irqs/second
900 context switches/second
20.5% CPU

receiving (roughly):
2.3k irqs/second
3k context switches/second
13.5% CPU

Most of the extra cost on send appears to be the higher interrupt rate.
So this begs the question, can we be more agressive with the tx
interrupt mitigation? I had a quick play with some of the e100 options
and it gave some short term relief (4k/sec) but then jumped back up to
9k/sec again.

For those who have made it this far down, here are some profiles :)


Keep in mind no idle time shows up here because I was running akpm(TM)

As you can see for the receiving side (sock_sink), copy_tofrom user is
the worst offender. Very nice. Ignore plpar_hcall_norets, its some magic
we do for dynamic PCI mapping. Also note profile hits get attributed to
the following instruction, eg in e100intr we see a bunch of time just
after the first lhbrx (the number on the left is % time of the entire
function). lhbrx is a byte reversed load - in this case it happens to be
a PCI memory read.

On the send side (sock_source) the higher interrupt rate shows up.
(hmm I wonder how we got idle time here, cyclesoak should have sucked
all of it up).


To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

This archive was generated by hypermail 2b29 : Sun Dec 15 2002 - 22:00:13 EST