Re: [PATCH][CFT] dcache-ac6-D - dcache threading

From: Andi Kleen (ak@suse.de)
Date: Sun Jun 04 2000 - 06:35:46 EST

Next message: willy@thepuffingroup.com: "Re: [PATCH] ac7 Athlon-SMP"
Previous message: Olaf Titz: "Re: White paper on UNIX timers/timing...."
In reply to: kumon@flab.fujitsu.co.jp: "Re: [PATCH][CFT] dcache-ac6-D - dcache threading"
Next in thread: kumon@flab.fujitsu.co.jp: "Re: [PATCH][CFT] dcache-ac6-D - dcache threading"
Reply: kumon@flab.fujitsu.co.jp: "Re: [PATCH][CFT] dcache-ac6-D - dcache threading"
Reply: kuznet@ms2.inr.ac.ru: "Re: [PATCH][CFT] dcache-ac6-D - dcache threading"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On Sun, Jun 04, 2000 at 05:24:11AM +0900, kumon@flab.fujitsu.co.jp wrote:
> Andi Kleen writes:
> > Some comments from a networking perspective.
>
> Thank you for having interests on my measurement.
>
> > It may also be worth to try the e100 driver from the Intel website.
>
> Instruction level profile shows, a "inw()" statement in
> speedo_interrupt():eepro100.c occupies more than half of the overhead
> of speedo_interrupt(). Actually, this statement is not compiled into
> an inw instruction but a movzwl instruction.
>
> near line 1520 in eepro100.c:
> do {
> HERE>>> status = inw(ioaddr + SCBStatus);
> /* Acknowledge all of the current interrupt sources ASAP. */
> /* Will change from 0xfc00 to 0xff00 when we start handling
> FCP and ER interrupts --Dragan */
> outw(status & 0xfc00, ioaddr + SCBStatus);
>
>
> > > 22.9 22.2 kmalloc
> > > 22.2 20.5 kfree
> >
> > That requires per CPU slabs to fix. Normally the new per CPU skb cache in 2.4
> > should help a bit already, maybe you need to increase
> > /proc/sys/net/core/hot_list_len
>
> Some part of kmalloc/kfree overhead is come from do_select, and it is
> easily eliminated using small array on a stack. Which I've already
> posted. IMHO per CPU skb will not reduce the kmalloc overhead,

Funny, Linux before 2.2 did that, but it was changed to allow fd set size
extensions. There were actually patches that tried stack alloc and
dynamic depending on the size, but they were rejected because of the
complexity.

> skb_buf_head uses kmem_cache_allocate directly.

With an own per CPU hot list cache.
The problem with kmem_cache_alloc is that it is not per CPU, but global,
which causes cache line juggling between in the list head of the cache
structure.

Caching more skbs as with increasing the hot list len may speed up the
skb header allocation up in your benchmark. For data portions it does
not help that much, because the payload size variations tend to distribute
the cache juggling over multiple fixed size kmalloc caches.

[That all is a big hack -- the real solution would a per CPU slab allocator
like Solaris has]

>
> The statistics shows that the skb buffer (not skb buffer head)
> allocation is the most frequent kmalloc/kfree client (after do_select
> optimization).
>
> Using per-cpu cache mechanism and auto-array in do_select() can
> curtail kmalloc() overhead to 1/3. And the statistics shows the most
> frequently requested sizes to kmalloc are 2kb and 128b, of course it
> is application dependent.

2KB is a full sized MTU packet (1.5 rounded to power of two)
128byte is probably the ACK.

>
> > > 9.0 8.2 ip_route_input
> >
> > Interesting. Looks like the routing cache hash isn't as good as we thought.
> > Could you add some statistics to ipv4/route.c:ip_route_input to check
> > the average hash chain length or where exactly the cycles go there?
>
> Two lock instructions took more than half of the ticks. But we should
> be carefull to interpret it. Super scalar execution may distort
> results. So, some additional experiments are needed for confirmation.
> I suspect instruction serialization may be the main reason.
>
> The instructions are shown below:
>
> int ip_route_input(struct sk_buff *skb, u32 daddr, u32 saddr,
> u8 tos, struct net_device *dev)
> {
> struct rtable * rth;
> unsigned hash;
> int iif = dev->ifindex;
>
> tos &= IPTOS_TOS_MASK;
> hash = rt_hash_code(daddr, saddr^(iif<<5), tos);
>
> HERE>> read_lock(&rt_hash_table[hash].lock);

The per bucket locks apparently perform well for outgoing traffic,
but they are very bad for incoming traffic because it usually tends
to hit the same lock. Hmm....

>
>
> the other is around line 1560
> rth->u.dst.lastuse = jiffies;
> HERE>> dst_hold(&rth->u.dst);

Again the same problem. All incoming traffic hits the same routing
cache entry for your local address. dst_hold increases the use lock
of the rtcache entry, which is shared between all cpus.

Looks like the locking strategy does not work very well
for this case.

-Andi

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/

Next message: willy@thepuffingroup.com: "Re: [PATCH] ac7 Athlon-SMP"
Previous message: Olaf Titz: "Re: White paper on UNIX timers/timing...."
In reply to: kumon@flab.fujitsu.co.jp: "Re: [PATCH][CFT] dcache-ac6-D - dcache threading"
Next in thread: kumon@flab.fujitsu.co.jp: "Re: [PATCH][CFT] dcache-ac6-D - dcache threading"
Reply: kumon@flab.fujitsu.co.jp: "Re: [PATCH][CFT] dcache-ac6-D - dcache threading"
Reply: kuznet@ms2.inr.ac.ru: "Re: [PATCH][CFT] dcache-ac6-D - dcache threading"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

This archive was generated by hypermail 2b29 : Wed Jun 07 2000 - 21:00:18 EST