lockless poll() (was Re: namei() query)

From: kumon@flab.fujitsu.co.jp
Date: Tue Apr 25 2000 - 02:16:32 EST


kumon@flab.fujitsu.co.jp writes:
> optimization. We are arranging the global figure now.

The following table shows user/os/idle statistics in the current
experiments.
#Kernel is 2.3.40 (we are moving to 2.3.99-prex)

note: Between 2.3.40 and the latest-kernel, interrupt bottom-half
      handling has been fairly changed, the following values may not
      be applied to the current kernel.

In light load or heavy load execution, 8 clients or 23 clients are
used. When the kernel is running on 1 CPU, 8 clients eat
full of the server CPUs, it is nothing "light" in the case.

8 clietns can request about 1500 trans/s, therefore heavy load (23
clients) will be able request upto maximum 4400 trans/s.

Looking into the indivisual number, the poll-optimization is rather
effective in the "light" case, the effects is decreased in "heavy"
case.

light load(8clients)
#cpu kernel tran/s user os idle
-----------------------------------------------------
1 orig 1092.40 28.4% 71.6% 0.0%
1 poll-opt 1081.75 28.3% 70.3% 1.4%
2 orig 1518.40 20.1% 62.9% 17.1%
2 poll-opt 1519.80 20.1% 60.2% 19.7%
4 orig 1524.44 10.2% 29.2% 60.5%
4 poll-opt 1524.88 10.3% 27.6% 62.1%

heavy load(23clients)
#cpu kernel tran/s user os idle
-----------------------------------------------------
1 orig 1083.05 27.9% 72.1% 0.0%
1 poll-opt 1083.99 27.6% 72.4% 0.0%
2 orig 1721.44 22.4% 77.6% 0.0%
2 poll-opt 1731.44 22.2% 77.8% 0.0%
4 orig 2411.93 17.2% 77.2% 5.6%
4 poll-opt 2403.23 17.3% 74.5% 8.2%

We now look into the per-transaction statistics.
The values are calclated by:
        One-Trans-Time = (# of cpu) / (throughput)
and the user/os/idle ratios are multiplied to the One-Trans-Time.

Light Load (original -> poll optimize)
#cpu user(usec/tran) os(usec/tran) idle(usec/tran)
--------------------------------------------------
1 260->262 656->650 0-> 13
2 264->265 828->792 225-> 259
4 269->270 767->724 1589->1630

Heavy Load (original -> poll optimize)
#cpu user(usec/tran) os(usec/tran) idle(usec/tran)
---------------------------------------------------------
1 258->256 668-> 670 0-> 0
2 260->256 903-> 900 0-> 0
4 285->288 1281->1241 92->136

User times are stable among the experiments, but the OS time increase
as the requests increase.

Perhaps, the above results are broadly known facts.

Next, we break down the each OS time into the indivisual functions.

To save the space, I show only heavy case with poll-optimization.
Before the poll-opt, stext_lock is 94.5us, the winner.

The functions are ordered by the time consumption order of 4cpu
execution.

The OS overhead goes 668us->1377us (incl. idle), from 1->4 cpu, it is
nealy doubled. Some functions rapidly grow than the average.

Huum, global_cli is the growth winner, it gained 62us (64.7-2.5).

Memory allocations may be problem, and schedule() and also
csum_partial_copy_generic().
These may be related to cache block transfer between CPUs.

1cpu 2cpu 4cpu func
--------------------
668 899 1376.7 Total Os time (incl idle)
--------------------
55.3 66.5 90.6 csum_partial_copy_generic
69.6 77.7 90.4 vortex_interrupt
2.5 15.6 64.7 __global_cli
- 27.8 62.5 stext_lock
- - 57.2 default_idle
42.6 45.2 54.5 boomerang_start_xmit
11.1 26.3 47.5 kmem_cache_alloc
12.8 26 41.7 kmem_cache_free
7.4 10.2 37.8 schedule
15 20.9 36.8 __wake_up
20.1 26.1 34.8 boomerang_rx
11.1 21.9 34.2 kmalloc
13.6 20.7 30.8 kfree
4.2 9.6 29.7 do_bottom_half
9.2 20.1 26.2 do_IRQ
22.7 23 24.3 mask_IO_APIC_irq
12.5 16.2 23.4 nf_hook_slow
5.3 12.5 19.2 handle_IRQ_event
7.9 13.7 17.8 __kfree_skb
5.9 10.5 16.9 alloc_skb
7.3 11.4 14.3 net_bh

Experiments on PII-Xeon-450MHz 2MB cache * 4

--
Computer Systems Laboratory, Fujitsu Labs.
kumon@flab.fujitsu.co.jp

- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.rutgers.edu Please read the FAQ at http://www.tux.org/lkml/



This archive was generated by hypermail 2b29 : Sun Apr 30 2000 - 21:00:09 EST