Re: [RFC] [PATCH] cache pollution aware __copy_from_user_ll()

From: Hiro Yoshioka
Date: Mon Aug 15 2005 - 01:43:55 EST

Next message: Hareesh Nagarajan: "relayfs back ported to 2.4"
Previous message: Ingo Molnar: "Re: [patch] Real-Time Preemption, -RT-2.6.13-rc4-V0.7.53-01, High Resolution Timers & RCU-tasklist features"
In reply to: Christoph Hellwig: "Re: [RFC] [PATCH] cache pollution aware __copy_from_user_ll()"
Next in thread: Arjan van de Ven: "Re: [RFC] [PATCH] cache pollution aware __copy_from_user_ll()"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Hi,

From: Arjan van de Ven <arjan@xxxxxxxxxxxxx>
Subject: Re: [RFC] [PATCH] cache pollution aware __copy_from_user_ll()
Date: Sun, 14 Aug 2005 12:35:43 +0200
Message-ID: <1124015743.3222.17.camel@xxxxxxxxxxxxxxxxxxxxx>

> On Sun, 2005-08-14 at 19:22 +0900, Hiro Yoshioka wrote:
> > Thanks for your comments.
> >
> > On 8/14/05, Arjan van de Ven <arjan@xxxxxxxxxxxxx> wrote:
> > > On Sun, 2005-08-14 at 18:16 +0900, Hiro Yoshioka wrote:
> > > > Hi,
> > > >
> > > > The following is a patch to reduce a cache pollution
> > > > of __copy_from_user_ll().
> > > >
> > > > When I run simple iozone benchmark to find a performance bottleneck of
> > > > the linux kernel, I found that __copy_from_user_ll() spent CPU cycle
> > > > most and it did many cache misses.
> > >
> > >
> > > however... you copy something from userspace... aren't you going to USE
> > > it? The non-termoral versions actually throw the data out of the
> > > cache... so while this part might be nice, you pay BIG elsewhere....
> >
> > The oprofile data does not give an evidence that we pay BIG elsewhere.
>
>
> the problem is that the pay elsewhere is far more spread out, but not
> less. At least generally....
>
> I can see the point of a copy_from_user_nocache() or something, for
> those cases where we *know* we are not going to use the copied data in
> the cpu (but say, only do DMA).
> But that should be explicit, not implicit, since the general case will
> be that the kernel WILL use the data. And if that's the case your change
> is a loss.... (just harder to see because the cost is spread out)

I understand the iozone is not good benchmark nor reprsents any useful
application so I did a kernel build as a simple benchmark.

What I did is
cd /test/f1
tar xjf ${baseDir}/src/linux-2.6.12.4.tar.bz2
cd linux-2.6.12.4
cp -p ${baseDir}/src/config .config
make oldconfig
time make -j $CPUS

The following is Top 5 of CPU cycle
Counted GLOBAL_POWER_EVENTS events (time during which processor is not
stopped) with a unit mask of 0x01 (mandatory) count 10
0000
samples % app name symbol name
7347544 72.8296 cc1 (no symbols)
532307 5.2763 libbz2.so.1.0.2 (no symbols)
241853 2.3973 vmlinux buffered_rmqueue
128552 1.2742 libc-2.3.4.so _int_malloc
107784 1.0684 vmlinux page_fault
...
10749 0.1065 vmlinux __copy_from_user_ll
pattern12-0-cpu4-0-08150920/summary.out

Since __copy_from_user_ll is not hot spot, so we didn't see any big
performance difference. (the number is time (sec) of 5 runs)

original 2.6.12.4 real user system
No profiling 532.27 1797.02 194.9
BSQ 0x200+0x3f 620.15 2094.21 212.38
GLOBAL_POWER_EVENTS:100000: 586.01 1984.92 215.97

cache aware 2.6.12.4 real user system
No profiling 526.65 1792.22 190.05
BSQ 0x200+0x3f 615.51 2090.74 206.58
GLOBAL_POWER_EVENTS:100000: 587.69 1978.66 209.18

Now Top 5 of Memory Access (2.6.12.4)
Counted BSQ_CACHE_REFERENCE events (cache references seen by the bus
unit) with a unit mask of 0x3f (multiple flags) count 3000
Counted BSQ_CACHE_REFERENCE events (cache references seen by the bus
unit) with a unit mask of 0x200 (read 3rd level cache miss) count 3000
samples % samples % app name symbol name
11439689 82.2135 33906 27.9328 cc1 (no symbols)
277177 1.9920 347 0.2859 libc-2.3.4.so _int_malloc
229593 1.6500 12946 10.6653 libbz2.so.1.0.2 (no symbols)
84348 0.6062 116 0.0956 libc-2.3.4.so _int_free
83653 0.6012 438 0.3608 libc-2.3.4.so calloc
...
8527 0.0613 1648 1.3577 vmlinux __copy_from_user_ll

Top 5 of Cache miss
33906 27.9328 cc1 (no symbols)
30849 25.4144 vmlinux buffered_rmqueue
12946 10.6653 libbz2.so.1.0.2 (no symbols)
9178 7.5611 vmlinux __copy_to_user_ll
2934 2.4171 oprofiled (no symbols)
...
1648 1.3577 vmlinux __copy_from_user_ll
pattern12-0-cpu4-0-08150917

Cache aware 2.6.12.4, Top 5 of Memory Access
samples % samples % app name symbol name
11448487 82.8100 32786 28.1051 cc1 (no symbols)
276812 2.0023 256 0.2195 libc-2.3.4.so _int_malloc
230177 1.6649 12371 10.6048 libbz2.so.1.0.2 (no symbols)
84485 0.6111 120 0.1029 libc-2.3.4.so _int_free
84043 0.6079 473 0.4055 libc-2.3.4.so calloc
...
18282 0.1322 9060 7.7665 vmlinux __copy_from_user_ll

Top 5 of Cache miss
32786 28.1051 cc1 (no symbols)
31175 26.7241 vmlinux buffered_rmqueue
12371 10.6048 libbz2.so.1.0.2 (no symbols)
9060 7.7665 vmlinux __copy_from_user_ll
2801 2.4011 oprofiled (no symbols)
...
0 0 vmlinux __copy_to_user_ll
pattern12-0-cpu4-0-08151048

Cache miss of __copy_from_user_ll has been increased but
__copy_to_user_ll has been decreased to 0. (oprofile could not get a
sample.)

I don't know the reason why __copy_to_user_ll has been decreased.

Anyway we could not find the cache aware version of __copy_from_user_ll
has a big regression yet.

What do you think?
Hiro
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Next message: Hareesh Nagarajan: "relayfs back ported to 2.4"
Previous message: Ingo Molnar: "Re: [patch] Real-Time Preemption, -RT-2.6.13-rc4-V0.7.53-01, High Resolution Timers & RCU-tasklist features"
In reply to: Christoph Hellwig: "Re: [RFC] [PATCH] cache pollution aware __copy_from_user_ll()"
Next in thread: Arjan van de Ven: "Re: [RFC] [PATCH] cache pollution aware __copy_from_user_ll()"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]