[RFC] [PATCH] cache pollution aware __copy_from_user_ll()

From: Hiro Yoshioka
Date: Sun Aug 14 2005 - 04:16:23 EST

Next message: Matan Peled: "Re: recursive dnotify"
Previous message: "Fabian LoneStar Frédérick": "recursive dnotify"
Next in thread: Arjan van de Ven: "Re: [RFC] [PATCH] cache pollution aware __copy_from_user_ll()"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Hi,

The following is a patch to reduce a cache pollution
of __copy_from_user_ll().

When I run simple iozone benchmark to find a performance bottleneck of
the linux kernel, I found that __copy_from_user_ll() spent CPU cycle
most and it did many cache misses.

The following is profiled by oprofile.

Top 5 CPU cycle
CPU: P4 / Xeon, speed 2200.91 MHz (estimated)
Counted GLOBAL_POWER_EVENTS events (time during which processor is not
stopped) with a unit mask of 0x01 (mandatory) count 100000
samples % app name symbol name
281538 15.2083 vmlinux __copy_from_user_ll
81069 4.3792 vmlinux _spin_lock
75523 4.0796 vmlinux journal_add_journal_head
63674 3.4396 vmlinux do_get_write_access
52634 2.8432 vmlinux journal_put_journal_head
(pattern9-0-cpu4-0-08141700/summary.out)

Top 5 Memory Access and Cache miss
CPU: P4 / Xeon, speed 2200.91 MHz (estimated)
Counted BSQ_CACHE_REFERENCE events (cache references seen by the bus
unit) with a unit mask of 0x3f (multiple flags) count 3000
Counted BSQ_CACHE_REFERENCE events (cache references seen by the bus
unit) with a unit mask of 0x200 (read 3rd level cache miss) count 3000
samples % samples % app name symbol name
120801 7.4379 37017 63.4603 vmlinux __copy_from_user_ll
84139 5.1806 885 1.5172 vmlinux _spin_lock
66027 4.0654 656 1.1246 vmlinux
journal_add_journal_head
60400 3.7189 250 0.4286 vmlinux __find_get_block
60032 3.6963 120 0.2057 vmlinux
journal_dirty_metadata

__copy_from_user_ll spent 63.4603% of L3 cache miss though it spent only
7.4379% of memory access.

In order to reduce the cache miss in the __copy_from_user_ll, I made
the following patch and confirmed the reduction of the miss.

Top 5 CPU cycle
CPU: P4 / Xeon, speed 2200.93 MHz (estimated)
Counted GLOBAL_POWER_EVENTS events (time during which processor is not
stopped) with a unit mask of 0x01 (mandatory) count 100000
samples % app name symbol name
120717 8.3454 vmlinux _mmx_memcpy_nt
65955 4.5596 vmlinux do_get_write_access
56088 3.8775 vmlinux journal_put_journal_head
52550 3.6329 vmlinux journal_dirty_metadata
38886 2.6883 vmlinux journal_add_journal_head
pattern9-0-cpu4-0-08141627/summary.out

_mmx_memcpy_nt is the new function which is called from
__copy_from_user_ll and it spent only 42.88% of the original
implementation. (120717/281538==42.88%)

Top 5 Memory Access
CPU: P4 / Xeon, speed 2200.93 MHz (estimated)
Counted BSQ_CACHE_REFERENCE events (cache references seen by the bus
unit) with a unit mask of 0x3f (multiple flags) count 3000
Counted BSQ_CACHE_REFERENCE events (cache references seen by the bus
unit) with a unit mask of 0x200 (read 3rd level cache miss) count 3000
samples % samples % app name symbol name
90918 6.3079 89 0.5673 vmlinux _mmx_memcpy_nt
83654 5.8039 177 1.1283 vmlinux
journal_dirty_metadata
57836 4.0127 348 2.2183 vmlinux
journal_put_journal_head
48236 3.3466 165 1.0518 vmlinux do_get_write_access
44546 3.0906 21 0.1339 vmlinux __getblk

The cache miss reduced from 37017 (63.4603%) to 89 (0.5673%). It is
0.24% of the original implementation.

The actual elapse time which five times run were 229.76 (sec) and
222.94 (sec). (229.76/222.94= 3.06% gain)

iozone -CMR -i 0 -+n -+u -s 8000MB -t 4

What do you think?

--- linux-2.6.12.4.orig/arch/i386/lib/usercopy.c 2005-08-05
16:04:37.000000000 +0900
+++ linux-2.6.12.4/arch/i386/lib/usercopy.c 2005-08-12 13:18:14.106916200 +0900
@@ -10,6 +10,7 @@
#include <linux/highmem.h>
#include <linux/blkdev.h>
#include <linux/module.h>
+#include <asm/i387.h>
#include <asm/uaccess.h>
#include <asm/mmx.h>

@@ -511,6 +512,108 @@
: "memory"); \
} while (0)

+/* Non Temporal Hint version of mmx_memcpy */
+/* It is cache aware */
+/* hyoshiok@xxxxxxxxxxxxxxxx */
+static unsigned long _mmx_memcpy_nt(void *to, const void *from, size_t len)
+{
+ /* Note! gcc doesn't seem to align stack variables properly, so we
+ * need to make use of unaligned loads and stores.
+ */
+ void *p;
+ int i;
+
+ if (unlikely(in_interrupt())){
+ __copy_user_zeroing(to, from, len);
+ return len;
+ }
+
+ p = to;
+ i = len >> 6; /* len/64 */
+
+ kernel_fpu_begin();
+
+ __asm__ __volatile__ (
+ "1: prefetchnta (%0)\n" /* This set is 28 bytes */
+ " prefetchnta 64(%0)\n"
+ " prefetchnta 128(%0)\n"
+ " prefetchnta 192(%0)\n"
+ " prefetchnta 256(%0)\n"
+ "2: \n"
+ ".section .fixup, \"ax\"\n"
+ "3: movw $0x1AEB, 1b\n" /* jmp on 26 bytes */
+ " jmp 2b\n"
+ ".previous\n"
+ ".section __ex_table,\"a\"\n"
+ " .align 4\n"
+ " .long 1b, 3b\n"
+ ".previous"
+ : : "r" (from) );
+
+ for(; i>5; i--)
+ {
+ __asm__ __volatile__ (
+ "1: prefetchnta 320(%0)\n"
+ "2: movq (%0), %%mm0\n"
+ " movq 8(%0), %%mm1\n"
+ " movq 16(%0), %%mm2\n"
+ " movq 24(%0), %%mm3\n"
+ " movntq %%mm0, (%1)\n"
+ " movntq %%mm1, 8(%1)\n"
+ " movntq %%mm2, 16(%1)\n"
+ " movntq %%mm3, 24(%1)\n"
+ " movq 32(%0), %%mm0\n"
+ " movq 40(%0), %%mm1\n"
+ " movq 48(%0), %%mm2\n"
+ " movq 56(%0), %%mm3\n"
+ " movntq %%mm0, 32(%1)\n"
+ " movntq %%mm1, 40(%1)\n"
+ " movntq %%mm2, 48(%1)\n"
+ " movntq %%mm3, 56(%1)\n"
+ ".section .fixup, \"ax\"\n"
+ "3: movw $0x05EB, 1b\n" /* jmp on 5 bytes */
+ " jmp 2b\n"
+ ".previous\n"
+ ".section __ex_table,\"a\"\n"
+ " .align 4\n"
+ " .long 1b, 3b\n"
+ ".previous"
+ : : "r" (from), "r" (to) : "memory");
+ from+=64;
+ to+=64;
+ }
+
+ for(; i>0; i--)
+ {
+ __asm__ __volatile__ (
+ " movq (%0), %%mm0\n"
+ " movq 8(%0), %%mm1\n"
+ " movq 16(%0), %%mm2\n"
+ " movq 24(%0), %%mm3\n"
+ " movntq %%mm0, (%1)\n"
+ " movntq %%mm1, 8(%1)\n"
+ " movntq %%mm2, 16(%1)\n"
+ " movntq %%mm3, 24(%1)\n"
+ " movq 32(%0), %%mm0\n"
+ " movq 40(%0), %%mm1\n"
+ " movq 48(%0), %%mm2\n"
+ " movq 56(%0), %%mm3\n"
+ " movntq %%mm0, 32(%1)\n"
+ " movntq %%mm1, 40(%1)\n"
+ " movntq %%mm2, 48(%1)\n"
+ " movntq %%mm3, 56(%1)\n"
+ : : "r" (from), "r" (to) : "memory");
+ from+=64;
+ to+=64;
+ }
+ /*
+ * Now do the tail of the block
+ */
+ kernel_fpu_end();
+ if(i=(len&63))
+ __copy_user_zeroing(to, from, i);
+ return i;
+}

unsigned long __copy_to_user_ll(void __user *to, const void *from,
unsigned long n)
{
@@ -575,10 +678,14 @@
__copy_from_user_ll(void *to, const void __user *from, unsigned long n)
{
BUG_ON((long)n < 0);
- if (movsl_is_ok(to, from, n))
+ if (n < 512) {
+ if (movsl_is_ok(to, from, n))
__copy_user_zeroing(to, from, n);
- else
+ else
n = __copy_user_zeroing_intel(to, from, n);
+ }
+ else
+ n = _mmx_memcpy_nt(to, from, n);
return n;
}

Thanks in advance,
Hiro
--
hyoshiok at miraclelinux.com
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Next message: Matan Peled: "Re: recursive dnotify"
Previous message: "Fabian LoneStar Frédérick": "recursive dnotify"
Next in thread: Arjan van de Ven: "Re: [RFC] [PATCH] cache pollution aware __copy_from_user_ll()"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]