Re: [PATCH] x86/uaccess: use unrolled string copy for short strings

From: Ingo Molnar
Date: Thu Jun 22 2017 - 04:47:42 EST



* Paolo Abeni <pabeni@xxxxxxxxxx> wrote:

> The 'rep' prefix suffers for a relevant "setup cost"; as a result
> string copies with unrolled loops are faster than even
> optimized string copy using 'rep' variant, for short string.
>
> This change updates __copy_user_generic() to use the unrolled
> version for small string length. The threshold length for short
> string - 64 - has been selected with empirical measures as the
> larger value that still ensure a measurable gain.
>
> A micro-benchmark of __copy_from_user() with different lengths shows
> the following:
>
> string len vanilla patched delta
> bytes ticks ticks tick(%)
>
> 0 58 26 32(55%)
> 1 49 29 20(40%)
> 2 49 31 18(36%)
> 3 49 32 17(34%)
> 4 50 34 16(32%)
> 5 49 35 14(28%)
> 6 49 36 13(26%)
> 7 49 38 11(22%)
> 8 50 31 19(38%)
> 9 51 33 18(35%)
> 10 52 36 16(30%)
> 11 52 37 15(28%)
> 12 52 38 14(26%)
> 13 52 40 12(23%)
> 14 52 41 11(21%)
> 15 52 42 10(19%)
> 16 51 34 17(33%)
> 17 51 35 16(31%)
> 18 52 37 15(28%)
> 19 51 38 13(25%)
> 20 52 39 13(25%)
> 21 52 40 12(23%)
> 22 51 42 9(17%)
> 23 51 46 5(9%)
> 24 52 35 17(32%)
> 25 52 37 15(28%)
> 26 52 38 14(26%)
> 27 52 39 13(25%)
> 28 52 40 12(23%)
> 29 53 42 11(20%)
> 30 52 43 9(17%)
> 31 52 44 8(15%)
> 32 51 36 15(29%)
> 33 51 38 13(25%)
> 34 51 39 12(23%)
> 35 51 41 10(19%)
> 36 52 41 11(21%)
> 37 52 43 9(17%)
> 38 51 44 7(13%)
> 39 52 46 6(11%)
> 40 51 37 14(27%)
> 41 50 38 12(24%)
> 42 50 39 11(22%)
> 43 50 40 10(20%)
> 44 50 42 8(16%)
> 45 50 43 7(14%)
> 46 50 43 7(14%)
> 47 50 45 5(10%)
> 48 50 37 13(26%)
> 49 49 38 11(22%)
> 50 50 40 10(20%)
> 51 50 42 8(16%)
> 52 50 42 8(16%)
> 53 49 46 3(6%)
> 54 50 46 4(8%)
> 55 49 48 1(2%)
> 56 50 39 11(22%)
> 57 50 40 10(20%)
> 58 49 42 7(14%)
> 59 50 42 8(16%)
> 60 50 46 4(8%)
> 61 50 47 3(6%)
> 62 50 48 2(4%)
> 63 50 48 2(4%)
> 64 51 38 13(25%)
>
> Above 64 bytes the gain fades away.
>
> Very similar values are collectd for __copy_to_user().
> UDP receive performances under flood with small packets using recvfrom()
> increase by ~5%.

What CPU model(s) were used for the performance testing and was it performance
tested on several different types of CPUs?

Please add a comment here:

+ if (len <= 64)
+ return copy_user_generic_unrolled(to, from, len);
+

... because it's not obvious at all that this is a performance optimization, not a
correctness issue. Also explain that '64' is a number that we got from performance
measurements.

But in general I like the change - as long as it was measured on reasonably modern
x86 CPUs. I.e. it should not regress on modern Intel or AMD CPUs.

Thanks,

Ingo