Re: [tip: x86/misc] x86/csum: Improve performance of `csum_partial`

From: Geert Uytterhoeven
Date: Sun Jul 02 2023 - 11:19:58 EST


Hi Noah,

On Thu, May 25, 2023 at 8:04 PM tip-bot2 for Noah Goldstein
<tip-bot2@xxxxxxxxxxxxx> wrote:
> The following commit has been merged into the x86/misc branch of tip:
>
> Commit-ID: 688eb8191b475db5acfd48634600b04fd3dda9ad
> Gitweb: https://git.kernel.org/tip/688eb8191b475db5acfd48634600b04fd3dda9ad
> Author: Noah Goldstein <goldstein.w.n@xxxxxxxxx>
> AuthorDate: Wed, 10 May 2023 20:10:02 -05:00
> Committer: Dave Hansen <dave.hansen@xxxxxxxxxxxxxxx>
> CommitterDate: Thu, 25 May 2023 10:55:18 -07:00
>
> x86/csum: Improve performance of `csum_partial`
>
> 1) Add special case for len == 40 as that is the hottest value. The
> nets a ~8-9% latency improvement and a ~30% throughput improvement
> in the len == 40 case.
>
> 2) Use multiple accumulators in the 64-byte loop. This dramatically
> improves ILP and results in up to a 40% latency/throughput
> improvement (better for more iterations).
>
> Results from benchmarking on Icelake. Times measured with rdtsc()
> len lat_new lat_old r tput_new tput_old r
> 8 3.58 3.47 1.032 3.58 3.51 1.021
> 16 4.14 4.02 1.028 3.96 3.78 1.046
> 24 4.99 5.03 0.992 4.23 4.03 1.050
> 32 5.09 5.08 1.001 4.68 4.47 1.048
> 40 5.57 6.08 0.916 3.05 4.43 0.690
> 48 6.65 6.63 1.003 4.97 4.69 1.059
> 56 7.74 7.72 1.003 5.22 4.95 1.055
> 64 6.65 7.22 0.921 6.38 6.42 0.994
> 96 9.43 9.96 0.946 7.46 7.54 0.990
> 128 9.39 12.15 0.773 8.90 8.79 1.012
> 200 12.65 18.08 0.699 11.63 11.60 1.002
> 272 15.82 23.37 0.677 14.43 14.35 1.005
> 440 24.12 36.43 0.662 21.57 22.69 0.951
> 952 46.20 74.01 0.624 42.98 53.12 0.809
> 1024 47.12 78.24 0.602 46.36 58.83 0.788
> 1552 72.01 117.30 0.614 71.92 96.78 0.743
> 2048 93.07 153.25 0.607 93.28 137.20 0.680
> 2600 114.73 194.30 0.590 114.28 179.32 0.637
> 3608 156.34 268.41 0.582 154.97 254.02 0.610
> 4096 175.01 304.03 0.576 175.89 292.08 0.602
>
> There is no such thing as a free lunch, however, and the special case
> for len == 40 does add overhead to the len != 40 cases. This seems to
> amount to be ~5% throughput and slightly less in terms of latency.
>
> Testing:
> Part of this change is a new kunit test. The tests check all
> alignment X length pairs in [0, 64) X [0, 512).
> There are three cases.
> 1) Precomputed random inputs/seed. The expected results where
> generated use the generic implementation (which is assumed to be
> non-buggy).
> 2) An input of all 1s. The goal of this test is to catch any case
> a carry is missing.
> 3) An input that never carries. The goal of this test si to catch
> any case of incorrectly carrying.
>
> More exhaustive tests that test all alignment X length pairs in
> [0, 8192) X [0, 8192] on random data are also available here:
> https://github.com/goldsteinn/csum-reproduction
>
> The reposity also has the code for reproducing the above benchmark
> numbers.
>
> Signed-off-by: Noah Goldstein <goldstein.w.n@xxxxxxxxx>
> Signed-off-by: Dave Hansen <dave.hansen@xxxxxxxxxxxxxxx>

Thanks for your patch, which is now commit 688eb8191b475db5 ("x86/csum:
Improve performance of `csum_partial`") in linus/master stable/master

> Link: https://lore.kernel.org/all/20230511011002.935690-1-goldstein.w.n%40gmail.com

This does not seem to be a message sent to a public mailing list
archived at lore (yet).

On m68k (ARAnyM):

KTAP version 1
# Subtest: checksum
1..3
# test_csum_fixed_random_inputs: ASSERTION FAILED at
lib/checksum_kunit.c:243
Expected result == expec, but
result == 54991 (0xd6cf)
expec == 33316 (0x8224)
not ok 1 test_csum_fixed_random_inputs
# test_csum_all_carry_inputs: ASSERTION FAILED at lib/checksum_kunit.c:267
Expected result == expec, but
result == 255 (0xff)
expec == 65280 (0xff00)

Endianness issue in the test?

not ok 2 test_csum_all_carry_inputs
# test_csum_no_carry_inputs: ASSERTION FAILED at lib/checksum_kunit.c:306
Expected result == expec, but
result == 64515 (0xfc03)
expec == 0 (0x0)
not ok 3 test_csum_no_carry_inputs
# checksum: pass:0 fail:3 skip:0 total:3
# Totals: pass:0 fail:3 skip:0 total:3
not ok 1 checksum

Gr{oetje,eeting}s,

Geert

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert@xxxxxxxxxxxxxx

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
-- Linus Torvalds