Re: [PATCH] parisc: Fix csum_ipv6_magic on 64-bit systems

From: Charlie Jenkins
Date: Fri Feb 16 2024 - 22:00:28 EST


On Fri, Feb 16, 2024 at 01:38:51PM +0100, Helge Deller wrote:
> * Guenter Roeck <linux@xxxxxxxxxxxx>:
> > hppa 64-bit systems calculates the IPv6 checksum using 64-bit add
> > operations. The last add folds protocol and length fields into the 64-bit
> > result. While unlikely, this operation can overflow. The overflow can be
> > triggered with a code sequence such as the following.
> >
> > /* try to trigger massive overflows */
> > memset(tmp_buf, 0xff, sizeof(struct in6_addr));
> > csum_result = csum_ipv6_magic((struct in6_addr *)tmp_buf,
> > (struct in6_addr *)tmp_buf,
> > 0xffff, 0xff, 0xffffffff);
> >
> > Fix the problem by adding any overflows from the final add operation into
> > the calculated checksum. Fortunately, we can do this without additional
> > cost by replacing the add operation used to fold the checksum into 32 bit
> > with "add,dc" to add in the missing carry.
>
>
> Gunter,
>
> Thanks for your patch for 32- and 64-bit systems.
> But I think it's time to sunset the parisc inline assembly for ipv4/ipv6
> checksumming.
> The patch below converts the code to use standard C-coding (taken from the
> s390 header file) and it survives the v8 lib/checksum patch.
>
> Opinions?
>
> Helge
>
> Subject: [PATCH] parisc: Fix csum_ipv6_magic on 32- and 64-bit machines
>
> Guenter noticed that the 32- and 64-bit checksum functions may calculate
> wrong values under certain circumstances. He fixed it by usining
> corrected carry-flags added, but overall I think it's better to convert
> away from inline assembly to generic C-coding. That way the code is
> much cleaner and the compiler can do much better optimizations based on
> the target architecture (32- vs. 64-bit). Furthermore, the compiler
> generates nowadays much better code, so inline assembly usually won't
> give any benefit any longer.
>
> Signed-off-by: Helge Deller <deller@xxxxxx>
> Noticed-by: Guenter Roeck <linux@xxxxxxxxxxxx>
>
> diff --git a/arch/parisc/include/asm/checksum.h b/arch/parisc/include/asm/checksum.h
> index 3c43baca7b39..c72f14536353 100644
> --- a/arch/parisc/include/asm/checksum.h
> +++ b/arch/parisc/include/asm/checksum.h
> @@ -18,160 +18,93 @@
> */
> extern __wsum csum_partial(const void *, int, __wsum);
>
> +
> /*
> - * Optimized for IP headers, which always checksum on 4 octet boundaries.
> - *
> - * Written by Randolph Chung <tausq@xxxxxxxxxx>, and then mucked with by
> - * LaMont Jones <lamont@xxxxxxxxxx>
> + * Fold a partial checksum without adding pseudo headers.
> */
> -static inline __sum16 ip_fast_csum(const void *iph, unsigned int ihl)
> +static inline __sum16 csum_fold(__wsum sum)
> {
> - unsigned int sum;
> - unsigned long t0, t1, t2;
> + u32 csum = (__force u32) sum;
>
> - __asm__ __volatile__ (
> -" ldws,ma 4(%1), %0\n"
> -" addib,<= -4, %2, 2f\n"
> -"\n"
> -" ldws 4(%1), %4\n"
> -" ldws 8(%1), %5\n"
> -" add %0, %4, %0\n"
> -" ldws,ma 12(%1), %3\n"
> -" addc %0, %5, %0\n"
> -" addc %0, %3, %0\n"
> -"1: ldws,ma 4(%1), %3\n"
> -" addib,< 0, %2, 1b\n"
> -" addc %0, %3, %0\n"
> -"\n"
> -" extru %0, 31, 16, %4\n"
> -" extru %0, 15, 16, %5\n"
> -" addc %4, %5, %0\n"
> -" extru %0, 15, 16, %5\n"
> -" add %0, %5, %0\n"
> -" subi -1, %0, %0\n"
> -"2:\n"
> - : "=r" (sum), "=r" (iph), "=r" (ihl), "=r" (t0), "=r" (t1), "=r" (t2)
> - : "1" (iph), "2" (ihl)
> - : "memory");
> -
> - return (__force __sum16)sum;
> + csum += (csum >> 16) | (csum << 16);
> + csum >>= 16;
> + return (__force __sum16) ~csum;
> }

For all of my analysis I am using gcc 12.3.0.

We can do better than this! By inspection this looks like a performance
regression. The generic version of csum_fold in
include/asm-generic/checksum.h is better than this so should be used
instead.

I am not familiar with hppa assembly but this is the assembly for the
original csum_fold here:

0000000000000000 <csum_fold>:
0: 08 03 02 41 copy r3,r1
4: 08 1e 02 43 copy sp,r3
8: 73 c1 00 88 std,ma r1,40(sp)
c: d3 5a 09 fc shrpw r26,r26,16,ret0
10: 0b 9a 0a 1c add,l r26,ret0,ret0
14: 0b 80 09 9c uaddcm r0,ret0,ret0
18: db 9c 09 f0 extrd,u,* ret0,47,16,ret0
1c: 34 7e 00 80 ldo 40(r3),sp
20: e8 40 d0 00 bve (rp)
24: 53 c3 3f 8d ldd,mb -40(sp),r3

This is the assembly for the generic version:
0000000000000000 <csum_fold_generic>:
0: 08 03 02 41 copy r3,r1
4: 08 1e 02 43 copy sp,r3
8: 73 c1 00 88 std,ma r1,40(sp)
c: 0b 40 09 9c uaddcm r0,r26,ret0
10: d3 5a 09 fa shrpw r26,r26,16,r26
14: 0b 5c 04 1c sub ret0,r26,ret0
18: db 9c 09 f0 extrd,u,* ret0,47,16,ret0
1c: 34 7e 00 80 ldo 40(r3),sp
20: e8 40 d0 00 bve (rp)
24: 53 c3 3f 8d ldd,mb -40(sp),r3

This is the assembly for yours:
0000000000000028 <csum_fold_new>:
28: 08 03 02 41 copy r3,r1
2c: 08 1e 02 43 copy sp,r3
30: 73 c1 00 88 std,ma r1,40(sp)
34: d3 5a 09 fc shrpw r26,r26,16,ret0
38: 0b 9a 0a 1c add,l r26,ret0,ret0
3c: d3 9c 19 f0 extrw,u ret0,15,16,ret0
40: 0b 80 09 9c uaddcm r0,ret0,ret0
44: db 9c 0b f0 extrd,u,* ret0,63,16,ret0
48: 34 7e 00 80 ldo 40(r3),sp
4c: e8 40 d0 00 bve (rp)
50: 53 c3 3f 8d ldd,mb -40(sp),r3
54: 00 00 00 00 break 0,0

The assembly is almost the same for the generic and the original code,
except for rearranging the shift and add operation which allows the
addition to become a subtraction and shortens the dependency chain on
some architectures (looks like it doesn't change much here). However,
this new code introduces an additional extrw instruction.

>
> /*
> - * Fold a partial checksum
> + * This is a version of ip_compute_csum() optimized for IP headers,
> + * which always checksums on 4 octet boundaries.
> */
> -static inline __sum16 csum_fold(__wsum csum)
> +static inline __sum16 ip_fast_csum(const void *iph, unsigned int ihl)
> {
> - u32 sum = (__force u32)csum;
> - /* add the swapped two 16-bit halves of sum,
> - a possible carry from adding the two 16-bit halves,
> - will carry from the lower half into the upper half,
> - giving us the correct sum in the upper half. */
> - sum += (sum << 16) + (sum >> 16);
> - return (__force __sum16)(~sum >> 16);
> + __u64 csum = 0;
> + __u32 *ptr = (u32 *)iph;
> +
> + csum += *ptr++;
> + csum += *ptr++;
> + csum += *ptr++;
> + csum += *ptr++;
> + ihl -= 4;
> + while (ihl--)
> + csum += *ptr++;
> + csum += (csum >> 32) | (csum << 32);
> + return csum_fold((__force __wsum)(csum >> 32));
> }

This doesn't leverage add with carry well. This causes the code size of this
to be dramatically larger than the original assembly, which I assume
nicely correlates to an increased execution time.

This is the original assembly in 64-bit (almost the same in 32-bit):
0000000000000028 <ip_fast_csum>:
28: 08 03 02 41 copy r3,r1
2c: 08 1e 02 43 copy sp,r3
30: 73 c1 00 88 std,ma r1,40(sp)
34: 0f 48 10 bc ldw,ma 4(r26),ret0
38: a7 39 60 70 addib,<= -4,r25,78 <ip_fast_csum+0x50>
3c: 0f 48 10 93 ldw 4(r26),r19
40: 0f 50 10 94 ldw 8(r26),r20
44: 0a 7c 06 1c add ret0,r19,ret0
48: 0f 58 10 bf ldw,ma c(r26),r31
4c: 0a 9c 07 1c add,c ret0,r20,ret0
50: 0b fc 07 1c add,c ret0,r31,ret0
54: 0f 48 10 bf ldw,ma 4(r26),r31
58: a7 20 5f ed addib,< 0,r25,54 <ip_fast_csum+0x2c>
5c: 0b fc 07 1c add,c ret0,r31,ret0
60: d3 93 1b f0 extrw,u ret0,31,16,r19
64: d3 94 19 f0 extrw,u ret0,15,16,r20
68: 0a 93 07 1c add,c r19,r20,ret0
6c: d3 94 19 f0 extrw,u ret0,15,16,r20
70: 0a 9c 06 1c add ret0,r20,ret0
74: 97 9c 07 ff subi -1,ret0,ret0
78: db 9c 0b f0 extrd,u,* ret0,63,16,ret0
7c: 34 7e 00 80 ldo 40(r3),sp
80: e8 40 d0 00 bve (rp)
84: 53 c3 3f 8d ldd,mb -40(sp),r3

This is the 64-bit assembly of your proposal:

0000000000000058 <ip_fast_csum_new>:
58: 08 03 02 41 copy r3,r1
5c: 0f c2 12 c1 std rp,-10(sp)
60: 08 1e 02 43 copy sp,r3
64: 73 c1 01 08 std,ma r1,80(sp)
68: 37 39 3f f9 ldo -4(r25),r25
6c: 0c 64 12 d0 std r4,8(r3)
70: 0f 48 10 9f ldw 4(r26),r31
74: db 39 0b e0 extrd,u,* r25,63,32,r25
78: 37 5a 00 20 ldo 10(r26),r26
7c: 0f 41 10 9c ldw -10(r26),ret0
80: 0b fc 0a 1c add,l ret0,r31,ret0
84: 0f 51 10 9f ldw -8(r26),r31
88: 0b 9f 0a 1f add,l r31,ret0,r31
8c: 0f 59 10 9c ldw -4(r26),ret0
90: 0b fc 0a 1c add,l ret0,r31,ret0
94: 37 3f 3f ff ldo -1(r25),r31
98: 8f ff 20 68 cmpib,<> -1,r31,d4 <ip_fast_csum_new+0x7c>
9c: db f9 0b e0 extrd,u,* r31,63,32,r25
a0: d3 9c 07 fa shrpd,* ret0,ret0,32,r26
a4: 37 dd 3f a1 ldo -30(sp),ret1
a8: 0b 9a 0a 1a add,l r26,ret0,r26
ac: 00 00 14 a1 mfia r1
b0: 28 20 00 00 addil L%0,r1,r1
b4: 34 21 00 00 ldo 0(r1),r1
b8: e8 20 f0 00 bve,l (r1),rp
bc: db 5a 03 e0 extrd,u,* r26,31,32,r26
c0: 0c 70 10 c4 ldd 8(r3),r4
c4: 0c 61 10 c2 ldd -10(r3),rp
c8: 34 7e 00 80 ldo 40(r3),sp
cc: e8 40 d0 00 bve (rp)
d0: 53 c3 3f 8d ldd,mb -40(sp),r3
d4: 0f 40 10 9f ldw 0(r26),r31
d8: 37 5a 00 08 ldo 4(r26),r26
dc: e8 1f 1f 65 b,l 94 <ip_fast_csum_new+0x3c>,r0
e0: 0b fc 0a 1c add,l ret0,r31,ret0
e4: 00 00 00 00 break 0,0

The 32-bit assembly is even longer.

Maybe you can get some inspiration from my riscv implementation
arch/riscv/include/asm/checksum.h. You can swap out the line:
csum += csum < ((const unsigned int *)iph)[pos];
With the addc macro defined in arch/parisc/lib/checksum.c.

I will guess that this will improve the code but I haven't checked.

> -
> -static inline __wsum csum_tcpudp_nofold(__be32 saddr, __be32 daddr,
> - __u32 len, __u8 proto,
> - __wsum sum)
> +
> +/*
> + * Computes the checksum of the TCP/UDP pseudo-header.
> + * Returns a 32-bit checksum.
> + */
> +static inline __wsum csum_tcpudp_nofold(__be32 saddr, __be32 daddr, __u32 len,
> + __u8 proto, __wsum sum)
> {
> - __asm__(
> - " add %1, %0, %0\n"
> - " addc %2, %0, %0\n"
> - " addc %3, %0, %0\n"
> - " addc %%r0, %0, %0\n"
> - : "=r" (sum)
> - : "r" (daddr), "r"(saddr), "r"(proto+len), "0"(sum));
> - return sum;
> + __u64 csum = (__force __u64)sum;
> +
> + csum += (__force __u32)saddr;
> + csum += (__force __u32)daddr;
> + csum += len;
> + csum += proto;
> + csum += (csum >> 32) | (csum << 32);
> + return (__force __wsum)(csum >> 32);
> }

The assembly from the original:

0000000000000088 <csum_tcpudp_nofold>:
88: 08 03 02 41 copy r3,r1
8c: 08 1e 02 43 copy sp,r3
90: 73 c1 00 88 std,ma r1,40(sp)
94: da f7 0b f8 extrd,u,* r23,63,8,r23
98: 0a f8 0a 17 add,l r24,r23,r23
9c: 0a d9 06 16 add r25,r22,r22
a0: 0a da 07 16 add,c r26,r22,r22
a4: 0a d7 07 16 add,c r23,r22,r22
a8: 0a c0 07 16 add,c r0,r22,r22
ac: da dc 0b e0 extrd,u,* r22,63,32,ret0
b0: 34 7e 00 80 ldo 40(r3),sp
b4: e8 40 d0 00 bve (rp)
b8: 53 c3 3f 8d ldd,mb -40(sp),r3
bc: 00 00 00 00 break 0,0

The 64-bit assembly from your proposal:

0000000000000140 <csum_tcpudp_nofold_new>:
140: 08 03 02 41 copy r3,r1
144: 08 1e 02 43 copy sp,r3
148: 73 c1 00 88 std,ma r1,40(sp)
14c: db 5a 0b e0 extrd,u,* r26,63,32,r26
150: db 39 0b e0 extrd,u,* r25,63,32,r25
154: da f7 0b f8 extrd,u,* r23,63,8,r23
158: db 18 0b e0 extrd,u,* r24,63,32,r24
15c: da d6 0b e0 extrd,u,* r22,63,32,r22
160: 0a f8 0a 18 add,l r24,r23,r24
164: 0b 38 0a 18 add,l r24,r25,r24
168: 0b 58 0a 18 add,l r24,r26,r24
16c: 0a d8 0a 16 add,l r24,r22,r22
170: d2 d6 07 fc shrpd,* r22,r22,32,ret0
174: 0a dc 0a 1c add,l ret0,r22,ret0
178: db 9c 03 e0 extrd,u,* ret0,31,32,ret0
17c: 34 7e 00 80 ldo 40(r3),sp
180: e8 40 d0 00 bve (rp)
184: 53 c3 3f 8d ldd,mb -40(sp),r3

There are a lot of extra extrd instructions, and again is really bad on
32-bit parisc.

Maybe there is a good way to represent this in C, but the assembly is
so simple and clean for this function already.

>
> /*
> - * computes the checksum of the TCP/UDP pseudo-header
> - * returns a 16-bit checksum, already complemented
> + * Computes the checksum of the TCP/UDP pseudo-header.
> + * Returns a 16-bit checksum, already complemented.
> */
> -static inline __sum16 csum_tcpudp_magic(__be32 saddr, __be32 daddr,
> - __u32 len, __u8 proto,
> - __wsum sum)
> +static inline __sum16 csum_tcpudp_magic(__be32 saddr, __be32 daddr, __u32 len,
> + __u8 proto, __wsum sum)
> {
> - return csum_fold(csum_tcpudp_nofold(saddr,daddr,len,proto,sum));
> + return csum_fold(csum_tcpudp_nofold(saddr, daddr, len, proto, sum));
> }
>
The implementation of csum_tcpudp_magic is the same as the generic, so
the generic version should be used instead.

> /*
> - * this routine is used for miscellaneous IP-like checksums, mainly
> - * in icmp.c
> + * Used for miscellaneous IP-like checksums, mainly icmp.
> */
> -static inline __sum16 ip_compute_csum(const void *buf, int len)
> +static inline __sum16 ip_compute_csum(const void *buff, int len)
> {
> - return csum_fold (csum_partial(buf, len, 0));
> + return csum_fold(csum_partial(buff, len, 0));
> }

The generic version is better than this so it should be used instead.

>
> -
> #define _HAVE_ARCH_IPV6_CSUM
> -static __inline__ __sum16 csum_ipv6_magic(const struct in6_addr *saddr,
> - const struct in6_addr *daddr,
> - __u32 len, __u8 proto,
> - __wsum sum)
> +static inline __sum16 csum_ipv6_magic(const struct in6_addr *saddr,
> + const struct in6_addr *daddr,
> + __u32 len, __u8 proto, __wsum csum)
> {
> - unsigned long t0, t1, t2, t3;
> -
> - len += proto; /* add 16-bit proto + len */
> -
> - __asm__ __volatile__ (
> -
> -#if BITS_PER_LONG > 32
> -
> - /*
> - ** We can execute two loads and two adds per cycle on PA 8000.
> - ** But add insn's get serialized waiting for the carry bit.
> - ** Try to keep 4 registers with "live" values ahead of the ALU.
> - */
> -
> -" ldd,ma 8(%1), %4\n" /* get 1st saddr word */
> -" ldd,ma 8(%2), %5\n" /* get 1st daddr word */
> -" add %4, %0, %0\n"
> -" ldd,ma 8(%1), %6\n" /* 2nd saddr */
> -" ldd,ma 8(%2), %7\n" /* 2nd daddr */
> -" add,dc %5, %0, %0\n"
> -" add,dc %6, %0, %0\n"
> -" add,dc %7, %0, %0\n"
> -" add,dc %3, %0, %0\n" /* fold in proto+len | carry bit */
> -" extrd,u %0, 31, 32, %4\n"/* copy upper half down */
> -" depdi 0, 31, 32, %0\n"/* clear upper half */
> -" add %4, %0, %0\n" /* fold into 32-bits */
> -" addc 0, %0, %0\n" /* add carry */
> -
> -#else
> -
> - /*
> - ** For PA 1.x, the insn order doesn't matter as much.
> - ** Insn stream is serialized on the carry bit here too.
> - ** result from the previous operation (eg r0 + x)
> - */
> -" ldw,ma 4(%1), %4\n" /* get 1st saddr word */
> -" ldw,ma 4(%2), %5\n" /* get 1st daddr word */
> -" add %4, %0, %0\n"
> -" ldw,ma 4(%1), %6\n" /* 2nd saddr */
> -" addc %5, %0, %0\n"
> -" ldw,ma 4(%2), %7\n" /* 2nd daddr */
> -" addc %6, %0, %0\n"
> -" ldw,ma 4(%1), %4\n" /* 3rd saddr */
> -" addc %7, %0, %0\n"
> -" ldw,ma 4(%2), %5\n" /* 3rd daddr */
> -" addc %4, %0, %0\n"
> -" ldw,ma 4(%1), %6\n" /* 4th saddr */
> -" addc %5, %0, %0\n"
> -" ldw,ma 4(%2), %7\n" /* 4th daddr */
> -" addc %6, %0, %0\n"
> -" addc %7, %0, %0\n"
> -" addc %3, %0, %0\n" /* fold in proto+len, catch carry */
> -
> -#endif
> - : "=r" (sum), "=r" (saddr), "=r" (daddr), "=r" (len),
> - "=r" (t0), "=r" (t1), "=r" (t2), "=r" (t3)
> - : "0" (sum), "1" (saddr), "2" (daddr), "3" (len)
> - : "memory");
> - return csum_fold(sum);
> + __u64 sum = (__force __u64)csum;
> +
> + sum += (__force __u32)saddr->s6_addr32[0];
> + sum += (__force __u32)saddr->s6_addr32[1];
> + sum += (__force __u32)saddr->s6_addr32[2];
> + sum += (__force __u32)saddr->s6_addr32[3];
> + sum += (__force __u32)daddr->s6_addr32[0];
> + sum += (__force __u32)daddr->s6_addr32[1];
> + sum += (__force __u32)daddr->s6_addr32[2];
> + sum += (__force __u32)daddr->s6_addr32[3];
> + sum += len;
> + sum += proto;
> + sum += (sum >> 32) | (sum << 32);
> + return csum_fold((__force __wsum)(sum >> 32));
> }
>
> #endif
> -

Similar story again here where the add with carry is not well translated
into C, resulting in significantly worse assembly. Using __u64 seems to
be a big contributing factor for why the 32-bit assembly is worse.

I am not sure the best way to represent this in a clean way in C.

add with carry is not well understood by GCC 12.3 it seems. These
functions are generally heavily optimized on every architecture, so I
think it is worth it to default to assembly if you aren't able to
achieve comparable performance in C.

- Charlie