Re: [PATCH v4 2/2] gro: optimise redundant parsing of packets

From: Richard Gobert
Date: Thu Apr 20 2023 - 13:23:44 EST

Next message: Rafael J. Wysocki: "Re: [PATCH v4 6/6] thermal/drivers/intel_menlow: Make additionnal sysfs information optional"
Previous message: Oleksii Moisieiev: "Re: [RFC v1 1/2] scmi: Introduce pinctrl SCMI protocol driver"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

> On Wed, 2023-03-22 at 20:33 +0100, Richard Gobert wrote:
> > > On Wed, Mar 22, 2023 at 2:59 AM Paolo Abeni <pabeni@xxxxxxxxxx>
> > > wrote:
> > > >
> > > > On Mon, 2023-03-20 at 18:00 +0100, Richard Gobert wrote:
> > > > > Currently the IPv6 extension headers are parsed twice: first in
> > > > > ipv6_gro_receive, and then again in ipv6_gro_complete.
> > > > >
> > > > > By using the new ->transport_proto field, and also storing the
> > > > > size of the
> > > > > network header, we can avoid parsing extension headers a second
> > > > > time in
> > > > > ipv6_gro_complete (which saves multiple memory dereferences and
> > > > > conditional
> > > > > checks inside ipv6_exthdrs_len for a varying amount of
> > > > > extension headers in
> > > > > IPv6 packets).
> > > > >
> > > > > The implementation had to handle both inner and outer layers in
> > > > > case of
> > > > > encapsulation (as they can't use the same field). I've applied
> > > > > a similar
> > > > > optimisation to Ethernet.
> > > > >
> > > > > Performance tests for TCP stream over IPv6 with a varying
> > > > > amount of
> > > > > extension headers demonstrate throughput improvement of ~0.7%.
> > > >
> > > > I'm surprised that the improvement is measurable: for large
> > > > aggregate
> > > > packets a single ipv6_exthdrs_len() call is avoided out of tens
> > > > calls
> > > > for the individual pkts. Additionally such figure is comparable
> > > > to
> > > > noise level in my tests.
> >
> > It's not simple but I made an effort to make a quiet environment.
> > Correct configuration allows for this kind of measurements to be made
> > as the test is CPU bound and noise is a variance that can be reduced
> > with
> > enough samples.
> >
> > Environment example: (100Gbit NIC (mlx5), physical machine, i9 12th
> > gen)
> >
> > # power-management and hyperthreading disabled in BIOS
> > # sysctl preallocate net mem
> > echo 0 > /sys/devices/system/cpu/cpufreq/boost # disable
> > turboboost
> > ethtool -A enp1s0f0np0 rx off tx off autoneg off # no PAUSE
> > frames
> >
> > # Single core performance
> > for x in /sys/devices/system/cpu/cpu[1-9]*/online; do echo 0
> > >"$x"; done
> >
> > ./network-testing-master/bin/netfilter_unload_modules.sh
> > 2>/dev/null # unload netfilter
> > tuned-adm profile latency-performance
> > cpupower frequency-set -f 2200MHz # Set core to specific
> > frequency
> > systemctl isolate rescue-ssh.target
> > # and kill all processes besides init
> >
> > > > This adds a couple of additional branches for the common (no
> > > > extensions
> > > > header) case.
> >
> > The additional branch in ipv6_gro_receive would be negligible or even
> > non-existent for a branch predictor in the common case
> > (non-encapsulated packets).
> > I could wrap it with a likely macro if you wish.
> > Inside ipv6_gro_complete a couple of branches are saved for the
> > common
> > case as demonstrated below.
> >
> > original code ipv6_gro_complete (ipv6_exthdrs_len is inlined):
> >
> > // if (skb->encapsulation)
> >
> > ffffffff81c4962b: f6 87 81 00 00 00 20 testb
> > $0x20,0x81(%rdi)
> > ffffffff81c49632: 74 2a je
> > ffffffff81c4965e <ipv6_gro_complete+0x3e>
> >
> > ...
> >
> > // nhoff += sizeof(*iph) + ipv6_exthdrs_len(iph, &ops);
> >
> > ffffffff81c4969c: eb 1b jmp
> > ffffffff81c496b9 <ipv6_gro_complete+0x99> <-- jump to beginning of
> > for loop
> > ffffffff81c4968e: b8 28 00 00 00 mov $0x28,%eax
> > ffffffff81c49693: 31 f6 xor %esi,%esi
> > ffffffff81c49695: 48 c7 c7 c0 28 aa 82 mov
> > $0xffffffff82aa28c0,%rdi
> > ffffffff81c4969c: eb 1b jmp
> > ffffffff81c496b9 <ipv6_gro_complete+0x99>
> > ffffffff81c4969e: f6 41 18 01 testb
> > $0x1,0x18(%rcx)
> > ffffffff81c496a2: 74 34 je
> > ffffffff81c496d8 <ipv6_gro_complete+0xb8> <--- 3rd conditional
> > check: !((*opps)->flags & INET6_PROTO_GSO_EXTHDR)
> > ffffffff81c496a4: 48 98 cltq
> > ffffffff81c496a6: 48 01 c2 add %rax,%rdx
> > ffffffff81c496a9: 0f b6 42 01 movzbl 0x1(%rdx),%eax
> > ffffffff81c496ad: 0f b6 0a movzbl (%rdx),%ecx
> > ffffffff81c496b0: 8d 04 c5 08 00 00 00 lea
> > 0x8(,%rax,8),%eax
> > ffffffff81c496b7: 01 c6 add %eax,%esi
> > ffffffff81c496b9: 85 c9 test %ecx,%ecx
> > <--- for loop starts here
> > ffffffff81c496bb: 74 e7 je
> > ffffffff81c496a4 <ipv6_gro_complete+0x84> <--- 1st conditional
> > check: proto != NEXTHDR_HOP
> > ffffffff81c496bd: 48 8b 0c cf mov
> > (%rdi,%rcx,8),%rcx
> > ffffffff81c496c1: 48 85 c9 test %rcx,%rcx
> > ffffffff81c496c4: 75 d8 jne
> > ffffffff81c4969e <ipv6_gro_complete+0x7e> <--- 2nd conditional
> > check: unlikely(!(*opps))
> >
> > ... (indirect call ops->callbacks.gro_complete)
> >
> > ipv6_exthdrs_len contains a loop which has 3 conditional checks.
> > For the common (no extensions header) case, in the new code, *all 3
> > branches are completely avoided*
> >
> > patched code ipv6_gro_complete:
> >
> > // if (skb->encapsulation)
> > ffffffff81befe58: f6 83 81 00 00 00 20 testb
> > $0x20,0x81(%rbx)
> > ffffffff81befe5f: 74 78 je
> > ffffffff81befed9 <ipv6_gro_complete+0xb9>
> >
> > ...
> >
> > // else
> > ffffffff81befed9: 0f b6 43 50 movzbl
> > 0x50(%rbx),%eax
> > ffffffff81befedd: 0f b7 73 4c movzwl
> > 0x4c(%rbx),%esi
> > ffffffff81befee1: 48 8b 0c c5 c0 3f a9 mov -
> > 0x7d56c040(,%rax,8),%rcx
> >
> > ... (indirect call ops->callbacks.gro_complete)
> >
> > Thus, the patch is beneficial for both the common case and the ext
> > hdr
> > case. I would appreciate a second consideration :)
>
> A problem with the above analysis is that it does not take in
> consideration the places where the new branch are added:
> eth_gro_receive() and ipv6_gro_receive().
>
> Note that such functions are called for each packet on the wire:
> multiple times for each aggregate packets.
>
> The above is likely not measurable in terms on pps delta, but the added
> CPU cycles spent for the common case are definitely there. In my
> opinion that outlast the benefit for the extensions header case.
>
> Cheers,
>
> Paolo
>
> p.s. please refrain from off-list ping. That is ignored by most and
> considered rude by some.

Thanks,
I will re-post the first patch as a new one.
As for the second patch, I get your point, you are correct. I didn't
pay enough attention to the accumulated overhead during the receive phase, as it
wasn't showing up in my measurements. I'll look further into it, and check if I
can come up with a better solution.

Sorry for the off-list ping, is it ok to send a ping via the mailing list?

Next message: Rafael J. Wysocki: "Re: [PATCH v4 6/6] thermal/drivers/intel_menlow: Make additionnal sysfs information optional"
Previous message: Oleksii Moisieiev: "Re: [RFC v1 1/2] scmi: Introduce pinctrl SCMI protocol driver"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]