Re: Re: [PATCH net-next 0/3] make skip_sw actually skip software

From: Marcelo Ricardo Leitner
Date: Fri Feb 16 2024 - 09:46:56 EST


On Fri, Feb 16, 2024 at 12:17:28PM +0000, Asbjørn Sloth Tønnesen wrote:
> Hi Marcelo,
>
> On 2/15/24 18:00, Marcelo Ricardo Leitner wrote:
> > On Thu, Feb 15, 2024 at 04:04:41PM +0000, Asbjørn Sloth Tønnesen wrote:
> > ...
> > > Since we use TC flower offload for the hottest
> > > prefixes, and leave the long tail to Linux / the CPU.
> > > we therefore need both the hardware and software
> > > datapath to perform well.
> > >
> > > I found that skip_sw rules, are quite expensive
> > > in the kernel datapath, sice they must be evaluated
> > > and matched upon, before the kernel checks the
> > > skip_sw flag.
> > >
> > > This patchset optimizes the case where all rules
> > > are skip_sw.
> >
> > The talk is interesting. Yet, I don't get how it is set up.
> > How do you use a dedicated block for skip_sw, and then have a
> > catch-all on sw again please?
>
> Bird installs the DFZ Internet routing table into the main kernel table
> for the software datapath.
>
> Bird also installs a subset of routing table into an aux. kernel table.
>
> flower-route then picks up the routes from the aux. kernel table, and
> installs them as TC skip_sw filters.
>
> On these machines we don't have any non-skip_sw TC filters.
>
> Since 2021, we have statically offloaded all inbound traffic, since
> nexthop for our IP space is always the switch next to it, which does
> interior L3 routing. Thereby we could offload ~50% of the packets.
>
> I have put an example of the static script here:
> https://files.fiberby.net/ast/2024/tc_skip_sw/mlx5_static_offload.sh
>
> And `tc filter show dev enp5s0f0np0 ingress` after running the script:
> https://files.fiberby.net/ast/2024/tc_skip_sw/mlx_offload_demo_tc_dump.txt

Ahh ok. So from tc/flower perspective, you actually offload
everything. :-)

The part that was confusing to me is that what you need done in sw,
you don't do it in tc sw, but rather with the IP the stack itself. So
you actually offload a flower filter with these, lets say, exceptions.

It seems to me a better fix for this is to have action trap to "resume
to sw" to itself. Then even if you have traffic that triggers a miss
in hw, you could add a catch-all filter to trigger the trap.

With the catch-all idea, you may also instead of using trap directly,
use a goto chain X. I just don't remember if you need to have a flow
in chain X that is not offloaded, or an inexistant chain is enough.

These ideas are rooted on the fact that now the offloading can resume
processing at a given chain, or even at a given action that triggered
the miss. With this, it should skip all the filtering that is
unnecessary in your case. IOW, instead of trying to make the filtering
smarter, which current proposal would be limited to this use case
pretty much (instead of using a dedicated list for skip_sw), it
resumes the processing at a better spot, and with what we already
have.

One caveat with this approach is that it will cause an skb_extension
to be allocated for all this traffic that is handled in sw. There's a
small performance penalty on it.

WDYT? Or maybe I missed something?

>
>
> > I'm missing which traffic is being matched against the sw datapath. In
> > theory, you have all the heavy duty filters offloaded, so the sw
> > datapath should be seeing only a few packets, right?
>
> We are an residential ISP, our traffic is therefore residential Internet
> traffic, we run the BGP routers as a router on a stick, the filters therefore
> see both inbound and outbound traffic.
>
> ~50% of packets are inbound traffic, our own prefixes are therefore the
> hottest prefixes. Most streaming traffic is handled internally, and is
> therefore not seen on our core routers. We regularly have 5%-10% of all
> outbound traffic going towards the same prefix, and have 50% of outbound
> traffic distributed across just a few prefixes.
>
> We currently only offload our own prefixes, and a select few other known
> high-traffic prefixes.
>
> The goal is to offload the majority of the trafic, but it is still early
> days for flower-route, and I need to implement some smarter chain layout
> first and dynamic filter placement based on hardware counters.

Cool. Btw, be aware that after a few chain jumps, performance may drop
considerably even if offloaded.

>
> Even when I get flower-route to offload almost all traffic, there will still
> be a long tail of prefixes not in hardware, so the kernel still needs
> to not be pulled down by the offloaded filters.
>
> --
> Best regards
> Asbjørn Sloth Tønnesen
> Network Engineer
> Fiberby - AS42541
>