Re: Re: [RFC v3 Optimizing veth xsk performance 0/9]

From: 黄杰
Date: Wed Aug 09 2023 - 03:14:27 EST

Next message: Sven Schnelle: "[PATCH v2] tracing/synthetic: use union instead of casts"
Previous message: Tony Lindgren: "Re: [PATCH 1/4] arch/arm/configs/omap2plus_defconfig: drop removed options"
In reply to: Toke Høiland-Jørgensen: "Re: [RFC v3 Optimizing veth xsk performance 0/9]"
Next in thread: Toke Høiland-Jørgensen: "Re: Re: [RFC v3 Optimizing veth xsk performance 0/9]"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Toke Høiland-Jørgensen <toke@xxxxxxxxxx> 于2023年8月8日周二 20:01写道：
>
> Albert Huang <huangjie.albert@xxxxxxxxxxxxx> writes:
>
> > AF_XDP is a kernel bypass technology that can greatly improve performance.
> > However,for virtual devices like veth,even with the use of AF_XDP sockets,
> > there are still many additional software paths that consume CPU resources.
> > This patch series focuses on optimizing the performance of AF_XDP sockets
> > for veth virtual devices. Patches 1 to 4 mainly involve preparatory work.
> > Patch 5 introduces tx queue and tx napi for packet transmission, while
> > patch 8 primarily implements batch sending for IPv4 UDP packets, and patch 9
> > add support for AF_XDP tx need_wakup feature. These optimizations significantly
> > reduce the software path and support checksum offload.
> >
> > I tested those feature with
> > A typical topology is shown below:
> > client(send): server:(recv)
> > veth<-->veth-peer veth1-peer<--->veth1
> > 1 | | 7
> > |2 6|
> > | |
> > bridge<------->eth0(mlnx5)- switch -eth1(mlnx5)<--->bridge1
> > 3 4 5
> > (machine1) (machine2)
>
> I definitely applaud the effort to improve the performance of af_xdp
> over veth, this is something we have flagged as in need of improvement
> as well.
>
> However, looking through your patch series, I am less sure that the
> approach you're taking here is the right one.
>
> AFAIU (speaking about the TX side here), the main difference between
> AF_XDP ZC and the regular transmit mode is that in the regular TX mode
> the stack will allocate an skb to hold the frame and push that down the
> stack. Whereas in ZC mode, there's a driver NDO that gets called
> directly, bypassing the skb allocation entirely.
>
> In this series, you're implementing the ZC mode for veth, but the driver
> code ends up allocating an skb anyway. Which seems to be a bit of a
> weird midpoint between the two modes, and adds a lot of complexity to
> the driver that (at least conceptually) is mostly just a
> reimplementation of what the stack does in non-ZC mode (allocate an skb
> and push it through the stack).
>
> So my question is, why not optimise the non-zc path in the stack instead
> of implementing the zc logic for veth? It seems to me that it would be
> quite feasible to apply the same optimisations (bulking, and even GRO)
> to that path and achieve the same benefits, without having to add all
> this complexity to the veth driver?
>
> -Toke
>
thanks!
This idea is really good indeed. You've reminded me, and that's
something I overlooked. I will now consider implementing the solution
you've proposed and test the performance enhancement.

Albert.

Next message: Sven Schnelle: "[PATCH v2] tracing/synthetic: use union instead of casts"
Previous message: Tony Lindgren: "Re: [PATCH 1/4] arch/arm/configs/omap2plus_defconfig: drop removed options"
In reply to: Toke Høiland-Jørgensen: "Re: [RFC v3 Optimizing veth xsk performance 0/9]"
Next in thread: Toke Høiland-Jørgensen: "Re: Re: [RFC v3 Optimizing veth xsk performance 0/9]"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]