Re: [RFC v3 Optimizing veth xsk performance 0/9]

From: Jesper Dangaard Brouer
Date: Wed Aug 09 2023 - 07:10:01 EST



On 09/08/2023 11.06, Toke Høiland-Jørgensen wrote:
黄杰 <huangjie.albert@xxxxxxxxxxxxx> writes:

Toke Høiland-Jørgensen <toke@xxxxxxxxxx> 于2023年8月8日周二 20:01写道:

Albert Huang <huangjie.albert@xxxxxxxxxxxxx> writes:

AF_XDP is a kernel bypass technology that can greatly improve performance.
However,for virtual devices like veth,even with the use of AF_XDP sockets,
there are still many additional software paths that consume CPU resources.
This patch series focuses on optimizing the performance of AF_XDP sockets
for veth virtual devices. Patches 1 to 4 mainly involve preparatory work.
Patch 5 introduces tx queue and tx napi for packet transmission, while
patch 8 primarily implements batch sending for IPv4 UDP packets, and patch 9
add support for AF_XDP tx need_wakup feature. These optimizations significantly
reduce the software path and support checksum offload.

I tested those feature with
A typical topology is shown below:
client(send): server:(recv)
veth<-->veth-peer veth1-peer<--->veth1
1 | | 7
|2 6|
| |
bridge<------->eth0(mlnx5)- switch -eth1(mlnx5)<--->bridge1
3 4 5
(machine1) (machine2)

I definitely applaud the effort to improve the performance of af_xdp
over veth, this is something we have flagged as in need of improvement
as well.

However, looking through your patch series, I am less sure that the
approach you're taking here is the right one.

AFAIU (speaking about the TX side here), the main difference between
AF_XDP ZC and the regular transmit mode is that in the regular TX mode
the stack will allocate an skb to hold the frame and push that down the
stack. Whereas in ZC mode, there's a driver NDO that gets called
directly, bypassing the skb allocation entirely.

In this series, you're implementing the ZC mode for veth, but the driver
code ends up allocating an skb anyway. Which seems to be a bit of a
weird midpoint between the two modes, and adds a lot of complexity to
the driver that (at least conceptually) is mostly just a
reimplementation of what the stack does in non-ZC mode (allocate an skb
and push it through the stack).

So my question is, why not optimise the non-zc path in the stack instead
of implementing the zc logic for veth? It seems to me that it would be
quite feasible to apply the same optimisations (bulking, and even GRO)
to that path and achieve the same benefits, without having to add all
this complexity to the veth driver?

-Toke

thanks!
This idea is really good indeed. You've reminded me, and that's
something I overlooked. I will now consider implementing the solution
you've proposed and test the performance enhancement.

Sounds good, thanks! :)

Good to hear, that you want to optimize the non-zc TX path of AF_XDP, as
Toke suggests.

There is a number of performance issues for AF_XDP non-zc TX that I've
talked/complained to Magnus and Bjørn about over the years.
I've recently started to work on fixing these myself, in collaboration
with Maryam (cc).

The most obvious is that non-zc TX uses socket memory accounting for the
SKBs that gets allocated. (ZC TX obviously doesn't). IMHO this doesn't
make sense as AF_XDP concept is to pre-allocate memory, thus AF_XDP
memory limits are already bounded at setup time. Further more,
__xsk_generic_xmit() already have a backpressure mechanism based on
avail room in the CQ (Completion Queue) . Hint: the call
sock_alloc_send_skb() includes/does socket mem accounting.

When AF_XDP gets combined with veth (or other layered software devices),
the problem gets worse, because:

(1) the SKB that gets allocated by xsk_build_skb() doesn't have enough
headroom to satisfy XDP requirement XDP_PACKET_HEADROOM.

(2) the backing memory type from sock_alloc_send_skb() is not
compatible with generic/veth XDP.

Both these issues, result in that when peer veth device RX the (AF_XDP)
TX packet, then it have to reallocate memory+SKB and copy data *again*.

I'm currently[1] looking into how to fix this and have some PoC patches
to estimate the performance benefit from avoiding the realloc when
entering veth. With packet size 512, the numbers start at 828Kpps and
after increase to 1002Kpps (and increase of 20% or 208 nanosec).

[1] https://github.com/xdp-project/xdp-project/blob/veth-benchmark01/areas/core/veth_benchmark03.org

--
Best regards,
Jesper Dangaard Brouer
MSc.CS, Sr. Principal Kernel Engineer at Red Hat
LinkedIn: http://www.linkedin.com/in/brouer