Re: [PATCH bpf-next v5 2/4] xsk: Support UMEM chunk_size > PAGE_SIZE

From: Magnus Karlsson
Date: Wed Apr 12 2023 - 09:39:33 EST


On Mon, 10 Apr 2023 at 14:08, Kal Conley <kal.conley@xxxxxxxxxxx> wrote:
>
> Add core AF_XDP support for chunk sizes larger than PAGE_SIZE. This
> enables sending/receiving jumbo ethernet frames up to the theoretical
> maximum of 64 KiB. For chunk sizes > PAGE_SIZE, the UMEM is required
> to consist of HugeTLB VMAs (and be hugepage aligned). Initially, only
> SKB mode is usable pending future driver work.
>
> For consistency, check for HugeTLB pages during UMEM registration. This
> implies that hugepages are required for XDP_COPY mode despite DMA not
> being used. This restriction is desirable since it ensures user software
> can take advantage of future driver support.
>
> Despite this change, always store order-0 pages in the umem->pgs array
> since this is what is returned by pin_user_pages(). Conversely, XSK
> pools bound to HugeTLB UMEMs do DMA page accounting at hugepage
> granularity (HPAGE_SIZE).
>
> No significant change in RX/TX performance was observed with this patch.
> A few data points are reproduced below:
>
> Machine : Dell PowerEdge R940
> CPU : Intel(R) Xeon(R) Platinum 8168 CPU @ 2.70GHz
> NIC : MT27700 Family [ConnectX-4]
>
> +-----+------+------+-------+--------+--------+--------+
> | | | | chunk | packet | rxdrop | rxdrop |
> | | mode | mtu | size | size | (Mpps) | (Gbps) |
> +-----+------+------+-------+--------+--------+--------+
> | old | -z | 3498 | 4000 | 320 | 15.9 | 40.8 |
> | new | -z | 3498 | 4000 | 320 | 15.9 | 40.8 |
> +-----+------+------+-------+--------+--------+--------+
> | old | -z | 3498 | 4096 | 320 | 16.5 | 42.2 |
> | new | -z | 3498 | 4096 | 320 | 16.5 | 42.3 |
> +-----+------+------+-------+--------+--------+--------+
> | new | -c | 3498 | 10240 | 320 | 6.1 | 15.7 |
> +-----+------+------+-------+--------+--------+--------+
> | new | -S | 9000 | 10240 | 9000 | 0.37 | 26.4 |
> +-----+------+------+-------+--------+--------+--------+
>
> Signed-off-by: Kal Conley <kal.conley@xxxxxxxxxxx>
> ---
> Documentation/networking/af_xdp.rst | 36 +++++++++++--------
> include/net/xdp_sock.h | 2 ++
> include/net/xdp_sock_drv.h | 12 +++++++
> include/net/xsk_buff_pool.h | 10 +++---
> net/xdp/xdp_umem.c | 55 +++++++++++++++++++++++------
> net/xdp/xsk_buff_pool.c | 36 +++++++++++--------
> 6 files changed, 109 insertions(+), 42 deletions(-)
>
> diff --git a/Documentation/networking/af_xdp.rst b/Documentation/networking/af_xdp.rst
> index 247c6c4127e9..ea65cd882af6 100644
> --- a/Documentation/networking/af_xdp.rst
> +++ b/Documentation/networking/af_xdp.rst
> @@ -105,12 +105,13 @@ with AF_XDP". It can be found at https://lwn.net/Articles/750845/.
> UMEM
> ----
>
> -UMEM is a region of virtual contiguous memory, divided into
> -equal-sized frames. An UMEM is associated to a netdev and a specific
> -queue id of that netdev. It is created and configured (chunk size,
> -headroom, start address and size) by using the XDP_UMEM_REG setsockopt
> -system call. A UMEM is bound to a netdev and queue id, via the bind()
> -system call.
> +UMEM is a region of virtual contiguous memory divided into equal-sized
> +frames. This is the area that contains all the buffers that packets can
> +reside in. A UMEM is associated with a netdev and a specific queue id of
> +that netdev. It is created and configured (start address, size,
> +chunk size, and headroom) by using the XDP_UMEM_REG setsockopt system
> +call. A UMEM is bound to a netdev and queue id via the bind() system
> +call.
>
> An AF_XDP is socket linked to a single UMEM, but one UMEM can have
> multiple AF_XDP sockets. To share an UMEM created via one socket A,
> @@ -418,14 +419,21 @@ negatively impact performance.
> XDP_UMEM_REG setsockopt
> -----------------------
>
> -This setsockopt registers a UMEM to a socket. This is the area that
> -contain all the buffers that packet can reside in. The call takes a
> -pointer to the beginning of this area and the size of it. Moreover, it
> -also has parameter called chunk_size that is the size that the UMEM is
> -divided into. It can only be 2K or 4K at the moment. If you have an
> -UMEM area that is 128K and a chunk size of 2K, this means that you
> -will be able to hold a maximum of 128K / 2K = 64 packets in your UMEM
> -area and that your largest packet size can be 2K.
> +This setsockopt registers a UMEM to a socket. The call takes a pointer
> +to the beginning of this area and the size of it. Moreover, there is a
> +parameter called chunk_size that is the size that the UMEM is divided
> +into. The chunk size limits the maximum packet size that can be sent or
> +received. For example, if you have a UMEM area that is 128K and a chunk
> +size of 2K, then you will be able to hold a maximum of 128K / 2K = 64
> +packets in your UMEM. In this case, the maximum packet size will be 2K.
> +
> +Valid chunk sizes range from 2K to 64K. However, in aligned mode, the
> +chunk size must also be a power of two. Additionally, the chunk size
> +must not exceed the size of a page (usually 4K). This limitation is
> +relaxed for UMEM areas allocated with HugeTLB pages, in which case
> +chunk sizes up to 64K are allowed. Note, this only works with hugepages
> +allocated from the kernel's persistent pool. Using Transparent Huge
> +Pages (THP) has no effect on the maximum chunk size.
>
> There is also an option to set the headroom of each single buffer in
> the UMEM. If you set this to N bytes, it means that the packet will
> diff --git a/include/net/xdp_sock.h b/include/net/xdp_sock.h
> index e96a1151ec75..a71589539c38 100644
> --- a/include/net/xdp_sock.h
> +++ b/include/net/xdp_sock.h
> @@ -25,6 +25,8 @@ struct xdp_umem {
> u32 chunk_size;
> u32 chunks;
> u32 npgs;
> + u32 page_shift;
> + u32 page_size;
> struct user_struct *user;
> refcount_t users;
> u8 flags;
> diff --git a/include/net/xdp_sock_drv.h b/include/net/xdp_sock_drv.h
> index 9c0d860609ba..83fba3060c9a 100644
> --- a/include/net/xdp_sock_drv.h
> +++ b/include/net/xdp_sock_drv.h
> @@ -12,6 +12,18 @@
> #define XDP_UMEM_MIN_CHUNK_SHIFT 11
> #define XDP_UMEM_MIN_CHUNK_SIZE (1 << XDP_UMEM_MIN_CHUNK_SHIFT)
>
> +static_assert(XDP_UMEM_MIN_CHUNK_SIZE <= PAGE_SIZE);
> +
> +/* Allow chunk sizes up to the maximum size of an ethernet frame (64 KiB).
> + * Larger chunks are not guaranteed to fit in a single SKB.
> + */
> +#ifdef CONFIG_HUGETLB_PAGE
> +#define XDP_UMEM_MAX_CHUNK_SHIFT min(16, HPAGE_SHIFT)
> +#else
> +#define XDP_UMEM_MAX_CHUNK_SHIFT min(16, PAGE_SHIFT)
> +#endif
> +#define XDP_UMEM_MAX_CHUNK_SIZE (1 << XDP_UMEM_MAX_CHUNK_SHIFT)
> +
> #ifdef CONFIG_XDP_SOCKETS
>
> void xsk_tx_completed(struct xsk_buff_pool *pool, u32 nb_entries);
> diff --git a/include/net/xsk_buff_pool.h b/include/net/xsk_buff_pool.h
> index a8d7b8a3688a..af822b322d89 100644
> --- a/include/net/xsk_buff_pool.h
> +++ b/include/net/xsk_buff_pool.h
> @@ -68,6 +68,8 @@ struct xsk_buff_pool {
> struct xdp_desc *tx_descs;
> u64 chunk_mask;
> u64 addrs_cnt;
> + u32 page_shift;
> + u32 page_size;
> u32 free_list_cnt;
> u32 dma_pages_cnt;
> u32 free_heads_cnt;
> @@ -123,8 +125,8 @@ static inline void xp_init_xskb_addr(struct xdp_buff_xsk *xskb, struct xsk_buff_
> static inline void xp_init_xskb_dma(struct xdp_buff_xsk *xskb, struct xsk_buff_pool *pool,
> dma_addr_t *dma_pages, u64 addr)
> {
> - xskb->frame_dma = (dma_pages[addr >> PAGE_SHIFT] & ~XSK_NEXT_PG_CONTIG_MASK) +
> - (addr & ~PAGE_MASK);
> + xskb->frame_dma = (dma_pages[addr >> pool->page_shift] & ~XSK_NEXT_PG_CONTIG_MASK) +
> + (addr & (pool->page_size - 1));
> xskb->dma = xskb->frame_dma + pool->headroom + XDP_PACKET_HEADROOM;
> }
>
> @@ -175,13 +177,13 @@ static inline void xp_dma_sync_for_device(struct xsk_buff_pool *pool,
> static inline bool xp_desc_crosses_non_contig_pg(struct xsk_buff_pool *pool,
> u64 addr, u32 len)
> {
> - bool cross_pg = (addr & (PAGE_SIZE - 1)) + len > PAGE_SIZE;
> + bool cross_pg = (addr & (pool->page_size - 1)) + len > pool->page_size;
>
> if (likely(!cross_pg))
> return false;
>
> return pool->dma_pages &&
> - !(pool->dma_pages[addr >> PAGE_SHIFT] & XSK_NEXT_PG_CONTIG_MASK);
> + !(pool->dma_pages[addr >> pool->page_shift] & XSK_NEXT_PG_CONTIG_MASK);
> }
>
> static inline u64 xp_aligned_extract_addr(struct xsk_buff_pool *pool, u64 addr)
> diff --git a/net/xdp/xdp_umem.c b/net/xdp/xdp_umem.c
> index 4681e8e8ad94..6fb984be8f40 100644
> --- a/net/xdp/xdp_umem.c
> +++ b/net/xdp/xdp_umem.c
> @@ -10,6 +10,8 @@
> #include <linux/uaccess.h>
> #include <linux/slab.h>
> #include <linux/bpf.h>
> +#include <linux/hugetlb.h>
> +#include <linux/hugetlb_inline.h>
> #include <linux/mm.h>
> #include <linux/netdevice.h>
> #include <linux/rtnetlink.h>
> @@ -91,9 +93,39 @@ void xdp_put_umem(struct xdp_umem *umem, bool defer_cleanup)
> }
> }
>
> +/* NOTE: The mmap_lock must be held by the caller. */
> +static void xdp_umem_init_page_size(struct xdp_umem *umem, unsigned long address)
> +{
> +#ifdef CONFIG_HUGETLB_PAGE
> + struct vm_area_struct *vma;
> + struct vma_iterator vmi;
> + unsigned long end;
> +
> + if (!IS_ALIGNED(address, HPAGE_SIZE))
> + goto no_hugetlb;
> +
> + vma_iter_init(&vmi, current->mm, address);
> + end = address + umem->size;
> +
> + for_each_vma_range(vmi, vma, end) {
> + if (!is_vm_hugetlb_page(vma))
> + goto no_hugetlb;
> + /* Hugepage sizes smaller than the default are not supported. */
> + if (huge_page_size(hstate_vma(vma)) < HPAGE_SIZE)
> + goto no_hugetlb;
> + }
> +
> + umem->page_shift = HPAGE_SHIFT;
> + umem->page_size = HPAGE_SIZE;
> + return;
> +no_hugetlb:
> +#endif
> + umem->page_shift = PAGE_SHIFT;
> + umem->page_size = PAGE_SIZE;
> +}
> +
> static int xdp_umem_pin_pages(struct xdp_umem *umem, unsigned long address)
> {
> - unsigned int gup_flags = FOLL_WRITE;
> long npgs;
> int err;
>
> @@ -102,8 +134,18 @@ static int xdp_umem_pin_pages(struct xdp_umem *umem, unsigned long address)
> return -ENOMEM;
>
> mmap_read_lock(current->mm);
> +
> + xdp_umem_init_page_size(umem, address);
> +
> + if (umem->chunk_size > umem->page_size) {
> + mmap_read_unlock(current->mm);
> + err = -EINVAL;
> + goto out_pgs;
> + }
> +
> npgs = pin_user_pages(address, umem->npgs,
> - gup_flags | FOLL_LONGTERM, &umem->pgs[0], NULL);
> + FOLL_WRITE | FOLL_LONGTERM, &umem->pgs[0], NULL);
> +
> mmap_read_unlock(current->mm);
>
> if (npgs != umem->npgs) {
> @@ -156,15 +198,8 @@ static int xdp_umem_reg(struct xdp_umem *umem, struct xdp_umem_reg *mr)
> unsigned int chunks, chunks_rem;
> int err;
>
> - if (chunk_size < XDP_UMEM_MIN_CHUNK_SIZE || chunk_size > PAGE_SIZE) {
> - /* Strictly speaking we could support this, if:
> - * - huge pages, or*
> - * - using an IOMMU, or
> - * - making sure the memory area is consecutive
> - * but for now, we simply say "computer says no".
> - */
> + if (chunk_size < XDP_UMEM_MIN_CHUNK_SIZE || chunk_size > XDP_UMEM_MAX_CHUNK_SIZE)
> return -EINVAL;
> - }
>
> if (mr->flags & ~XDP_UMEM_UNALIGNED_CHUNK_FLAG)
> return -EINVAL;
> diff --git a/net/xdp/xsk_buff_pool.c b/net/xdp/xsk_buff_pool.c
> index 26f6d304451e..85b36c31b505 100644
> --- a/net/xdp/xsk_buff_pool.c
> +++ b/net/xdp/xsk_buff_pool.c
> @@ -75,14 +75,16 @@ struct xsk_buff_pool *xp_create_and_assign_umem(struct xdp_sock *xs,
>
> pool->chunk_mask = ~((u64)umem->chunk_size - 1);
> pool->addrs_cnt = umem->size;
> + pool->page_shift = umem->page_shift;
> + pool->page_size = umem->page_size;
> pool->heads_cnt = umem->chunks;
> pool->free_heads_cnt = umem->chunks;
> pool->headroom = umem->headroom;
> pool->chunk_size = umem->chunk_size;
> pool->chunk_shift = ffs(umem->chunk_size) - 1;
> - pool->unaligned = unaligned;
> pool->frame_len = umem->chunk_size - umem->headroom -
> XDP_PACKET_HEADROOM;
> + pool->unaligned = unaligned;

nit: This change is not necessary.

> pool->umem = umem;
> pool->addrs = umem->addrs;
> INIT_LIST_HEAD(&pool->free_list);
> @@ -328,7 +330,8 @@ static void xp_destroy_dma_map(struct xsk_dma_map *dma_map)
> kfree(dma_map);
> }
>
> -static void __xp_dma_unmap(struct xsk_dma_map *dma_map, unsigned long attrs)
> +static void __xp_dma_unmap(struct xsk_buff_pool *pool, struct xsk_dma_map *dma_map,
> + unsigned long attrs)

Instead of sending down the whole buffer pool, it would be better to
pass down the page_size here. __xp_dma_unmap(*dma_map, attrs,
page_size)

Also makes it consistent with the check_dma_contiguity below.

> {
> dma_addr_t *dma;
> u32 i;
> @@ -337,7 +340,7 @@ static void __xp_dma_unmap(struct xsk_dma_map *dma_map, unsigned long attrs)
> dma = &dma_map->dma_pages[i];
> if (*dma) {
> *dma &= ~XSK_NEXT_PG_CONTIG_MASK;
> - dma_unmap_page_attrs(dma_map->dev, *dma, PAGE_SIZE,
> + dma_unmap_page_attrs(dma_map->dev, *dma, pool->page_size,
> DMA_BIDIRECTIONAL, attrs);
> *dma = 0;
> }
> @@ -362,7 +365,7 @@ void xp_dma_unmap(struct xsk_buff_pool *pool, unsigned long attrs)
> if (!refcount_dec_and_test(&dma_map->users))
> return;
>
> - __xp_dma_unmap(dma_map, attrs);
> + __xp_dma_unmap(pool, dma_map, attrs);
> kvfree(pool->dma_pages);
> pool->dma_pages = NULL;
> pool->dma_pages_cnt = 0;
> @@ -370,16 +373,17 @@ void xp_dma_unmap(struct xsk_buff_pool *pool, unsigned long attrs)
> }
> EXPORT_SYMBOL(xp_dma_unmap);
>
> -static void xp_check_dma_contiguity(struct xsk_dma_map *dma_map)
> +static void xp_check_dma_contiguity(struct xsk_dma_map *dma_map, u32 page_size)
> {
> u32 i;
>
> - for (i = 0; i < dma_map->dma_pages_cnt - 1; i++) {
> - if (dma_map->dma_pages[i] + PAGE_SIZE == dma_map->dma_pages[i + 1])
> + for (i = 0; i + 1 < dma_map->dma_pages_cnt; i++) {

I think the previous version is clearer than this new one.

> + if (dma_map->dma_pages[i] + page_size == dma_map->dma_pages[i + 1])
> dma_map->dma_pages[i] |= XSK_NEXT_PG_CONTIG_MASK;
> else
> dma_map->dma_pages[i] &= ~XSK_NEXT_PG_CONTIG_MASK;
> }
> + dma_map->dma_pages[i] &= ~XSK_NEXT_PG_CONTIG_MASK;
> }
>
> static int xp_init_dma_info(struct xsk_buff_pool *pool, struct xsk_dma_map *dma_map)
> @@ -412,6 +416,7 @@ int xp_dma_map(struct xsk_buff_pool *pool, struct device *dev,
> {
> struct xsk_dma_map *dma_map;
> dma_addr_t dma;
> + u32 stride;
> int err;
> u32 i;
>
> @@ -425,15 +430,19 @@ int xp_dma_map(struct xsk_buff_pool *pool, struct device *dev,
> return 0;
> }
>
> + /* dma_pages use pool->page_size whereas `pages` are always order-0. */
> + stride = pool->page_size >> PAGE_SHIFT; /* in order-0 pages */
> + nr_pages = (nr_pages + stride - 1) >> (pool->page_shift - PAGE_SHIFT);
> +
> dma_map = xp_create_dma_map(dev, pool->netdev, nr_pages, pool->umem);
> if (!dma_map)
> return -ENOMEM;
>
> for (i = 0; i < dma_map->dma_pages_cnt; i++) {
> - dma = dma_map_page_attrs(dev, pages[i], 0, PAGE_SIZE,
> + dma = dma_map_page_attrs(dev, pages[i * stride], 0, pool->page_size,
> DMA_BIDIRECTIONAL, attrs);
> if (dma_mapping_error(dev, dma)) {
> - __xp_dma_unmap(dma_map, attrs);
> + __xp_dma_unmap(pool, dma_map, attrs);
> return -ENOMEM;
> }
> if (dma_need_sync(dev, dma))
> @@ -442,11 +451,11 @@ int xp_dma_map(struct xsk_buff_pool *pool, struct device *dev,
> }
>
> if (pool->unaligned)
> - xp_check_dma_contiguity(dma_map);
> + xp_check_dma_contiguity(dma_map, pool->page_size);
>
> err = xp_init_dma_info(pool, dma_map);
> if (err) {
> - __xp_dma_unmap(dma_map, attrs);
> + __xp_dma_unmap(pool, dma_map, attrs);
> return err;
> }
>
> @@ -663,9 +672,8 @@ EXPORT_SYMBOL(xp_raw_get_data);
> dma_addr_t xp_raw_get_dma(struct xsk_buff_pool *pool, u64 addr)
> {
> addr = pool->unaligned ? xp_unaligned_add_offset_to_addr(addr) : addr;
> - return (pool->dma_pages[addr >> PAGE_SHIFT] &
> - ~XSK_NEXT_PG_CONTIG_MASK) +
> - (addr & ~PAGE_MASK);
> + return (pool->dma_pages[addr >> pool->page_shift] & ~XSK_NEXT_PG_CONTIG_MASK) +
> + (addr & (pool->page_size - 1));
> }
> EXPORT_SYMBOL(xp_raw_get_dma);
>
> --
> 2.39.2
>