Re: [RFC 00/12] io_uring zerocopy send

From: Pavel Begunkov
Date: Thu Dec 02 2021 - 11:26:19 EST


On 12/2/21 00:36, Willem de Bruijn wrote:
1) we pass a bvec, so no page table walks.
2) zerocopy_sg_from_iter() is just slow, adding a bvec optimised version
still doing page get/put (see 4/12) slashed 4-5%.
3) avoiding get_page/put_page in 5/12
4) completion events are posted into io_uring's CQ, so no
extra recvmsg for getting events
5) no poll(2) in the code because of io_uring
6) lot of time is spent in sock_omalloc()/free allocating ubuf_info.
io_uring caches the structures reducing it to nearly zero-overhead.

Nice set of complementary optimizations.

We have looked at adding some of those as independent additions to
msg_zerocopy before, such as long-term pinned regions. One issue with
that is that the pages must remain until the request completes,
regardless of whether the calling process is alive. So it cannot rely
on a pinned range held by a process only.

If feasible, it would be preferable if the optimizations can be added
to msg_zerocopy directly, rather than adding a dependency on io_uring
to make use of them. But not sure how feasible that is. For some, like
4 and 5, the answer is clearly it isn't. 6, it probably is?

Forgot about 6), io_uring uses the fact that submissions are
done under an per ring mutex, and completions are under a per
ring spinlock, so there are two lists for them and no extra
locking. Lists are spliced in a batched manner, so it's
1 spinlock per N (e.g. 32) cached ubuf_info's allocations.

Any similar guarantees for sockets?

For datagrams it might matter, not sure if it would show up in a
profile. The current notification mechanism is quite a bit more
heavyweight than any form of fixed ubuf pool.

Just to give an idea what I'm seeing in profiles: while testing

3 | io_uring (@flush=false, nr_reqs=1) | 96534 | 2.03

I found that removing one extra smb_mb() per request in io_uring
gave around +0.65% of t-put (quick testing). In profiles the
function where it was dropped from 0.93% to 0.09%.

From what I see, alloc+free takes 6-10% for 64KB UDP, it may be
great to have something for MSG_ZEROCOPY, but if that adds
additional locking/atomics, honestly I'd prefer to keep it separate
from io_uring's caching.

I also hope we can optimise generic paths at some point, and the
faster it gets the more such additional locking will hurt, pretty
much how it was with the block layer.

For TCP this matters less, as multiple sends are not needed and
completions are coalesced, because in order.


--
Pavel Begunkov