Re: [RFC PATCH v4 00/17] vsock: MSG_ZEROCOPY flag support

From: Arseniy Krasnov
Date: Tue Jun 27 2023 - 01:01:06 EST




On 26.06.2023 19:15, Stefano Garzarella wrote:
> On Sat, Jun 03, 2023 at 11:49:22PM +0300, Arseniy Krasnov wrote:
>> Hello,
>>
>>                           DESCRIPTION
>>
>> this is MSG_ZEROCOPY feature support for virtio/vsock. I tried to follow
>> current implementation for TCP as much as possible:
>>
>> 1) Sender must enable SO_ZEROCOPY flag to use this feature. Without this
>>   flag, data will be sent in "classic" copy manner and MSG_ZEROCOPY
>>   flag will be ignored (e.g. without completion).
>>
>> 2) Kernel uses completions from socket's error queue. Single completion
>>   for single tx syscall (or it can merge several completions to single
>>   one). I used already implemented logic for MSG_ZEROCOPY support:
>>   'msg_zerocopy_realloc()' etc.
>>
>> Difference with copy way is not significant. During packet allocation,
>> non-linear skb is created and filled with pinned user pages.
>> There are also some updates for vhost and guest parts of transport - in
>> both cases i've added handling of non-linear skb for virtio part. vhost
>> copies data from such skb to the guest's rx virtio buffers. In the guest,
>> virtio transport fills tx virtio queue with pages from skb.
>>
>> Head of this patchset is:
>> https://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next.git/commit/?id=d20dd0ea14072e8a90ff864b2c1603bd68920b4b
>>
>>
>> This version has several limits/problems:
>>
>> 1) As this feature totally depends on transport, there is no way (or it
>>   is difficult) to check whether transport is able to handle it or not
>>   during SO_ZEROCOPY setting. Seems I need to call AF_VSOCK specific
>>   setsockopt callback from setsockopt callback for SOL_SOCKET, but this
>>   leads to lock problem, because both AF_VSOCK and SOL_SOCKET callback
>>   are not considered to be called from each other. So in current version
>>   SO_ZEROCOPY is set successfully to any type (e.g. transport) of
>>   AF_VSOCK socket, but if transport does not support MSG_ZEROCOPY,
>>   tx routine will fail with EOPNOTSUPP.
>>
>>   ^^^
>>   This is still no resolved :(
>>
>> 2) When MSG_ZEROCOPY is used, for each tx system call we need to enqueue
>>   one completion. In each completion there is flag which shows how tx
>>   was performed: zerocopy or copy. This leads that whole message must
>>   be send in zerocopy or copy way - we can't send part of message with
>>   copying and rest of message with zerocopy mode (or vice versa). Now,
>>   we need to account vsock credit logic, e.g. we can't send whole data
>>   once - only allowed number of bytes could sent at any moment. In case
>>   of copying way there is no problem as in worst case we can send single
>>   bytes, but zerocopy is more complex because smallest transmission
>>   unit is single page. So if there is not enough space at peer's side
>>   to send integer number of pages (at least one) - we will wait, thus
>>   stalling tx side. To overcome this problem i've added simple rule -
>>   zerocopy is possible only when there is enough space at another side
>>   for whole message (to check, that current 'msghdr' was already used
>>   in previous tx iterations i use 'iov_offset' field of it's iov iter).
>>
>>   ^^^
>>   Discussed as ok during v2. Link:
>>   https://lore.kernel.org/netdev/23guh3txkghxpgcrcjx7h62qsoj3xgjhfzgtbmqp2slrz3rxr4@zya2z7kwt75l/
>>
>> 3) loopback transport is not supported, because it requires to implement
>>   non-linear skb handling in dequeue logic (as we "send" fragged skb
>>   and "receive" it from the same queue). I'm going to implement it in
>>   next versions.
>>
>>   ^^^ fixed in v2
>>
>> 4) Current implementation sets max length of packet to 64KB. IIUC this
>>   is due to 'kmalloc()' allocated data buffers. I think, in case of
>>   MSG_ZEROCOPY this value could be increased, because 'kmalloc()' is
>>   not touched for data - user space pages are used as buffers. Also
>>   this limit trims every message which is > 64KB, thus such messages
>>   will be send in copy mode due to 'iov_offset' check in 2).
>>
>>   ^^^ fixed in v2
>>
>>                         PATCHSET STRUCTURE
>>
>> Patchset has the following structure:
>> 1) Handle non-linear skbuff on receive in virtio/vhost.
>> 2) Handle non-linear skbuff on send in virtio/vhost.
>> 3) Updates for AF_VSOCK.
>> 4) Enable MSG_ZEROCOPY support on transports.
>> 5) Tests/tools/docs updates.
>>
>>                            PERFORMANCE
>>
>> Performance: it is a little bit tricky to compare performance between
>> copy and zerocopy transmissions. In zerocopy way we need to wait when
>> user buffers will be released by kernel, so it is like synchronous
>> path (wait until device driver will process it), while in copy way we
>> can feed data to kernel as many as we want, don't care about device
>> driver. So I compared only time which we spend in the 'send()' syscall.
>> Then if this value will be combined with total number of transmitted
>> bytes, we can get Gbit/s parameter. Also to avoid tx stalls due to not
>> enough credit, receiver allocates same amount of space as sender needs.
>>
>> Sender:
>> ./vsock_perf --sender <CID> --buf-size <buf size> --bytes 256M [--zc]
>>
>> Receiver:
>> ./vsock_perf --vsk-size 256M
>>
>> I run tests on two setups: desktop with Core i7 - I use this PC for
>> development and in this case guest is nested guest, and host is normal
>> guest. Another hardware is some embedded board with Atom - here I don't
>> have nested virtualization - host runs on hw, and guest is normal guest.
>>
>> G2H transmission (values are Gbit/s):
>>
>>   Core i7 with nested guest.            Atom with normal guest.
>>
>> *-------------------------------*   *-------------------------------*
>> |          |         |          |   |          |         |          |
>> | buf size |   copy  | zerocopy |   | buf size |   copy  | zerocopy |
>> |          |         |          |   |          |         |          |
>> *-------------------------------*   *-------------------------------*
>> |   4KB    |    3    |    10    |   |   4KB    |   0.8   |   1.9    |
>> *-------------------------------*   *-------------------------------*
>> |   32KB   |   20    |    61    |   |   32KB   |   6.8   |   20.2   |
>> *-------------------------------*   *-------------------------------*
>> |   256KB  |   33    |   244    |   |   256KB  |   7.8   |   55     |
>> *-------------------------------*   *-------------------------------*
>> |    1M    |   30    |   373    |   |    1M    |   7     |   95     |
>> *-------------------------------*   *-------------------------------*
>> |    8M    |   22    |   475    |   |    8M    |   7     |   114    |
>> *-------------------------------*   *-------------------------------*
>>
>> H2G:
>>
>>   Core i7 with nested guest.            Atom with normal guest.
>>
>> *-------------------------------*   *-------------------------------*
>> |          |         |          |   |          |         |          |
>> | buf size |   copy  | zerocopy |   | buf size |   copy  | zerocopy |
>> |          |         |          |   |          |         |          |
>> *-------------------------------*   *-------------------------------*
>> |   4KB    |   20    |    10    |   |   4KB    |   4.37  |    3     |
>> *-------------------------------*   *-------------------------------*
>> |   32KB   |   37    |    75    |   |   32KB   |   11    |   18     |
>> *-------------------------------*   *-------------------------------*
>> |   256KB  |   44    |   299    |   |   256KB  |   11    |   62     |
>> *-------------------------------*   *-------------------------------*
>> |    1M    |   28    |   335    |   |    1M    |   9     |   77     |
>> *-------------------------------*   *-------------------------------*
>> |    8M    |   27    |   417    |   |    8M    |  9.35   |  115     |
>> *-------------------------------*   *-------------------------------*
>>
>> * Let's look to the first line of both tables - where copy is better
>>   than zerocopy. I analyzed this case more deeply and found that
>>   bottleneck is function 'vhost_work_queue()'. With 4K buffer size,
>>   caller spends too much time in it with zerocopy mode (comparing to
>>   copy mode). This happens only with 4K buffer size. This function just
>>   calls 'wake_up_process()' and its internal logic does not depends on
>>   skb, so i think potential reason (may be) is interval between two
>>   calls of this function (e.g. how often it is called). Note, that
>>   'vhost_work_queue()' differs from the same function at guest's side of
>>   transport: 'virtio_transport_send_pkt()' uses 'queue_work()' which
>>   i think is more optimized for worker purposes, than direct call to
>>   'wake_up_process()'. But again - this is just my assumption.
>
> Thanks for the analysis, however for small payloads it makes sense that
> the cost might be too high that optimization does not bring benefits.
>
>>
>> Loopback:
>>
>>   Core i7 with nested guest.            Atom with normal guest.
>>
>> *-------------------------------*   *-------------------------------*
>> |          |         |          |   |          |         |          |
>> | buf size |   copy  | zerocopy |   | buf size |   copy  | zerocopy |
>> |          |         |          |   |          |         |          |
>> *-------------------------------*   *-------------------------------*
>> |   4KB    |    8    |     7    |   |   4KB    |   1.8   |   1.3    |
>> *-------------------------------*   *-------------------------------*
>> |   32KB   |   38    |    44    |   |   32KB   |   10    |   10     |
>> *-------------------------------*   *-------------------------------*
>> |   256KB  |   55    |   168    |   |   256KB  |   15    |   36     |
>> *-------------------------------*   *-------------------------------*
>> |    1M    |   53    |   250    |   |    1M    |   12    |   45     |
>> *-------------------------------*   *-------------------------------*
>> |    8M    |   40    |   344    |   |    8M    |   11    |   74     |
>> *-------------------------------*   *-------------------------------*
>>
>> I analyzed performace difference more deeply for the following setup:
>> server: ./vsock_perf --vsk-size 16M
>> client: ./vsock_perf --sender 2 --bytes 16M --buf-size 16K/4K [--zc]
>>
>> In other words I send 16M of data from guest to host in copy/zerocopy
>> modes and with two different sizes of buffer - 4K and 64K. Let's see
>> to tx path for both modes - it consists of two steps:
>>
>> copy:
>> 1) Allocate skb of buffer's length.
>> 2) Copy data to skb from buffer.
>>
>> zerocopy:
>> 1) Allocate skb with header space only.
>> 2) Pin pages of the buffer and insert them to skb.
>>
>> I measured average number of ns (returned by 'ktime_get()') for each
>> step above:
>> 1) Skb allocation (for both copy and zerocopy modes).
>> 2) For copy mode in 'memcpy_to_msg()' - copying.
>> 3) For zerocopy mode in '__zerocopy_sg_from_iter()' - pinning.
>>
>> Here are results for copy mode:
>> *-------------------------------------*
>> | buf | skb alloc | 'memcpy_to_msg()' |
>> *-------------------------------------*
>> |     |           |                   |
>> | 64K |  5000ns   |      25000ns      |
>> |     |           |                   |
>> *-------------------------------------*
>> |     |           |                   |
>> | 4K  |  800ns    |      2200ns       |
>> |     |           |                   |
>> *-------------------------------------*
>>
>> Here are results for zerocopy mode:
>> *-----------------------------------------------*
>> | buf | skb alloc | '__zerocopy_sg_from_iter()' |
>> *-----------------------------------------------*
>> |     |           |                             |
>> | 64K |  250ns    |          3500ns             |
>> |     |           |                             |
>> *-----------------------------------------------*
>> |     |           |                             |
>> | 4K  |  250ns    |          3000ns             |
>> |     |           |                             |
>> *-----------------------------------------------*
>>
>> I guess that reason of zerocopy performance is low overhead for page
>> pinning: there is big difference between 4K and 64K in case of copying
>> (25000 vs 2200), but in pinning case - just 3000 vs 3500.
>>
>> So, zerocopy is faster than classic copy mode, but of course it requires
>> specific architecture of application due to user pages pinning, buffer
>> size and alignment.
>
> Makes sense!
>
>>
>>                             NOTES
>>
>> If host fails to send data with "Cannot allocate memory", check value
>> /proc/sys/net/core/optmem_max - it is accounted during completion skb
>> allocation. Try to update it to for example 1M and try send again:
>> "echo 1048576 > /proc/sys/net/core/optmem_max" (as root).
>>
>>                            TESTING
>>
>> This patchset includes set of tests for MSG_ZEROCOPY feature. I tried to
>> cover new code as much as possible so there are different cases for
>> MSG_ZEROCOPY transmissions: with disabled SO_ZEROCOPY and several io
>> vector types (different sizes, alignments, with unmapped pages). I also
>> run tests with loopback transport and run vsockmon. In v3 i've added
>> io_uring test as separated application.
>>
>>           LET'S SPLIT PATCHSET TO MAKE REVIEW EASIER
>>
>> In v3 Stefano Garzarella <sgarzare@xxxxxxxxxx> asked to split this patchset
>> for several parts, because it looks too big for review. I think in this
>> version (v4) we can do it in the following way:
>>
>> [0001 - 0005] - this is preparation for virtio/vhost part.
>> [0006 - 0009] - this is preparation for AF_VSOCK part.
>> [0010 - 0013] - these patches allows to trigger logic from the previous
>>                two parts.
>> [0014 - rest] - updates for doc, tests, utils. This part doesn't touch
>>                kernel code and looks not critical.
>
> Yeah, I like this split, but I'd include 14 in the (10, 13) group.
>
> I have reviewed most of them and I think we are well on our way :-)
> I've already seen that Bobby suggested changes for v5, so I'll review
> that version better.
>
> Great work so far!

Hello Stefano!

Thanks for review! I left some questions, but most of comments are clear
for me. So I guess that idea of split is that I still keep all patches in
a big single patchset, but preserve structure described above and we will
do review process step by step according split?

Or I should split this patchset for 3 separated sets? I guess this will be
more complex to review...

Thanks, Arseniy

>
> Thanks,
> Stefano
>