Re: [RFC PATCH net-next v2 0/5] net/smc:Introduce SMC-D based loopback acceleration

From: Alexandra Winter
Date: Wed Jan 04 2023 - 11:10:55 EST




On 21.12.22 14:14, Wen Gu wrote:
>
>
> On 2022/12/20 22:02, Niklas Schnelle wrote:
>
>> On Tue, 2022-12-20 at 11:21 +0800, Wen Gu wrote:
>>> Hi, all
>>>
>>> # Background
>>>
>>> As previously mentioned in [1], we (Alibaba Cloud) are trying to use SMC
>>> to accelerate TCP applications in cloud environment, improving inter-host
>>> or inter-VM communication.
>>>
>>> In addition of these, we also found the value of SMC-D in scenario of local
>>> inter-process communication, such as accelerate communication between containers
>>> within the same host. So this RFC tries to provide a SMC-D loopback solution
>>> in such scenario, to bring a significant improvement in latency and throughput
>>> compared to TCP loopback.
>>>
>>> # Design
>>>
>>> This patch set provides a kind of SMC-D loopback solution.
>>>
>>> Patch #1/5 and #2/5 provide an SMC-D based dummy device, preparing for the
>>> inter-process communication acceleration. Except for loopback acceleration,
>>> the dummy device can also meet the requirements mentioned in [2], which is
>>> providing a way to test SMC-D logic for broad community without ISM device.
>>>
>>>   +------------------------------------------+
>>>   |  +-----------+           +-----------+   |
>>>   |  | process A |           | process B |   |
>>>   |  +-----------+           +-----------+   |
>>>   |       ^                        ^         |
>>>   |       |    +---------------+   |         |
>>>   |       |    |   SMC stack   |   |         |
>>>   |       +--->| +-----------+ |<--|         |
>>>   |            | |   dummy   | |             |
>>>   |            | |   device  | |             |
>>>   |            +-+-----------+-+             |
>>>   |                   VM                     |
>>>   +------------------------------------------+
>>>
>>> Patch #3/5, #4/5, #5/5 provides a way to avoid data copy from sndbuf to RMB
>>> and improve SMC-D loopback performance. Through extending smcd_ops with two
>>> new semantic: attach_dmb and detach_dmb, sender's sndbuf shares the same
>>> physical memory region with receiver's RMB. The data copied from userspace
>>> to sender's sndbuf directly reaches the receiver's RMB without unnecessary
>>> memory copy in the same kernel.
>>>
>>>   +----------+                     +----------+
>>>   | socket A |                     | socket B |
>>>   +----------+                     +----------+
>>>         |                               ^
>>>         |         +---------+           |
>>>    regard as      |         | ----------|
>>>    local sndbuf   |  B's    |     regard as
>>>         |         |  RMB    |     local RMB
>>>         |-------> |         |
>>>                   +---------+
>>
>> Hi Wen Gu,
>>
>> I maintain the s390 specific PCI support in Linux and would like to
>> provide a bit of background on this. You're surely wondering why we
>> even have a copy in there for our ISM virtual PCI device. To understand
>> why this copy operation exists and why we need to keep it working, one
>> needs a bit of s390 aka mainframe background.
>>
>> On s390 all (currently supported) native machines have a mandatory
>> machine level hypervisor. All OSs whether z/OS or Linux run either on
>> this machine level hypervisor as so called Logical Partitions (LPARs)
>> or as second/third/… level guests on e.g. a KVM or z/VM hypervisor that
>> in turn runs in an LPAR. Now, in terms of memory this machine level
>> hypervisor sometimes called PR/SM unlike KVM, z/VM, or VMWare is a
>> partitioning hypervisor without paging. This is one of the main reasons
>> for the very-near-native performance of the machine hypervisor as the
>> memory of its guests acts just like native RAM on other systems. It is
>> never paged out and always accessible to IOMMU translated DMA from
>> devices without the need for pinning pages and besides a trivial
>> offset/limit adjustment an LPAR's MMU does the same amount of work as
>> an MMU on a bare metal x86_64/ARM64 box.
>>
>> It also means however that when SMC-D is used to communicate between
>> LPARs via an ISM device there is  no way of mapping the DMBs to the
>> same physical memory as there exists no MMU-like layer spanning
>> partitions that could do such a mapping. Meanwhile for machine level
>> firmware including the ISM virtual PCI device it is still possible to
>> _copy_ memory between different memory partitions. So yeah while I do
>> see the appeal of skipping the memcpy() for loopback or even between
>> guests of a paging hypervisor such as KVM, which can map the DMBs on
>> the same physical memory, we must keep in mind this original use case
>> requiring a copy operation.
>>
>> Thanks,
>> Niklas
>>
>
> Hi Niklas,
>
> Thank you so much for the complete and detailed explanation! This provides
> me a brand new perspective of s390 device that we hadn't dabbled in before.
> Now I understand why shared memory is unavailable between different LPARs.
>
> Our original intention of proposing loopback device and the incoming device
> (virtio-ism) for inter-VM is to use SMC-D to accelerate communication in the
> case with no existing s390 ISM devices. In our conception, s390 ISM device,
> loopback device and virtio-ism device are parallel and are abstracted by smcd_ops.
>
>  +------------------------+
>  |          SMC-D         |
>  +------------------------+
>  -------- smcd_ops ---------
>  +------+ +------+ +------+
>  | s390 | | loop | |virtio|
>  | ISM  | | back | | -ism |
>  | dev  | | dev  | | dev  |
>  +------+ +------+ +------+
>
> We also believe that keeping the existing design and behavior of s390 ISM
> device is unshaken. What we want to get support for is some smcd_ops extension
> for devices with optional beneficial capability, such as nocopy here (Let's call
> it this for now), which is really helpful for us in inter-process and inter-VM
> scenario.
>
> And coincided with IBM's intention to add APIs between SMC-D and devices to
> support various devices for SMC-D, as mentioned in [2], we send out this RFC and
> the incoming virio-ism RFC, to provide some examples.
>
>>>
>>> # Benchmark Test
>>>
>>>   * Test environments:
>>>        - VM with Intel Xeon Platinum 8 core 2.50GHz, 16 GiB mem.
>>>        - SMC sndbuf/RMB size 1MB.
>>>
>>>   * Test object:
>>>        - TCP: run on TCP loopback.
>>>        - domain: run on UNIX domain.
>>>        - SMC lo: run on SMC loopback device with patch #1/5 ~ #2/5.
>>>        - SMC lo-nocpy: run on SMC loopback device with patch #1/5 ~ #5/5.
>>>
>>> 1. ipc-benchmark (see [3])
>>>
>>>   - ./<foo> -c 1000000 -s 100
>>>
>>>                         TCP              domain              SMC-lo             SMC-lo-nocpy
>>> Message
>>> rate (msg/s)         75140      129548(+72.41)    152266(+102.64%)         151914(+102.17%)
>>
>> Interesting that it does beat UNIX domain sockets. Also, see my below
>> comment for nginx/wrk as this seems very similar.
>>
>>>
>>> 2. sockperf
>>>
>>>   - serv: <smc_run> taskset -c <cpu> sockperf sr --tcp
>>>   - clnt: <smc_run> taskset -c <cpu> sockperf { tp | pp } --tcp --msg-size={ 64000 for tp | 14 for pp } -i 127.0.0.1 -t 30
>>>
>>>                         TCP                  SMC-lo             SMC-lo-nocpy
>>> Bandwidth(MBps)   4943.359        4936.096(-0.15%)        8239.624(+66.68%)
>>> Latency(us)          6.372          3.359(-47.28%)            3.25(-49.00%)
>>>
>>> 3. iperf3
>>>
>>>   - serv: <smc_run> taskset -c <cpu> iperf3 -s
>>>   - clnt: <smc_run> taskset -c <cpu> iperf3 -c 127.0.0.1 -t 15
>>>
>>>                         TCP                  SMC-lo             SMC-lo-nocpy
>>> Bitrate(Gb/s)         40.5            41.4(+2.22%)            76.4(+88.64%)
>>>
>>> 4. nginx/wrk
>>>
>>>   - serv: <smc_run> nginx
>>>   - clnt: <smc_run> wrk -t 8 -c 500 -d 30 http://127.0.0.1:80
>>>
>>>                         TCP                  SMC-lo             SMC-lo-nocpy
>>> Requests/s       154643.22      220894.03(+42.84%)        226754.3(+46.63%)
>>
>>
>> This result is very interesting indeed. So with the much more realistic
>> nginx/wrk workload it seems to copy hurts much less than the
>> iperf3/sockperf would suggest while SMC-D itself seems to help more.
>> I'd hope that this translates to actual applications as well. Maybe
>> this makes SMC-D based loopback interesting even while keeping the
>> copy, at least until we can come up with a sane way to work a no-copy
>> variant into SMC-D?
>>
>
> I agree, nginx/wrk workload is much more realistic for many applications.
>
> But we also encounter many other cases similar to sockperf on the cloud, which
> requires high throughput, such as AI training and big data.
>
> So avoidance of copying between DMBs can help these cases a lot :)
>
>>>
>>>
>>> # Discussion
>>>
>>> 1. API between SMC-D and ISM device
>>>
>>> As Jan mentioned in [2], IBM are working on placing an API between SMC-D
>>> and the ISM device for easier use of different "devices" for SMC-D.
>>>
>>> So, considering that the introduction of attach_dmb or detach_dmb can
>>> effectively avoid data copying from sndbuf to RMB and brings obvious
>>> throughput advantages in inter-VM or inter-process scenarios, can the
>>> attach/detach semantics be taken into consideration when designing the
>>> API to make it a standard ISM device behavior?
>           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>>
>> Due to the reasons explained above this behavior can't be emulated by
>> ISM devices at least not when crossing partitions. Not sure if we can
>> still incorporate it in the API and allow for both copying and
>> remapping SMC-D like devices, it definitely needs careful consideration
>> and I think also a better understanding of the benefit for real world
>> workloads.
>>
>
> Here I am not rigorous.
>
> Nocopy shouldn't be a standard ISM device behavior indeed. Actually we hope it be a
> standard optional _SMC-D_ device behavior and defined by smcd_ops.
>
> For devices don't support these options, like ISM device on s390 architecture,
> .attach_dmb/.detach_dmb and other reasonable extensions (which will be proposed to
> discuss in incoming virtio-ism RFC) can be set to NULL or return invalid. And for
> devices do support, they may be used for improving performance in some cases.
>
> In addition, can I know more latest news about the API design? :) , like its scale, will
> it be a almost refactor of existing interface or incremental patching? and its object,
> will it be tailored for exact ISM behavior or to reserve some options for other devices,
> like nocopy here? From my understanding of [2], it might be the latter?
>
>>>
>>> Maybe our RFC of SMC-D based inter-process acceleration (this one) and
>>> inter-VM acceleration (will coming soon, which is the update of [1])
>>> can provide some examples for new API design. And we are very glad to
>>> discuss this on the mail list.
>>>
>>> 2. Way to select different ISM-like devices
>>>
>>> With the proposal of SMC-D loopback 'device' (this RFC) and incoming
>>> device used for inter-VM acceleration as update of [1], SMC-D has more
>>> options to choose from. So we need to consider that how to indicate
>>> supported devices, how to determine which one to use, and their priority...
>>
>> Agree on this part, though it is for the SMC maintainers to decide, I
>> think we would definitely want to be able to use any upcoming inter-VM
>> devices on s390 possibly also in conjunction with ISM devices for
>> communication across partitions.
>>
>
> Yes, this part needs to be discussed with SMC maintainers. And thank you, we are very glad
> if our devices can be applied on s390 through the efforts.
>
>
> Best Regards,
> Wen Gu
>
>>>
>>> IMHO, this may require an update of CLC message and negotiation mechanism.
>>> Again, we are very glad to discuss this with you on the mailing list.

As described in
SMC protocol (including SMC-D): https://www.ibm.com/support/pages/system/files/inline-files/IBM%20Shared%20Memory%20Communications%20Version%202_2.pdf
the CLC messages provide a list of up to 8 ISM devices to chose from.
So I would hope that we can use the existing protocol.

The challenge will be to define GID (Global Interface ID) and CHID (a fabric ID) in
a meaningful way for the new devices.
There is always smcd_ops->query_remote_gid() as a safety net. But the idea is that
a CHID mismatch is a fast way to tell that these 2 interfaces do match.


>>>
>>> [1] https://lore.kernel.org/netdev/20220720170048.20806-1-tonylu@xxxxxxxxxxxxxxxxx/
>>> [2] https://lore.kernel.org/netdev/35d14144-28f7-6129-d6d3-ba16dae7a646@xxxxxxxxxxxxx/
>>> [3] https://github.com/goldsborough/ipc-bench
>>>
>>> v1->v2
>>>   1. Fix some build WARNINGs complained by kernel test rebot
>>>      Reported-by: kernel test robot <lkp@xxxxxxxxx>
>>>   2. Add iperf3 test data.
>>>
>>> Wen Gu (5):
>>>    net/smc: introduce SMC-D loopback device
>>>    net/smc: choose loopback device in SMC-D communication
>>>    net/smc: add dmb attach and detach interface
>>>    net/smc: avoid data copy from sndbuf to peer RMB in SMC-D loopback
>>>    net/smc: logic of cursors update in SMC-D loopback connections
>>>
>>>   include/net/smc.h      |   3 +
>>>   net/smc/Makefile       |   2 +-
>>>   net/smc/af_smc.c       |  88 +++++++++++-
>>>   net/smc/smc_cdc.c      |  59 ++++++--
>>>   net/smc/smc_cdc.h      |   1 +
>>>   net/smc/smc_clc.c      |   4 +-
>>>   net/smc/smc_core.c     |  62 +++++++++
>>>   net/smc/smc_core.h     |   2 +
>>>   net/smc/smc_ism.c      |  39 +++++-
>>>   net/smc/smc_ism.h      |   2 +
>>>   net/smc/smc_loopback.c | 358 +++++++++++++++++++++++++++++++++++++++++++++++++
>>>   net/smc/smc_loopback.h |  63 +++++++++
>>>   12 files changed, 662 insertions(+), 21 deletions(-)
>>>   create mode 100644 net/smc/smc_loopback.c
>>>   create mode 100644 net/smc/smc_loopback.h
>>>