Re: [RFC PATCH net-next v4 0/9] net/smc: Introduce SMC-D-based OS internal communication acceleration

From: Niklas Schnelle
Date: Wed Apr 05 2023 - 13:05:36 EST


On Mon, 2023-03-27 at 11:28 +0800, Wen Gu wrote:
> Hi, all
>
> # Background
>
> The background and previous discussion can be referred from [1],[6].
>
> We found SMC-D can be used to accelerate OS internal communication, such as
> loopback or between two containers within the same OS instance. So this patch
> set provides a kind of SMC-D dummy device (we call it the SMC-D loopback device)
> to emulate an ISM device, so that SMC-D can also be used on architectures
> other than s390. The SMC-D loopback device are designed as a system global
> device, visible to all containers.
>
> This version is implemented based on the generalized interface provided by [2].
> And there is an open issue, which will be mentioned later.
>
> # Design
>
> This patch set basically follows the design of the previous version.
>
> Patch #1/9 ~ #3/9 attempt to decouple ISM-related structures from the SMC-D
> generalized code and extract some helpers to make SMC-D protocol compatible
> with devices other than s390 ISM device.
>
> Patch #4/9 introduces a kind of loopback device, which is defined as SMC-D v2
> device and designed to provide communication between SMC sockets in the same OS
> instance.
>
> +-------------------------------------------+
> | +--------------+ +--------------+ |
> | | SMC socket A | | SMC socket B | |
> | +--------------+ +--------------+ |
> | ^ ^ |
> | | +----------------+ | |
> | | | SMC stack | | |
> | +--->| +------------+ |<--| |
> | | | dummy | | |
> | | | device | | |
> | +-+------------+-+ |
> | OS |
> +-------------------------------------------+
>
> Patch #5/9 ~ #8/9 expand SMC-D protocol interface (smcd_ops) for scenarios where
> SMC-D is used to communicate within VM (loopback here) or between VMs on the same
> host (based on virtio-ism device, see [3]). What these scenarios have in common
> is that the local sndbuf and peer RMB can be mapped to same physical memory region,
> so the data copy between the local sndbuf and peer RMB can be omitted. Performance
> improvement brought by this extension can be found in # Benchmark Test.
>
> +----------+ +----------+
> | socket A | | socket B |
> +----------+ +----------+
> | ^
> | +---------+ |
> regard as | | ----------|
> local sndbuf | B's | regard as
> | | RMB | local RMB
> |-------> | |
> +---------+
>
> Patch #9/9 realizes the support of loopback device for the above-mentioned expanded
> SMC-D protocol interface.
>
> # Benchmark Test
>
> * Test environments:
> - VM with Intel Xeon Platinum 8 core 2.50GHz, 16 GiB mem.
> - SMC sndbuf/RMB size 1MB.
>
> * Test object:
> - TCP lo: run on TCP loopback.
> - domain: run on UNIX domain.
> - SMC lo: run on SMC loopback device with patch #1/9 ~ #4/9.
> - SMC lo-nocpy: run on SMC loopback device with patch #1/9 ~ #9/9.
>
> 1. ipc-benchmark (see [4])
>
> - ./<foo> -c 1000000 -s 100
>
> TCP-lo domain SMC-lo SMC-lo-nocpy
> Message
> rate (msg/s) 79025 115736(+46.45%) 146760(+85.71%) 149800(+89.56%)
>
> 2. sockperf
>
> - serv: <smc_run> taskset -c <cpu> sockperf sr --tcp
> - clnt: <smc_run> taskset -c <cpu> sockperf { tp | pp } --tcp --msg-size={ 64000 for tp | 14 for pp } -i 127.0.0.1 -t 30
>
> TCP-lo SMC-lo SMC-lo-nocpy
> Bandwidth(MBps) 4822.388 4940.918(+2.56%) 8086.67(+67.69%)
> Latency(us) 6.298 3.352(-46.78%) 3.35(-46.81%)
>
> 3. iperf3
>
> - serv: <smc_run> taskset -c <cpu> iperf3 -s
> - clnt: <smc_run> taskset -c <cpu> iperf3 -c 127.0.0.1 -t 15
>
> TCP-lo SMC-lo SMC-lo-nocpy
> Bitrate(Gb/s) 40.7 40.5(-0.49%) 72.4(+77.89%)
>
> 4. nginx/wrk
>
> - serv: <smc_run> nginx
> - clnt: <smc_run> wrk -t 8 -c 500 -d 30 http://127.0.0.1:80
>
> TCP-lo SMC-lo SMC-lo-nocpy
> Requests/s 155994.57 214544.79(+37.53%) 215538.55(+38.17%)
>
>
> # Open issue
>
> The open issue is about how to detect that the source and target of CLC proposal
> are within the same OS instance and can communicate through the SMC-D loopback device.
> Similar issue also exists when using virtio-ism devices (the background and details
> of virtio-ism device can be referred from [3]). In previous discussions, multiple
> options were proposed (see [5]). Thanks again for the help of the community. :)
>
> But as we discussed, these solutions have some imperfection. So this version of RFC
> continues to use previous workaround, that is, a 64-bit random GID is generated for
> SMC-D loopback device. If the GIDs of the devices found by two peers are the same,
> then they are considered to be in the same OS instance and can communicate with each
> other by the loopback device.
>
> This approach needs that the loopback device GID is globally unique. But theoretically
> there is a possibility of a collision. Assume the following situations:
>
> (1) Assume that the SMC-D loopback devices of the two different OS instances happen
> to generate the same 64-bit GID.
>
> For the convenience of description, we refer to the sockets on these two
> different OS instance as server A and client B.
>
> A will misjudge that the two are on the same OS instance because the same GID
> in CLC proposal message. Then A creates its RMB and sends 64-bit token-A to B
> in CLC accept message.
>
> B receives the CLC accept message. And according to patch #7/9, B tries to
> attach its sndbuf to A's RMB by token-A.
>
> (2) And assume that the OS instance where B is located happens to have an unattached
> RMB whose 64-bit token is same as token-A.
>
> Then B successfully attaches its sndbuf to the wrong RMB, and creates its RMB,
> sends token-B to A in CLC confirm message.
>
> Similarly, A receives the message and tries to attach its sndbuf to B's RMB by
> token-B.
>
> (3) Similar to (2), assume that the OS instance where A is located happens to have
> an unattached RMB whose 64-bit token is same as token-B.
>
> Then A successfully attach its sndbuf to the wrong RMB. Both sides mistakenly
> believe that an SMC-D connection based on the loopback device is established
> between them.
>
> If the above 3 coincidences all happen, that is, 64-bit random number conflicts occur
> 3 times, then an unreachable SMC-D connection will be established, which is nasty.
> But if one of above is not satisfied, it will safely fallback to TCP.
>
> Since the chances of these happening are very small, I wonder if this risk of 1/2^(64*3)
> probability is acceptable? Can we just use 64-bits random generated number as GID in
> loopback device?

Let me just spell out some details here to make sure we're all on the
same page.

You're assuming that GIDs are generated randomly at cryptographic
quality. In the code I can see that you use get_random_bytes() which as
its comment explains supplies the same quality randomness as
/dev/urandom so on modern kernels that should provide cryptographic
quality randomness and be fine. Might be something to keep in mind for
backports though.

The fixed CHID of 0xFFFF makes sure this system identity confusion can
only occur between SMC-D loopback (and possibly virtio-ism?) never with
ISM based SMC-D or SMC-R as these never use this CHID value. Correct?

Now for the collision scenario above. As I understand it the
probability of the case where fallback does *not* occur is equivalent
to a 128 bit hash collision. Basically the random 64 bit GID_A
concatenated with the 64 bit DMB Token_A needs to just happen to match
the concatenation of the random 64 bit GID_B with DMB Token_B. With
that interpretation we can consult Wikipedia[0] for a nice table of how
many random GID+DMB Token choices are needed for a certain collision
probability. For 128 bits at least 8.2×10^11 tries would be needed just
to reach a 10^-15 collision probability. Considering the collision does
not only need to exist between two systems but these also need to try
to communicate with each other and happen to use the colliding DMBs for
things to get into the broken fallback case I think from a theoretical
point of view this sounds like neglible risk to me.

That said I'm more worried about the fallback to TCP being broken due
to a code bug once the GIDs do match which is already extremely
unlikely and thus not naturally tested in the wild. Do we have a plan
how to keep testing that fallback scenario somehow. Maybe with a
selftest or something? 

If we can solve the testing part then I'm personally in favor of this
approach of going with cryptograhically random GID and DMB token. It's
simple and doesn't depend on external factors and doesn't need a
protocol extension except for possibly reserving CHID 0xFFFF.

One more question though, what about the SEID why does that have to be
fixed and at least partially match what ISM devices use? I think I'm
missing some SMC protocol/design detail here. I'm guessing this would
require a protocol change?

Thanks,
Niklas

[0] https://en.wikipedia.org/wiki/Birthday_attack