Re: Memory providers multiplexing (Was: [PATCH net-next v4 4/5] page_pool: remove PP_FLAG_PAGE_FRAG flag)

From: Jesper Dangaard Brouer
Date: Wed Jul 26 2023 - 13:37:18 EST

Next message: Catalin Marinas: "Re: [PATCH] arm64/sme: Set new vector length before reallocating"
Previous message: Alex Williamson: "Re: [PATCH v8 4/4] vfio: Support IO page table replacement"
In reply to: Mina Almasry: "Re: Memory providers multiplexing (Was: [PATCH net-next v4 4/5] page_pool: remove PP_FLAG_PAGE_FRAG flag)"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On 25/07/2023 06.04, Mina Almasry wrote:

On Mon, Jul 24, 2023 at 7:56 AM Jesper Dangaard Brouer
<jbrouer@xxxxxxxxxx> wrote:

On 17/07/2023 03.53, Mina Almasry wrote:

On Fri, Jul 14, 2023 at 8:55 AM Jason Gunthorpe <jgg@xxxxxxxx> wrote:

On Fri, Jul 14, 2023 at 07:55:15AM -0700, Mina Almasry wrote:

Once the skb frags with struct new_abstraction are in the TCP stack,
they will need some special handling in code accessing the frags. But
my RFC already addressed that somewhat because the frags were
inaccessible in that case. In this case the frags will be both
inaccessible and will not be struct pages at all (things like
get_page() will not work), so more special handling will be required,
maybe.

It seems sort of reasonable, though there will be interesting concerns
about coherence and synchronization with generial purpose DMABUFs that
will need tackling.

Still it is such a lot of churn and weridness in the netdev side, I
think you'd do well to present an actual full application as
justification.

Yes, you showed you can stick unordered TCP data frags into GPU memory
sort of quickly, but have you gone further with this to actually show
it is useful for a real world GPU centric application?

BTW your cover letter said 96% utilization, the usual server
configuation is one NIC per GPU, so you were able to hit 1500Gb/sec of
TCP BW with this?

I do notice that the number of NICs is missing from our public
documentation so far, so I will refrain from specifying how many NICs
are on those A3 VMs until the information is public. But I think I can
confirm that your general thinking is correct, the perf that we're
getting is 96.6% line rate of each GPU/NIC pair,

What do you mean by 96.6% "line rate".
Is is the Ethernet line-rate?

Yes I believe this is the ethernet line-rate. I.e. the 200 Gbits/sec
that my NICs run.

Is the measured throughput the measured TCP data "goodput"?

Yes, it is goodput. Roughly I believe we add up the return values of
recvmsg() and divide that number by time (very roughly, I think).

Assuming
- MTU 1500 bytes (1514 on wire).
- Ethernet header 14 bytes
- IP header 20 bytes
- TCP header 20 bytes

Due to header overhead the goodput will be approx 96.4%.
- (1514-(14+20+20))/1514 = 0.9643
- (Not taking Ethernet interframe gap into account).

Thus, maybe you have hit Ethernet wire line-rate already?

My MTU is 8244 actually, which gives me 8192 mss/payload for my
connections. By my math the theoretical max would be 1 - 52/8244 =
~99.3%. So it looks like I'm dropping ~3% line rate somewhere in the
implementation.

Close enough, my math I would have added the 4 byte FCS checksum
(1-56/8244), but it makes no real difference at MTU 8244.

and scales linearly
for each NIC/GPU pair we've tested with so far. Line rate of each
NIC/GPU pair is 200 Gb/sec.

So if we have 8 NIC/GPU pairs we'd be hitting 96.6% * 200 * 8 = 1545 GB/sec.

Lets keep our units straight.
Here you mean 1545 Gbit/sec, which is 193 GBytes/s

Yes! Sorry! I definitely meant 1545 Gbits/sec, sorry!

If we have, say, 2 NIC/GPU pairs, we'd be hitting 96.6% * 200 * 2 = 384 GB/sec

Here you mean 384 Gbit/sec, which is 48 GBytes/sec.

Correct again!

...
etc.

These massive throughput numbers are important, because they *exceed*
the physical host RAM/DIMM memory speeds.

This is the *real argument* why software cannot afford to do a single
copy of the data from host-RAM into GPU-memory, because the CPU memory
throughput to DRAM/DIMM are insufficient.

My testlab CPU E5-1650 have 4 DIMM slots DDR4
- Data Width: 64 bits (= 8 bytes)
- Configured Memory Speed: 2400 MT/s
- Theoretical maximum memory bandwidth: 76.8 GBytes/s (2400*8*4)

Even the theoretical max 76.8 GBytes/s (614 Gbit/s) is not enough for
the 193 GBytes/s or 1545 Gbit/s (8 NIC/GPU pairs).

When testing this with lmbench tool bw_mem, the results (below
signature) are in the area 14.8 GBytes/sec (118 Gbit/s), as soon as
exceeding L3 cache size. In practice it looks like main memory is
limited to reading 118 Gbit/s *once*. (Mina's NICs run at 200 Gbit/s)

Some more insights. I couldn't believe this (single CPU) test was so
far from the theoretical max (76.8 vs. 14.8 GBytes/s).
This smells like a per CPU core limitation. The lmbench tool bw_mem have
an option for parallelism (-P) for testing this.
My testlab CPU only have 6 cores (as I have disabled HT).

Testing on more CPU cores show an increase in scaling mem bandwidth:

Cores 1 = 15.0 GB/s - scale: 1.00 (one core as scale point)
Cores 2 = 26.9 GB/s - scale: 1.79
Cores 3 = 36.3 GB/s - scale: 2.42
Cores 4 = 44.0 GB/s - scale: 2.93
Cores 5 = 48.9 GB/s - scale: 3.26
Cores 6 = 49.4 GB/s - scale: 3.29

Thus, the practical test show CPU memory DIMM read bandwidth scales to
49.4 GB/s (395.2 Gbit/s), so there is still hope of 400Gbit/s devices,
when utilizing more CPU cores.

I don't have a clear explanation why there is a limit per core.

The [toplev] tool with (bw_mem -P2) says:
Backend_Bound = 90.7% of the time.
Backend_Bound.Memory_Bound = 71.8% of these 90.7%
Backend_Bound.Core_Bound = remaining 18.9%

Backend_Bound.Memory_Bound is split into two main "stalls":
Backend_Bound.Memory_Bound.L3_Bound = 7.9%
Backend_Bound.Memory_Bound.DRAM_Bound = 58.5%

[toplev] https://github.com/andikleen/pmu-tools

Given DDIO can deliver network packets into L3, I also tried to figure
out what the L3 read bandwidth, which I measured to be 42.4 GBits/sec
(339 Gbit/s), in hopes that it would be enough, but it was not.

The memory bandwidth to L3 cache scales up per CPU core:

Cores 1 = 42.35 GB/s = scale: 1.00 (one core as scale point)
Cores 2 = 86.38 GB/s = scale: 2.04
Cores 3 = 126.96 GB/s = scale: 3.00
Cores 4 = 168.48 GB/s = scale: 3.98
Cores 5 = 211.77 GB/s = scale: 5.00
Cores 6 = 244.95 GB/s = scale: 5.78

Nice to see how well this scales up per core.
Fairly impressive total max L3 bandwidth of 244.95 GB/s (1959.6 Gbit/s).

Yes, avoiding any memory speed bottleneck as you note is important,
but the second point mentioned in my cover letter is also impactful:

" Alleviate PCIe BW pressure, by limiting data transfer to the lowest level
of the PCIe tree, compared to traditional path which sends data through the
root complex."

This is a good and important point.

Depending on the hardware, this is a bottleneck that we avoid with
device memory TCP. NIC/GPU copies occupy the PCIe link bandwidth. In a
hierarchy like this:

root complex
| (uplink)
PCIe switch
/ \
NIC GPU

I believe the uplink from the PCIe switch to the root complex is used
up 2 times for TX and 2 times for RX if the data needs to go through
host memory:

RX: NIC -> root complex -> GPU
TX: GPU -> root complex -> NIC

With device memory TCP, and enabling PCI P2P communication between the
devices under the same PCIe switch, the payload flows directly from/to
the NIC/GPU through the PCIe switch, and the payload never goes to the
root complex, alleviating pressure/bottleneck on that link between the
PCIe switch/root complex. I believe this is a core reason we're able
to scale throughput linearly with NIC/GPU pairs, because we don't
stress share uplink connections and all the payload data transfer
happens beneath the PCIe switch.

Good points, and I guess this is what Jason was hinting to.
And illustrated in this picture[1] (I googled):

[1] https://www.servethehome.com/wp-content/uploads/2022/08/HC34-NVIDIA-DGX-H100-Data-network-configuration.jpg

The GPUs "internally" have switched nvlink connections. As Jason said,
these nvlinks have an impressive bandwidth[2] of 900GB/s (7200 Gbit/s).

[2] https://www.nvidia.com/en-us/data-center/nvlink/

--Jesper
(data below signature)

Added raw commands and data below.

CPU under test:

$ cat /proc/cpuinfo | egrep -e 'model name|cache size' | head -2
model name : Intel(R) Xeon(R) CPU E5-1650 v4 @ 3.60GHz
cache size : 15360 KB

Providing some cmdline outputs from lmbench "bw_mem" tool.
(Output format is "%0.2f %.2f\n", megabytes, megabytes_per_second)

Running bw_mem with parallelism utilizing more cores:

$ /usr/lib/lmbench/bin/x86_64-linux-gnu/bw_mem -W2 -N4 -P1 256m rd
268.44 15015.69

$ /usr/lib/lmbench/bin/x86_64-linux-gnu/bw_mem -W2 -N4 -P2 256m rd
268.44 26896.42

$ /usr/lib/lmbench/bin/x86_64-linux-gnu/bw_mem -W2 -N4 -P3 256m rd
268.44 36347.36

$ /usr/lib/lmbench/bin/x86_64-linux-gnu/bw_mem -W2 -N4 -P4 256m rd
268.44 44073.72

$ /usr/lib/lmbench/bin/x86_64-linux-gnu/bw_mem -W2 -N4 -P5 256m rd
268.44 48872.02

$ /usr/lib/lmbench/bin/x86_64-linux-gnu/bw_mem -W2 -N4 -P6 256m rd
268.44 49426.76

$ taskset -c 2 /usr/lib/lmbench/bin/x86_64-linux-gnu/bw_mem 256M rd
256.00 14924.50

$ taskset -c 2 /usr/lib/lmbench/bin/x86_64-linux-gnu/bw_mem 256M wr
256.00 9895.25

$ taskset -c 2 /usr/lib/lmbench/bin/x86_64-linux-gnu/bw_mem 256M rdwr
256.00 9737.54

$ taskset -c 2 /usr/lib/lmbench/bin/x86_64-linux-gnu/bw_mem 256M bcopy
256.00 12462.88

$ taskset -c 2 /usr/lib/lmbench/bin/x86_64-linux-gnu/bw_mem 256M bzero
256.00 14869.89

Next output shows reducing size below L3 cache size, which shows an
increase in speed, likely the L3 bandwidth.

$ taskset -c 2 /usr/lib/lmbench/bin/x86_64-linux-gnu/bw_mem 64M rd
64.00 14840.58

$ taskset -c 2 /usr/lib/lmbench/bin/x86_64-linux-gnu/bw_mem 32M rd
32.00 14823.97

$ taskset -c 2 /usr/lib/lmbench/bin/x86_64-linux-gnu/bw_mem 16M rd
16.00 24743.86

$ taskset -c 2 /usr/lib/lmbench/bin/x86_64-linux-gnu/bw_mem 8M rd
8.00 40852.26

$ taskset -c 2 /usr/lib/lmbench/bin/x86_64-linux-gnu/bw_mem 4M rd
4.00 42545.65

$ taskset -c 2 /usr/lib/lmbench/bin/x86_64-linux-gnu/bw_mem 2M rd
2.00 42447.82

$ taskset -c 2 /usr/lib/lmbench/bin/x86_64-linux-gnu/bw_mem 1M rd
1.00 42447.82

Tests for testing L3 per core scaling.

$ taskset -c 0 /usr/lib/lmbench/bin/x86_64-linux-gnu/bw_mem -W2 -N4 -P1 512K rd
0.512000 42357.43

$ taskset -c 0-1 /usr/lib/lmbench/bin/x86_64-linux-gnu/bw_mem -W2 -N4 -P2 512K rd
0.512000 86380.09

$ taskset -c 0-2 /usr/lib/lmbench/bin/x86_64-linux-gnu/bw_mem -W2 -N4 -P3 512K rd
0.512000 126960.94

$ taskset -c 0-3 /usr/lib/lmbench/bin/x86_64-linux-gnu/bw_mem -W2 -N4 -P4 512K rd
0.512000 168485.49

$ taskset -c 0-4 /usr/lib/lmbench/bin/x86_64-linux-gnu/bw_mem -W2 -N4 -P5 512K rd
0.512000 211770.67

$ taskset -c 0-5 /usr/lib/lmbench/bin/x86_64-linux-gnu/bw_mem -W2 -N4 -P6 512K rd
0.512000 244959.35

Next message: Catalin Marinas: "Re: [PATCH] arm64/sme: Set new vector length before reallocating"
Previous message: Alex Williamson: "Re: [PATCH v8 4/4] vfio: Support IO page table replacement"
In reply to: Mina Almasry: "Re: Memory providers multiplexing (Was: [PATCH net-next v4 4/5] page_pool: remove PP_FLAG_PAGE_FRAG flag)"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]