Re: OT: Processor recommendation for RAID6

From: Paul Menzel
Date: Wed Apr 07 2021 - 09:47:06 EST


Dear Roger,


Thank you for your response.


Am 02.04.21 um 16:45 schrieb Roger Heflin:
On Fri, Apr 2, 2021 at 4:13 AM Paul Menzel wrote:

Are these values a good benchmark for comparing processors?

After two years, yes they are. I created 16 10 GB files in `/dev/shm`,
set them up as loop devices, and created a RAID6. For resync speed it
makes difference.

2 x AMD EPYC 7601 32-Core Processor: 34671K/sec
2 x Intel Xeon Gold 6248 CPU @ 2.50GHz: 87533K/sec

So, the current state of affairs seems to be, that AVX512 instructions
do help for software RAIDs, if you want fast rebuild/resync times.
Getting, for example, a four core/eight thread Intel Xeon Gold 5222
might be useful.

Now, the question remains, if AMD processors could make it up with
higher performance, or better optimized code, or if AVX512 instructions
are a must,

[…]

PS: Here are the commands on the AMD EPYC system:

```
$ for i in $(seq 1 16); do truncate -s 10G /dev/shm/vdisk$i.img; done
$ for i in /dev/shm/v*.img; do sudo losetup --find --show $i; done
[…]
$ sudo mdadm --create /dev/md1 --level=6 --raid-devices=16 /dev/loop{0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15}
mdadm: Defaulting to version 1.2 metadata
mdadm: array /dev/md1 started.
$ more /proc/mdstat
Personalities : [linear] [raid0] [raid1] [raid6] [raid5] [raid4] [multipath]
md1 : active raid6 loop15[15] loop14[14] loop13[13] loop12[12] loop11[11] loop10[10] loop9[9] loop8[8] loop7[7] loop6[6] loop5[5] loop4[4] loop3[3] loop2[2] loop1[1] loop0[0]
146671616 blocks super 1.2 level 6, 512k chunk, algorithm 2 [16/16] [UUUUUUUUUUUUUUUU]
[>....................] resync = 3.9% (416880/10476544) finish=5.6min speed=29777K/sec

unused devices: <none>
$ more /proc/mdstat
Personalities : [linear] [raid0] [raid1] [raid6] [raid5] [raid4] [multipath]
md1 : active raid6 loop15[15] loop14[14] loop13[13] loop12[12] loop11[11] loop10[10] loop9[9] loop8[8] loop7[7] loop6[6] loop5[5] loop4[4] loop3[3] loop2[2] loop1[1] loop0[0]
146671616 blocks super 1.2 level 6, 512k chunk, algorithm 2 [16/16] [UUUUUUUUUUUUUUUU]
[>....................] resync = 4.1% (439872/10476544) finish=5.3min speed=31419K/sec
$ sudo mdadm -S /dev/md1
mdadm: stopped /dev/md1
$ sudo losetup -D
$ sudo rm /dev/shm/vdisk*.img

I think you are testing something else. Your speeds are way below
what the raw processor can do. You are probably testing memory
speed/numa arch differences between the 2.

On the intel arch there are 2 numa nodes total with 4 channels, so the
system has 8 usable channels of bandwidth, but a allocation on a
single numa node will only have 4 channels usable (ddr4-2933)

On the epyc there are 8 numa nodes with 2 channels each (ddr4-2666),
so any single memory allocation will have only 2 channels available
and if the accesses are across the numa bus will be slower.

So 4*2933/2*2666 = 2.20 * 34671 = 76286 (fairly close to your results).

How the allocation for memory works depends a lot on how much ram you
actually have per numa node and how much for the whole machine. But
any single block for any single device should be on a single numa node
almost all of the time.

You might want to drop the cache before the test, run numactl
--hardware to see how much memory is free per numa node, then rerun
the test and at the of the test before the stop run numactl --hardware
again to see how it was spread across numa nodes. Even if it spreads
it across multiple numa nodes that may well mean that on the epyc case
you are running with several numa nodes were the main raid processes
are running against remote numa nodes, and because intel only has 2
then there is a decent chance that it is only running on 1 most of the
time (so no remote memory). I have also seen in benchmarks I have run
on 2P and 4P intel machines that interleaved on a 2P single thread job
is faster than running on a single numa nodes memory (with the process
pinned to a single cpu on one of the numa nodes, memory interleaved
over both), but on a 4P/4numa node machine interleaving slows it down
significantly. And in the default case any single write/read of a
block is likely only on a single numa node so that specific read/write
is constrained by a single numa node bandwidth giving an advantage to
fewer faster/bigger numa nodes and less remote memory.

Outside of rebooting and forcing the entire machine to interleave I am
not sure how to get shm to interleave. It might be a good enough
test to just force the epyc to interleave and see if the benchmark
result changes in any way. If the result does change repeat on the
intel. Overall for the most part the raid would not be able to use
very many cpu anyway, so a bigger machine with more numa nodes may
slow down the overall rate.

Thank you for the analysis. If I am going to have time, I am going to try your suggestions. In the meantime I won’t test in `/dev/shm`. Our servers with 256+ GB RAM are only two socket systems with a lot of cores/threads, but I didn’t have controllers and disks for testing handy.

Quickly testing this on two desktop machine.

Dell OptiPlex 5055 with AMD Ryzen 5 PRO 1500 (max 3.5 GHz), 16 GB memory, and 16 loop mounted 512 MB files in `/dev/shm` Linux 5.12.0-rc6 reports 60000K/sec.

```
$ more /proc/mdstat
Personalities : [linear] [raid0] [raid1] [raid6] [raid5] [raid4] [multipath]
md1 : active raid6 loop15[15] loop14[14] loop13[13] loop12[12] loop11[11] loop10[10] loop9[9] loop8[8] loop7[7] loop6[6] loop5[5] loop4[4] loop3[3] loop2[2] loop1[1] loop0[0]
7311360 blocks super 1.2 level 6, 512k chunk, algorithm 2 [16/16] [UUUUUUUUUUUUUUUU]
[===================>.] resync = 95.6% (500704/522240) finish=0.0min speed=62588K/sec

unused devices: <none>
```

Dell Precision 3620 with Intel i7-7700 @ 3.6 GHz, 32 GB memory Linux 5.12.0-rc3 reports 110279K/sec.

```
$ more /proc/mdstat
Personalities : [linear] [raid0] [raid1] [raid6] [raid5] [raid4] [multipath]
md1 : active raid6 loop15[15] loop14[14] loop13[13] loop12[12] loop11[11] loop10[10] loop9[9] loop8[8] loop7[7] loop6[6] loop5[5] loop4[4] loop3[3] lo
op2[2] loop1[1] loop0[0]
7311360 blocks super 1.2 level 6, 512k chunk, algorithm 2 [16/16] [UUUUUUUUUUUUUUUU]
[================>....] resync = 84.3% (441116/522240) finish=0.0min speed=110279K/sec

unused devices: <none>
```

I have no idea, if it’s related to the smaller files or the processor/system (single thread performance?).

On a Dell T440/021KCD (firmware 2.9.3) with two Intel Xeon Gold 5222 CPU @ 3.80GHz (AVX512), 128 GB memory, Adaptec Smart Storage PQI 12G SAS/PCIe 3 (HBA1100) and 16 8 TB Seagate ST8000NM001A, Linux 5.4.97 reports over 130000K/sec.

```
$ more /proc/mdstat
Personalities : [linear] [raid0] [raid1] [raid6] [raid5] [raid4] [multipath]
md0 : active raid6 sdr[15] sdq[14] sdp[13] sdo[12] sdn[11] sdm[10] sdl[9] sdk[8] sdj[7] sdi[6] sdh[5] sdg[4] sdf[3] sde[2] sdd[1] sdc[0]
109394518016 blocks super 1.2 level 6, 512k chunk, algorithm 2 [16/16] [UUUUUUUUUUUUUUUU]
[=>...................] resync = 5.7% (452697636/7813894144) finish=938.1min speed=130767K/sec
bitmap: 56/59 pages [224KB], 65536KB chunk

unused devices: <none>
$ sudo perf top
[…]
15.97% [kernel] [k] xor_avx_5
12.78% [kernel] [k] analyse_stripe
11.90% [kernel] [k] memcmp
7.71% [kernel] [k] ops_run_io
4.75% [kernel] [k] blk_rq_map_sg
4.41% [kernel] [k] raid6_avx5124_gen_syndrome
3.36% [kernel] [k] bio_advance
3.03% [kernel] [k] raid5_get_active_stripe
3.00% [kernel] [k] raid5_end_read_request
2.85% [kernel] [k] xor_avx_3
1.72% [kernel] [k] blk_update_request
[…]
```

This is also much faster compared to the Dell PowerEdge T640 with two Intel Xeon Gold 6248 @ 2,50 GHz results in `/dev/shm`.

So, for the thread purpose, tests need to be done on real disks and not loop mounted files in memory.


Kind regards,

Paul