Re: [RFC PATCH 00/11] Rust null block driver

From: Andreas Hindborg (Samsung)
Date: Mon Jul 31 2023 - 11:10:19 EST



Hi,

"Andreas Hindborg (Samsung)" <nmi@xxxxxxxxxxxx> writes:

> Hi,
>
> Yexuan Yang <1182282462@xxxxxxxxxxx> writes:
>
>>> Over the 432 benchmark configurations, the relative performance of the Rust
>>> driver to the C driver (in terms of IOPS) is between 6.8 and -11.8 percent with
>>> an average of 0.2 percent better for reads. For writes the span is 16.8 to -4.5
>>> percent with an average of 0.9 percent better.
>>>
>>> For each measurement the drivers are loaded, a drive is configured with memory
>>> backing and a size of 4 GiB. C null_blk is configured to match the implemented
>>> modes of the Rust driver: `blocksize` is set to 4 KiB, `completion_nsec` to 0,
>>> `irqmode` to 0 (IRQ_NONE), `queue_mode` to 2 (MQ), `hw_queue_depth` to 256 and
>>> `memory_backed` to 1. For both the drivers, the queue scheduler is set to
>>> `none`. These measurements are made using 30 second runs of `fio` with the
>>> `PSYNC` IO engine with workers pinned to separate CPU cores. The measurements
>>> are done inside a virtual machine (qemu/kvm) on an Intel Alder Lake workstation
>>> (i5-12600).
>>
>> Hi All!
>> I have some problems about your benchmark test.
>> In Ubuntu 22.02, I compiled an RFL kernel with both C null_blk driver and Rust
>> null_blk_driver as modules. For the C null_blk driver, I used the `modprobe`
>> command to set relevant parameters, while for the Rust null_blk_driver, I simply
>> started it. I used the following two commands to start the drivers:
>>
>> ```bash
>> sudo modprobe null_blk \
>> queue_mode=2 irqmode=0 hw_queue_depth=256 \
>> memory_backed=1 bs=4096 completion_nsec=0 \
>> no_sched=1 gb=4;
>> sudo modprobe rnull_mod
>> ```
>>
>> After that, I tested their performance in `randread` with the fio command, specifying the first parameter as 4 and the second parameter as 1:
>>
>> ```bash
>> fio --filename=/dev/nullb0 --iodepth=64 --ioengine=psync --direct=1 --rw=randread --bs=$1k --numjobs=$2 --runtime=30 --group_reporting --name=test-rand-read --output=test_c_randread.log
>> fio --filename=/dev/rnullb0 --iodepth=64 --ioengine=psync --direct=1 --rw=randread --bs=$1k --numjobs=$2 --runtime=30 --group_reporting --name=test-rand-read --output=test_rust_randread.log
>> ```
>>
>> The test results showed a significant performance difference between the two
>> drivers, which was drastically different from the data you tested in the
>> community. Specifically, for `randread`, the C driver had a bandwidth of
>> 487MiB/s and IOPS of 124.7k, while the Rust driver had a bandwidth of 264MiB/s
>> and IOPS of 67.7k. However, for other I/O types, the performance of the C and
>> Rust drivers was more similar. Could you please provide more information about
>> the actual bandwidth and IOPS data from your tests, rather than just the
>> performance difference between the C and Rust drivers? Additionally, if you
>> could offer possible reasons for this abnormality, I would greatly appreciate
>> it!
>
> Thanks for trying out the code! I am not sure why you get these numbers.
> I am currently out of office, but I will rerun the benchmarks next week
> when I get back in. Maybe I can provide you with some scripts and my
> kernel configuration. Hopefully we can figure out the difference in our
> setups.
>

I just ran the benchmarks for that configuration again. My setup is an
Intel Alder Lake CPU (i5-12600) with Debian Bullseye user space running
a 6.2.12 host kernel with a kernel config based on the stock Debian
Bullseye (debian config + make olddefconfig). 32 GB of DDR5 4800MHz
memory. The benchmarks (and patched kernel) run inside QEMU KVM (from
Debian repos Debian 1:5.2+dfsg-11+deb11u2). I use the following qemu
command to boot (edited a bit for clarity):

"qemu-system-x86_64" "-nographic" "-enable-kvm" "-m" "16G" "-cpu" "host"
"-M" "q35" "-smp" "6" "-kernel" "vmlinux" "-append" "console=ttyS0
nokaslr rdinit=/sbin/init root=/dev/vda1 null_blk.nr_devices=0"

I use a Debian Bullseye user space inside the virtual machine.

I think I used the stock Bullseye kernel for the host when I did the
numbers for the cover letter, so that is different for this run.

Here are the results:

+---------+----------+------+---------------------+
| jobs/bs | workload | prep | 4 |
+---------+----------+------+---------------------+
| 4 | randread | 1 | 0.28 0.00 (1.7,0.0) |
| 4 | randread | 0 | 5.75 0.00 (6.8,0.0) |
+---------+----------+------+---------------------+

I used the following parameters for both tests:

Block size: 4k
Run time: 120s
Workload: randread
Jobs: 4

This is the `fio` command I used:

['fio', '--group_reporting', '--name=default', '--filename=/dev/<snip>', '--time_based=1', '--runtime=120', '--rw=randread', '--output=<snip>', '--output-format=json', '--numjobs=4', '--cpus_allowed=0-3', '--cpus_allowed_policy=split', '--ioengine=psync', '--bs=4k', '--direct=1', '--norandommap', '--random_generator=lfsr']

For the line in the table where `prep` is 1 I filled up the target
device with data before running the benchmark. I use the following `fio`
job to do that:

['fio', '--name=prep', '--rw=write', '--filename=/dev/None', '--bs=4k', '--direct=1']

Did you prepare the volumes with data in your experiment?

How to read the data in the table: For the benchmark where the drive is
prepared, Rust null_blk perform 0.28 percent better than C null_block.
For the case where the drive is empty Rust null_block performs 5.75
percent better than C null_block. I calculated the relative performance
as "(r_iops - c_iops) / c_iops * 100".

The IOPS info you request is present in the tables from the cover
letter. As mentioned in the cover letter, the numbers in parenthesis in
each cell is throughput in 1,000,000 IOPS for read and
write for the C null_blk driver. For the table above, C null_blk did
1,700,000 read IOPS in the case when the drive is prepared and 5,750,000
IOPS in the case where the drive is not prepared.

You can obtain bandwidth by multiplying IOPS by block size. I can also
provide the raw json output of `fio` if you want to have a look.

One last note is that I unload the module between each run and I do not
have both modules loaded at the same time. If you are exhausting you
memory pool this could maybe impact performance?

So things to check:
- Do you prepare the drive?
- Do you release drive memory after test (you can unload the module)
- Use the lfsr random number generator
- Do not use the randommap

I am not sure what hardware you are running on, but the throughput
numbers you obtain seem _really_ low.

Best regards
Andreas Hindborg