Re: [PATCH 00/15] Coalesced Interrupt Delivery with posted MSI

From: Jens Axboe
Date: Mon Feb 12 2024 - 13:37:00 EST


On 2/12/24 11:27 AM, Jacob Pan wrote:
> Hi Jens,
>
> On Fri, 9 Feb 2024 13:31:17 -0700, Jens Axboe <axboe@xxxxxxxxx> wrote:
>
>> On 2/9/24 10:43 AM, Jacob Pan wrote:
>>> Hi Jens,
>>>
>>> On Thu, 8 Feb 2024 08:34:55 -0700, Jens Axboe <axboe@xxxxxxxxx> wrote:
>>>
>>>> Hi Jacob,
>>>>
>>>> I gave this a quick spin, using 4 gen2 optane drives. Basic test, just
>>>> IOPS bound on the drive, and using 1 thread per drive for IO. Random
>>>> reads, using io_uring.
>>>>
>>>> For reference, using polled IO:
>>>>
>>>> IOPS=20.36M, BW=9.94GiB/s, IOS/call=31/31
>>>> IOPS=20.36M, BW=9.94GiB/s, IOS/call=31/31
>>>> IOPS=20.37M, BW=9.95GiB/s, IOS/call=31/31
>>>>
>>>> which is abount 5.1M/drive, which is what they can deliver.
>>>>
>>>> Before your patches, I see:
>>>>
>>>> IOPS=14.37M, BW=7.02GiB/s, IOS/call=32/32
>>>> IOPS=14.38M, BW=7.02GiB/s, IOS/call=32/31
>>>> IOPS=14.38M, BW=7.02GiB/s, IOS/call=32/31
>>>> IOPS=14.37M, BW=7.02GiB/s, IOS/call=32/32
>>>>
>>>> at 2.82M ints/sec. With the patches, I see:
>>>>
>>>> IOPS=14.73M, BW=7.19GiB/s, IOS/call=32/31
>>>> IOPS=14.90M, BW=7.27GiB/s, IOS/call=32/31
>>>> IOPS=14.90M, BW=7.27GiB/s, IOS/call=31/32
>>>>
>>>> at 2.34M ints/sec. So a nice reduction in interrupt rate, though not
>>>> quite at the extent I expected. Booted with 'posted_msi' and I do see
>>>> posted interrupts increasing in the PMN in /proc/interrupts,
>>>>
>>> The ints/sec reduction is not as high as I expected either, especially
>>> at this high rate. Which means not enough coalescing going on to get the
>>> performance benefits.
>>
>> Right, it means that we're getting pretty decent commands-per-int
>> coalescing already. I added another drive and repeated, here's that one:
>>
>> IOPS w/polled: 25.7M IOPS
>>
>> Stock kernel:
>>
>> IOPS=21.41M, BW=10.45GiB/s, IOS/call=32/32
>> IOPS=21.44M, BW=10.47GiB/s, IOS/call=32/32
>> IOPS=21.41M, BW=10.45GiB/s, IOS/call=32/32
>>
>> at ~3.7M ints/sec, or about 5.8 IOPS / int on average.
>>
>> Patched kernel:
>>
>> IOPS=21.90M, BW=10.69GiB/s, IOS/call=31/32
>> IOPS=21.89M, BW=10.69GiB/s, IOS/call=32/31
>> IOPS=21.89M, BW=10.69GiB/s, IOS/call=32/32
>>
>> at the same interrupt rate. So not a reduction, but slighter higher
>> perf. Maybe we're reaping more commands on average per interrupt.
>>
>> Anyway, not a lot of interesting data there, just figured I'd re-run it
>> with the added drive.
>>
>>> The opportunity of IRQ coalescing is also dependent on how long the
>>> driver's hardirq handler executes. In the posted MSI demux loop, it does
>>> not wait for more MSIs to come before existing the pending IRQ polling
>>> loop. So if the hardirq handler finishes very quickly, it may not
>>> coalesce as much. Perhaps, we need to find more "useful" work to do to
>>> maximize the window for coalescing.
>>>
>>> I am not familiar with optane driver, need to look into how its hardirq
>>> handler work. I have only tested NVMe gen5 in terms of storage IO, i saw
>>> 30-50% ints/sec reduction at even lower IRQ rate (200k/sec).
>>
>> It's just an nvme device, so it's the nvme driver. The IRQ side is very
>> cheap - for as long as there are CQEs in the completion ring, it'll reap
>> them and complete them. That does mean that if we get an IRQ and there's
>> more than one entry to complete, we will do all of them. No IRQ
>> coalescing is configured (nvme kind of sucks for that...), but optane
>> media is much faster than flash, so that may be a difference.
>>
> Yeah, I also check the the driver code it seems just wake up the threaded
> handler.

That only happens if you're using threaded interrupts, which is not the
default as it's much slower. What happens for the normal case is that we
init a batch, and then poll the CQ ring for completions, and then add
them to the completion batch. Once no more are found, we complete the
batch.

You're not using threaded interrupts, are you?

> For the record, here is my set up and performance data for 4 Samsung disks.
> IOPS increased from 1.6M per disk to 2.1M. One difference I noticed is that
> IRQ throughput is improved instead of reduction with this patch on my setup.
> e.g. BEFORE: 185545/sec/vector
> AFTER: 220128

I'm surprised at the rates being that low, and if so, why the posted MSI
makes a difference? Usually what I've seen for IRQ being slower than
poll is if interrupt delivery is unreasonably slow on that architecture
of machine. But ~200k/sec isn't that high at all.

> [global]
> bs=4k
> direct=1
> norandommap
> ioengine=libaio
> randrepeat=0
> readwrite=randread
> group_reporting
> time_based
> iodepth=64
> exitall
> random_generator=tausworthe64
> runtime=30
> ramp_time=3
> numjobs=8
> group_reporting=1
>
> #cpus_allowed_policy=shared
> cpus_allowed_policy=split
> [disk_nvme6n1_thread_1]
> filename=/dev/nvme6n1
> cpus_allowed=0-7
> [disk_nvme6n1_thread_1]
> filename=/dev/nvme5n1
> cpus_allowed=8-15
> [disk_nvme5n1_thread_2]
> filename=/dev/nvme4n1
> cpus_allowed=16-23
> [disk_nvme5n1_thread_3]
> filename=/dev/nvme3n1
> cpus_allowed=24-31

For better performance, I'd change that engine=libaio to:

ioengine=io_uring
fixedbufs=1
registerfiles=1

Particularly fixedbufs makes a big difference, as a big cycle consumer
is mapping/unmapping pages from the application space into the kernel
for O_DIRECT. With fixedbufs=1, this is done once and we just reuse the
buffers. At least for my runs, this is ~15% of the systime for doing IO.
It also removes the page referencing, which isn't as big a consumer, but
still noticeable.

Anyway, side quest, but I think you'll find this considerably reduces
overhead / improves performance. Also makes it so that you can compare
with polled IO on nvme, which aio can't do. You'd just add hipri=1 as an
option for that (with a side note that you need to configure nvme poll
queues, see the poll_queues parameter).

On my box, all the NVMe devices seem to be on node1, not node0 which
looks like it's the CPUs you are using. Might be worth checking and
adjusting your CPU domains for each drive? I also tend to get better
performance by removing the CPU scheduler, eg just pin each job to a
single CPU rather than many. It's just one process/thread anyway, so
really no point in giving it options here. It'll help reduce variability
too, which can be a pain in the butt to deal with.

--
Jens Axboe