Re: [PATCH 00/15] Coalesced Interrupt Delivery with posted MSI

From: Jens Axboe
Date: Fri Feb 09 2024 - 15:31:35 EST


On 2/9/24 10:43 AM, Jacob Pan wrote:
> Hi Jens,
>
> On Thu, 8 Feb 2024 08:34:55 -0700, Jens Axboe <axboe@xxxxxxxxx> wrote:
>
>> Hi Jacob,
>>
>> I gave this a quick spin, using 4 gen2 optane drives. Basic test, just
>> IOPS bound on the drive, and using 1 thread per drive for IO. Random
>> reads, using io_uring.
>>
>> For reference, using polled IO:
>>
>> IOPS=20.36M, BW=9.94GiB/s, IOS/call=31/31
>> IOPS=20.36M, BW=9.94GiB/s, IOS/call=31/31
>> IOPS=20.37M, BW=9.95GiB/s, IOS/call=31/31
>>
>> which is abount 5.1M/drive, which is what they can deliver.
>>
>> Before your patches, I see:
>>
>> IOPS=14.37M, BW=7.02GiB/s, IOS/call=32/32
>> IOPS=14.38M, BW=7.02GiB/s, IOS/call=32/31
>> IOPS=14.38M, BW=7.02GiB/s, IOS/call=32/31
>> IOPS=14.37M, BW=7.02GiB/s, IOS/call=32/32
>>
>> at 2.82M ints/sec. With the patches, I see:
>>
>> IOPS=14.73M, BW=7.19GiB/s, IOS/call=32/31
>> IOPS=14.90M, BW=7.27GiB/s, IOS/call=32/31
>> IOPS=14.90M, BW=7.27GiB/s, IOS/call=31/32
>>
>> at 2.34M ints/sec. So a nice reduction in interrupt rate, though not
>> quite at the extent I expected. Booted with 'posted_msi' and I do see
>> posted interrupts increasing in the PMN in /proc/interrupts,
>>
> The ints/sec reduction is not as high as I expected either, especially
> at this high rate. Which means not enough coalescing going on to get the
> performance benefits.

Right, it means that we're getting pretty decent commands-per-int
coalescing already. I added another drive and repeated, here's that one:

IOPS w/polled: 25.7M IOPS

Stock kernel:

IOPS=21.41M, BW=10.45GiB/s, IOS/call=32/32
IOPS=21.44M, BW=10.47GiB/s, IOS/call=32/32
IOPS=21.41M, BW=10.45GiB/s, IOS/call=32/32

at ~3.7M ints/sec, or about 5.8 IOPS / int on average.

Patched kernel:

IOPS=21.90M, BW=10.69GiB/s, IOS/call=31/32
IOPS=21.89M, BW=10.69GiB/s, IOS/call=32/31
IOPS=21.89M, BW=10.69GiB/s, IOS/call=32/32

at the same interrupt rate. So not a reduction, but slighter higher
perf. Maybe we're reaping more commands on average per interrupt.

Anyway, not a lot of interesting data there, just figured I'd re-run it
with the added drive.

> The opportunity of IRQ coalescing is also dependent on how long the
> driver's hardirq handler executes. In the posted MSI demux loop, it does
> not wait for more MSIs to come before existing the pending IRQ polling
> loop. So if the hardirq handler finishes very quickly, it may not coalesce
> as much. Perhaps, we need to find more "useful" work to do to maximize the
> window for coalescing.
>
> I am not familiar with optane driver, need to look into how its hardirq
> handler work. I have only tested NVMe gen5 in terms of storage IO, i saw
> 30-50% ints/sec reduction at even lower IRQ rate (200k/sec).

It's just an nvme device, so it's the nvme driver. The IRQ side is very
cheap - for as long as there are CQEs in the completion ring, it'll reap
them and complete them. That does mean that if we get an IRQ and there's
more than one entry to complete, we will do all of them. No IRQ
coalescing is configured (nvme kind of sucks for that...), but optane
media is much faster than flash, so that may be a difference.

--
Jens Axboe