Re: [PATCH RFC 1/1] genirq: Make threaded handler use irq affinity for managed interrupt

From: John Garry
Date: Mon Dec 16 2019 - 13:51:04 EST


Hi Marc,



I'm just wondering if non-managed interrupts should be included in
the load balancing calculation? Couldn't irqbalance (if active) start
moving non-managed interrupts around anyway?
But they are, aren't they? See what we do in irq_set_affinity:
+ÂÂÂÂÂÂÂ atomic_inc(per_cpu_ptr(&cpu_lpi_count, cpu));
+ÂÂÂÂÂÂÂ atomic_dec(per_cpu_ptr(&cpu_lpi_count,
+ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ its_dev->event_map.col_map[id]));
We don't try to "rebalance" anything based on that though, not that
I think we should.

Ah sorry, I meant whether they should not be included. In
its_irq_domain_activate(), we increment the per-cpu lpi count and also
use its_pick_target_cpu() to find the least loaded cpu. I am asking
whether we should just stick with the old policy for non-managed
interrupts here.

After checking D05, I see a very significant performance hit for SAS
controller performance - ~40% throughout lowering.

-ETOOMANYMOVINGPARTS.

Understood.


With this patch, now we have effective affinity targeted at seemingly
"random" CPUs, as opposed to all just using CPU0. This affects
performance.

And piling all interrupts on the same CPU does help?

Apparently... I need to check this more.


The difference is that when we use managed interrupts - like for NVME
or D06 SAS controller - the irq cpu affinity mask matches the CPUs
which enqueue the requests to the queue associated with the interrupt.
So there is an efficiency is enqueuing and deqeueing on same CPU group
- all related to blk multi-queue. And this is not the case for
non-managed interrupts.

So you enqueue requests from CPU0 only? It seems a bit odd...

No, but maybe I wasn't clear enough. I'll give an overview:

For D06 SAS controller - which is a multi-queue PCI device - we use managed interrupts. The HW has 16 submission/completion queues, so for 96 cores, we have an even spread of 6 CPUs assigned per queue; and this per-queue CPU mask is the interrupt affinity mask. So CPU0-5 would submit any IO on queue0, CPU6-11 on queue2, and so on. PCI NVMe is essentially the same.

These are the environments which we're trying to promote performance.

Then for D05 SAS controller - which is multi-queue platform device (mbigen) - we don't use managed interrupts. We still submit IO from any CPU, but we choose the queue to submit IO on a round-robin basis to promote some isolation, i.e. reduce inter-queue lock contention, so the queue chosen has nothing to do with the CPU.

And with your change we may submit on cpu4 but service the interrupt on cpu30, as an example. While previously we would always service on cpu0. The old way still isn't ideal, I'll admit.

For this env, we would just like to maintain the same performance. And it's here that we see the performance drop.


Please give this new patch a shot on your system (my D05 doesn't have
any managed devices):

We could consider supporting platform msi managed interrupts, but I
doubt the value.
It shouldn't be hard to do, and most of the existing code could be
moved to the generic level. As for the value, I'm not convinced
either. For example D05 uses the MBIGEN as an intermediate interrupt
controller, so MSIs are from the PoV of MBIGEN, and not the SAS device
attached to it. Not the best design...

JFYI, I did raise this following topic before, but that's as far as I got:

https://marc.info/?l=linux-block&m=150722088314310&w=2

Yes. And that's probably not very hard, but the problem in your case is
that the D05 HW is not using MSIs...

Right

You'd have to provide an abstraction
for wired interrupts (please don't).

You'd be better off directly setting the affinity of the interrupts from
the driver, but I somehow can't believe that you're only submitting requests
from the same CPU,

Maybe...

always. There must be something I'm missing.


Thanks,
John