Re: Virtio-scsi multiqueue irq affinity

From: Peter Xu
Date: Mon Mar 25 2019 - 01:02:28 EST


On Sat, Mar 23, 2019 at 06:15:59PM +0100, Thomas Gleixner wrote:
> Peter,

Hi, Thomas,

>
> On Mon, 18 Mar 2019, Peter Xu wrote:
> > I noticed that starting from commit 0d9f0a52c8b9 ("virtio_scsi: use
> > virtio IRQ affinity", 2017-02-27) the virtio scsi driver is using a
> > new way (via irq_create_affinity_masks()) to automatically initialize
> > IRQ affinities for the multi-queues, which is different comparing to
> > all the other virtio devices (like virtio-net, which still uses
> > virtqueue_set_affinity(), which is actually, irq_set_affinity_hint()).
> >
> > Firstly, it will definitely broke some of the userspace programs with
> > that when the scripts wanted to do the bindings explicitly like before
> > and they could simply fail with -EIO now every time when echoing to
> > /proc/irq/N/smp_affinity of any of the multi-queues (see
> > write_irq_affinity()).
>
> Did it break anything? I did not see a report so far. Assumptions about
> potential breakage are not really useful.

It broke some automation scripts e.g. where they tried to bind CPUs to
IRQs before staring IO but these scripts failed early during setup
when trying to echo into the affinity procfs file. Actually I started
to look into this because of such script breakage reported by QEs.
Iinitially it was thought as a kernel bug but later we noticed that
it's a change in policy.

>
> > Is there any specific reason to do it with the new way? Since AFAIU
> > we should still allow the system admins to decide what to do for such
> > configurations, .e.g., what if we only want to provision half of the
> > CPU resources to handle IRQs for a specific virtio-scsi controller?
> > We won't be able to achieve that with current policy. Or, could this
> > be a question for the IRQ system (irq_create_affinity_masks()) in
> > general? Any special considerations behind the big picture?
>
> That has nothing to do with the irq subsystem. That merily provides the
> mechanisms.
>
> The reason behind this is that multi-queue devices set up queues per cpu or
> if not enough queues are available queues per cpu groups. So it does not
> make sense to move the interrupt away from the CPU or the CPU group.
>
> Aside of that in the CPU hotunplug case, interrupts used to be moved to the
> online CPUs which resulted in problems for e.g. hibernation because on
> large systems moving all interrupts to the boot CPU does not work due to
> vector space exhaustion. Also CPU hotunplug is used for power management
> purposes and there it does not make sense either to have the per cpu queues
> of the offlined CPUs moved to the still online CPUs which then end up with
> several queues.
>
> The new way to deal with this is to strictly bind per CPU (per CPU group)
> queues. If the CPU or the last CPU in the group goes offline the following
> happens:
>
> 1) The queue is disabled, i.e. no new requests can be queued
>
> 2) Wait for the outstanding requests to complete
>
> 3) Shut down the interrupt
>
> This avoids having multiple queues moved to the still online CPUs and also
> prevents vector space exhaustion because the shut down interrupt does not
> have to be migrated.
>
> When the CPU (or the first in the group) comes online again:
>
> 1) Reenable the interrupt
>
> 2) Reenable the queue
>
> Hope that helps.

Thanks for explaining everything! It helps a lot, and yes it makes
perfect sense to me.

If no one reported any issue I think either the scripts are not
checking the return code so they might fail silently but it might not
matter much (e.g., if the only thing that a script wants to do is to
spread the CPUs upon the IRQs then the script can simply cancel the
setup procedure of this, and even failing of those echos won't affect
much too), or they're just simpled fixed up later on. Now the only
thing I am unsure about is whether there could be scenarios that we
may not want to use the default policy to spread the cores.

One thing I can think of is the real-time scenario where "isolcpus="
is provided, then logically we should not allow any isolated CPUs to
be bound to any of the multi-queue IRQs. Though Ming Lei and I had a
discussion offlist before and Ming explained to me that as long as the
isolated CPUs do not generate any IO then there will be no IRQ on
those isolated (real-time) CPUs at all. Can we guarantee that? Now
I'm thinking whether the ideal way should be that, when multi-queue is
used with "isolcpus=" then we only spread the queues upon housekeeping
CPUs somehow? Because AFAIU general real-time applications should not
use block IOs at all (and if not those hardware multi-queues running
upon isolated CPUs would probably be a pure waste too because they
could be always idle on the isolated cores where the real-time
application runs).

CCing Ming too.

Thanks,

--
Peter Xu