Re: [PATCH v12 3/7] genirq: Add mechanism to multiplex a single HW IPI

From: Marc Zyngier
Date: Mon Nov 28 2022 - 06:30:34 EST


On Mon, 28 Nov 2022 11:13:30 +0000,
Anup Patel <apatel@xxxxxxxxxxxxxxxx> wrote:
>
> On Mon, Nov 28, 2022 at 4:04 PM Marc Zyngier <maz@xxxxxxxxxx> wrote:
> >
> > On Sat, 26 Nov 2022 17:34:49 +0000,
> > Anup Patel <apatel@xxxxxxxxxxxxxxxx> wrote:
> > >
> > > +static void ipi_mux_send_mask(struct irq_data *d, const struct cpumask *mask)
> > > +{
> > > + u32 ibit = BIT(irqd_to_hwirq(d));
> > > + struct ipi_mux_cpu *icpu = this_cpu_ptr(ipi_mux_pcpu);
> > > + struct cpumask *send_mask = &icpu->send_mask;
> > > + unsigned long flags;
> > > + int cpu;
> > > +
> > > + /*
> > > + * We use send_mask as a per-CPU variable so disable local
> > > + * interrupts to avoid being preempted.
> > > + */
> > > + local_irq_save(flags);
> >
> > The correct way to avoid preemption is to use preempt_disable(), which
> > is a lot cheaper than disabling interrupt on most architectures.
>
> Okay, I will update.
>
> >
> > > +
> > > + cpumask_clear(send_mask);
> >
> > This thing is likely to be unnecessarily expensive on very large
> > systems, as it is proportional to the number of CPUs.
> >
> > > +
> > > + for_each_cpu(cpu, mask) {
> > > + icpu = per_cpu_ptr(ipi_mux_pcpu, cpu);
> > > + atomic_or(ibit, &icpu->bits);
> >
> > The original code had an atomic_fetch_or_release() to allow eliding
> > the IPI if the target interrupt was already pending. Why is that code
> > gone? This is a pretty cheap and efficient optimisation.
>
> That optimization is causing RCU stalls on QEMU RISC-V virt
> machine with large number of CPUs.

Then there is a bug somewhere, either in the implementation of the
atomic operations or in QEMU. Or maybe even in the original code
(though this looks unlikely given how heavily this is used on actual
HW - I'm typing this email from one of these machines, and I'd be
pretty annoyed if I was missing IPIs).

In any case, please don't paper over this.

>
> >
> > > +
> > > + /*
> > > + * The atomic_or() above must complete before
> > > + * the atomic_read() below to avoid racing with
> > > + * ipi_mux_unmask().
> > > + */
> > > + smp_mb__after_atomic();
> > > +
> > > + if (atomic_read(&icpu->enable) & ibit)
> > > + cpumask_set_cpu(cpu, send_mask);
> > > + }
> > > +
> > > + /* Trigger the parent IPI */
> > > + ipi_mux_send(send_mask);
> >
> > IPIs are very rarely made pending on more than a single CPU at a
> > time. The overwhelming majority of them are targeting a single CPU. So
> > accumulating bits to avoid doing two or more "send" actions only
> > penalises the generic case.
> >
> > My conclusion is that this "send_mask" can probably be removed,
> > together with the preemption fiddling.
>
> So, we should call ipi_mux_send() for one target CPU at a time ?

I think so, as it matches my measurements from a few years ago. It
also simplifies things significantly, leading to better performance
for the common case. Add some instrumentation and see whether this is
still the case though.

>
> >
> > > +
> > > + local_irq_restore(flags);
> > > +}
> > > +
> > > +static const struct irq_chip ipi_mux_chip = {
> > > + .name = "IPI Mux",
> > > + .irq_mask = ipi_mux_mask,
> > > + .irq_unmask = ipi_mux_unmask,
> > > + .ipi_send_mask = ipi_mux_send_mask,
> > > +};
> >
> > OK, you have now dropped the superfluous pre/post handlers. But the
> > need still exists. Case in point, the aic_handle_ipi() prologue and
> > epilogue to the interrupt handling. I have suggested last time that
> > the driver could provide the actual struct irq_chip in order to
> > provide the callbacks it requires.
>
> The aic_handle_ipi() can simply call ipi_mux_process() between
> the prologue and epilogue.

Hmm. OK. That's not what I had in mind, but fair enough.

M.

--
Without deviation from the norm, progress is not possible.