Re: [PATCH v12 3/7] genirq: Add mechanism to multiplex a single HW IPI

From: Anup Patel
Date: Tue Nov 29 2022 - 09:18:24 EST


On Mon, Nov 28, 2022 at 5:00 PM Marc Zyngier <maz@xxxxxxxxxx> wrote:
>
> On Mon, 28 Nov 2022 11:13:30 +0000,
> Anup Patel <apatel@xxxxxxxxxxxxxxxx> wrote:
> >
> > On Mon, Nov 28, 2022 at 4:04 PM Marc Zyngier <maz@xxxxxxxxxx> wrote:
> > >
> > > On Sat, 26 Nov 2022 17:34:49 +0000,
> > > Anup Patel <apatel@xxxxxxxxxxxxxxxx> wrote:
> > > >
> > > > +static void ipi_mux_send_mask(struct irq_data *d, const struct cpumask *mask)
> > > > +{
> > > > + u32 ibit = BIT(irqd_to_hwirq(d));
> > > > + struct ipi_mux_cpu *icpu = this_cpu_ptr(ipi_mux_pcpu);
> > > > + struct cpumask *send_mask = &icpu->send_mask;
> > > > + unsigned long flags;
> > > > + int cpu;
> > > > +
> > > > + /*
> > > > + * We use send_mask as a per-CPU variable so disable local
> > > > + * interrupts to avoid being preempted.
> > > > + */
> > > > + local_irq_save(flags);
> > >
> > > The correct way to avoid preemption is to use preempt_disable(), which
> > > is a lot cheaper than disabling interrupt on most architectures.
> >
> > Okay, I will update.
> >
> > >
> > > > +
> > > > + cpumask_clear(send_mask);
> > >
> > > This thing is likely to be unnecessarily expensive on very large
> > > systems, as it is proportional to the number of CPUs.
> > >
> > > > +
> > > > + for_each_cpu(cpu, mask) {
> > > > + icpu = per_cpu_ptr(ipi_mux_pcpu, cpu);
> > > > + atomic_or(ibit, &icpu->bits);
> > >
> > > The original code had an atomic_fetch_or_release() to allow eliding
> > > the IPI if the target interrupt was already pending. Why is that code
> > > gone? This is a pretty cheap and efficient optimisation.
> >
> > That optimization is causing RCU stalls on QEMU RISC-V virt
> > machine with large number of CPUs.
>
> Then there is a bug somewhere, either in the implementation of the
> atomic operations or in QEMU. Or maybe even in the original code
> (though this looks unlikely given how heavily this is used on actual
> HW - I'm typing this email from one of these machines, and I'd be
> pretty annoyed if I was missing IPIs).
>
> In any case, please don't paper over this.

I was trying to defer the optimization to a later stage until this
issue was fixed for RISC-V.

Anyways, I found the root cause. This turned out to be missing
broadcast timer initialization in time_init() for RISC-V. Removing
the optimization over here was simply hiding the issue.

I will bring back the optimization in the next patch revision.

>
> >
> > >
> > > > +
> > > > + /*
> > > > + * The atomic_or() above must complete before
> > > > + * the atomic_read() below to avoid racing with
> > > > + * ipi_mux_unmask().
> > > > + */
> > > > + smp_mb__after_atomic();
> > > > +
> > > > + if (atomic_read(&icpu->enable) & ibit)
> > > > + cpumask_set_cpu(cpu, send_mask);
> > > > + }
> > > > +
> > > > + /* Trigger the parent IPI */
> > > > + ipi_mux_send(send_mask);
> > >
> > > IPIs are very rarely made pending on more than a single CPU at a
> > > time. The overwhelming majority of them are targeting a single CPU. So
> > > accumulating bits to avoid doing two or more "send" actions only
> > > penalises the generic case.
> > >
> > > My conclusion is that this "send_mask" can probably be removed,
> > > together with the preemption fiddling.
> >
> > So, we should call ipi_mux_send() for one target CPU at a time ?
>
> I think so, as it matches my measurements from a few years ago. It
> also simplifies things significantly, leading to better performance
> for the common case. Add some instrumentation and see whether this is
> still the case though.

I did not see any difference in the hackbench running on QEMU RISC-V.
I will simplify ipi_mux_send() like you suggested.

>
> >
> > >
> > > > +
> > > > + local_irq_restore(flags);
> > > > +}
> > > > +
> > > > +static const struct irq_chip ipi_mux_chip = {
> > > > + .name = "IPI Mux",
> > > > + .irq_mask = ipi_mux_mask,
> > > > + .irq_unmask = ipi_mux_unmask,
> > > > + .ipi_send_mask = ipi_mux_send_mask,
> > > > +};
> > >
> > > OK, you have now dropped the superfluous pre/post handlers. But the
> > > need still exists. Case in point, the aic_handle_ipi() prologue and
> > > epilogue to the interrupt handling. I have suggested last time that
> > > the driver could provide the actual struct irq_chip in order to
> > > provide the callbacks it requires.
> >
> > The aic_handle_ipi() can simply call ipi_mux_process() between
> > the prologue and epilogue.
>
> Hmm. OK. That's not what I had in mind, but fair enough.
>
> M.
>
> --
> Without deviation from the norm, progress is not possible.

Regards,
Anup