Re: [PATCH v3] random: use expired per-cpu timer rather than wq for mixing fast pool

From: Sebastian Andrzej Siewior
Date: Thu Sep 29 2022 - 10:19:04 EST

Next message: Andy Shevchenko: "Re: [PATCH v2 1/2] x86/stackprotector/32: Make the canary into a regular percpu variable"
Previous message: John Ogness: "Re: [resend][bug] low-probability console lockups since 5.19"
In reply to: Jason A. Donenfeld: "Re: [PATCH v3] random: use expired per-cpu timer rather than wq for mixing fast pool"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On 2022-09-28 18:15:46 [+0200], Jason A. Donenfeld wrote:
> Hi Sebastian,
Hi Jason,

> On Wed, Sep 28, 2022 at 02:06:45PM +0200, Sebastian Andrzej Siewior wrote:
> > On 2022-09-27 12:42:33 [+0200], Jason A. Donenfeld wrote:
> > …
> > > This is an ordinary pattern done all over the kernel. However, Sherry
> > > noticed a 10% performance regression in qperf TCP over a 40gbps
> > > InfiniBand card. Quoting her message:
> > >
> > > > MT27500 Family [ConnectX-3] cards:
> > > > Infiniband device 'mlx4_0' port 1 status:
> > …
> >
> > While looking at the mlx4 driver, it looks like they don't use any NAPI
> > handling in their interrupt handler which _might_ be the case that they
> > handle more than 1k interrupts a second. I'm still curious to get that
> > ACKed from Sherry's side.
>
> Are you sure about that? So far as I can tell drivers/net/ethernet/
> mellanox/mlx4 has plenty of napi_schedule/napi_enable and such. Or are
> you looking at the infiniband driver instead? I don't really know how
> these interact.

I've been looking at mlx4_msi_x_interrupt() and it appears that it
iterates over a ring buffer. I guess that mlx4_cq_completion() will
invoke mlx4_en_rx_irq() which schedules NAPI.

> But yea, if we've got a driver not using NAPI at 40gbps that's obviously
> going to be a problem.

So I'm wondering if we get 1 worker a second which kills the performance
or if we get more than 1k interrupts in less than second resulting in
more wakeups within a second..

> > Jason, from random's point of view: deferring until 1k interrupts + 1sec
> > delay is not desired due to low entropy, right?
>
> Definitely || is preferable to &&.
>
> >
> > > Rather than incur the scheduling latency from queue_work_on, we can
> > > instead switch to running on the next timer tick, on the same core. This
> > > also batches things a bit more -- once per jiffy -- which is okay now
> > > that mix_interrupt_randomness() can credit multiple bits at once.
> >
> > Hmmm. Do you see higher contention on input_pool.lock? Just asking
> > because if more than once CPUs invokes this timer callback aligned, then
> > they block on the same lock.
>
> I've been doing various experiments, sending mini patches to Oracle and
> having them test this in their rig. So far, it looks like the cost of
> the body of the worker itself doesn't matter much, but rather the cost
> of the enqueueing function is key. Still investigating though.
>
> It's a bit frustrating, as all I have to work with are results from the
> tests, and no perf analysis. It'd be great if an engineer at Oracle was
> capable of tackling this interactively, but at the moment it's just me
> sending them patches. So we'll see. Getting closer though, albeit very
> slowly.

Oh boy. Okay.

> Jason

Sebastian

Next message: Andy Shevchenko: "Re: [PATCH v2 1/2] x86/stackprotector/32: Make the canary into a regular percpu variable"
Previous message: John Ogness: "Re: [resend][bug] low-probability console lockups since 5.19"
In reply to: Jason A. Donenfeld: "Re: [PATCH v3] random: use expired per-cpu timer rather than wq for mixing fast pool"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]