Re: [PATCH net v1 1/2] lan743x: improve performance: fix rx_napi_poll/interrupt ping-pong

From: Eric Dumazet
Date: Tue Dec 08 2020 - 18:51:24 EST


On Wed, Dec 9, 2020 at 12:29 AM Jakub Kicinski <kuba@xxxxxxxxxx> wrote:
>
> On Tue, 8 Dec 2020 17:23:08 -0500 Sven Van Asbroeck wrote:
> > On Tue, Dec 8, 2020 at 2:50 PM Jakub Kicinski <kuba@xxxxxxxxxx> wrote:
> > >
> > > >
> > > > +done:
> > > > /* update RX_TAIL */
> > > > lan743x_csr_write(adapter, RX_TAIL(rx->channel_number),
> > > > rx_tail_flags | rx->last_tail);
> > > > -done:
> > > > +
> > >
> > > I assume this rings the doorbell to let the device know that more
> > > buffers are available? If so it's a little unusual to do this at the
> > > end of NAPI poll. The more usual place would be to do this every n
> > > times a new buffer is allocated (in lan743x_rx_init_ring_element()?)
> > > That's to say for example ring the doorbell every time a buffer is put
> > > at an index divisible by 16.
> >
> > Yes, I believe it tells the device that new buffers have become available.
> >
> > I wonder why it's so unusual to do this at the end of a napi poll? Omitting
> > this could result in sub-optimal use of buffers, right?
> >
> > Example:
> > - tail is at position 0
> > - core calls napi_poll(weight=64)
> > - napi poll consumes 15 buffers and pushes 15 skbs, then ring empty
> > - index not divisible by 16, so tail is not updated
> > - weight not reached, so napi poll re-enables interrupts and bails out
> >
> > Result: now there are 15 buffers which the device could potentially use, but
> > because the tail wasn't updated, it doesn't know about them.
>
> Perhaps 16 for a device with 64 descriptors is rather high indeed.
> Let's say 8. If the device misses 8 packet buffers on the ring,
> that should be negligible.
>

mlx4 uses 8 as the threshold ( mlx4_en_refill_rx_buffers())

> Depends on the cost of the CSR write, usually packet processing is
> putting a lot of pressure on the memory subsystem of the CPU, hence
> amortizing the write over multiple descriptors helps. The other thing
> is that you could delay the descriptor writes to write full cache lines,
> but I don't think that will help on IMX6.
>
> > It does make sense to update the tail more frequently than only at the end
> > of the napi poll, though?
> >
> > I'm new to napi polling, so I'm quite interested to learn about this.
>
> There is a tracepoint which records how many packets NAPI has polled:
> napi:napi_poll, you can see easily what your system is doing.
>
> What you want to avoid is the situation where the device already used
> up all the descriptors by the time driver finishes the Rx processing.
> That'd result in drops. So the driver should push the buffers back to
> the device reasonably early.
>
> With a ring of 64 descriptors and NAPI budget of 64 it's not unlikely
> that the ring will be completely used when processing runs.
>
> > > > + /* up to half of elements in a full rx ring are
> > > > + * extension frames. these do not generate skbs.
> > > > + * to prevent napi/interrupt ping-pong, limit default
> > > > + * weight to the smallest no. of skbs that can be
> > > > + * generated by a full rx ring.
> > > > + */
> > > > netif_napi_add(adapter->netdev,
> > > > &rx->napi, lan743x_rx_napi_poll,
> > > > - rx->ring_size - 1);
> > > > + (rx->ring_size - 1) / 2);
> > >
> > > This is rather unusual, drivers should generally pass NAPI_POLL_WEIGHT
> > > here.
> >
> > I agree. The problem is that a full ring buffer of 64 buffers will only
> > contain 32 buffers with network data - the others are timestamps.
> >
> > So napi_poll(weight=64) can never reach its full weight. Even with a full
> > buffer, it always assumes that it has to stop polling, and re-enable
> > interrupts, which results in a ping-pong.
>
> Interesting I don't think we ever had this problem before. Let me CC
> Eric to see if he has any thoughts on the case. AFAIU you should think
> of the weight as way of arbitrating between devices (if there is more
> than one).

Driver could be called with an arbitrary budget (of 64),
and if its ring buffer has been depleted, return @budget instead of skb counts,
and not ream the interrupt

if (count < budget && !rx_ring_fully_processed) {
if (napi_complete_done(napi, count))
ream_irqs();
return count;
}
return budget;


>
> NAPI does not do any deferral (in wall clock time terms) of processing,
> so the only difference you may get for lower weight is that softirq
> kthread will get a chance to kick in earlier.
>
> > Would it be better to fix the weight counting? Increase the count
> > for every buffer consumed, instead of for every skb pushed?
>