RE: >10% performance degradation since 2.6.18

From: Chetan . Loke
Date: Wed Jul 08 2009 - 11:04:24 EST


> -----Original Message-----
> From: Daniel J Blueman [mailto:daniel.blueman@xxxxxxxxx]
> Sent: Tuesday, July 07, 2009 6:06 PM
> To: Loke,Chetan; matthew@xxxxxx; andi@xxxxxxxxxxxxxx;
> jens.axboe@xxxxxxxxxx; Arjan van de Ven
> Cc: linux-kernel@xxxxxxxxxxxxxxx
> Subject: Re: >10% performance degradation since 2.6.18
>
> On Mon, Jul 6, 2009 at 10:58 PM, <Chetan.Loke@xxxxxxxxxx> wrote:
> >> -----Original Message-----
> >> From: linux-kernel-owner@xxxxxxxxxxxxxxx
> >> [mailto:linux-kernel-owner@xxxxxxxxxxxxxxx] On Behalf Of Daniel J
> >> Blueman
> >> Sent: Sunday, July 05, 2009 7:01 AM
> >> To: Matthew Wilcox; Andi Kleen
> >> Cc: Linux Kernel; Jens Axboe; Arjan van de Ven
> >> Subject: Re: >10% performance degradation since 2.6.18
> >>
> >> On Jul 3, 9:10 pm, Arjan van de Ven <ar...@xxxxxxxxxxxxx> wrote:
> >> > On Fri, 3 Jul 2009 21:54:58 +0200
> >> >
> >> > Andi Kleen <a...@xxxxxxxxxxxxxx> wrote:
> >> > > > That would seem to be a fruitful avenue of investigation --
> >> > > > whether limiting the cards to a single RX/TX
> interrupt would be
> >> > > > advantageous, or whether spreading the eight interrupts
> >> out over
> >> > > > the CPUs would be advantageous.
> >> >
> >> > > The kernel should really do the per cpu binding of MSIs
> >> by default.
> >> >
> >> > ... so that you can't do power management on a per socket basis?
> >> > hardly a good idea.
> >> >
> >> > just need to use a new enough irqbalance and it will
> spread out the
> >> > interrupts unless your load is low enough to go into low
> power mode.
> >>
> >> I was finding newer kernels (>~2.6.24) would set the
> Redirection Hint
> >> bit in the MSI address vector, allowing the processors to
> deliver the
> >> interrupt to the lowest interrupt priority (eg idle, no powersave)
> >> core (http://www.intel.com/Assets/PDF/manual/253668.pdf
> pp10-66) and
> >> older irqbalance daemons would periodically naively rewrite the
> >> bitmask of cores, delivering the interrupt to a static one.
> >>
> >> Thus, it may be worth checking if disabling any older irqbalance
> >> daemon gives any win.
> >>
> >> Perhaps there is value in writing different subsets of
> cores to the
> >> MSI address vector core bitmask (with the redirection hint
> enabled)
> >> for different I/O queues on heavy interrupt sources? By
> default, it's
> >> all cores.
> >>
> >
> > Possible enhancement -
> >
> > 1) Drain the responses in the xmit_frame() path. That is, post the
> > TX-request() and just before returning see if there are
> >   any more responses in the RX-queue. This will
> minimize(only if the NIC f/w coalesces) interrupt load.
> >   The n/w core should drain the responses rather than calling the
> > drain-routine from the adapter's xmit_frame() handler. This
> way there
> > won't be any need to
> >   modify individual xmit_frame handlers.
>
> The problem of additional checking on such a hot path, is each
> (synchronous) read over the PCIe bus takes ~1us, which is the
> same order of cost of executing 1000 instructions (and
> getting greater with faster processors and deeper serial
> buses).

Totally agree. Non-posted transactions are expensive.


> Perhaps it's sufficiently low cost if the NIC's RX
> queue status/structure was in main memory (vs registers over PCI).
>

Ok, I was under the impression that the RX-pointers are in the host-memory. And
so it would be a local mem-read as opposed to a PCI-transaction. So this
is obviously a NIC-design(ASIC/fw) decision.Nothing much we can do about it.

> If latency is not favoured over throughput,


> increasing the
> packet coalescing watermarks may reduce interrupt rate and
> thus some performance loss?
>
> Daniel

Well, coalescing goes hand-in-hand.We can't optimize just the 'driver/OS' stack and not optimize the adapter @ all.
Lets assume(and this was my assumption) the (partial)RX-structs are in host's memory.

So,
1) The RX-put ptr is in the host-memory(NIC updates by a posted transaction).
2) The RX-get ptr is on the NIC(host updates by a posted-transaction.PCIe bridge will coalesce the posted-writes).

3) Under heavy load the TX path is getting beat up really heavy.
3.1) The modified algo will read(now a local-read) the RX-put ptr and drain(after posting the TX-job) the RX-queue.
3.2) The host then updates the RX-get ptr on the NIC via a posted-write(don't flush the write,let the bridge coalesce).
Coalescing has so far worked for us. So it should work for others.But we can't speak for the chipsets.
3.3) This is where the coalescing algo kicks-in on the NIC(adapter) side.
3.3.1) The NIC f/w reads(local read) the RX-get ptr and does different things -
3.3.1.1) If the host is making good progress then it should reset the interrupt-timer.
3.3.1.2) If the host is not making enough progress(latency-algorithm) then it should fire the interrupt.
This will ensure timely response and lower-latency under bursty loads.

I think some chipsets support a relaxed version of PCI-read. So drivers could use that unless you really want to flush your posted-write.



Chetan Loke--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/