Re: [PATCH 1/3] ptr_ring: batch ring zeroing

From: Jason Wang
Date: Mon Apr 17 2017 - 22:19:16 EST




On 2017å04æ15æ 06:50, Michael S. Tsirkin wrote:
On Fri, Apr 14, 2017 at 03:52:23PM +0800, Jason Wang wrote:

On 2017å04æ12æ 16:03, Jason Wang wrote:

On 2017å04æ07æ 13:49, Michael S. Tsirkin wrote:
A known weakness in ptr_ring design is that it does not handle well the
situation when ring is almost full: as entries are consumed they are
immediately used again by the producer, so consumer and producer are
writing to a shared cache line.

To fix this, add batching to consume calls: as entries are
consumed do not write NULL into the ring until we get
a multiple (in current implementation 2x) of cache lines
away from the producer. At that point, write them all out.

We do the write out in the reverse order to keep
producer from sharing cache with consumer for as long
as possible.

Writeout also triggers when ring wraps around - there's
no special reason to do this but it helps keep the code
a bit simpler.

What should we do if getting away from producer by 2 cache lines
would mean we are keeping the ring moe than half empty?
Maybe we should reduce the batching in this case,
current patch simply reduces the batching.

Notes:
- it is no longer true that a call to consume guarantees
that the following call to produce will succeed.
No users seem to assume that.
- batching can also in theory reduce the signalling rate:
users that would previously send interrups to the producer
to wake it up after consuming each entry would now only
need to do this once in a batch.
Doing this would be easy by returning a flag to the caller.
No users seem to do signalling on consume yet so this was not
implemented yet.

Signed-off-by: Michael S. Tsirkin<mst@xxxxxxxxxx>
---

Jason, I am curious whether the following gives you some of
the performance boost that you see with vhost batching
patches. Is vhost batching on top still helpful?
The patch looks good to me, will have a test for vhost batching patches.

Thanks
Still helpful:

before this patch: 1.84Mpps
with this patch: 2.00Mpps
with batch dequeuing: 2.30Mpps
Just a thought: could you test dropping the consumer spinlock
completely? Just around the peek?

2% improvement for dropping spinlock around peeking, 2% more for dropping spinlock for consuming.


As I said previously, perf c2c tool should be helpful
to locate sources latency related to cache.


perf c2c indeeds shows some false sharing were reduced by this patch. But it does not show obvious different with batch dequeuing on top.

Thanks

Acked-by: Jason Wang <jasowang@xxxxxxxxxx>

Thanks