Re: [PATCH] Move kfree_call_rcu() to slab_common.c

From: Paul E. McKenney
Date: Thu Dec 21 2017 - 20:27:37 EST


On Thu, Dec 21, 2017 at 09:06:28AM -0800, Matthew Wilcox wrote:
> On Thu, Dec 21, 2017 at 07:54:34AM -0800, Paul E. McKenney wrote:
> > > +/* Queue an RCU callback for lazy invocation after a grace period.
> > > + * Currently there is no way of tagging the lazy RCU callbacks in the
> > > + * list of pending callbacks. Until then, this function may only be
> > > + * called from kfree_call_rcu().
> >
> > But now we might have a way.
> >
> > If the value in ->func is too small to be a valid function, RCU invokes
> > a fixed function name. This function can then look at ->func and do
> > whatever it wants, for example, maintaining an array indexed by the
> > ->func value that says what function to call and what else to pass it,
> > including for example the slab pointer and offset.
> >
> > Thoughts?
>
> Thought 1 is that we can force functions to be quad-byte aligned on all
> architectures (gcc option -falign-functions=...), so we can have more
> than the 4096 different values we currently use. We can get 63.5 bits of
> information into that ->func argument if we align functions to at least
> 4 bytes, or 63 if we only force alignment to a 2-byte boundary. I'm not
> sure if we support any architecture other than x86 with byte-aligned
> instructions. (I'm assuming that function descriptors as used on POWER
> and ia64 will also be sensibly aligned).

I do like this approach, especially should some additional subsystems
need this sort of special handling from RCU. It is also much faster
to demultiplex than alternative schemes based on address ranges and
the like.

How many bits are required by slab? Would ~56 bits (less the bottom
bit pattern reserved for function pointers) suffice on 64-bit systems
and ~24 bits on 32-bit systems? That would allow up to 256 specially
handled situations, which should be enough. (Famous last words!)

> Thought 2 is that the slab is quite capable of getting the slab pointer
> from the address of the object -- virt_to_head_page(p)->slab_cache
> So sorting objects by address is as good as storing their slab caches
> and offsets.

Different slabs can in some cases interleave their slabs of objects,
right? It might well be that grouping together different slabs from
the same slab cache doesn't help, but seems worth my asking the question.

> Thought 3 is that we probably don't want to overengineer this.
> Just allocating a 14-entry buffer (along with an RCU head) is probably
> enough to give us at least 90% of the wins that a more complex solution
> would give.

Can we benchmark this? After all, memory allocation can sometimes
counter one's intuition.

One alternative approach would be to allocate such a buffer per
slab cache, and run each slab caches through RCU independently.
Seems like this should allow some savings. Might not be worthwhile,
but again seemed worth asking the question.

Thanx, Paul