Re: [RFC-PATCH 2/4] mm: Add __rcu_alloc_page_lockless() func.

From: Uladzislau Rezki
Date: Tue Sep 29 2020 - 18:07:52 EST


On Tue, Sep 29, 2020 at 12:15:34PM +0200, Vlastimil Babka wrote:
> On 9/18/20 9:48 PM, Uladzislau Rezki (Sony) wrote:
> > Some background and kfree_rcu()
> > ===============================
> > The pointers to be freed are stored in the per-cpu array to improve
> > performance, to enable an easier-to-use API, to accommodate vmalloc
> > memmory and to support a single argument of the kfree_rcu() when only
> > a pointer is passed. More details are below.
> >
> > In order to maintain such per-CPU arrays there is a need in dynamic
> > allocation when a current array is fully populated and a new block is
> > required. See below the example:
> >
> > 0 1 2 3 0 1 2 3
> > |p|p|p|p| -> |p|p|p|p| -> NULL
> >
> > there are two pointer-blocks, each one can store 4 addresses
> > which will be freed after a grace period is passed. In reality
> > we store PAGE_SIZE / sizeof(void *). So to maintain such blocks
> > a single page is obtain via the page allocator:
> >
> > bnode = (struct kvfree_rcu_bulk_data *)
> > __get_free_page(GFP_NOWAIT | __GFP_NOWARN);
> >
> > after that it is attached to the "head" and its "next" pointer is
> > set to previous "head", so the list of blocks can be maintained and
> > grow dynamically until it gets drained by the reclaiming thread.
> >
> > Please note. There is always a fallback if an allocation fails. In the
> > single argument, this is a call to synchronize_rcu() and for the two
> > arguments case this is to use rcu_head structure embedded in the object
> > being free, and then paying cache-miss penalty, also invoke the kfree()
> > per object instead of kfree_bulk() for groups of objects.
> >
> > Why we maintain arrays/blocks instead of linking objects by the regular
> > "struct rcu_head" technique. See below a few but main reasons:
> >
> > a) A memory can be reclaimed by invoking of the kfree_bulk()
> > interface that requires passing an array and number of
> > entries in it. That reduces the per-object overhead caused
> > by calling kfree() per-object. This reduces the reclamation
> > time.
> >
> > b) Improves locality and reduces the number of cache-misses, due to
> > "pointer chasing" between objects, which can be far spread between
> > each other.
> >
> > c) Support a "single argument" in the kvfree_rcu()
> > void *ptr = kvmalloc(some_bytes, GFP_KERNEL);
> > if (ptr)
> > kvfree_rcu(ptr);
> >
> > We need it when an "rcu_head" is not embed into a stucture but an
> > object must be freed after a grace period. Therefore for the single
> > argument, such objects cannot be queued on a linked list.
> >
> > So nowadays, since we do not have a single argument but we see the
> > demand in it, to workaround it people just do a simple not efficient
> > sequence:
> > <snip>
> > synchronize_rcu(); /* Can be long and blocks a current context */
> > kfree(p);
> > <snip>
> >
> > More details is here: https://lkml.org/lkml/2020/4/28/1626
> >
> > d) To distinguish vmalloc pointers between SLAB ones. It becomes possible
> > to invoke the right freeing API for the right kind of pointer, kfree_bulk()
> > or TBD: vmalloc_bulk().
> >
> > e) Speeding up the post-grace-period freeing reduces the chance of a flood
> > of callback's OOMing the system.
> >
> > Also, please have a look here: https://lkml.org/lkml/2020/7/30/1166
> >
> > Proposal
> > ========
> > Introduce a lock-free function that obtain a page from the per-cpu-lists
> > on current CPU. It returns NULL rather than acquiring any non-raw spinlock.
> >
> > Description
> > ===========
> > The page allocator has two phases, fast path and slow one. We are interested
> > in fast path and order-0 allocations. In its turn it is divided also into two
> > phases: lock-less and not:
> >
> > 1) As a first step the page allocator tries to obtain a page from the
> > per-cpu-list, so each CPU has its own one. That is why this step is
> > lock-less and fast. Basically it disables irqs on current CPU in order
> > to access to per-cpu data and remove a first element from the pcp-list.
> > An element/page is returned to an user.
> >
> > 2) If there is no any available page in per-cpu-list, the second step is
> > involved. It removes a specified number of elements from the buddy allocator
> > transferring them to the "supplied-list/per-cpu-list" described in [1].
> >
> > Summarizing. The __rcu_alloc_page_lockless() covers only [1] and can not
> > do step [2], due to the fact that [2] requires an access to zone->lock.
> > It implies that it is super fast, but a higher rate of fails is also
> > expected.
> >
> > Usage: __rcu_alloc_page_lockless();
> >
> > Link: https://lore.kernel.org/lkml/20200814215206.GL3982@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx/
> > Not-signed-off-by: Peter Zijlstra <peterz@xxxxxxxxxxxxx>
> > Signed-off-by: Uladzislau Rezki (Sony) <urezki@xxxxxxxxx>
>
> After reading all the threads and mulling over this, I am going to deflect from
> Mel and Michal and not oppose the idea of lockless allocation. I would even
> prefer to do it via the gfp flag and not a completely separate path. Not using
> the exact code from v1, I think it could be done in a way that we don't actually
> look at the new flag until we find that pcplist is empty - which should not
> introduce overhead to the fast-fast path when pcpclist is not empty. It's more
> maintainable that adding new entry points, IMHO.
>
Thanks for reading all that from the beginning! It must be tough due to the
fact there were lot of messages in the threads, so at least i was lost.

I agree that adding a new entry or separate lock-less function can be considered
as something that is hard to maintain. I have a question here. I mean about your
different look at it:

<snip>
bool is_pcp_cache_empty(gfp_t gfp)
{
struct per_cpu_pages *pcp;
struct zoneref *ref;
unsigned long flags;
bool empty;

ref = first_zones_zonelist(node_zonelist(
numa_node_id(), gfp), gfp_zone(gfp), NULL);
if (!ref->zone)
return true;

local_irq_save(flags);
pcp = &this_cpu_ptr(ref->zone->pageset)->pcp;
empty = list_empty(&pcp->lists[gfp_migratetype(gfp)]);
local_irq_restore(flags);

return empty;
}

disable_irq();
if (!is_pcp_cache_empty(GFP_NOWAIT))
__get_free_page(GFP_NOWAIT);
enable_irq();
<snip>

Do you mean to have something like above? I mean some extra API
function that returns true or false if fast-fast allocation can
either occur or not. Above code works just fine and never touches
main zone->lock.

i.e. Instead of introducing an extra GFP_LOCKLESS flag or any new
extra lock-less function. We could have something that checks a
pcp page cache list, thus it can guarantee that a request would
be accomplish using fast-fast path.

>
> But there's still the condition that it's sufficiently shown that the allocation
> is useful for RCU. In that case I prefer that the page allocator (or MM in
> general) can give its users what they need without having to work around it.
>
I see your point, if there is something that can be generic and not specif - makes
sense to me also. I do share your view.

>
> Seems like GFP_ATOMIC is insufficient today so if that means we need a new flag
> for the raw spinlock context, so be it. But if your usage of __GFP_NO_LOCKS
> depends on the result of preempt_count() or similar check whether this is a
> context that needs it, I'd prefer to keep this in the caller.
>
By having the preemptabele() thing, we can just workaround GFP_ATOMIC or GFP_NOWAIT
flag issues. Because both are insufficient. I mean when we see that a context is
atomic(can not sleep) we can bypass any usage of the page allocator because it will
hit BUG() with flags which are in question. And yes, the best is when we do not
think much about context, so there is something reliable from point of using the
page allocator.

Thanks Vlastimil!

--
Vlad Rezki