idea for scalable and almost waste-free percpu counter alloc (was: Re: [PATCH 0/2] execve scalability issues, part 1)

From: Mateusz Guzik
Date: Tue Aug 22 2023 - 20:31:25 EST


On 8/21/23, Mateusz Guzik <mjguzik@xxxxxxxxx> wrote:
> On Mon, Aug 21, 2023 at 02:07:28PM -0700, Dennis Zhou wrote:
>> On Mon, Aug 21, 2023 at 10:28:27PM +0200, Mateusz Guzik wrote:
>> > With this out of the way I'll be looking at some form of caching to
>> > eliminate these allocs as a problem.
>> >
>>
>> I'm not against caching, this is just my first thought. Caching will
>> have an impact on the backing pages of percpu. All it takes is 1
>> allocation on a page for the current allocator to pin n pages of memory.
>> A few years ago percpu depopulation was implemented so that limits the
>> amount of resident backing pages.
>>
>
> I'm painfully aware.
>
>> Maybe the right thing to do is preallocate pools of common sized
>> allocations so that way they can be recycled such that we don't have to
>> think too hard about fragmentation that can occur if we populate these
>> pools over time?
>>
>
> This is what I was going to suggest :)
>
> FreeBSD has a per-cpu allocator which pretends to be the same as the
> slab allocator, except handing out per-cpu bufs. So far it has sizes 4,
> 8, 16, 32 and 64 and you can act as if you are mallocing in that size.
>
> Scales perfectly fine of course since it caches objs per-CPU, but there
> is some waste and I have 0 idea how it compares to what Linux is doing
> on that front.
>
> I stress though that even if you were to carve out certain sizes, a
> global lock to handle ops will still kill scalability.
>
> Perhaps granularity better than global, but less than per-CPU would be a
> sweet spot for scalabability vs memory waste.
>
> That said...
>
>> Also as you've pointed out, it wasn't just the percpu allocation being
>> the bottleneck, but percpu_counter's global lock too for hotplug
>> support. I'm hazarding a guess most use cases of percpu might have
>> additional locking requirements too such as percpu_counter.
>>
>
> True Fix(tm) is a longer story.
>

So I had a meh sleep and I think I figured it out. Key insight is that
with some hackery we can move counters *in active use* to different
pagesets. More, I found the crux of the hackery is already used by
percpu refcounting so I don't think it would be seen as too ugly
(keyword: RCU). :)

The word "allocate" is way overloaded for this discussion, so I'm
going to use different terms for each use case.

These would be:
backing percpu machinery -- it *assigns* memory for use by caches
caching layer -- it *hands out* memory to final consumers, memory is
*cached* if not handed out

Also the collection of per-cpu pages would be referred to as "pageset"
(already used above).

Say caches are per-cpu and are able to satisfy all requests as is.
This means there is no contention on the backing percpu store. Mass
import/export to/from caches can also handle the hotplug list --
everything cached or handed out can reside on it. Then handing
out/returning a counter does not have to synchronize anything with
other CPUs.

While this obviously scales, there is the problem of memory waste you mentioned.

For cached objects things are easy -- they can be unassigned *or*
moved elsewhere, for example trade 2 counters holding a pageset
hostage for 2 counters in a pageset which is already in heavy use.

Nothing new in the above and for the sake of argument let's say this
is not enough when it comes to fighting memory waste.

Say we got really unlucky and there is a bunch of processes where each
one happens to have counters in an otherwise empty pageset. So that's
cpucount * processes pages spent when they all perhaps could fit in
one * cpucount and that's not acceptable.

The idea boils down to temporarily disabling per-cpu operation,
waiting for everyone to observe the new state, changing the pointer to
percpu stuff and re-enabling. A simplified remark is that this can be
done with RCU, wrapping the regular consumers with rcu_read_lock and
invoking synchronize_rcu to wait for them (there is more to it though,
some fences may be needed).

To sum up, it is very much possible to avoid any memory waste *and*
have perfectly scalable percpu counter init/destroy, while adding next
to no overhead to consumers of said counters.

That said, I don't know if the complexity is worth it assuming the
"only use percpu counters if going multithreaded" idea gets
implemented.

--
Mateusz Guzik <mjguzik gmail.com>