Re: [RFC PATCH v9 for 4.15 01/14] Restartable sequences system call

From: Linus Torvalds
Date: Fri Oct 13 2017 - 14:30:37 EST


On Fri, Oct 13, 2017 at 2:35 AM, Ben Maurer <bmaurer@xxxxxx> wrote:
>
> I'm really excited to hear that you're open to this patch set and totally understand the desire for some more numbers.

So the patch-set actually looks very reasonable today. I looked
through it (ok, I wasn't cc'd on the ppc-only patches so I didn't look
at those, but I don't think they are likely objectionable either), and
everything looked fine from a patch standpoint.

But it's not _just_ numbers for real loads I'm looking for, it's
actually an _existence proof_ for a real load too. I'd like to know
that the suggested interface _really_ works in practice too for all
the expected users.

In particular, it's easy to make test-cases to show basic
functionality, but that does not necessarily show that the interface
then works in "real life".

For example, if this is supposed to work for a malloc library, it's
important that people show that yes, this can really work in a
*LIBRARY*.

That sounds so obvious and stupid that you might go "What do you
mean?", but for things to work for libraries, they have to work
together with *other* users, and with *independent* users.

For example, say that you're some runtime that wants to use the percpu
thing for percpu counters - because you want to avoid cache ping-pong,
and you want to avoid per-thread allocation overhead (or per-thread
scaling for just summing up the counters) when you have potentially
tens of thousands of threads.

Now, how does this runtime work *together* with

- CPU hotplug adding new cpu's while you are running (and after you
allocated your percpu areas)

- libraries and system admins that limit - or extend - you to a
certain set of CPUs

- another library (like the malloc library) that wants to use the
same interface for its percpu allocation queues.

maybe all of this "just works", but I really want to see an existence
proof. Not just a "dedicated use of the interface for one benchmark".

So yes, I want to see numbers, but I really want to see something much
more fundamental. I want to feel like there is a good reason to
believe that the interface really is sufficient and that it really
does work, even when a single thread may have multiple *different*
uses for this. Statistics, memory allocation queues, RCU, per-cpu
locking, yadda yadda. All these things may want to use this, but they
want to use it *together*, and without you having to write special
code where every user needs to know about every other user statically.

Can you load two different *dynamic* libraries that each independently
uses this thing for their own use, without having to be built together
for each other?

>> A "increment percpu value" simply isn't relevant.
>
> While I understand it seems trivial, my experience has been that this type of operation can actually be important in many server workloads.

Oh, I'm not saying that it's not relevant to have high-performance
statistics gathering using percpu data structures. Of _course_ that is
important, we do that very much in the kernel itself.

But a benchmark that does nothing else really isn't relevant. If the
*only* thing somebody uses this for is statistics, it's simply not
good enough.


>> Because without real-world uses, it's not obvious that there won't be
>> somebody who goes "oh, this isn't quite enough for us, the semantics
>> are subtly incompatible with our real-world use case".
>
> Is your concern mainly this question (is this patchset a good way to
> bring per-cpu algorithms to userspace)? I'm hoping that given the
> variety of ways that per-cpu data structures are used in the kernel
> the concerns around this patch set are mainly around what approach we
> should take rather than if per-cpu algorithms are a good idea at all.
> If this is your main concern perhaps our focus should be around
> demonstrating that a number of useful per-cpu algorithms can be
> implemented using restartable sequences.

The important thing for me is that it should demonstrate that you can
have users co-exists, and that the interface is sufficient for that.

So I do want to see "just numbers" in the sense that I would want to
see that people have actually written code that takes advantage of the
percpu nature to do real things (like an allocator). But more than
that, I want to see *use*.

> Ultimately I'm worried there's a chicken and egg problem here.

This patch-set has been around for *years* in some form. It's improved
over the years, but the basic approaches are not new.

Honestly, if people still don't have any actual user-level code that
really _uses_ this, I'm not interested in merging it.

There's no chicken-and-egg here. Anybody who wants to push this
patch-set needs to write the user level code to validate that the
patch-set makes sense. That's not chicken-and-egg, that's just
"without the user-space code, the kernel code has never been tested,
validated or used".

And if nobody can be bothered to write the user-level code and test
this patch-series, then obviously it's not important enough for the
kernel to merge it.

Linus