Re: [RFC PATCH v3 for 4.15 08/24] Provide cpu_opv system call

From: Thomas Gleixner
Date: Mon Nov 20 2017 - 12:48:45 EST


On Mon, 20 Nov 2017, Mathieu Desnoyers wrote:
> ----- On Nov 16, 2017, at 6:26 PM, Thomas Gleixner tglx@xxxxxxxxxxxxx wrote:
> >> +#define NR_PINNED_PAGES_ON_STACK 8
> >
> > 8 pinned pages on stack? Which stack?
>
> The common cases need to touch few pages, and we can keep the
> pointers in an array on the kernel stack within the cpu_opv system
> call.
>
> Updating to:
>
> /*
> * Typical invocation of cpu_opv need few pages. Keep struct page
> * pointers in an array on the stack of the cpu_opv system call up to
> * this limit, beyond which the array is dynamically allocated.
> */
> #define NR_PIN_PAGES_ON_STACK 8

That name still sucks. NR_PAGE_PTRS_ON_STACK would be immediately obvious.

> >> + * The operations available are: comparison, memcpy, add, or, and, xor,
> >> + * left shift, and right shift. The system call receives a CPU number
> >> + * from user-space as argument, which is the CPU on which those
> >> + * operations need to be performed. All preparation steps such as
> >> + * loading pointers, and applying offsets to arrays, need to be
> >> + * performed by user-space before invoking the system call. The
> >
> > loading pointers and applying offsets? That makes no sense.
>
> Updating to:
>
> * All preparation steps such as
> * loading base pointers, and adding offsets derived from the current
> * CPU number, need to be performed by user-space before invoking the
> * system call.

This still does not explain anything, really.

Which base pointer is loaded? I nowhere see a reference to a base
pointer.

And what are the offsets about?

derived from current cpu number? What is current CPU number? The one on
which the task executes now or the one which it should execute on?

I assume what you want to say is:

All pointers in the ops must have been set up to point to the per CPU
memory of the CPU on which the operations should be executed.

At least that's what I oracle in to that.

> >> + * "comparison" operation can be used to check that the data used in the
> >> + * preparation step did not change between preparation of system call
> >> + * inputs and operation execution within the preempt-off critical
> >> + * section.
> >> + *
> >> + * The reason why we require all pointer offsets to be calculated by
> >> + * user-space beforehand is because we need to use get_user_pages_fast()
> >> + * to first pin all pages touched by each operation. This takes care of
> >
> > That doesnt explain it either.
>
> What kind of explication are you looking for here ? Perhaps being too close
> to the implementation prevents me from understanding what is unclear from
> your perspective.

What the heck are pointer offsets?

The ops have one or two pointer(s) to a lump of memory. So if a pointer
points to the wrong lump of memory then you're screwed, but that's true for
all pointers handed to the kernel.

> Sorry, that paragraph was unclear. Updated:
>
> * An overall maximum of 4216 bytes in enforced on the sum of operation
> * length within an operation vector, so user-space cannot generate a
> * too long preempt-off critical section (cache cold critical section
> * duration measured as 4.7Âs on x86-64). Each operation is also limited
> * a length of PAGE_SIZE bytes,

Again PAGE_SIZE is the wrong unit here. PAGE_SIZE can vary. What you want
is a hard limit of 4K. And because there is no alignment requiremnt the
rest of the sentence is stating the obvious.

> * meaning that an operation can touch a
> * maximum of 4 pages (memcpy: 2 pages for source, 2 pages for
> * destination if addresses are not aligned on page boundaries).

I still have to understand why the 4K copy is necessary in the first place.

> > What's the critical section duration for operations which go to the limits
> > of this on a average x86 64 machine?
>
> When cache-cold, I measure 4.7 Âs per critical section doing a
> 4k memcpy and 15 * 8 bytes memcpy on a E5-2630 v3 @2.4GHz. Is it an
> acceptable preempt-off latency for RT ?

Depends on the use case as always ....

Thanks,

tglx