Re: [PATCH -tip 3/3] x86/percpu: *NOT FOR MERGE* Implement arch_raw_cpu_ptr() with RDGSBASE

From: Linus Torvalds
Date: Mon Oct 16 2023 - 15:55:01 EST


On Mon, 16 Oct 2023 at 12:29, Sean Christopherson <seanjc@xxxxxxxxxx> wrote:
>
> > Are we certain that ucode on modern x86 CPUs check CR4 for every affected
> > instruction?
>
> Not certain at all. I agree the CR4.FSGSBASE thing could be a complete non-issue
> and was just me speculating.

Note that my timings on two fairly different arches do put the cost of
'rdgsbase' at 2 cycles, so it's not microcoded in the sense of jumping
off to some microcode sequence that has a noticeable overhead.

So it's almost certainly what Intel calls a "complex decoder" case
that generates up to 4 uops inline and only decodes in the first
decode slot.

One of the uops could easily be a cr4 check, that's not an uncommon
thing for those kinds of instructions.

If somebody wants to try my truly atrocious test program on other
machines, go right ahead. It's attached. I'm not proud of it. It's a
hack.

Do something like this:

$ gcc -O2 t.c
$ ./a.out
"nop"=0l: 0.380925
"nop"=0l: 0.380640
"nop"=0l: 0.380373
"mov %1,%0":"=r"(base):"m"(zero)=0l: 0.787984
"rdgsbase %0":"=r"(base)=0l: 2.626625

and you'll see that a no-op takes about a third of a cycle on my Zen 2
core (according to this truly stupid benchmark). With some small
overhead.

And a "mov memory to register" shows up as ~3/4 cycle, but it's really
probably that the core can do two of them per cycle, and then the
chain of adds (see how that benchmark makes sure the result is "used")
adds some more overhead etc.

And the 'rdgsbase' is about two cycles, and presumably is fully
serialized, so all the loop overhead and adding results then shows up
as that extra .6 of a cycle on average.

But doing cycle estimations on OoO machines is "guess rough patterns",
so take all the above with a big pinch of salt. And feel free to test
it on other cores than the ones I did (Intel Skylake and and AMD Zen
2). You migth want to put your machine into "performance" mode or
other things to actually make it run at the highest frequency to get
more repeatable numbers.

The Skylake core does better on the nops (I think Intel gets rid of
them earlier in the decode stages and they basically disappear in the
uop cache), and can do three loads per cycle. So rdgsbase looks
relatively slower on my Skylake at about 3 cycles per op, but when you
look at an individual instruction, that's a fairly artificial thing.
You don't run these things in the uop cache in reality.

Linus
#include <stdio.h>

#define NR 100000000

#define LOOP(x) for(int i = 0; i < NR/16; i++) do { \
asm volatile(x); sum += base; \
asm volatile(x); sum += base; \
asm volatile(x); sum += base; \
asm volatile(x); sum += base; \
asm volatile(x); sum += base; \
asm volatile(x); sum += base; \
asm volatile(x); sum += base; \
asm volatile(x); sum += base; \
asm volatile(x); sum += base; \
asm volatile(x); sum += base; \
asm volatile(x); sum += base; \
asm volatile(x); sum += base; \
asm volatile(x); sum += base; \
asm volatile(x); sum += base; \
asm volatile(x); sum += base; \
asm volatile(x); sum += base; \
} while (0)

static inline unsigned int rdtsc(void)
{
unsigned int a,d;
asm volatile("rdtsc":"=a"(a),"=d"(d)::"memory");
return a;
}

#define TEST(x) do { \
unsigned int s = rdtsc(); \
LOOP(x); \
s = rdtsc()-s; \
fprintf(stderr, " " #x "=%ul: %f\n", sum, s / (double)NR); \
} while (0)

int main(int argc, char **argv)
{
unsigned long base = 0, sum = 0;
unsigned long zero = 0;

TEST("nop");
TEST("nop");
TEST("nop");
TEST("mov %1,%0":"=r"(base):"m"(zero));
TEST("rdgsbase %0":"=r"(base));
return 0;
}