Re: [PATCH] x86_64, asm: Work around AMD SYSRET SS descriptor attribute issue

From: Linus Torvalds
Date: Tue Apr 28 2015 - 12:29:00 EST


On Tue, Apr 28, 2015 at 8:55 AM, Borislav Petkov <bp@xxxxxxxxx> wrote:
>
> Provided it is correct, it shows that the 0x66-prefixed 3-byte NOPs are
> better than the 0F 1F 00 suggested by the manual (Haha!):

That's which AMD CPU?

On my intel i7-4770S, they are the same cost (I cut down your loop
numbers by an order of magnitude each because I couldn't be arsed to
wait for it, so it might be off by a cycle or two):

Running 60 times, 1000000 loops per run.
nop_0x90 average: 81.065681
nop_3_byte average: 80.230101

That said, I think your benchmark tests the speed of "rdtsc" rather
than the no-ops. Putting the read_tsc inside the inner loop basically
makes it swamp everything else.

> $ taskset -c 3 ./nops
> Running 600 times, 10000000 loops per run.
> nop_0x90 average: 439.805220
> nop_3_byte average: 442.412915

I think that's in the noise, and could be explained by random
alignment of the loop too, or even random factors like "the CPU heated
up, so the later run was slightly slower". The difference between 439
and 442 doesn't strike me as all that significant.

It might be better to *not* inline, and instead make a real function
call to something that has a lot of no-ops (do some preprocessor magic
to make more no-ops in one go). At least that way the alignment is
likely the same for the two cases.

Or if not that, then I think you're better off with something like

p1 = read_tsc();
for (i = 0; i < LOOPS; i++) {
nop_0x90();

}
p2 = read_tsc();
r = (p2 - p1);

because while you're now measuring the loop overhead too, that's
*much* smaller than the rdtsc overhead. So I get something like

Running 600 times, 1000000 loops per run.
nop_0x90 average: 3.786935
nop_3_byte average: 3.677228

and notice the difference between "~80 cycles" and "~3.7 cycles".
Yeah, that's rdtsc. I bet your 440 is about the same thing too.

Btw, the whole thing about "averaging cycles" is not the right thing
to do either. You should probably take the *minimum* cycles count, not
the average, because anything non-minimal means "some perturbation"
(ie interrupt etc).

So I think something like the attached would be better. It gives an
approximate "cycles per one four-byte nop", and I get

[torvalds@i7 ~]$ taskset -c 3 ./a.out
Running 60 times, 1000000 loops per run.
nop_0x90 average: 0.200479
nop_3_byte average: 0.199694

which sounds suspiciously good to me (5 nops per cycle? uop cache and
nop compression, I guess).

Linus
/*
* $ taskset -c 3 ./nops
* Running 600 times, 10000000 loops per run.
* nop_0x90 average: 439.805220
* nop_3_byte average: 442.412915
*
* How to run:
*
* taskset -c <cpunum> argv0
*/
#include <stdio.h>
#include <sys/syscall.h>
#include <stdlib.h>
#include <unistd.h>

typedef unsigned long long u64;

#define TWO(a) a; a;
#define FOUR(a) TWO(TWO(a))
#define SIXTEEN(a) FOUR(FOUR(a))
#define TWOFIVESIX(a) SIXTEEN(SIXTEEN(a))

#define DECLARE_ARGS(val, low, high) unsigned low, high
#define EAX_EDX_VAL(val, low, high) ((low) | ((u64)(high) << 32))
#define EAX_EDX_ARGS(val, low, high) "a" (low), "d" (high)
#define EAX_EDX_RET(val, low, high) "=a" (low), "=d" (high)

static __always_inline unsigned long long rdtsc(void)
{
DECLARE_ARGS(val, low, high);

asm volatile("rdtsc" : EAX_EDX_RET(val, low, high));

return EAX_EDX_VAL(val, low, high);
}

static inline u64 read_tsc(void)
{
u64 ret;

asm volatile("mfence");
ret = rdtsc();
asm volatile("mfence");

return ret;
}

static void nop_0x90(void)
{
TWOFIVESIX(asm volatile(".byte 0x66, 0x66, 0x90"))
}

static void nop_3_byte(void)
{
TWOFIVESIX(asm volatile(".byte 0x0f, 0x1f, 0x00"))
}

int main()
{
int i, j;
u64 p1, p2;
u64 r, min;

#define TIMES 60
#define LOOPS 1000000ULL

printf("Running %d times, %lld loops per run.\n", TIMES, LOOPS);

min = 100000000;

for (r = 0, j = 0; j < TIMES; j++) {
p1 = read_tsc();
for (i = 0; i < LOOPS; i++) {
nop_0x90();

}
p2 = read_tsc();
r = (p2 - p1);

if (r < min)
min = r;
}

printf("nop_0x90 average: %f\n", min / (double) LOOPS / 256);

min = 100000000;

for (r = 0, j = 0; j < TIMES; j++) {
p1 = read_tsc();
for (i = 0; i < LOOPS; i++) {
nop_3_byte();
}
p2 = read_tsc();

r = (p2 - p1);
if (r < min)
min = r;
}

printf("nop_3_byte average: %f\n", min / (double) LOOPS / 256);

return 0;
}