Re: [PATCH] x86_64, asm: Work around AMD SYSRET SS descriptor attribute issue

From: Borislav Petkov
Date: Tue Apr 28 2015 - 12:58:32 EST

Next message: Chris Metcalf: "Re: [PATCH 2/2] [PATCH] sched: Add smp_rmb() in task rq locking cycles"
Previous message: Linus Torvalds: "Re: Should mmap MAP_LOCKED fail if mm_poppulate fails?"
In reply to: Linus Torvalds: "Re: [PATCH] x86_64, asm: Work around AMD SYSRET SS descriptor attribute issue"
Next in thread: Linus Torvalds: "Re: [PATCH] x86_64, asm: Work around AMD SYSRET SS descriptor attribute issue"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On Tue, Apr 28, 2015 at 09:28:52AM -0700, Linus Torvalds wrote:
> On Tue, Apr 28, 2015 at 8:55 AM, Borislav Petkov <bp@xxxxxxxxx> wrote:
> >
> > Provided it is correct, it shows that the 0x66-prefixed 3-byte NOPs are
> > better than the 0F 1F 00 suggested by the manual (Haha!):
>
> That's which AMD CPU?

F16h.

> On my intel i7-4770S, they are the same cost (I cut down your loop
> numbers by an order of magnitude each because I couldn't be arsed to
> wait for it, so it might be off by a cycle or two):
>
> Running 60 times, 1000000 loops per run.
> nop_0x90 average: 81.065681
> nop_3_byte average: 80.230101
>
> That said, I think your benchmark tests the speed of "rdtsc" rather
> than the no-ops. Putting the read_tsc inside the inner loop basically
> makes it swamp everything else.

Whoops, now that you mention it... of course, that RDTSC *along* with
the barriers around it is much much more expensive than the NOPs.

> > $ taskset -c 3 ./nops
> > Running 600 times, 10000000 loops per run.
> > nop_0x90 average: 439.805220
> > nop_3_byte average: 442.412915
>
> I think that's in the noise, and could be explained by random
> alignment of the loop too, or even random factors like "the CPU heated
> up, so the later run was slightly slower". The difference between 439
> and 442 doesn't strike me as all that significant.
>
> It might be better to *not* inline, and instead make a real function
> call to something that has a lot of no-ops (do some preprocessor magic
> to make more no-ops in one go). At least that way the alignment is
> likely the same for the two cases.

malloc a page, populate it with NOPs, slap a RET at the end and jump to
it? Maybe even more than 1 page?

> Or if not that, then I think you're better off with something like
>
> p1 = read_tsc();
> for (i = 0; i < LOOPS; i++) {
> nop_0x90();
>
> }
> p2 = read_tsc();
> r = (p2 - p1);
>
> because while you're now measuring the loop overhead too, that's
> *much* smaller than the rdtsc overhead. So I get something like

Yap, that looks better.

> Running 600 times, 1000000 loops per run.
> nop_0x90 average: 3.786935
> nop_3_byte average: 3.677228
>
> and notice the difference between "~80 cycles" and "~3.7 cycles".
> Yeah, that's rdtsc. I bet your 440 is about the same thing too.
>
> Btw, the whole thing about "averaging cycles" is not the right thing
> to do either. You should probably take the *minimum* cycles count, not
> the average, because anything non-minimal means "some perturbation"
> (ie interrupt etc).

My train of thought was: if you do a *lot* of runs, perturbations would
average out. But ok, noted.

> So I think something like the attached would be better. It gives an
> approximate "cycles per one four-byte nop", and I get
>
> [torvalds@i7 ~]$ taskset -c 3 ./a.out
> Running 60 times, 1000000 loops per run.
> nop_0x90 average: 0.200479
> nop_3_byte average: 0.199694
>
> which sounds suspiciously good to me (5 nops per cycle? uop cache and
> nop compression, I guess).

Well, AFAIK, NOPs do require resources for tracking in the machine. I
was hoping that hw would be smarter and discard at decode time but there
probably are reasons that it can't be done (...yet).

So they most likely get discarted at retire time and I can't imagine how
an otherwise relatively idle core's ROB with gazillion of NOPs would
look like. Those things need hw traces. Maybe in another life. :-)

$ taskset -c 3 ./t
Running 60 times, 1000000 loops per run.
nop_0x90 average: 0.390625
nop_3_byte average: 0.390625

and those exact numbers are actually reproducible pretty reliably.

$ taskset -c 3 ./t
Running 60 times, 1000000 loops per run.
nop_0x90 average: 0.390625
nop_3_byte average: 0.390625
$ taskset -c 3 ./t
Running 60 times, 1000000 loops per run.
nop_0x90 average: 0.390625
nop_3_byte average: 0.390625
$ taskset -c 3 ./t
Running 60 times, 1000000 loops per run.
nop_0x90 average: 0.390625
nop_3_byte average: 0.390625

Hmm, so what are we saying? Modern CPUs should use one set of NOPs and
that's it...

Maybe we need to do more measurements...

Hmmm.

--
Regards/Gruss,
Boris.

ECO tip #101: Trim your mails when you reply.
--
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Next message: Chris Metcalf: "Re: [PATCH 2/2] [PATCH] sched: Add smp_rmb() in task rq locking cycles"
Previous message: Linus Torvalds: "Re: Should mmap MAP_LOCKED fail if mm_poppulate fails?"
In reply to: Linus Torvalds: "Re: [PATCH] x86_64, asm: Work around AMD SYSRET SS descriptor attribute issue"
Next in thread: Linus Torvalds: "Re: [PATCH] x86_64, asm: Work around AMD SYSRET SS descriptor attribute issue"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]