Re: random: Benchamrking fast_mix2

From: George Spelvin
Date: Thu Jun 12 2014 - 20:23:14 EST


> So I just tried your modified 32-bit mixing function where you the
> rotation to the middle step instead of the last step. With the
> usleep(), it doesn't make any difference:
>
> # schedtool -R -p 1 -e /tmp/fast_mix2_48
> fast_mix: 212 fast_mix2: 400 fast_mix3: 400
> fast_mix: 208 fast_mix2: 408 fast_mix3: 388
> fast_mix: 208 fast_mix2: 396 fast_mix3: 404
> fast_mix: 224 fast_mix2: 408 fast_mix3: 392
> fast_mix: 200 fast_mix2: 404 fast_mix3: 404
> fast_mix: 208 fast_mix2: 412 fast_mix3: 396
> fast_mix: 208 fast_mix2: 392 fast_mix3: 392
> fast_mix: 212 fast_mix2: 408 fast_mix3: 388
> fast_mix: 200 fast_mix2: 716 fast_mix3: 773
> fast_mix: 426 fast_mix2: 717 fast_mix3: 728

> And here is my testing using your 64-bit variant:
>
> # schedtool -R -p 1 -e /tmp/fast_mix2_49
> fast_mix: 294 fast_mix2: 476 fast_mix4: 442
> fast_mix: 286 fast_mix2: 1058 fast_mix4: 448
> fast_mix: 958 fast_mix2: 460 fast_mix4: 1002
> fast_mix: 940 fast_mix2: 1176 fast_mix4: 826
> fast_mix: 476 fast_mix2: 840 fast_mix4: 826
> fast_mix: 462 fast_mix2: 840 fast_mix4: 826
> fast_mix: 462 fast_mix2: 826 fast_mix4: 826
> fast_mix: 462 fast_mix2: 826 fast_mix4: 826
> fast_mix: 462 fast_mix2: 826 fast_mix4: 826
> fast_mix: 462 fast_mix2: 840 fast_mix4: 826

> The bottom line is that what we are primarily measuring here is all
> different cache effects. And these are going to be quite different on
> different microarchitectures.

So adding fast_mix4 doubled the time taken by fast_mix.
Yeah, that's trustworthy timing! :-)

Still, you do seem to observe a pretty consistent factor of about 2x
difference, which confuses me because I can't reproduce it.

But it's hard to reach definite conclusions with this much measurement noise.

Another cache we might be hitting is the branch predictor. Could you try
unrolling fast_mix2 and fast_mix4 and see what difference that makes?
(I'd send you a patch but you could probably do it by hand faster than
appying one.)

It only makes a slight difference on my high-end Intel box, but almost
doubles the speed on the Phenom:

Rolled (64-bit core, 2 rounds):
fast_mix: 293 fast_mix2: 205
fast_mix: 257 fast_mix2: 162
fast_mix: 170 fast_mix2: 137
fast_mix: 283 fast_mix2: 218
fast_mix: 270 fast_mix2: 185
fast_mix: 288 fast_mix2: 199
fast_mix: 423 fast_mix2: 131
fast_mix: 286 fast_mix2: 218
fast_mix: 681 fast_mix2: 165
fast_mix: 268 fast_mix2: 190

Unrolled (64-bit core, 2 rounds):
fast_mix: 394 fast_mix2: 108
fast_mix: 145 fast_mix2: 80
fast_mix: 270 fast_mix2: 112
fast_mix: 145 fast_mix2: 81
fast_mix: 145 fast_mix2: 79
fast_mix: 662 fast_mix2: 107
fast_mix: 145 fast_mix2: 78
fast_mix: 140 fast_mix2: 127
fast_mix: 164 fast_mix2: 182
fast_mix: 205 fast_mix2: 79

Since the original fast_mix is unrolled, a penalty there wouldn't
hit it.

> That being said, I wouldn't be at all surprised if there are some
> CPU's where the extract memory dereference to the twist_table[] would
> definitely hurt, since Intel's amazing cache architecture(tm) is no
> doubt covering a lot of sins. I wouldn't be at all surprised if some
> of these new mixing functions would fare much better if we tried
> benchmarking them on an 32-bit ARM processor, for example....

Yes, Intel's D-caches are quite impressive.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/