IIRC the x86 performance counters aren't dependent on anything
so they tend to execute much earlier than you want.
OTOH rdtsc is likely to be synchronising and affect what follows.
ISTR using rdtsc to wait for instructions to complete and then
the performance clock counter to see how long it took.