Re: -mmX 4G patches feedback [numbers: how much performance impact]

From: Ingo Molnar
Date: Tue Apr 06 2004 - 14:26:40 EST



* Andrea Arcangeli <andrea@xxxxxxx> wrote:

> I will use the HINT to measure the slowdown on HZ=1000. It's an
> optimal benchmark simulating userspace load at various cache sizes and
> it's somewhat realistic.

here are the INT results from the HINT benchmark (best of 3 runs):

1000Hz, 3:1, PAE: 25513978.295333 net QUIPs
1000Hz, 4:4, PAE: 25515998.582834 net QUIPs

i.e. the two kernels are equal in performance. (the noise of the
benchmark was around ~0.5% so this 0.01% win of 4:4 is a draw.) This is
not unexpected, the benchmark is too noisy to notice the 0.22% maximum
possible 4:4 hit.

> Also note that the slowdown for app calling heavily syscalls is 30%
> not 5-10%, [...]

you are right that it's not 5-10%, it's more like 5-15%. It's not 30%,
except in the mentioned case of heavily threaded MySQL benchmark, and in
microbenchmarks. (the microbenchmark case is understandable, 4:4 adds +3
usecs on PAE and +1 usec on non-PAE.)

i've just re-measured a couple of workloads that are very kernel and
syscall intensive, to get a feel for the worst-case:

apache tested via 'ab': 5% slowdown
dbench: 10% slowdown
tbench: 16% slowdown

these would be the ones where i'd expect to see the biggest slowdown,
these are dominated by kernel overhead and do alot of small syscalls.
(all these tests fully saturated the CPU.)

you should also consider that while 4:4 does introduce extra TLB
flushes, it also removes the TLB flush at context-switch. So for
context-switch intensive workloads the 4:4 overhead will be smaller. (in
some rare and atypical cases it might even be a speedup - e.g. NFS
servers driven by knfsd.) This is why e.g. lat_ctx is 4.15 with 3:1, and
it's 4.85 with 4:4, a 16% slowdown only - in contrast to lat_syscall
null, which is 0.7 usecs in the 3:1 case vs. 3.9 usecs in the 4:4 case.

But judging by your present attitude i'm sure you'll be able to find
worse performing testcases and will use them as the typical slowdown
number to quote from that point on ;) Good luck in your search.

here's the 4:4 overhead for some other workloads:

kernel compilation (30% kernel overhead): 2% slowdown
pure userspace code: 0% slowdown

anyway, i can only repeat what i said last year in the announcement
email of the 4:4 feature:

the typical cost of 4G/4G on typical x86 servers is +3 usecs of
syscall latency (this is in addition to the ~1 usec null syscall
latency). Depending on the workload this can cause a typical
measurable wall-clock overhead from 0% to 30%, for typical
application workloads (DB workload, networking workload, etc.).
Isolated microbenchmarks can show a bigger slowdown as well - due to
the syscall latency increase.

so it's not like there's a cat in the bag.

the cost of 4:4, just like the cost of any other kernel feature that
impacts performance (like e.g. PAE, highmem or swapping) should be
considered in light of the actual workload. 4:4 is definitely not an
'always good' feature - i never claimed it was. It is an enabler feature
for very large RAM systems, and it gives 3.98 GB of VM to userspace. It
is a slowdown for anything that doesnt need these features.

But for pure userspace code (which started this discussion), where
userspace overhead dominates by far, the cost is negligible even with
1000Hz.

Ingo
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/