RE: Cache Sloshing Slowing Apache in NT Benchmarks?

Johnson, Peter (Peter.Johnson@voicestream.com)
Thu, 1 Jul 1999 09:57:19 -0700


The linux scalability project did a study on the performance of malloc. The
results looked quite good. You can check them here:
http://www.citi.umich.edu/projects/citi-netscape/reports/malloc.html

--PJ

Ewige Blumenkraft! HAIL ERIS.

-----Original Message-----
From: Peter William Lount [mailto:peter@smalltalk.org]
Sent: Wednesday, June 30, 1999 7:52 PM
To: linux-kernel@vger.rutgers.edu
Cc: Jay Thorne
Subject: Cache Sloshing Slowing Apache in NT Benchmarks?

Hi,

I think that this paper (see abstract below) might have some bearing on the
recent NT Internet Server v.s. Linux Apache Web Server benchmark illusion.
The NT Internet Server software has been optimized for SMP, in a very
specific way, and as a result gains a significant advantage over Apache.

I attended the International Symposium on Memory Management (ISMM '98) last
October and someone from Microsoft presented a paper about some important
findings in the building of Microsoft's Internet Server. They found that
their "memory allocator" didn't scale on SMP machines - it actually got
worse! That's right the standard NT memory allocator doesn't scale! They
found that the reason that their standard malloc doesn't scale is "cache
sloshing" between the SMP CPU's caches when objects are born, live, and die
on different CPUs. So they set out to find or develop one that did.

In the end they went with a memory allocator design that will scale from 1
cpu at a factor of 1 (of course) to 8 cpus at at scaling factor of 7.5.
Unfortunately, it requires 60% more RAM in order to acheive this speed up
just because of the memory allocator's design. 7.5 times faster with 8
cpu's while using 60% more RAM. That's the trade off of this allocator.
This is the allocator that they have put into their NT Internet Web Server
product!

Recently it was reported that the reason that Linux-Apache is slower than
NT is that some portion of Linux is not yet fully Multi-threaded or that it
must wait for a "lock". I suspect that the "cache sloshing", discussed in
this paper, may be another as yet undiscussed significant reason.

The paper is avaliable from ACM's Digital Library if you seach
(http://www.acm.org/dl/Search.html) for one of the authors "Murali
Krishnan". I can send you a copy (in pdf format) for your reference.
Please don't publish it so as to protect ACMs copyright.

All the best,

Peter William Lount
peter@smalltalk.org
http://www.smalltalk.org

Memory Allocation for Long-Running Server Applications
Per-Ake Larson and Murali Krishnan
Microsoft
palarson@microsoft.com, muralik@microsoft.com
ABSTRACT
Prior work on dynamic memory allocation has largely neglected long-running
server applications, for example, web servers and mail servers. Their
requirements differ from those of one-shot applications like compilers or
text editors. We investigated how to build an allocator that is not only
fast and memory efficient but also scales well on SMP machines. We found
that it is not sufficient to focus on reducing lock contention - higher
speedups require a reduction in cache misses and bus traffic.

We then designed and prototyped a new allocator, called LKmalloc, targeted
for both traditional applications and server applications. LKmalloc uses
several subheaps, each one with a separate set of free lists and memory
arena. A thread always allocates from the same subheap but can free a block
belonging to any subheap. A thread is assigned to a subheap by hashing on
its thread ID. WC compared its performance with several other allocators on
a server-like,
simulated workload and found that it indeed scales well and
is quite fast hut memory more efficiently. Applications have different
allocation patterns and different requirements than traditional one-shot
applications like compilers or text editors. They are usually multithreaded
and frequently run on large SMP systems, which implies that allocators
targeted for this class of applications must be able to handle high levels
of concurrency.

This paper describes our progress in developing a dynamic memory allocator
targeted both for traditional applications and server applications. In
addition to the traditional objectives of speed and efficient memory usage,
our design emphasizes scalability on SMP systems.

The rest of the paper is organized as follows. Section 3 sets the stage by
describing typical server applications, their workload, and how they are
architected. Section 4 summarizes our view of the requirements on dynamic
memory allocators for server applications. Section 5 provides a brief
summary of prior work in this area. The current design of our allocator is
described in section 6. Experimental results, using a simulated workload,
are reported in section 7. Section 8 summarizes our findings and offers
some conclusions.

p.s. How do I join the "linux-kernel" discussion email list?

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/