WCHAN is wait_on_pag

Chuck Lever (cel@monkey.org)
Wed, 24 Feb 1999 11:08:48 -0500 (EST)


i've been running SPEC S-DET on our lab's 4-CPU Dell PE 6300/450 to see
how far i can push linux on a server like this. i've found some behavior
that i'd like some help pursuing. this happens to be on 2.2.2-ac1, but
i've seen this behavior since late 2.1.x.

SPEC S-DET consists of a script that emulates a software development
workload (contains lots of "cc", "ed" and "nroff"). the benchmark load is
created by concurrently running multiple copies of this script. the
output of each instance of the script is automatically checked against a
generic copy of the output when the benchmark is done so you can catch
problems (like "can't load shared library: out of file descriptors"). a
throughput value is calculated by taking the elapsed wall-clock time for
all scripts to complete, and dividing by the number of scripts that were
running.

running this benchmark with up to 32 concurrent scripts works fine.
however, running with 64 or more scripts works very well for the first few
minutes after a reboot. then, after the buffer cache fills up all of free
memory, the benchmark starts to slow down, almost by a factor of three or
four. i'm not sure why the buffer cache grows so large, because the
benchmark uses only 6k of file system space for each script. it grows
more each time the benchmark runs -- the final state is 450M buffer, 35M
page cache, 3M free, on a 512M machine.

watching "vmstat", i noticed that for long periods during the benchmark,
there are no, or only a few, runnable processes, but all of the processes
that should be runnable are blocked. the blocked count would remain very
high for minutes at a time. normally when running the benchmark, most of
the processes are runnable, and up to half of them can be blocked, but
usually no more than 10 to 15, and only for a few seconds.

"ps alx" showed that almost all of the blocked processes had a WCHAN of
wait_on_pag, and that the ones waiting on wait_on_pag were running cpio. i
straced all the cpio's with the "-T" option to see which system call might
be holding it up. i found a few 22 second lstat() calls, but most of the
longest running system calls were mkdir() and open(), taking from 0.25 to
1 second to complete.

now i'm stuck. how do i find out what page these cpio processes are
waiting on? does it even matter -- is this a red herring? and is there a
way i can determine what's in the buffer cache and why it grows so large?

- Chuck Lever

--
corporate:	<chuckl@netscape.com>
personal:	<chucklever@netscape.net> or <cel@monkey.org>

- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.rutgers.edu Please read the FAQ at http://www.tux.org/lkml/