Re: long sleep_on_page delays writing to slow storage

From: Mel Gorman
Date: Fri Nov 18 2011 - 06:11:47 EST


On Thu, Nov 17, 2011 at 02:47:20PM -0500, Dave Jones wrote:
> On Tue, Nov 15, 2011 at 10:13:13AM +0000, Mel Gorman wrote:
>
> > If they are still experiencing major stalls, I have an experimental
> > script that may be able to capture stack traces of processes stalled
> > for more than 1 second. I've had some success with it locally so
> > maybe they could try it out to identify if it's THP or something else.
>
> I'm not sure if it's the same problem, but I'd be interested in trying
> that script.
>

Monitor script is attached as watch-dstate.pl. Run it as

watch-dstate.pl -o logfile

I'm also attaching a post-processing script stap-dstate-frequency.

cat logfile | stap-dstate-frequency

will report on unique stack traces, what got stuck in them and
for how long. Unfortunately, this does require a working systemtap
installation because it had to work on systems without ftrace. Usually
systemtap is a case of installing debugging symbols and its package
but milage varies.

I ran this for a few days on my own desktop but found that the worst
stalls for firefox and evolution were in futex_wait with the second
worst in

[<ffffffffa018e3c5>] ext4_sync_file+0x225/0x290 [ext4]
[<ffffffff81178250>] do_fsync+0x50/0x80
[<ffffffff8117852e>] sys_fdatasync+0xe/0x20
[<ffffffff81448592>] system_call_fastpath+0x16/0x1b

The stall timing is approximate at best. If you find the stall figures
are way too high or unrealistic, try running with --accurate-stall. The
stall figures will be much more accurate but depending on your
kernel version, the stack traces may be one line long ending with
kretprobe_trampoline.

> When I build a kernel on my laptop, when it gets to the final link stage,
> and there's a ton of IO, my entire X session wedges for a few seconds.
> This may be unrelated, because this is on an SSD, which shouldn't suffer
> from the slow IO of the USB devices mentioned in this thread.
>

I have a vague suspicion that there are some interactivity issues
around SSDs but I don't know why that is. I'm basing this on some
complaints of audio skipping with heavy kernel compiles on machines
very similar to my own other than mine uses rotary storage. It's on
the Christmas list to by myself a SSD to take a closer look.

> (This is even with that patch applied btw, perhaps adding further fuel to
> the idea that it's unrelated).
>

I suspect it's not compaction related in that case but the script may be
able to tell for sure. If it does not catch anything, alter this line to
have a smaller threshold

global stall_threshold = 1000

--
Mel Gorman
SUSE Labs

Attachment: watch-dstate.pl
Description: Perl program

#!/usr/bin/perl
# This script reads the output from the dstate monitor and reports how
# many unique stack traces there were and what the stall times were.
# The objective is to identify the worst sources of stalls.
#
# Copyright Mel Gorman 2011
use strict;

my $line;
my %unique_event_counts;
my %unique_event_stall;
my %unique_event_stall_details;
my $total_stalled;

my ($process, $stalltime, $function, $event);
my ($stall_details, $trace, $first_trace, $reading_trace, $skip);
while ($line = <>) {
# Watch for the beginning of a new event
if ($line =~ /^time ([0-9]*): ([0-9]*) \((.*)\) Stalled: ([-0-9]*) ms: (.*)/) {

# Skip uninteresting traces
if (!$skip) {
# Record the last event
$unique_event_counts{$trace}++;
$unique_event_stall_details{$trace} .= $event;
if ($stalltime >= 0) {
$unique_event_stall{$trace} += $stalltime;
$total_stalled += $stalltime;
}
}

# Start the next event
$event = sprintf "%-20s %-20s %6d ms\n", $3, $5, $4;
$reading_trace = 0;
$stalltime = $4;

if ($stalltime == 0) {
print "DEBUG $line";
}

$first_trace = "";
$trace = "";
}

# If we have reached a trace, blindly read it
if ($reading_trace) {
# Ignore traces that look like they are from user space
if ($line =~ /^\[<0/) {
$reading_trace = 0;
next;
}
$trace .= $line;
if ($first_trace eq "") {
$first_trace = $line;
$skip = 1;

# Skip uninteresting traces
if ($first_trace !~ / do_poll\+/ &&
$first_trace !~ / kthread\+/ &&
$first_trace !~ / khugepaged\+/ &&
$first_trace !~ / sys_epoll_wait\+/ &&
$first_trace !~ / kswapd\+/) {
$skip = 0;
}
}
next;
}

if ($line =~ /^\[<f/) {
$reading_trace = 1;
next;
}
}

print "Overall stalled time: $total_stalled ms\n\n";
foreach my $trace (sort {$unique_event_stall{$b} <=> $unique_event_stall{$a}} keys %unique_event_stall) {
#printf "Event $short_event us count %4d\n", $unique_event_counts{$event};
#print $unique_event_stall_details{$event};
printf "Time stalled in this event: %8d ms\n", $unique_event_stall{$trace};
printf "Event count: %8d\n", $unique_event_counts{$trace};
print $unique_event_stall_details{$trace};
print "$trace\n";
}

#print "\nDetails\n=======\n";
#foreach my $event (sort {$unique_event_stall{$b} <=> $unique_event_stall{$a}} keys %unique_event_stall) {
# print "Event $event us count $unique_event_counts{$event}\n";
# print "\n";
#}