Re: [patch] tracing/mm: add page frame snapshot trace

From: Wu Fengguang
Date: Sat May 09 2009 - 08:24:12 EST


On Sat, May 09, 2009 at 07:05:13PM +0800, Ingo Molnar wrote:
>
> * Wu Fengguang <fengguang.wu@xxxxxxxxx> wrote:
>
> > On Sat, May 09, 2009 at 06:01:37PM +0800, Ingo Molnar wrote:
> > >
> > > * Wu Fengguang <fengguang.wu@xxxxxxxxx> wrote:
> > >
> > > > 2) support concurrent object iterations
> > > > For example, a huge 1TB memory space can be split up into 10
> > > > segments which can be queried concurrently (with different options).
> > >
> > > this should already be possible. If you lseek the trigger file, that
> > > will be understood as an 'offset' by the patch, and then write a
> > > (decimal) value into the file, that will be the count.
> > >
> > > So it should already be possible to fork off nr_cpus helper threads,
> > > one bound to each CPU, each triggering trace output of a separate
> > > segment of the memory map - and each reading that CPU's
> > > trace_pipe_raw file to recover the data - all in parallel.
> >
>
> > How will this work out in general? More examples, when walking
> > pages by file/process, is it possible to divide the
> > files/processes into N sets, and dump their pages concurrently?
> > When walking the (huge) inode lists of different superblocks, is
> > it possible to fork one thread for each superblock?
> >
> > In the above situations, they would demand concurrent instances
> > with different filename/pid/superblock options.
>
> the iterators are certainly more complex, and harder to parallelise,
> in those cases, i submit.

OK. I'm pushing the parallelism idea because 4+ cores is going to be
commonplace in desktop(not to mention the servers). And I have a clear
use case for the parallelism: user space directed memory shrinking
before hibernation. Where the user space tool scan all/most pages in
all/most files in all superblocks and then selectively fadvise(DONTNEED).

In that case we want to work as fast as possible in order not to slow
down the hibernation speed. Parallelism definitely helps.

> But i like the page map example because it is (by far!) the largest
> collection of objects. Four million pages on a test-box i have.

Yes!

> So if the design is right and we do dumping on that extreme-end very
> well, we might not even care that much about parallelising dumping
> in other situations, even if there are thousands of tasks - it will
> just be even faster. And then we can keep the iterators and the APIs
> as simple as simple.

That offset trick won't work well for small files. When we have lots
of small files, the parallelism granularity shall be files instead of
page chunks inside them. Maybe I'm too stressing.

> ( End even for tasks, which are perhaps the hardest to iterate, we
> can still do the /proc method of iterating up to the offset by
> counting. It wastes some time for each separate thread as it has
> to count up to its offset, but it still allows the dumping itself
> to be parallelised. Or we could dump blocks of the PID hash array.
> That distributes tasks well, and can be iterated very easily with
> low/zero contention. The result will come out unordered in any
> case. )

For task/file based page walking, the best parallelism unit can be
the task/file, instead of page segments inside them.

And there is the sparse file problem. There will be large holes in
the address space of file and process(and even physical memory!).

It would be good to not output any lines for the holes. Even better,
in the case of file/process, lots of pages will share the same flags,
count and mapcount. If not printing their pfn, the output can be
stripped from per-page lines
index flags count mapcount
to per-page-range summaries:
index len flags count mapcount

For example, here is an output from my filecache tool. This trick
could reduce 10x output size!

# idx len state refcnt
0 1 RAMU___ 2
1 3 ___U___ 1
4 1 RAMU___ 2
5 57 R_MU___ 2
62 2 ___U___ 1
64 60 R_MU___ 2
124 6 ___U___ 1
130 1 R_MU___ 2
131 1 ___U___ 1
132 2 R_MU___ 2
134 1 ___U___ 1
135 2 R_MU___ 2
137 1 ___U___ 1
138 5 R_MU___ 2
143 1 ___U___ 1
144 2 R_MU___ 2
146 2 ___U___ 1
148 26 R_MU___ 2
174 3 ___U___ 1
177 54 R_MU___ 2
231 1 ___U___ 1
232 16 R_MU___ 2
248 2 ___U___ 1

Another problem can be, the holes are often really huge. If user space
walk the pages by

while true
do
echo n n+10000 > range-to-dump
cat trace >> log
done

Then the holes will still consume a lot of unnecessary context switches.
It would better to work this way:
while true
do
echo 10000 > amount-to-dump
cat trace >> log
done

Is this possible?

Thanks,
Fengguang

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/