RE: [GIT PULL] mm: frontswap (for 3.2 window)

From: Dan Magenheimer
Date: Sun Oct 30 2011 - 19:20:02 EST

Next message: Anton Altaparmakov: "Deadlock due to device removal - race condition in scsi/block layer?"
Previous message: Samuel Ortiz: "Re: linux-next: Tree for Oct 25 (mfd/ab8500)"
In reply to: Johannes Weiner: "Re: [GIT PULL] mm: frontswap (for 3.2 window)"
Next in thread: Andrea Arcangeli: "Re: [GIT PULL] mm: frontswap (for 3.2 window)"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

> From: Johannes Weiner [mailto:jweiner@xxxxxxxxxx]
> Subject: Re: [GIT PULL] mm: frontswap (for 3.2 window)

Hi Johannes --

Thanks for taking the time for some real technical discussion (below).

> On Fri, Oct 28, 2011 at 10:07:12AM -0700, Dan Magenheimer wrote:
> >
> > > From: Johannes Weiner [mailto:jweiner@xxxxxxxxxx]
> > > Subject: Re: [GIT PULL] mm: frontswap (for 3.2 window)
> > >
> > > On Fri, Oct 28, 2011 at 06:36:03PM +0300, Pekka Enberg wrote:
> > > > On Fri, Oct 28, 2011 at 6:21 PM, Dan Magenheimer
> > > > <dan.magenheimer@xxxxxxxxxx> wrote:
> > > > Looking at your patches, there's no trace that anyone outside your own
> > > > development team even looked at the patches. Why do you feel that it's
> > > > OK to ask Linus to pull them?
> > >
> > > People did look at it.
> > >
> > > In my case, the handwavy benefits did not convince me. The handwavy
> > > 'this is useful' from just more people of the same company does not
> > > help, either.
> > >
> > > I want to see a usecase that tangibly gains from this, not just more
> > > marketing material. Then we can talk about boring infrastructure and
> > > adding hooks to the VM.
> > >
> > > Convincing the development community of the problem you are trying to
> > > solve is the undocumented part of the process you fail to follow.
> >
> > Hi Johannes --
> >
> > First, there are several companies and several unaffiliated kernel
> > developers contributing here, building on top of frontswap. I happen
> > to be spearheading it, and my company is backing me up. (It
> > might be more appropriate to note that much of the resistance comes
> > from people of your company... but please let's keep our open-source
> > developer hats on and have a technical discussion rather than one
> > which pleases our respective corporate overlords.)
>
> I didn't mean to start a mud fight about this, I only mentioned the
> part about your company because I already assume it sees value in tmem
> - it probably wouldn't fund its development otherwise. I just tend to
> not care too much about Acks from the same company as the patch itself
> and I believe other people do the same.

Oops, sorry for mudslinging if none was intended.

Although I understand your position about Acks from the same company,
isn't that challenging the integrity of the individual's ack/review,
implying that they are not really reviewing the code with the same
intensity as if it came from another company? Especially with
something like tmem, maybe the review is just as valid, and people
from the same company have just had more incentive to truly
understand the intent and potential of the functionality, as well as
the syntax in the code? And maybe, on some patches, reviewers ARE
from different companies are "good buddies" and watch each others'
back and those reviews are not really complete?

So perhaps this default assumption about code review is flawed?

> > Second, have you read http://lwn.net/Articles/454795/ ?
> > If not, please do. If yes, please explain what you don't
> > see as convincing or tangible or documented. All of this
> > exists today as working publicly available code... it's
> > not marketing material.
>
> I remember answering this to you in private already some time ago when
> discussing frontswap.

Yes, reading ahead, all the questions sound familiar and I thought
they were all answered (albeit some offlist). I think the conversation
ended at that point, so I assumed any issues were resolved.

> You keep proposing a bridge and I keep asking for proof that this is
> not a bridge to nowhere. Unless that question is answered, I am not
> interested in discussing the bridge's design.
>
> According to the LWN article, there are the following backends:
>
> 1. Zcache: allow swapping into compressed memory
>
> This sets aside a portion of memory which the kernel will swap
> compressed pages into upon pressure. Now, obviously, reserving memory
> from the system for this increases the pressure in the first place,
> eating away on what space we have for anonymous memory and page cache.
>
> Do you auto-size that region depending on workload?

Yes. A key value of the whole transcendent memory design
is that everything is done dynamically. That's one
reason that Nitin Gupta (author of zram) supports zcache.

> If so, how? If not, is it documented how to size it manually?

See above. There are some zcache policy parameters that can be
adjusted manually (currently through sysfs) so we can adjust
the defaults as necessary over time.

> Where are the performance numbers for various workloads, including
> both those that benefit from every bit of page cache and those that
> would fit into memory without zcache occupying space?

I have agreed already that more zcache measurement is warranted
(though I maintain it will get a lot more measurement merged than
it will unmerged). So I can only answer theoretically, though
I would appreciate your comment if you disagree.

Space used for page cache is almost always opportunistic; it is
a "guess" that the page will be needed again in the future.
Frontswap only stores pages that MUST otherwise be swapped.
Swapping occurs only if the clean list is empty (or if the
MM system is too slow to respond to changes in workload).
In fact some of the pages-to-be-swapped that end up in
frontswap can be dirty page cache pages.

All of this is handled dynamically. The kernel is still deciding
which pages to keep and which to reclaim and which to swap.
The hooks simply grab pages as they are going by. That's
why the frontswap patch can be so simple and can have many "users"
built on top of it.

> However, looking at the zcache code, it seems it wants to allocate
> storage pages only when already trying to swap out. Are you sure this
> works in reality?

Yes. I'd encourage you to try it. I'd be a fool if I tried
to guarantee that there are no bugs of course.

> 2. RAMster: allow swapping between machines in a cluster
>
> Are there people using it? It, too, sounds like a good idea but I
> don't see any proof it actually works as intended.

No. I've posted the code publicly but it's still a godawful mess
and I'd be embarrassed if anyone looked at it. But the code
does work and I've got some ideas on how to make it more
upstreamable. If anybody seriously wants to work on it right
now, I could do that, but I'd prefer some more time alone with
it first.

Conceptually, it's just a matter of moving pages to a different
machine instead of across a hypercall interface. All the "magic"
is in the frontswap and cleancache hooks. They run on both
machines, both dynamically managing space (and compressing it
too). The code uses ocfs2 for "cluster" discovery and is built
on top of a modified zcache.

> 3. Xen: allow guests to swap into the host.
>
> The article mentions that there is code to put the guests under
> pressure and let them swap to host memory when the pressure is too
> high. This sounds useful.
>
> Where is the code that controls the amount of pressure put on the
> guests?

See drivers/xen/xen-selfballoon.c, which was just merged at 3.1,
though there have been versions of it floating around for 2+ years.
Note there's a bug fix pending that makes the pressure a little less
aggressive. I think it is/was submitted for the open 3.2 window.
(Note the same file manipulates the number of pages in frontswap.)

> Where are the performance numbers? Surely you can construct a case
> where the initial machine sizes are not quite right and then collect
> data that demonstrates the machines are rebalancing as expected?

Yes I can. It just works and with the right tools running, it's
even fun to watch. Some interesting performance numbers were
published at Xen Summit 2010. See the last few pages of:

http://oss.oracle.com/projects/tmem/dist/documentation/presentations/TranscendentMemoryXenSummit2010.pdf

The speakers notes (so you can follow the presentation without video)
are in the same dir.

> 4. kvm: same as Xen
>
> Apart from the questions that already apply to Xen, I remember KVM
> people in particular complaining about the synchroneous single-page
> interface that results in a hypercall per swapped page. What happened
> to this concern?

I think we (me and the KVM people) agreed that the best way to determine
if this is a concern is to just measure it. Sasha and Neo are working on
a KVM implementation which should make this possible (but neither wants
to invest a lot of time if frontswap isn't merged or has a clear path
to merging).

So, again, theoretically, and please argue if you disagree...
(and yes I know real measurements are better, but I think we all
know how easy it is to manipulate benchmarks so IMHO a
theoretical understanding is useful too).

What is the cost of a KVM hypercall (vmexit/vmenter) vs the cost of
swapping a page? Clearly, reading/writing a disk is a very slow
operation, but has very little CPU overhead (though preparing a
page to be swapped via blkio is NOT very inexpensive). But if
you are swapping, it is almost never the case that the CPU is busy,
especially on a multicore CPU.

I expect on old slow (e.g. first gen 1 core VT-x processors) this might
sometimes be measureable, but rarely an issue. On modern processors,
I don't expect it to be significant.

BTW, it occurs to me that this is now measureable on Xen too, since
Xen tmem works now for fully-virtualized guests. I don't have
the machines to reproduce the same experiment, but if you look at
the graphs in the Xen presentation, you can see that CPU utilization
goes up substantially, but throughput still improves. I am almost
positive that the CPU cost of compression/decompression plus the
cost of deduplication insert/fetch exceeds the cost of a vmexit/vmenter,
so the additional cost of vmexit/vmenter will at most increase
the CPU utilization. The real performance gain comes from avoiding
(waiting for) disk accesses.

> I would really appreciate if you could pick one of those backends and
> present them as a real and practical solution to real and practical
> problems. With documentation on configuration and performance data of
> real workloads. We can discuss implementation details like how memory
> is exchanged between source and destination when we come to it.
>
> I am not asking for just more code that uses your interface, I want to
> know the real value for real people of the combination of all that
> stuff. With proof, not just explanations of how it's supposed to
> work.

Well, the Xen implementation is by far the most mature and the
Xen presentation above is reasonably conclusive though, as always,
more measurements of more workloads would be good.

Not to get back into the mudslinging, but certain people from certain
companies try to ignore or minimize the value of Xen, so I've been
trying to emphasize the other (non-Xen, non-virtualization) code.
Personally, I think the Xen use case is sufficient by itself as it
solves a problem nobody else has ever solved (or, more precisely,
that VMware attempted to solve but, as real VMware customers will
attest, did so very poorly).

To be a good Linux kernel citizen, I've encouraged my company to hold
off on widespread support for Xen tmem until all the parts are upstream
in Linux, so there isn't a wide existing body of "proof" data. And
releasing customer data from my employer requires an act of God. But
private emails to Linus for cleancache seemed to convince him that
there was enough justification for cleancache. I thought frontswap
was simpler and would be the easy part, but was clearly mistaken :-(
We are now proceeding fully with Xen tmem with both frontswap
and cleancache in the kernel.

> Until you can accept that, please include
>
> Nacked-by: Johannes Weiner <hannes@xxxxxxxxxxx>
>
> on all further stand-alone submissions of tmem core code and/or hooks
> in the VM. Thanks.

If you are willing to accept that Xen is a valid use case, I
think I have provided that (although I agree that more data would
be good and would be happy to take suggestions for what data to
provide). If not, I would call that a form of mudslinging
but will add your Nack. Please let me know.

Dan
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Next message: Anton Altaparmakov: "Deadlock due to device removal - race condition in scsi/block layer?"
Previous message: Samuel Ortiz: "Re: linux-next: Tree for Oct 25 (mfd/ab8500)"
In reply to: Johannes Weiner: "Re: [GIT PULL] mm: frontswap (for 3.2 window)"
Next in thread: Andrea Arcangeli: "Re: [GIT PULL] mm: frontswap (for 3.2 window)"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]