Re: PG_zero

From: Andrea Arcangeli
Date: Tue Nov 02 2004 - 20:29:09 EST


On Tue, Nov 02, 2004 at 02:41:12PM -0800, Martin J. Bligh wrote:
> it's really one list. However, I'd like to stop the cold list stealing
> from the hot, so I think it's easier to *implement* it as two lists,
> because there's nothing in the standard list code to add a magic marker
> on one element in the middle (though maybe you can think of a trick way


sure if we want to stop cold allocations to get hot pages we need two
lists.

> Yes, it's certainly faster to do that one operation - no dispute from me.
> However, the question is whether it's worth trading off against the
> cache-warmth of pages ... that's why we wrote it that way originally, to
> preserve that.

my argument is very simple. A random content cold page is worthless.

When somebody allocates a free page, it will either:

1) write to it with the cpu to write some useful data into it, so the
page has a value (not the case since it used __GFP_COLD and we assume it
was not using __GFP_COLD by mistake)
2) use DMA to write something useful into the page

Now after 2) unless it's readahead, it will also _read_ or modify the
contents of the page. And writing into a _cold_ page with the cpu, isn't
very different from doing DMA into an hot page. At least this is my
theory, I'm not an hardware designer but there should be some cache
snooping during the invalidates that shouldn't be very different from
the cache recycling. Note that we've no clue if the cache is hot in the
other cpus as well, this is all per-cpu stuff.

Probably the only way to know if retaining the separated hot/cold list
is worthwhile is benchmarking.

I'm pretty sure __GFP_COLD makes huge difference for the freeing of ram,
but I doubt it does for the allocation of ram. Everybody who is
allocating memory is going to write to it and touch it with the cpu
eventually.

While freed ram may remain cold forever, so to me __GFP_COLD looks more
a freeing-ram allocation.

There's not such thing as allocating ram and never going to touch it
with the CPU anytime soon.

> Probably easier to make it a per-arch option to define, but yes, if one
> arch works better one way, and others the other, we can make it optional
> fairly easily in lots of ways. <insert traditional disagreement with
> "new" vs "old" connotations here ...>

yep ;).

> Well, remember that the lists only contain a very small amount of memory
> relative to the machine size anyway, so I'm not really worried about us
> triggering early OOMs or whatever.

I'm a bit worried for the many cpu systems. Since I'm also not limiting
the per-cpu size in function of the total ram of the machine, one could
have a configuration that doesn't even boot if it has too many cpus,
probably thousand of them (I doubt any system like that exists, it's
only a theoretical issue). But the closer you get to the non-booting
system, the more likely allocations will fail with oom due the
reservation.

especially order > 0 allocations will fail way very soon, but for those
dropping the reservation won't help, so they're an unfixable thing,
that asks to use the per-cpu resources as efficiently as possible.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/