Re: [PATCH -next v2] mm, proc: collect percpu free pages into the free pages

From: Kefeng Wang
Date: Fri Nov 24 2023 - 21:25:17 EST




On 2023/11/25 1:54, Dmytro Maluka wrote:
On Tue, Aug 23, 2022 at 03:37:52PM +0200, Michal Hocko wrote:
On Tue 23-08-22 20:46:43, Liu Shixin wrote:
On 2022/8/23 15:50, Michal Hocko wrote:
On Mon 22-08-22 14:12:07, Andrew Morton wrote:
On Mon, 22 Aug 2022 11:33:54 +0800 Liu Shixin <liushixin2@xxxxxxxxxx> wrote:

The page on pcplist could be used, but not counted into memory free or
avaliable, and pcp_free is only showed by show_mem() for now. Since commit
d8a759b57035 ("mm, page_alloc: double zone's batchsize"), there is a
significant decrease in the display of free memory, with a large number
of cpus and zones, the number of pages in the percpu list can be very
large, so it is better to let user to know the pcp count.

On a machine with 3 zones and 72 CPUs. Before commit d8a759b57035, the
maximum amount of pages in the pcp lists was theoretically 162MB(3*72*768KB).
After the patch, the lists can hold 324MB. It has been observed to be 114MB
in the idle state after system startup in practice(increased 80 MB).

Seems reasonable.
I have asked in the previous incarnation of the patch but haven't really
received any answer[1]. Is this a _real_ problem? The absolute amount of
memory could be perceived as a lot but is this really noticeable wrt
overall memory on those systems?

Let me provide some other numbers, from the desktop side. On a low-end
chromebook with 4GB RAM and a dual-core CPU, after commit b92ca18e8ca5
(mm/page_alloc: disassociate the pcp->high from pcp->batch) the max
amount of PCP pages increased 56x times: from 2.9MB (1.45 per CPU) to
165MB (82.5MB per CPU).

On such a system, memory pressure conditions are not a rare occurrence,
so several dozen MB make a lot of difference.

And with mm: PCP high auto-tuning merged in v6.7, the pcp could be more bigger than before.


(The reason it increased so much is because it now corresponds to the
low watermark, which is 165MB. And the low watermark, in turn, is so
high because of khugepaged, which bumps up min_free_kbytes to 132MB
regardless of the total amount of memory.)

This may not obvious when the memory is sufficient. However, as products monitor the
memory to plan it. The change has caused warning.

Is it possible that the said monitor is over sensitive and looking at
wrong numbers? Overall free memory doesn't really tell much TBH.
MemAvailable is a very rough estimation as well.

In reality what really matters much more is whether the memory is
readily available when it is required and none of MemFree/MemAvailable
gives you that information in general case.

We have also considered using /proc/zoneinfo to calculate the total
number of pcplists. However, we think it is more appropriate to add
the total number of pcplists to free and available pages. After all,
this part is also free pages.

Those free pages are not generally available as exaplained. They are
available to a specific CPU, drained under memory pressure and other
events but still there is no guarantee a specific process can harvest
that memory because the pcp caches are replenished all the time.
So in a sense it is a semi-hidden memory.

I was intuitively assuming that per-CPU pages should be always available
for allocation without resorting to paging out allocated pages (and thus
it should be non-controversially a good idea to include per-CPU pages in
MemFree, to make it more accurate).

But looking at the code in __alloc_pages() and around, I see you are
right: we don't try draining other CPUs' PCP lists *before* resorting to
direct reclaim, compaction etc.

BTW, why not? Shouldn't draining PCP lists be cheaper than pageout() in
any case?

Same question here, could drain pcp before direct reclaim?


That being said, I am still not convinced this is actually going to help
all that much. You will see a slightly different numbers which do not
tell much one way or another and if the sole reason for tweaking these
numbers is that some monitor is complaining because X became X-epsilon
then this sounds like a weak justification to me. That epsilon happens
all the time because there are quite some hidden caches that are
released under memory pressure. I am not sure it is maintainable to
consider each one of them and pretend that MemFree/MemAvailable is
somehow precise. It has never been and likely never will be.
--
Michal Hocko
SUSE Labs