Re: [RFC 0/4] Introduce unbalance proactive reclaim

From: Huan Yang
Date: Tue Nov 14 2023 - 21:12:18 EST



在 2023/11/14 21:03, Michal Hocko 写道:
On Tue 14-11-23 20:37:07, Huan Yang wrote:
在 2023/11/14 18:04, Michal Hocko 写道:
On Mon 13-11-23 09:54:55, Huan Yang wrote:
在 2023/11/10 20:32, Michal Hocko 写道:
On Fri 10-11-23 14:21:17, Huan Yang wrote:
[...]
BTW: how do you know the number of pages to be reclaimed proactively in
memcg proactive reclaiming based solution?
One point here is that we are not sure how long the frozen application
will be opened, it could be 10 minutes, an hour, or even days. So we
need to predict and try, gradually reclaim anonymous pages in
proportion, preferably based on the LRU algorithm. For example, if
the application has been frozen for 10 minutes, reclaim 5% of
anonymous pages; 30min:25%anon, 1hour:75%, 1day:100%. It is even more
complicated as it requires adding a mechanism for predicting failure
penalties.
Why would make your reclaiming decisions based on time rather than the
actual memory demand? I can see how a pro-active reclaim could make a
head room for an unexpected memory pressure but applying more pressure
just because of inactivity sound rather dubious to me TBH. Why cannot
you simply wait for the external memory pressure (e.g. from kswapd) to
deal with that based on the demand?
Because the current kswapd and direct memory reclamation are a passive
memory reclamation based on the watermark, and in the event of triggering
these reclamation scenarios, the smoothness of the phone application cannot
be guaranteed.
OK, so you are worried about latencies on spike memory usage.

(We often observe that when the above reclamation is triggered, there
is a delay in the application startup, usually accompanied by block
I/O, and some concurrency issues caused by lock design.)
Does that mean you do not have enough head room for kswapd to keep with
Yes, but if set high watermark a little high, the power consumption
will be very high. We usually observe that kswapd will run
frequently. Even if we have set a low kswapd water level, kswapd CPU
usage can still be high in some extreme scenarios.(For example, when
starting a large application that needs to acquire a large amount of
memory in a short period of time.)However, we will not discuss it in
detail here, the reasons are quite complex, and we have not yet sorted
out a complete understanding of them.
This is definitely worth investigating further before resorting to
proposing a new interface. If the kswapd consumes CPU cycles
unproductively then we should look into why.
Yes, this is my current research objective.

If there is a big peak memory demand then that surely requires CPU
capacity for the memory reclaim. The work has to be done, whether that
is in kswapd or the pro-active reclaimer context. I can imagine the
latter one could be invoked with a better timing in mind but that is not
a trivial thing to do. There are examples where this could be driven by
PSI feedback loop but from what you have mention earlier you are doing a
idle time based reclaim. Anyway, this is mostly a tuning related
discussion. I wanted to learn more about what you are trying to achieve
and so far it seems to me you are trying to workaround some issues and
a) we would like to learn about those issues and b) a new interface is
unlikely a good fit to paper over a suboptimal behavior.
Our current research goal is to find a possible dynamic balance between the
time consumption of passive memory reclamation and the application death
caused by active process killing.

The current strategy is to use proactive memory reclamation to intervene in
this process. As mentioned earlier, by actively reclaiming anonymous pages
that are deemed safe to reclaim, we can increase the currently available memory,
avoid lag when starting new applications, and prevent the death of resident
applications.

Through the previous discussions, it seems that we have reached a consensus
that although the active memory reclamation interface can achieve this goal,
it is not the best approach. Using MADV can both use existing methods to
achieve this goal and decide whether to reclaim based on the characteristics of
the anon vma, especially the anon_vma name set.

Therefore, I will also push for internal research on this approach.

This would suggest that MADV_PAGEOUT is really what you are looking
for.
Yes, I agree, especially to avoid reclaiming shared anonymous pages.

However, I did some shallow research and found that MADV_PAGEOUT does
not reclaim pages with mapcount != 1. Our applications are usually
composed of multiple processes, and some anonymous pages are shared
among them. When the application is frozen, the memory that is only
shared among the processes within the application should be released,
but MADV_PAGEOUT seems not to be suitable for this scenario?(If I
misunderstood anything, please correct me.)
Hmm, OK it seems that we are hitting some terminology problems. The
discussion was about private memory so far (essentially MAP_PRIVATE)
now you are talking about a shared anonymous memory. That would imply
shmem and that is indeed not supported by MADV_PAGEOUT. The reason for
that is that this poses a security risk for time based attacks. I can
imagine, though, that we could extend the behavior to support shared
mappings if they do not cross a security boundary (e.g. mapped by the
same user). This would require some analysis though.
OK, thanks. I have communicated with our internal team and found out that
this part of the memory usage will not be particularly large.
In addition, I still have doubts that this approach will consume a lot
of strategy resources, but it is worth studying.
If you really aim at compressing a specific type of memory then
tweking reclaim to achieve that sounds like a shortcut because
madvise based solution is more involved. But that is not a solid
justification for adding a new interface.
Yes, but this RFC is just adding an additional configuration option to
the proactive reclaim interface. And in the reclaim path, prioritize
processing these requests with reclaim tendencies. However, using
`unlikely` judgment should not have much impact.
Just adding an adding configuration option means user interface contract
that needs to be maintained for ever. Our future reclaim algorithm migh
change (and in fact it has already changed quite a bit with MGLRU) and
explicit request for LRU type specific reclaim might not even have any
sense. See that point?
Yes, I get it.  This also means that if the reclaim algorithm changes, the current
implementation of tendencies will need to be modified accordingly, which requires
a certain cost to maintain.
If the current implementation of tendencies cannot prove its necessity, it should
be keep deep research.
This solution may be simpler for me to achieve our internal goals, but it may not be
the best solution.So, MADV_PAGEOUT is worth to research.

This conversation was very beneficial for me.
Thank you all very much.

--
Thanks,
Huan Yang