Re: RFC [PATCH v4 2/7] Enable balloon drivers to report inflated memory

From: Alexander Atanasov
Date: Tue Oct 11 2022 - 05:07:43 EST


Hello,

On 10.10.22 17:47, Nadav Amit wrote:
On Oct 10, 2022, at 12:24 AM, Alexander Atanasov <alexander.atanasov@xxxxxxxxxxxxx> wrote:

Hello,

On 10.10.22 9:18, Nadav Amit wrote:
On Oct 7, 2022, at 3:58 AM, Alexander Atanasov <alexander.atanasov@xxxxxxxxxxxxx> wrote:>

[snip]

Side-note: That’s not the case for VMware balloon. I actually considered
calling adjust_managed_page_count() just to conform with other balloon
drivers. But since we use totalram_pages() to communicate to the hypervisor
the total-ram, this would create endless (and wrong) feedback loop. I am not
claiming it is not possible to VMware balloon driver to call
adjust_managed_page_count(), but the chances are that it would create more
harm than good.

Virtio does both - depending on the deflate on OOM option. I suggested already to unify all drivers to inflate the used memory as it seems more logical to me since no body expects the totalram_pages() to change but the current state is that both ways are accepted and if changed can break existing users.
See discussion here https://lore.kernel.org/lkml/20220809095358.2203355-1-alexander.atanasov@xxxxxxxxxxxxx/.

Thanks for the reminder. I wish you can somehow summarize all of that into the
cover-letter and/or the commit messages for these patches.


I will put excerpts in the next versions and relevant links in the next versions. I see that the more i dig into it the deeper it becomes so it needs more explanations.




Back to the matter at hand. It seems that you wish that the notifiers would
be called following any changes that would be reflected in totalram_pages().
So, doesn't it make more sense to call it from adjust_managed_page_count() ?

It will hurt performance - all drivers work page by page , i.e. they update by +1/-1 and they do so under locks which as you already noted can lead to bad things. The notifier will accumulate the change and let its user know how much changed, so the can decide if they have to recalculate - it even can do so async in order to not disturb the drivers.

So updating the counters by 1 is ok (using atomic operation, which is not
free)? And the reason it is (relatively) cheap is because nobody actually
looks at the value (i.e., nobody actually acts on the value)?

If nobody considers the value, then doesn’t it make sense just to update it
less frequently, and then call the notifiers?

That's my point too.
The drivers update managed page count by 1.
My goal is when they are done to fire the notifier.

All drivers are similiar and work like this:
HV sends request inflate up/down
driver up/down
lock
get_page()/put_page()
optionally - adjust_managed_page_count(... +-1);
unlock
update_core and notify_balloon_changed

The difference is here:

mm/zswap.c: return totalram_pages() * zswap_max_pool_percent / 100 <
mm/zswap.c: return totalram_pages() * zswap_accept_thr_percent / 100
uses percents and you can recalculate easy with

+static inline unsigned long totalram_pages_current(void)
+{
+ unsigned long inflated = 0;
+#ifdef CONFIG_MEMORY_BALLOON
+ extern atomic_long_t mem_balloon_inflated_free_kb;
+ inflated = atomic_long_read(&mem_balloon_inflated_free_kb);
+ inflated >>= (PAGE_SHIFT - 10);
+#endif
+ return (unsigned long)atomic_long_read(&_totalram_pages) - inflated;
+}

So we have here two values and it appears there is a hidden assumption that
they are both updated atomically. Otherwise, it appears, inflated
theoretically might be greater that _totalram_pages dn we get negative value
and all hell breaks loose.

But _totalram_pages and mem_balloon_inflated_free_kb are not updated
atomically together (each one is, but not together).


I do not think that can happen - in that case totalram_pages() is not adjusted and you can never inflate more than total ram.

Yes, they are not set atomic but see the use cases:

- a driver that does calculations on init.
It will use notifier to redo the calculations.
The notifier will bring the values and the size of change to help the driver decide if it needs to recalculate.

- a user of totalram_pages() that does calculations at run time -
i have to research are there any users that could be affected by not setting the two values atomicaly - assuming there can be a slight difference. I.e. do we need precise calculations or they are calculating fractions.


And you are good when you switch to _current version - si_meminfo_current is alike .

On init (probably) all use some kind of fractions to calculate but when there is a set value via /proc/sys/net/ipv4/tcp_wmem for example it is just a value and you can not recalculate it. And here, please, share your ideas how to solve this.
I don’t get all of that. Now that you provided some more explanations, it
sounds that what you want is adjust_managed_page_count(), which we already
have and affects the output of totalram_pages(). Therefore, totalram_pages()
anyhow accounts for the balloon memory (excluding VMware’s). So why do we
need to take mem_balloon_inflated_free_kb into account?
Ok, you have this:
/ totalram
|----used----|b1|----free------|b2|

drivers can inflate both b1 and b2 - b1 free gets smaller, b2 totalram pages get smaller. so when you need totalram_pages() to do a calculation you need to adjust it with the pages that are inflated in free/used (b1). VMWare is not exception , Virtio does the same.
And according to to mst and davidh it is okay like this.
So i am proposing a way to handle both cases.

Ugh. What about BALLOON_INFLATE and BALLOON_DEFLATE vm-events? Can’t this
information be used instead of yet another counter? Unless, of course, you
get the atomicity that I mentioned before.

What do you mean by vm-events ?


Sounds to me that all you want is some notifier to be called from
adjust_managed_page_count(). What am I missing?

Notifier will act as an accumulator to report size of change and it will make things easier for the drivers and users wrt locking.
Notifier is similar to the memory hotplug notifier.

Overall, I am not convinced that there is any value of separating the value
and the notifier. You can batch both or not batch both. In addition, as I
mentioned, having two values seems racy.

I have identified two users so far above - may be more to come.
One type needs the value to adjust. Also having the value is necessary to report it to users and oom. There are options with callbacks and so on but it will complicate things with no real gain. You are right about the atomicity but i guess if that's a problem for some user it could find a way to ensure it. i am yet to find such place.

--
Regards,
Alexander Atanasov