Re: [RFC for Linux] virtio_balloon: Add VIRTIO_BALLOON_F_THP_ORDER to handle THP spilt issue

From: David Hildenbrand
Date: Tue Mar 31 2020 - 06:35:43 EST


On 26.03.20 10:49, Michael S. Tsirkin wrote:
> On Thu, Mar 26, 2020 at 08:54:04AM +0100, David Hildenbrand wrote:
>>
>>
>>> Am 26.03.2020 um 08:21 schrieb Michael S. Tsirkin <mst@xxxxxxxxxx>:
>>>
>>> ïOn Thu, Mar 12, 2020 at 09:51:25AM +0100, David Hildenbrand wrote:
>>>>> On 12.03.20 09:47, Michael S. Tsirkin wrote:
>>>>> On Thu, Mar 12, 2020 at 09:37:32AM +0100, David Hildenbrand wrote:
>>>>>> 2. You are essentially stealing THPs in the guest. So the fastest
>>>>>> mapping (THP in guest and host) is gone. The guest won't be able to make
>>>>>> use of THP where it previously was able to. I can imagine this implies a
>>>>>> performance degradation for some workloads. This needs a proper
>>>>>> performance evaluation.
>>>>>
>>>>> I think the problem is more with the alloc_pages API.
>>>>> That gives you exactly the given order, and if there's
>>>>> a larger chunk available, it will split it up.
>>>>>
>>>>> But for balloon - I suspect lots of other users,
>>>>> we do not want to stress the system but if a large
>>>>> chunk is available anyway, then we could handle
>>>>> that more optimally by getting it all in one go.
>>>>>
>>>>>
>>>>> So if we want to address this, IMHO this calls for a new API.
>>>>> Along the lines of
>>>>>
>>>>> struct page *alloc_page_range(gfp_t gfp, unsigned int min_order,
>>>>> unsigned int max_order, unsigned int *order)
>>>>>
>>>>> the idea would then be to return at a number of pages in the given
>>>>> range.
>>>>>
>>>>> What do you think? Want to try implementing that?
>>>>
>>>> You can just start with the highest order and decrement the order until
>>>> your allocation succeeds using alloc_pages(), which would be enough for
>>>> a first version. At least I don't see the immediate need for a new
>>>> kernel API.
>>>
>>> OK I remember now. The problem is with reclaim. Unless reclaim is
>>> completely disabled, any of these calls can sleep. After it wakes up,
>>> we would like to get the larger order that has become available
>>> meanwhile.
>>>
>>
>> Yes, but thatâs a pure optimization IMHO.
>> So I think we should do a trivial implementation first and then see what we gain from a new allocator API. Then we might also be able to justify it using real numbers.
>>
>
> Well how do you propose implement the necessary semantics?
> I think we are both agreed that alloc_page_range is more or
> less what's necessary anyway - so how would you approximate it
> on top of existing APIs?

Looking at drivers/misc/vmw_balloon.c:vmballoon_inflate(), it first
tries to allocate huge pages using

alloc_pages(__GFP_HIGHMEM|__GFP_NOWARN| __GFP_NOMEMALLOC,
VMW_BALLOON_2M_ORDER)

And then falls back to 4k allocations (balloon_page_alloc()) in case
allocation fails.

I'm roughly thinking of something like the following, but with an
optimized reporting interface/bigger pfn array so we can report >
1MB at a time. Also, it might make sense to remember the order that
succeeded across some fill_balloon() calls.

Don't even expect it to compile ...