Re: ZONE_NORMAL vs. ZONE_MOVABLE

From: Joonsoo Kim
Date: Mon Mar 20 2017 - 02:33:20 EST


2017-03-17 4:01 GMT+09:00 Andrea Arcangeli <aarcange@xxxxxxxxxx>:
> Hello Joonsoo,

Hello, Andrea.

> On Thu, Mar 16, 2017 at 02:31:22PM +0900, Joonsoo Kim wrote:
>> I don't follow up previous discussion so please let me know if I miss
>> something. I'd just like to mention about sticky pageblocks.
>
> The interesting part of the previous discussion relevant for the
> sticky movable pageblock is this part from Vitaly:
>
> === quote ===
> Now we have
>
> [Normal][Normal][Normal][Movable][Movable][Movable]
>
> we could have
>
> [Normal][Normal][Movable][Normal][Movable][Normal]
> === quote ===
>
> Suppose you're an admin you can try to do starting from an
> all-offlined hotplug memory:
>
> kvm ~ # cat /sys/devices/system/memory/memory3[6-9]/online
> 0
> 0
> 0
> 0
> kvm ~ # python ~andrea/zoneinfo.py
> Zone: DMA Present: 15M Managed: 15M Start: 0M End: 16M
> Zone: DMA32 Present: 2031M Managed: 1892M Start: 16M End: 2047M
>
> All hotplug memory is offline, no Movable zone.
>
> Then you online interleaved:
>
> kvm ~ # echo online_movable > /sys/devices/system/memory/memory39/online
> kvm ~ # python ~andrea/zoneinfo.py
> Zone: DMA Present: 15M Managed: 15M Start: 0M End: 16M
> Zone: DMA32 Present: 2031M Managed: 1892M Start: 16M End: 2047M
> Zone: Movable Present: 128M Managed: 128M Start: 4.9G End: 5.0G
> kvm ~ # echo online > /sys/devices/system/memory/memory38/online
> kvm ~ # python ~andrea/zoneinfo.py
> Zone: DMA Present: 15M Managed: 15M Start: 0M End: 16M
> Zone: DMA32 Present: 2031M Managed: 1892M Start: 16M End: 2047M
> Zone: Normal Present: 128M Managed: 128M Start: 4.0G End: 4.9G
> Zone: Movable Present: 128M Managed: 128M Start: 4.9G End: 5.0G
>
> So far so good.
>
> kvm ~ # echo online_movable > /sys/devices/system/memory/memory37/online
> kvm ~ # python ~andrea/zoneinfo.py
> Zone: DMA Present: 15M Managed: 15M Start: 0M End: 16M
> Zone: DMA32 Present: 2031M Managed: 1892M Start: 16M End: 2047M
> Zone: Normal Present: 256M Managed: 256M Start: 4.0G End: 4.9G
> Zone: Movable Present: 128M Managed: 128M Start: 4.9G End: 5.0G
>
> Oops you thought you onlined movable memory37 but instead it silently
> went in the normal zone (without even erroring out) and it's
> definitely not going to be unpluggable and it's definitely non
> movable.... all falls apart here. Admin won't run my zoneinfo.py
> script that I had write specifically to understand what a mess what
> was happening with online_movable interleaved.
>
> The admin is much better off not touching
> /sys/devices/system/memory/memory37 ever, and just use the in-kernel
> onlining, at the very least until udev and sys interface are fixed for
> both movable and non-movable hotplug onlining.

Thanks for explanation. Now, I understand the issue correctly.

>> Before that, I'd like to say that a lot of code already deals with zone
>> overlap. Zone overlap exists for a long time although I don't know exact
>> history. IIRC, Mel fixed such a case before and compaction code has a
>> check for it. And, I added the overlap check to some pfn iterators which
>> doesn't have such a check for preparation of introducing a new zone,
>> ZONE_CMA, which has zone range overlap property. See following commits.
>>
>> 'ba6b097', '9d43f5a', 'a91c43c'.
>>
>
> So you suggest to create a full overlap like:
>
> --------------- Movable --------------
> --------------- Normal --------------
>
> Then search for pages in the Movable zone buddy which will only
> contain those that are onlined with echo online_movable?

Yes. Full overlap would be the worst case but it's possible and it
would work well(?) even in current kernel.

>> Come to my main topic, I disagree that sticky pageblock would be
>> superior to the current separate zone approach. There is some reasons
>> about the objection to sticky movable pageblock in following link.
>>
>> Sticky movable pageblock is conceptually same with MIGRATE_CMA and it
>> will cause many subtle issues like as MIGRATE_CMA did for CMA users.
>> MIGRATE_CMA introduces many hooks in various code path, and, to fix the
>> remaining issues, it needs more hooks. I don't think it is
>
> I'm not saying the sticky movable pageblocks are the way to go, to the
> contrary we're saying the Movable zone constraints can better be
> satisfied by the in-kernel onlining mechanism and it's overall much
> simpler for the user to use the in-kernel onlining, than in trying to
> fix udev to be synchronous and implementing sticky movable pageblocks
> to make the /sys interface usable without unexpected side effects. And
> I would suggest to look into dropping the MOVABLE_NODE config option
> first (and turn it in a kernel parameter if something).

Okay.

> I agree sticky movable pageblocks may slowdown things and increase
> complexity so it'd be better not having to implement those.
>
>> maintainable approach. If you see following link which implements ZONE
>> approach, you can see that many hooks are removed in the end.
>>
>> lkml.kernel.org/r/1476414196-3514-1-git-send-email-iamjoonsoo.kim@xxxxxxx
>>
>> I don't know exact requirement on memory hotplug so it would be
>> possible that ZONE approach is not suitable for it. But, anyway, sticky
>> pageblock seems not to be a good solution to me.
>
> The fact sticky movable pageblocks aren't ideal for CMA doesn't mean
> they're not ideal for memory hotunplug though.
>
> With CMA there's no point in having the sticky movable pageblocks
> scattered around and it's purely a misfeature to use sticky movable
> pageblocks because you need the whole CMA area contiguous hence a
> ZONE_CMA is ideal.

No. CMA ranges could be registered many times for each devices and they
could be scattered due to device's H/W limitation. So, current implementation
in kernel, MIGRATE_CMA pageblocks, are scattered sometimes.

> As opposed with memory hotplug the sticky movable pageblocks would
> allow the kernel to satisfy the current /sys API and they would
> provide no downside unlike in the CMA case where the size of the
> allocation is unknown.

No, same downside also exists in this case. Downside is not related to the case
that device uses that range. It is related to VM management to this range and
problems are the same. For example, with sticky movable pageblock, we need to
subtract number of freepages in sticky movable pageblock when watermark is
checked for non-movable allocation and it causes some problems.

> If we can make zone overlap work with a 100% overlap across the whole
> node that would be a fine alternative, the zoneinfo.py output will
> look weird, but if that's the only downside it's no big deal. With
> sticky movable pageblocks it'll all be ZONE_NORMAL, with overlap it'll
> all be both ZONE_NORMAL and ZONE_MOVABLE at the same time.

Okay.

> Again with the in-kernel onlining none of the above is necessary as
> nobody should then need to echo online/online_movable >memory*/enabled
> ever again and it can all be obsoleted. So before dropping the only
> option we have that works flawlessly, we should fix all the above in
> udev, /sys and provide full zone overlap or sticky movable pageblocks.

Okay.

Thanks.