Re: arm64: dropping prevent_bootmem_remove_notifier

From: Anshuman Khandual
Date: Mon Oct 19 2020 - 02:31:10 EST

Next message: Corentin Labbe: "[PATCH] arm64: dts: allwinner: Pine H64: Enable both RGMII RX/TX delay"
Previous message: Ramuthevar, Vadivel MuruganX: "Re: [PATCH v1 4/6] spi: cadence-quadspi: Add QSPI support for Intel LGM SoC"
In reply to: David Hildenbrand: "Re: arm64: dropping prevent_bootmem_remove_notifier"
Next in thread: Anshuman Khandual: "Re: arm64: dropping prevent_bootmem_remove_notifier"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On 10/17/2020 03:05 PM, David Hildenbrand wrote:
> On 17.10.20 01:11, Sudarshan Rajagopalan wrote:
>>
>> Hello Anshuman,
>>
> David here,
>
> in general, if your driver offlines+removes random memory, it is doing
> something *very* wrong and dangerous. You shouldn't ever be
> offlining+removing memory unless
> a) you own that boot memory after boot. E.g., the ACPI driver owns DIMMs
> after a reboot.
> b) you added that memory via add_memory() and friends.

Right.

>
> Even trusting that offline memory can be used by your driver is wrong.

Right.

>
> Just imagine you racing with actual memory hot(un)plug, you'll be in
> *big* trouble. For example,
>
> 1. You offlined memory and assume you can use it. A DIMM can simply get
> unplugged. you're doomed.
> 2. You offlined+removed memory and assume you can use it. A DIMM can
> simply get unplugged and the whole machine would crash.
>
> Or imagine your driver running on a system that has virtio-mem, which
> will try to remove/offline+remove memory that was added by virtio-mem/
> is under its control.
>
> Long story short: don't do it.
>
> There is *one* instance in Linux where we currently allow it for legacy
> reasons. It is powernv/memtrace code that offlines+removes boot memory.
> But here we are guaranteed to run in an environment (HW) without any
> actual memory hot(un)plug.
>
> I guess you're going to say "but in our environment we don't have ..." -
> this is not a valid argument to change such generic things upstream /
> introducing such hacks.

Agreed.

>
>> In the patch that enables memory hot-remove (commit bbd6ec605c0f
>> ("arm64/mm: Enable memory hot remove")) for arm64, there’s a notifier
>> put in place that prevents boot memory from being offlined and removed.
>> Also commit text mentions that boot memory on arm64 cannot be removed.
>> We wanted to understand more about the reasoning for this. X86 and other
>> archs doesn’t seem to do this prevention. There’s also comment in the
>> code that this notifier could be dropped in future if and when boot
>> memory can be removed.
>
> The issue is that with *actual* memory hotunplug (for what the whole
> machinery should be used for), that memory/DIMM will be gone. And as you
> cannot fixup the initial memmap, if you were to reboot that machine, you
> would simply crash immediately.

Right.

>
> On x86, you can have that easily: hotplug DIMMs on bare metal and
> reboot. The DIMMs will be exposed via e820 during boot, so they are
> "early", although if done right (movable_node, movable_core and
> similar), they can get hotunplugged later. Important in environments
> where you want to hotunplug whole nodes. But has HW on x86 will properly
> adjust the initial memmap / e820, there is no such issue as on arm64.

That is the primary problem.

>
>>
>> The current logic is that only “new” memory blocks which are hot-added
>> can later be offlined and removed. The memory that system booted up with
>> cannot be offlined and removed. But there could be many usercases such
>> as inter-VM memory sharing where a primary VM could offline and
>> hot-remove a block/section of memory and lend it to secondary VM where
>> it could hot-add it. And after usecase is done, the reverse happens
>
> That use case is using the wrong mechanisms. It shouldn't be
> offlining+removing memory. Read below.
>
>> where secondary VM hot-removes and gives it back to primary which can
>> hot-add it back. In such cases, the present logic for arm64 doesn’t
>> allow this hot-remove in primary to happen.
>>
>> Also, on systems with movable zone that sort of guarantees pages to be
>> migrated and isolated so that blocks can be offlined, this logic also
>> defeats the purpose of having a movable zone which system can rely on
>> memory hot-plugging, which say virt-io mem also relies on for fully
>> plugged memory blocks.
>
> The MOVABLE_ZONE is *not* just for better guarantees when trying to
> hotunplug memory. It also increases the number of THP/huge pages. And
> that part works just fine.

Right.

>
>>
>> So we’re trying to understand the reasoning for such a prevention put in
>> place for arm64 arch alone.
>>
>> One possible way to solve this is by marking the required sections as
>> “non-early” by removing the SECTION_IS_EARLY bit in its section_mem_map.
>> This puts these sections in the context of “memory hotpluggable” which
>> can be offlined-removed and added-onlined which are part of boot RAM
>> itself and doesn’t need any extra blocks to be hot added. This way of
>> marking certain sections as “non-early” could be exported so that module
>> drivers can set the required number of sections as “memory
>
> Oh please no. No driver should be doing that. That's just hacking around
> the root issue: you're not supposed to do that.
>
>> hotpluggable”. This could have certain checks put in place to see which
>> sections are allowed, example only movable zone sections can be marked
>> as “non-early”.
>>
>
> I assume what your use case wants to achieve is, starting VMs with
> large, contiguous memory backings, not wasting memory for the memmap in
> the hypervisor.
>
> The "traditional" way of doing that is using the "mem=" boot parameter,
> and starting VMs with memory within the "never exposed to Linux" part.

I suggested something similar. Memory not exposed to Linux via this method
can be hot added or removed later on.

> While that in general works, I consider it an ugly hack. And it doesn't
> really allow the hypervisor the reuse unexposed memory.
>
> The obvious way for a driver to *allocate* memory (because that's what
> you want to do!) is using alloc_contig_range(). I know, that there are
> no guarantees. So you could be using CMA, ... but then, you still have
> the memmap consuming memory in your hypervisor.
>
> What you could try instead is
>
> 1. Using hugetlbfs with huge (2MB) / gigantic (1GB) (on x86) pages for
> backing your guest.
> 2. To free up the memmap, you could then go into the direction proposed
> by Muchun Song [1].
>
> That's then a clean way for a driver to allocate/use memory without
> abusing memory hot(un)plug infrastructure, minimizing the memmap
> consumption.
>
>
> [1]
> https://lkml.kernel.org/r/20200915125947.26204-1-songmuchun@xxxxxxxxxxxxx
>

Next message: Corentin Labbe: "[PATCH] arm64: dts: allwinner: Pine H64: Enable both RGMII RX/TX delay"
Previous message: Ramuthevar, Vadivel MuruganX: "Re: [PATCH v1 4/6] spi: cadence-quadspi: Add QSPI support for Intel LGM SoC"
In reply to: David Hildenbrand: "Re: arm64: dropping prevent_bootmem_remove_notifier"
Next in thread: Anshuman Khandual: "Re: arm64: dropping prevent_bootmem_remove_notifier"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]