Re: [RFC PATCH] Introduce persistent memory pool

From: Greg Kroah-Hartman
Date: Fri Aug 25 2023 - 04:27:41 EST


On Tue, Aug 22, 2023 at 11:34:34AM -0700, Stanislav Kinsburskii wrote:
> This patch addresses the need for a memory allocator dedicated to
> persistent memory within the kernel. This allocator will preserve
> kernel-specific states like DMA passthrough device states, IOMMU state, and
> more across kexec.

And CMA doesn't do this for you today?

> The proposed solution offers a foundational implementation for potential
> custom solutions that might follow. Though the implementation is
> intentionally kept concise and straightforward to foster discussion and
> feedback, it's fully functional in its current state.
>
> The persistent memory pool consists of a simple page allocator that
> operates on reserved physical memory regions. It employs bitmaps to
> allocate and free pages, with the following characteristics:
>
> 1. Memory pool regions are specified using the command line option:
>
> pmpool=<base>,<size>
>
> Where <base> represents the physical memory base address and <size>
> indicates the memory size.
>
> 2. While the pages allocation emulates the buddy allocator interface, it
> isn't confined to the MAX_ORDER.
>
> 3. The memory pool initializes during a cold boot and is retained across
> kexec.
>
> Potential applications include:
>
> 1. Allowing various in-kernel entities to allocate persistent pages from
> a singular memory pool, eliminating the need for multiple region
> reservations.
>
> 2. For in-kernel components that require the allocation address to be
> available on kernel kexec, this address can be exposed to user space and
> then passed via the command line.
>
> 3. Separate subsystems or drivers can reserve their region, sharing a
> portion of it for their persistent memory pool. This might be a file
> system, a key-value store, or other applications.
>
> Potential Enhancements for the Proposed Memory Pool:
>
> 1. Multiple Memory Regions Support: enhance the pool to accommodate and
> manage multiple memory regions simultaneously.
>
> 2. In-Kernel Memory Allocations for Storage Memory Regions:
> * Allow in-kernel memory allocations to serve as storage memory regions.
> * Implement explicit reservation of these designated regions during kexec.

As you have no in-kernel users of this, it's not something we can even
consider at the moment for obvious reasons (neither would you want us
to.)

Can you make this part of a patch series that actually adds a user,
probably more than one, so that we can see if any of this even makes
sense?


> drivers/misc/Kconfig | 7 +
> drivers/misc/Makefile | 1
> drivers/misc/pmpool.c | 270 ++++++++++++++++++++++++++++++++++++++++++++++++
> include/linux/pmpool.h | 20 ++++
> 4 files changed, 298 insertions(+)
> create mode 100644 drivers/misc/pmpool.c
> create mode 100644 include/linux/pmpool.h

misc is not for memory pools, as this is not a driver. please put this
in the properly location instead of trying to hide it from the mm
maintainers and subsystem :)

> diff --git a/drivers/misc/Kconfig b/drivers/misc/Kconfig
> index cadd4a820c03..c8ef5b37ee98 100644
> --- a/drivers/misc/Kconfig
> +++ b/drivers/misc/Kconfig
> @@ -562,6 +562,13 @@ config TPS6594_PFSM
> This driver can also be built as a module. If so, the module
> will be called tps6594-pfsm.
>
> +config PMPOOL
> + bool "Persistent memory pool support"
> + help
> + This option adds support for a persistent memory pool feature, which
> + provides pages allocation and freeing from a set of persistent memory ranges,
> + deposited to the memory pool.

Why would this even be a user selectable option? Either the kernel
needs this or it doesn't.


> +
> source "drivers/misc/c2port/Kconfig"
> source "drivers/misc/eeprom/Kconfig"
> source "drivers/misc/cb710/Kconfig"
> diff --git a/drivers/misc/Makefile b/drivers/misc/Makefile
> index f2a4d1ff65d4..31dd6553057d 100644
> --- a/drivers/misc/Makefile
> +++ b/drivers/misc/Makefile
> @@ -67,3 +67,4 @@ obj-$(CONFIG_TMR_MANAGER) += xilinx_tmr_manager.o
> obj-$(CONFIG_TMR_INJECT) += xilinx_tmr_inject.o
> obj-$(CONFIG_TPS6594_ESM) += tps6594-esm.o
> obj-$(CONFIG_TPS6594_PFSM) += tps6594-pfsm.o
> +obj-$(CONFIG_PMPOOL) += pmpool.o
> diff --git a/drivers/misc/pmpool.c b/drivers/misc/pmpool.c
> new file mode 100644
> index 000000000000..e2c923b31b36
> --- /dev/null
> +++ b/drivers/misc/pmpool.c
> @@ -0,0 +1,270 @@
> +#include <linux/io.h>

You forgot basic copyright/license stuff, did you use checkpatch.pl on
your file?

> +#include <linux/bitmap.h>
> +#include <linux/memblock.h>
> +#include <linux/spinlock.h>
> +#include <linux/types.h>
> +
> +#include <linux/pmpool.h>
> +
> +#define VERSION 1

In kernel code does not need versions.

> +#define MAX_REGIONS 14

Why 14? Why not 24? Or something else?

> +
> +#define for_each_region(pmpool, r) \
> + for (r = pmpool->hdr->regions; \
> + r - pmpool->hdr->regions < pmpool->hdr->nr_regions; \
> + r++)
> +
> +#define for_each_used_region(pmpool, r) \
> + for_each_region((pmpool), (r)) \
> + if (!(r)->size_in_pfns) { continue; } else
> +
> +struct pmpool_region {
> + uint32_t base_pfn; /* 32 bits * 4k = up it 15TB of physical address space */

Please use proper kernel types when writing kernel code (i.e. u32, u8,
and so on.) uint*_t is for userspace code only.

> --- /dev/null
> +++ b/include/linux/pmpool.h
> @@ -0,0 +1,20 @@
> +#ifndef _PMPOOL_H
> +#define _PMPOOL_H
> +
> +#include <linux/types.h>
> +#include <linux/spinlock.h>
> +
> +void *alloc_pm_pages(unsigned int order);
> +void free_pm_pages(void *addr, unsigned int order);
> +
> +struct pmpool {
> + spinlock_t lock;
> + struct pmpool_header *hdr;
> +};
> +
> +int pmpool_init(struct pmpool *pmpool, phys_addr_t base, phys_addr_t size);
> +
> +void *alloc_pmpool_pages(struct pmpool *pmpool, unsigned int order);
> +void free_pmpool_pages(struct pmpool *pmpool, void *addr, unsigned int order);

Please use "noun_verb_*" for new apis so that we have a chance at
understanding where the calls are living.

good luck!

greg k-h