[LSF/MM TOPIC] Use NVDIMM as NUMA node and NUMA API

From: Yang Shi
Date: Wed Jan 30 2019 - 12:26:55 EST


Hi folks,


I would like to attend the LSF/MM Summit 2019. I'm interested in most MM topics, particularly the NUMA API topic proposed by Jerome since it is related to my below proposal.

I would like to share some our usecases, needs and approaches about using NVDIMM as a NUMA node.

We would like to provide NVDIMM to our cloud customers as some low cost memory. Virtual machines could run with NVDIMM as backed memory. Then we would like the below needs are met:

ÂÂÂÂ* The ratio of DRAM vs NVDIMM is configurable per process, or even per VMA
ÂÂÂÂ* The user VMs alway get DRAM first as long as the ratio is not reached
ÂÂÂÂ* Migrate cold data to NVDIMM and keep hot data in DRAM dynamically and throughout the life time of VMs

To meet the needs we did some in-house implementation:
ÂÂÂÂ* Provide madvise interface to configure the ratio
ÂÂÂÂ* Put NVDIMM into a separate zonelist so that default allocation can't touch it as long as it is requested explicitly
ÂÂÂÂ* A kernel thread scans cold pages

We tried to just use current NUMA APIs, but we realized they can't meet our needs. For example, if we configure a VMA use 50% DRAM and 50% NVDIMM, mbind() could set preferred node policy (DRAM node or NVDIMM node) for this VMA, but it can't control how much DRAM or NVDIMM is used by this specific VMA to satisfy the ratio.

So, IMHO we definitely need more fine-grained APIs to control the NUMA behavior.

I'd like also to discuss about this topic with:
ÂÂÂÂDave Hansen
ÂÂÂÂDan Williams
ÂÂÂÂFengguang Wu

Other than the above topic, I'd also like to meet other MM developers to discuss about some our usecases about memory cgroup (hallway conversation may be good enough). I had submitted some RFC patches to the mailing list and they did incur some discussion, but we have not reached solid conclusion yet.

https://lore.kernel.org/lkml/1547061285-100329-1-git-send-email-yang.shi@xxxxxxxxxxxxxxxxx/


Thanks,

Yang