Re: [PATCH 00/35] Enhance memory utilization with DMEMFS

From: Joao Martins
Date: Thu Oct 08 2020 - 15:03:01 EST


[adding a couple folks that directly or indirectly work on the subject]

On 10/8/20 8:53 AM, yulei.kernel@xxxxxxxxx wrote:
> From: Yulei Zhang <yuleixzhang@xxxxxxxxxxx>
>
> In current system each physical memory page is assocaited with
> a page structure which is used to track the usage of this page.
> But due to the memory usage rapidly growing in cloud environment,
> we find the resource consuming for page structure storage becomes
> highly remarkable. So is it an expense that we could spare?
>
Happy to see another person working to solve the same problem!

I am really glad to see more folks being interested in solving
this problem and I hope we can join efforts?

BTW, there is also a second benefit in removing struct page -
which is carving out memory from the direct map.

> This patchset introduces an idea about how to save the extra
> memory through a new virtual filesystem -- dmemfs.
>
> Dmemfs (Direct Memory filesystem) is device memory or reserved
> memory based filesystem. This kind of memory is special as it
> is not managed by kernel and most important it is without 'struct page'.
> Therefore we can leverage the extra memory from the host system
> to support more tenants in our cloud service.
>
This is like a walk down the memory lane.

About a year ago we followed the same exact idea/motivation to
have memory outside of the direct map (and removing struct page overhead)
and started with our own layer/thingie. However we realized that DAX
is one the subsystems which already gives you direct access to memory
for free (and is already upstream), plus a couple of things which we
found more handy.

So we sent an RFC a couple months ago:

https://lore.kernel.org/linux-mm/20200110190313.17144-1-joao.m.martins@xxxxxxxxxx/

Since then majority of the work has been in improving DAX[1].
But now that is done I am going to follow up with the above patchset.

[1]
https://lore.kernel.org/linux-mm/159625229779.3040297.11363509688097221416.stgit@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx/

(Give me a couple of days and I will send you the link to the latest
patches on a git-tree - would love feedback!)

The struct page removal for DAX would then be small, and ticks the
same bells and whistles (MCE handling, reserving PAT memtypes, ptrace
support) that we both do, with a smaller diffstat and it doesn't
touch KVM (not at least fundamentally).

15 files changed, 401 insertions(+), 38 deletions(-)

The things needed in core-mm is for handling PMD/PUD PAGE_SPECIAL much
like we both do. Furthermore there wouldn't be a need for a new vm type,
consuming an extra page bit (in addition to PAGE_SPECIAL) or new filesystem.

[1]
https://lore.kernel.org/linux-mm/159625229779.3040297.11363509688097221416.stgit@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx/


> We uses a kernel boot parameter 'dmem=' to reserve the system
> memory when the host system boots up, the details can be checked
> in /Documentation/admin-guide/kernel-parameters.txt.
>
> Theoretically for each 4k physical page it can save 64 bytes if
> we drop the 'struct page', so for guest memory with 320G it can
> save about 5G physical memory totally.
>
Also worth mentioning that if you only care about 'struct page' cost, and not on the
security boundary, there's also some work on hugetlbfs preallocation of hugepages into
tricking vmemmap in reusing tail pages.

https://lore.kernel.org/linux-mm/20200915125947.26204-1-songmuchun@xxxxxxxxxxxxx/

Going forward that could also make sense for device-dax to avoid so many
struct pages allocated (which would require its transition to compound
struct pages like hugetlbfs which we are looking at too). In addition an
idea <handwaving> would be perhaps to have a stricter mode in DAX where
we initialize/use the metadata ('struct page') but remove the underlaying
PFNs (of the 'struct page') from the direct map having to bear the cost of
mapping/unmapping on gup/pup.

Joao