Re: [PATCH RFC v2 0/4] Add support for sharing page tables across processes (Previously mshare)

From: David Hildenbrand
Date: Mon Jul 31 2023 - 13:11:42 EST


On 31.07.23 18:54, Matthew Wilcox wrote:
On Mon, Jul 31, 2023 at 06:48:47PM +0200, David Hildenbrand wrote:
On 31.07.23 18:38, Matthew Wilcox wrote:
On Mon, Jul 31, 2023 at 06:30:22PM +0200, David Hildenbrand wrote:
Assume we do do the page table sharing at mmap time, if the flags are right.
Let's focus on the most common:

mmap(memfd, PROT_READ | PROT_WRITE, MAP_SHARED)

And doing the same in each and every process.

That may be the most common in your usage, but for a database, you're
looking at two usage scenarios. Postgres calls mmap() on the database
file itself so that all processes share the kernel page cache.
Some Commercial Databases call mmap() on a hugetlbfs file so that all
processes share the same userspace buffer cache. Other Commecial
Databases call shmget() / shmat() with SHM_HUGETLB for the exact
same reason.

I remember you said that postgres might be looking into using shmem as well,
maybe I am wrong.

No, I said that postgres was also interested in sharing page tables.
I don't think they have any use for shmem.

memfd/hugetlb/shmem could all be handled alike, just "arbitrary filesystems"
would require more work.

But arbitrary filesystems was one of the origin use cases; where the
database is stored on a persistent memory filesystem, and neither the
kernel nor userspace has a cache. The Postgres & Commercial Database
use-cases collapse into the same case, and we want to mmap the files
directly and share the page tables.

Yes, and transparent page table sharing can be achieved otherwise.

I guess what you imply is that they want to share page tables and have a single mprotect(PROT_READ) to modify the shared page tables.


This is why I proposed mshare(). Anyone can use it for anything.
We have such a diverse set of users who want to do stuff with shared
page tables that we should not be tying it to memfd or any other
filesystem. Not to mention that it's more flexible; you can map
individual 4kB files into it and still get page table sharing.

That's not what the current proposal does, or am I wrong?

I think you're wrong, but I haven't had time to read the latest patches.


Maybe I misunderstood what the MAP_SHARED_PT actually does.

"
This patch series adds a new flag to mmap() call - MAP_SHARED_PT.
This flag can be specified along with MAP_SHARED by a process to
hint to kernel that it wishes to share page table entries for this
file mapping mmap region with other processes. Any other process
that mmaps the same file with MAP_SHARED_PT flag can then share the
same page table entries. Besides specifying MAP_SHARED_PT flag, the
processes must map the files at a PMD aligned address with a size
that is a multiple of PMD size and at the same virtual addresses.
This last requirement of same virtual addresses can possibly be
relaxed if that is the consensus.
"

Reading this, I'm confused how 4k files would interact with the PMD size requirement.

Probably I got it all wrong.

Also, I'm curious, is that a real requirement in the database world?

I don't know. It's definitely an advantage that falls out of the design
of mshare.

Okay, just checking if there is an important use case I'm missing, I'm also not aware of any.


Anyhow, I have other work to do. Happy to continue the discussion someone is actually working on this (again).

--
Cheers,

David / dhildenb