Re: [PATCH RFC 0/4] mm: Enable PM_SWAP for shmem with PTE_MARKER

From: David Hildenbrand
Date: Thu Aug 19 2021 - 13:26:38 EST


On 19.08.21 16:54, Tiberiu Georgescu wrote:

On 18 Aug 2021, at 19:13, David Hildenbrand <david@xxxxxxxxxx> wrote:


I'm now wondering whether for Tiberiu's case mincore() can also be used. It
should just still be a bit slow because it'll look up the cache too, but it
should work similarly like the original proposal.
I am afraid that the information returned by mincore is a little too vague to be of better help, compared to what the pagemap should provide in theory. I will have a look to see whether lseek on
proc/map_files works as a "PM_SWAP" equivalent. However, the swap offset would still be missing.

Well, with mincore() you could at least decide "page is present" vs. "page is swapped or not existent". At least for making pageout decisions it shouldn't really matter, no? madvise(MADV_PAGEOUT) on a hole is a nop.

I think you are right. In the optimisation we first presented, we should be able to
send the madvise(MADV_PAGEOUT) call even if the page is none quite safely
and get the wanted behaviour. Also, the "is_present" or "is_swap_or_none"
question can be answered by the current pagemap too. Nice catch.

However, not all use cases are the same. AFAIK, there is still no way to figure
out whether a shared page is swapped out or none unless it is directly
read/accessed after a pagemap check. Bringing a page into memory to check
if it previously was in swap does not seem ideal.

Well, you can lseek() to remove all the holes and use mincore() to remove all in-memory pages. You're left with the swapped ones. Not the most efficient interface maybe, but there is a way :)


Also, we still have no mechanism to retrieve the swap offsets of shmem pages
AFAIK. There is one more QEMU optimisation we are working on that requires
these mappings available outside of kernel space.

How exactly would the swap offset really help? IMHO that's a kernel internal that shouldn't be of any value to user space -- it's merely for debugging purposes. But I'd love to learn details.

[...]

If it has an fd and we can punch that into syscalls, we should much rather use that fd to lookup stuff then going via process page tables -- if possible of course (to be evaluated, because I haven't looked into the CRIU details and how they use lseek with anonymous shared memory).

I found out that it is possible to retrieve the fds of shmem/tmpfs file allocations
using proc/pid/map_files, which is neat. Still, CRIU does not seem to care
whether a page is swapped out or just empty, only if it is present on page cache.
The holes that lseek finds would not be able to infer this difference, AFAIK. Will
test the behaviour to make sure.

CRIU wants to migrate everything. lseek() gives you the definitive answer what needs migration -- if it's swapped out or resident. Just skip the holes.

--
Thanks,

David / dhildenb