Re: [PATCH RFC 0/4] mm: Enable PM_SWAP for shmem with PTE_MARKER

From: Tiberiu Georgescu
Date: Fri Aug 20 2021 - 12:50:40 EST



> On 19 Aug 2021, at 18:26, David Hildenbrand <david@xxxxxxxxxx> wrote:
>
> On 19.08.21 16:54, Tiberiu Georgescu wrote:
>>> On 18 Aug 2021, at 19:13, David Hildenbrand <david@xxxxxxxxxx> wrote:
>>>
>>>>>
>>>>>> I'm now wondering whether for Tiberiu's case mincore() can also be used. It
>>>>>> should just still be a bit slow because it'll look up the cache too, but it
>>>>>> should work similarly like the original proposal.
>>>> I am afraid that the information returned by mincore is a little too vague to be of better help, compared to what the pagemap should provide in theory. I will have a look to see whether lseek on
>>>> proc/map_files works as a "PM_SWAP" equivalent. However, the swap offset would still be missing.
>>>
>>> Well, with mincore() you could at least decide "page is present" vs. "page is swapped or not existent". At least for making pageout decisions it shouldn't really matter, no? madvise(MADV_PAGEOUT) on a hole is a nop.
>> I think you are right. In the optimisation we first presented, we should be able to
>> send the madvise(MADV_PAGEOUT) call even if the page is none quite safely
>> and get the wanted behaviour. Also, the "is_present" or "is_swap_or_none"
>> question can be answered by the current pagemap too. Nice catch.
>> However, not all use cases are the same. AFAIK, there is still no way to figure
>> out whether a shared page is swapped out or none unless it is directly
>> read/accessed after a pagemap check. Bringing a page into memory to check
>> if it previously was in swap does not seem ideal.
>
> Well, you can lseek() to remove all the holes and use mincore() to remove all in-memory pages. You're left with the swapped ones. Not the most efficient interface maybe, but there is a way :)

Ok, that could work. Still, I have a couple of concerns.

Firstly, I am worried lseek with the SEEK_HOLE flag would page in pages from
swap, so using it would be a direct factor on its own output. If people are working
on Live Migration, this would not be ideal. I am not 100% sure this is how lseek
works, so please feel free to contradict me, but I think it would swap in some
of the pages that it seeks through, if not all, to figure out when to stop. Unless it
leverages the page cache somehow, or an internal bitmap.

Secondly, mincore() could return some "false positives" for this particular use
case. That is because it returns flag=1 for pages which are still in the swap
cache, so the output becomes ambiguous.

I am not saying this is not something that would ever be needed. Some people
could actually be looking for exactly this scenario, and lseeking during the check
could be an advantage. Just that it does not look very flexible. That is why the
pagemap would have been ideal for us.

Alternatively, to get all logically swapped out pages, the lseek with pagemap
should do the trick. As you said, we remove holes with lseek, but we remove
in-memory pages with is_present(pte) instead. This solution would still suffer from
my first concern, but it should do the job.

>
>> Also, we still have no mechanism to retrieve the swap offsets of shmem pages
>> AFAIK. There is one more QEMU optimisation we are working on that requires
>> these mappings available outside of kernel space.
>
> How exactly would the swap offset really help? IMHO that's a kernel internal that shouldn't be of any value to user space -- it's merely for debugging purposes. But I'd love to learn details.

It is possible for the swap device to be network attached and shared, so multiple
hosts would need to understand its content. Then it is no longer internal to one
kernel only.

By being swap-aware, we can skip swapped-out pages during migration (to prevent IO and potential thrashing), and transfer those pages in another way that
is zero-copy.
>
> [...]
>
>>> If it has an fd and we can punch that into syscalls, we should much rather use that fd to lookup stuff then going via process page tables -- if possible of course (to be evaluated, because I haven't looked into the CRIU details and how they use lseek with anonymous shared memory).
>> I found out that it is possible to retrieve the fds of shmem/tmpfs file allocations
>> using proc/pid/map_files, which is neat. Still, CRIU does not seem to care
>> whether a page is swapped out or just empty, only if it is present on page cache.
>> The holes that lseek finds would not be able to infer this difference, AFAIK. Will
>> test the behaviour to make sure.
>
> CRIU wants to migrate everything. lseek() gives you the definitive answer what needs migration -- if it's swapped out or resident. Just skip the holes.

Thank you for the summary. I see why the use case is sufficient for CRIU then.
In our case, the optimisations aim to make the migration on QEMU swap aware.

--
Kind regards,
Tibi Georgescu