RE: find_get_page() VS pin_user_pages()

From: Teterevkov, Ivan
Date: Wed Apr 12 2023 - 05:04:46 EST


From: Alistair Popple <apopple@xxxxxxxxxx>

> "Teterevkov, Ivan" <Ivan.Teterevkov@xxxxxxx> writes:
>
> > Hello folks,
> >
> > I work with an application which aims to share memory in the userspace and
> > interact with the NIC DMA. The memory allocation workflow begins in the
> > userspace, which creates a new file backed by 2MiB hugepages with
> > memfd_create(MFD_HUGETLB, MFD_HUGE_2MB) and fallocate(). Then the userspace
> > makes an IOCTL to the kernel module with the file descriptor and size so that
> > the kernel module can get the struct page with find_get_page(). Then the kernel
> > module calls dma_map_single(page_address(page)) for NIC, which concludes the
> > datapath. The allocated memory may (significantly) outlive the originating
> > userspace application. The hugepages stay mapped with NIC, and the kernel
> > module wants to continue using them and map to other applications that come and
> > go with vm_mmap().
> >
> > I am studying the pin_user_pages*() family of functions, and I wonder if the
> > outlined workflow requires it. The hugepages do not page out, but they can move
> > as they may be allocated with GFP_HIGHUSER_MOVABLE. However, find_get_page()
> > must increment the page reference counter without mapping and prevent it from
> > moving. In particular, https://docs.kernel.org/mm/page_migration.html:
>
> I'm not super familiar with the memfd_create()/find_get_page() workflow
> but is there some reason you're not using pin_user_pages*(FOLL_LONGTERM)
> to get the struct page initially? You're description above sounds
> exactly the use case pin_user_pages() was designed for because it marks
> the page as being writen to by DMA, makes sure it's not in a movable
> zone, etc.
>

The biggest obstacle with the application workflow is that the memory
allocation is mostly kernel-driven. The kernel module may want to tell DMA
about the hugepages before the userspace application maps it into its address
space, so the kernel module does not have the starting user address at hand.
I believe one kernel-side workaround would be to vm_mmap(),
pin_user_pages(FOLL_LONGTERM) and possibly vm_munmap() shortly after if we do
not want to keep them mapped in the originating application. This would have a
side effect, but the pinning would stay in place until the kernel module unpins
the pages with unpin_user_page().

The pin_user_pages*() operating on behalf of the userspace application made me
think that the pinning was not designed to outlive the application, but perhaps
that is what FOLL_LONGTERM for in comparison with FOLL_PIN?

> >> How migrate_pages() works
> >> ...
> >> Steps:
> >> ...
> >> 4. All the page table references to the page are converted to migration
> >> entries. This decreases the mapcount of a page. If the resulting mapcount
> >> is not zero then we do not migrate the page.
> >
> > Does find_get_page() achieve that condition or does the outlined workflow
> > still requires pin_user_pages*() for safe DMA?
>
> Yes. The extra page reference will prevent the migration regardless of
> mapcount being zero or not. See folio_expected_refs() for how the extra
> reference is detected.
>

Thank you for pointing out folio_expected_refs(). I see that as soon as the
reference counter exceeds the number returned by folio_expected_refs(), the
page becomes pinned, but it reduces the mobility for the pages coming from
ZONE_MOVABLE making pin_user_pages*() preferable.

Thanks,
Ivan