Re: [PATCH v2 1/5] mm/hugetlb: fix races when looking up a CONT-PTE size hugetlb page

From: David Hildenbrand
Date: Wed Aug 24 2022 - 10:33:37 EST


On 24.08.22 16:30, Baolin Wang wrote:
>
>
> On 8/24/2022 7:55 PM, David Hildenbrand wrote:
>> On 24.08.22 11:41, Baolin Wang wrote:
>>>
>>>
>>> On 8/24/2022 3:31 PM, David Hildenbrand wrote:
>>>>>>>>
>>>>>>>> IMHO, these follow_huge_xxx() functions are arch-specified at first and
>>>>>>>> were moved into the common hugetlb.c by commit 9e5fc74c3025 ("mm:
>>>>>>>> hugetlb: Copy general hugetlb code from x86 to mm"), and now there are
>>>>>>>> still some arch-specified follow_huge_xxx() definition, for example:
>>>>>>>> ia64: follow_huge_addr
>>>>>>>> powerpc: follow_huge_pd
>>>>>>>> s390: follow_huge_pud
>>>>>>>>
>>>>>>>> What I mean is that follow_hugetlb_page() is a common and
>>>>>>>> not-arch-specified function, is it suitable to change it to be
>>>>>>>> arch-specified?
>>>>>>>> And thinking more, can we rename follow_hugetlb_page() as
>>>>>>>> hugetlb_page_faultin() and simplify it to only handle the page faults of
>>>>>>>> hugetlb like the faultin_page() for normal page? That means we can make
>>>>>>>> sure only follow_page_mask() can handle hugetlb.
>>>>>>>>
>>>>>>
>>>>>> Something like that might work, but you still have two page table walkers
>>>>>> for hugetlb. I like David's idea (if I understand it correctly) of
>>>>>
>>>>> What I mean is we may change the hugetlb handling like normal page:
>>>>> 1) use follow_page_mask() to look up a hugetlb firstly.
>>>>> 2) if can not get the hugetlb, then try to page fault by
>>>>> hugetlb_page_faultin().
>>>>> 3) if page fault successed, then retry to find hugetlb by
>>>>> follow_page_mask().
>>>>
>>>> That implies putting more hugetlbfs special code into generic GUP,
>>>> turning it even more complicated. But of course, it depends on how the
>>>> end result looks like. My gut feeling was that hugetlb is better handled
>>>> in follow_hugetlb_page() separately (just like we do with a lot of other
>>>> page table walkers).
>>>
>>> OK, fair enough.
>>>
>>>>>
>>>>> Just a rough thought, and I need more investigation for my idea and
>>>>> David's idea.
>>>>>
>>>>>> using follow_hugetlb_page for both cases. As noted, it will need to be
>>>>>> taught how to not trigger faults in the follow_page_mask case.
>>>>>
>>>>> Anyway, I also agree we need some cleanup, and firstly I think we should
>>>>> cleanup these arch-specified follow_huge_xxx() on some architectures
>>>>> which are similar with the common ones. I will look into these.
>>>>
>>>> There was a recent discussion on that, e.g.:
>>>>
>>>> https://lkml.kernel.org/r/20220818135717.609eef8a@thinkpad
>>>
>>> Thanks.
>>>
>>>>
>>>>>
>>>>> However, considering cleanup may need more investigation and
>>>>> refactoring, now I prefer to make these bug-fix patches of this patchset
>>>>> into mainline firstly, which are suitable to backport to old version to
>>>>> fix potential race issues. Mike and David, how do you think? Could you
>>>>> help to review these patches? Thanks.
>>>>
>>>> Patch #1 certainly add more special code just to handle another hugetlb
>>>> corner case (CONT pages), and maybe just making it all use
>>>> follow_hugetlb_page() would be even cleaner and less error prone.
>>>>
>>>> I agree that locking is shaky, but I'm not sure if we really want to
>>>> backport this to stable trees:
>>>>
>>>> https://www.kernel.org/doc/html/latest/process/stable-kernel-rules.html
>>>>
>>>> "It must fix a real bug that bothers people (not a, “This could be a
>>>> problem...” type thing)."
>>>>
>>>>
>>>> Do we actually have any instance of this being a real (and not a
>>>> theoretical) problem? If not, I'd rather clean it all up right away.
>>>
>>> I think this is a real problem (not theoretical), and easy to write some
>>> code to show the issue. For example, suppose thread A is trying to look
>>> up a CONT-PTE size hugetlb page under the lock, however antoher thread B
>>> can migrate the CONT-PTE hugetlb page at the same time, which will cause
>>> thread A to get an incorrect page, if thread A want to do something for
>>> this incorrect page, error occurs.
>>>
>>> Actually we also want to backport these fixes to the distro with old
>>> kernel versions to make the hugetlb more stable. Otherwise we must hit
>>> these issues sooner or later if the customers use CONT-PTE/PMD hugetlb.
>>>
>>> Anyway, if you and Mike still think these issues are not important
>>> enough to be fixed in the old versions, I can do the cleanup firstly.
>>>
>>
>> [asking myself which follow_page() users actually care about hugetlb,
>> and why we need this handling in follow_page at all]
>>
>> Which follow_page() user do we care about here? Primarily mm/migrate.c
>> only I assume?
>
> Right, mainly affects the move_pages() syscall I think. Yes, I can not
> know all of the users of the move_pages() syscall now or in the future
> in our data center, but like I said the move_pages() syscall + hugetlb
> can be a real potential stability issue.
>

I wonder if we can get rid of follow_page() completely, there are not
too many users. Or alternatively simply make it use general GUP
infrastructure more clearly. We'd need something like FOLL_NOFAULT that
also covers "absolutely no faults".

--
Thanks,

David / dhildenb