Re: [PATCH 1/2] mm: introduce put_user_page*(), placeholder versions

From: John Hubbard
Date: Wed Jan 02 2019 - 22:27:24 EST


On 1/2/19 5:55 PM, Jerome Glisse wrote:
> On Wed, Dec 19, 2018 at 12:08:56PM +0100, Jan Kara wrote:
>> On Tue 18-12-18 21:07:24, Jerome Glisse wrote:
>>> On Tue, Dec 18, 2018 at 03:29:34PM -0800, John Hubbard wrote:
>>>> OK, so let's take another look at Jerome's _mapcount idea all by itself (using
>>>> *only* the tracking pinned pages aspect), given that it is the lightest weight
>>>> solution for that.
>>>>
>>>> So as I understand it, this would use page->_mapcount to store both the real
>>>> mapcount, and the dma pinned count (simply added together), but only do so for
>>>> file-backed (non-anonymous) pages:
>>>>
>>>>
>>>> __get_user_pages()
>>>> {
>>>> ...
>>>> get_page(page);
>>>>
>>>> if (!PageAnon)
>>>> atomic_inc(page->_mapcount);
>>>> ...
>>>> }
>>>>
>>>> put_user_page(struct page *page)
>>>> {
>>>> ...
>>>> if (!PageAnon)
>>>> atomic_dec(&page->_mapcount);
>>>>
>>>> put_page(page);
>>>> ...
>>>> }
>>>>
>>>> ...and then in the various consumers of the DMA pinned count, we use page_mapped(page)
>>>> to see if any mapcount remains, and if so, we treat it as DMA pinned. Is that what you
>>>> had in mind?
>>>
>>> Mostly, with the extra two observations:
>>> [1] We only need to know the pin count when a write back kicks in
>>> [2] We need to protect GUP code with wait_for_write_back() in case
>>> GUP is racing with a write back that might not the see the
>>> elevated mapcount in time.
>>>
>>> So for [2]
>>>
>>> __get_user_pages()
>>> {
>>> get_page(page);
>>>
>>> if (!PageAnon) {
>>> atomic_inc(page->_mapcount);
>>> + if (PageWriteback(page)) {
>>> + // Assume we are racing and curent write back will not see
>>> + // the elevated mapcount so wait for current write back and
>>> + // force page fault
>>> + wait_on_page_writeback(page);
>>> + // force slow path that will fault again
>>> + }
>>> }
>>> }
>>
>> This is not needed AFAICT. __get_user_pages() gets page reference (and it
>> should also increment page->_mapcount) under PTE lock. So at that point we
>> are sure we have writeable PTE nobody can change. So page_mkclean() has to
>> block on PTE lock to make PTE read-only and only after going through all
>> PTEs like this, it can check page->_mapcount. So the PTE lock provides
>> enough synchronization.
>>
>>> For [1] only needing pin count during write back turns page_mkclean into
>>> the perfect spot to check for that so:
>>>
>>> int page_mkclean(struct page *page)
>>> {
>>> int cleaned = 0;
>>> + int real_mapcount = 0;
>>> struct address_space *mapping;
>>> struct rmap_walk_control rwc = {
>>> .arg = (void *)&cleaned,
>>> .rmap_one = page_mkclean_one,
>>> .invalid_vma = invalid_mkclean_vma,
>>> + .mapcount = &real_mapcount,
>>> };
>>>
>>> BUG_ON(!PageLocked(page));
>>>
>>> if (!page_mapped(page))
>>> return 0;
>>>
>>> mapping = page_mapping(page);
>>> if (!mapping)
>>> return 0;
>>>
>>> // rmap_walk need to change to count mapping and return value
>>> // in .mapcount easy one
>>> rmap_walk(page, &rwc);
>>>
>>> // Big fat comment to explain what is going on
>>> + if ((page_mapcount(page) - real_mapcount) > 0) {
>>> + SetPageDMAPined(page);
>>> + } else {
>>> + ClearPageDMAPined(page);
>>> + }
>>
>> This is the detail I'm not sure about: Why cannot rmap_walk_file() race
>> with e.g. zap_pte_range() which decrements page->_mapcount and thus the
>> check we do in page_mkclean() is wrong?
>>
>
> Ok so i found a solution for that. First GUP must wait for racing
> write back. If GUP see a valid write-able PTE and the page has
> write back flag set then it must back of as if the PTE was not
> valid to force fault. It is just a race with page_mkclean and we
> want ordering between the two. Note this is not strictly needed
> so we can relax that but i believe this ordering is better to do
> in GUP rather then having each single user of GUP test for this
> to avoid the race.
>
> GUP increase mapcount only after checking that it is not racing
> with writeback it also set a page flag (SetPageDMAPined(page)).
>
> When clearing a write-able pte we set a special entry inside the
> page table (might need a new special swap type for this) and change
> page_mkclean_one() to clear to 0 those special entry.
>
>
> Now page_mkclean:
>
> int page_mkclean(struct page *page)
> {
> int cleaned = 0;
> + int real_mapcount = 0;
> struct address_space *mapping;
> struct rmap_walk_control rwc = {
> .arg = (void *)&cleaned,
> .rmap_one = page_mkclean_one,
> .invalid_vma = invalid_mkclean_vma,
> + .mapcount = &real_mapcount,
> };
> + int mapcount1, mapcount2;
>
> BUG_ON(!PageLocked(page));
>
> if (!page_mapped(page))
> return 0;
>
> mapping = page_mapping(page);
> if (!mapping)
> return 0;
>
> + mapcount1 = page_mapcount(page);
>
> // rmap_walk need to change to count mapping and return value
> // in .mapcount easy one
> rmap_walk(page, &rwc);
>
> + if (PageDMAPined(page)) {
> + int rc2;
> +
> + if (mapcount1 == real_count) {
> + /* Page is no longer pin, no zap pte race */
> + ClearPageDMAPined(page);
> + goto out;
> + }
> + /* No new mapping of the page so mp1 < rc is illegal. */
> + VM_BUG_ON(mapcount1 < real_count);
> + /* Page might be pin. */
> + mapcount2 = page_mapcount(page);
> + if (mapcount2 > real_count) {
> + /* Page is pin for sure. */
> + goto out;
> + }
> + /* We had a race with zap pte we need to rewalk again. */
> + rc2 = real_mapcount;
> + real_mapcount = 0;
> + rwc.rmap_one = page_pin_one;
> + rmap_walk(page, &rwc);
> + if (mapcount2 <= (real_count + rc2)) {
> + /* Page is no longer pin */
> + ClearPageDMAPined(page);
> + }
> + /* At this point the page pin flag reflect pin status of the page */

Until...what? In other words, what is providing synchronization here?

thanks,
--
John Hubbard
NVIDIA

> + }
> +
> +out:
> ...
> }
>
> The page_pin_one() function count the number of special PTE entry so
> which match the count of pte that have been zapped since the first
> reverse map walk.
>
> So worst case a page that was pin by a GUP would need 2 reverse map
> walk during page_mkclean(). Moreover this is only needed if we race
> with something that clear pte. I believe this is an acceptable worst
> case. I will work on some RFC patchset next week (once i am down with
> email catch up).
>
>
> I do not think i made mistake here, i have been torturing my mind
> trying to think of any race scenario and i believe it holds to any
> racing zap and page_mkclean()
>
> Cheers,
> JÃrÃme
>