Re: [PATCH v3 3/7] mm: Add write-protect and clean utilities for address space ranges

From: Thomas Hellstrom
Date: Thu Oct 03 2019 - 03:56:57 EST


On 10/2/19 10:28 PM, Linus Torvalds wrote:
> On Wed, Oct 2, 2019 at 12:09 PM Thomas Hellström (VMware)
> <thomas_os@xxxxxxxxxxxx> wrote:
>> Yes I typically tend towards using a "namespace_object_operation" naming
>> scheme, with "as_dirty" being the namespace here,
> We discourage that kind of mindless namespacing for core stuff.
>
> It makes sense in a driver or a filesystem: when there are 20+
> different filesystems all implementing the same operation, you can't
> have descriptive and natural names that are unique. So having a
> namespace prefix for the filesystem or driver makes sense. But even
> then it tends ot be just a simple name, not the op it does.

Understood.

>
>> Looking at Matthew's suggestion but lining up with
>> "unmap_mapping_range()", perhaps we could use "clean_mapping_range" and
>> "wp_mapping_range"?
> Yes, I agree with Willy that "dirty" is kind of implicit when the
> operation is to clean something, so the above sounds sane to me.
>
> The wp part I'm not entirely sure about: you're not actually
> write-protecting the range. You're doing something different. You're
> only doing it for shared writable mappings.

Both the cleaning operation and the wp operation operate on shared
writable mappings, and since they are also both restricted to entries
that may take part in dirty-tracking (they're ignoring pmds and puds),
so perhaps a "dirty" may make sense anyway, and to point out the similarity:

clean_mapping_dirty_range() and wp_mapping_dirty_range()"?

> And I'd rather see
> separate 'struct mm_walk_ops' than shared ones that then have a flag
> in a differfent structure to change behavior.
>
> Yes, yes, some of the levels share the same logic, but that just means
> that you can use the same function pointer for that level in the
> different "clean" vs "wp" mm_walk_op.

I think that this comment and this last one:

> Also, why do you loop inside the pmd_entry function, instead of just
> having a pte_entry function?"

are tied together. The pagewalk code is kind of awkward since if you
provide a pte_entry function,
then huge pmds are unconditionally split, even if handled in pmd_entry,
forcing pmd-aware users to handle also ptes in pmd_entry(). I'm not
entirely sure why, but it looks like it's to avoid a race where huge
pmds could otherwise be unintentionally split if appearing *after* the
pmd_entry() call. Other pagewalk users do the same here, see for
example clear_refs_pte_range();

https://elixir.bootlin.com/linux/latest/source/fs/proc/task_mmu.c#L1040

Also the pagewalk walk_pte_range() for some reason doesn't take the page
table lock, which means that a pte may well be cleared under us while we
have it cached for modification.

For these reasons we can't use the pte_entry, even internally and this
means we have three choices:

a) Selecting the pte function using a bool. Saves code and avoids extra
indirect function call.
b) Duplicating the pmd_entry with a different pte function, also
duplicating the ops. Avoids extra indirect function call but some extra
code.
c) Instead of the bool, use another function pointer in yet another ops
pointed to by the private structure. Saves some code.

I opted for a) here, but I can of course change that if needed?

>
> Also, looking closer at this patch, this makes me go "Whaa?":
>
> + pte = (mm == &init_mm) ?
> + pte_offset_kernel(pmd, addr) :
> + pte_offset_map_lock(mm, pmd, addr, &ptl);
>
> because I don't think that's sensible. When do you have a vma in kernel space?
>
> Also, why do you loop inside the pmd_entry function, instead of just
> having a pte_entry function?

Yes, that was a blind copy from v1 of the code that used
"apply_to_page_range()". I'll fix that up.

Thanks,

/Thomas

>
> Linus
>