Re: [PATCH v3 1/7] mm: allow drivers to prevent new writable mappings

From: David Herrmann
Date: Sat Jul 19 2014 - 12:12:23 EST


Hi Hugh

On Wed, Jul 9, 2014 at 10:55 AM, Hugh Dickins <hughd@xxxxxxxxxx> wrote:
> On Fri, 13 Jun 2014, David Herrmann wrote:
>
>> The i_mmap_writable field counts existing writable mappings of an
>> address_space. To allow drivers to prevent new writable mappings, make
>> this counter signed and prevent new writable mappings if it is negative.
>> This is modelled after i_writecount and DENYWRITE.
>>
>> This will be required by the shmem-sealing infrastructure to prevent any
>> new writable mappings after the WRITE seal has been set. In case there
>> exists a writable mapping, this operation will fail with EBUSY.
>>
>> Note that we rely on the fact that iff you already own a writable mapping,
>> you can increase the counter without using the helpers. This is the same
>> that we do for i_writecount.
>>
>> Signed-off-by: David Herrmann <dh.herrmann@xxxxxxxxx>
>
> I'm very pleased with the way you chose to do this in the end, following
> the way VM_DENYWRITE does it: it works out nicer than I had imagined.
>
> I do have some comments below, but the only one where I end up asking
> for a change is to remove the void "return "; but please read through
> in case any of my wonderings lead you to change your mind on something.
>
> I am very tempted to suggest that you add VM_WRITE into those VM_SHARED
> tests, and put the then necessary additional tests into mprotect.c: so
> that you would no longer have the odd case that a VM_SHARED !VM_WRITE
> area has to be unmapped before sealing the fd.
>
> But that would involve an audit of flush_dcache_page() architectures,
> and perhaps some mods to keep them safe across the mprotect(): let's
> put it down as a desirable future enhancement, just to remove an odd
> restriction.
>
> With the "return " gone,
> Acked-by: Hugh Dickins <hughd@xxxxxxxxxx>

Thanks for the review! Yes, allowing VM_SHARED without VM_WRITE would
be nice, but I think it would make this patchset more complex than it
already is. I agree that it's better done separately.

>> ---
>> fs/inode.c | 1 +
>> include/linux/fs.h | 29 +++++++++++++++++++++++++++--
>> kernel/fork.c | 2 +-
>> mm/mmap.c | 24 ++++++++++++++++++------
>> mm/swap_state.c | 1 +
>> 5 files changed, 48 insertions(+), 9 deletions(-)
>>
>> diff --git a/fs/inode.c b/fs/inode.c
>> index 6eecb7f..9945b11 100644
>> --- a/fs/inode.c
>> +++ b/fs/inode.c
>> @@ -165,6 +165,7 @@ int inode_init_always(struct super_block *sb, struct inode *inode)
>> mapping->a_ops = &empty_aops;
>> mapping->host = inode;
>> mapping->flags = 0;
>> + atomic_set(&mapping->i_mmap_writable, 0);
>> mapping_set_gfp_mask(mapping, GFP_HIGHUSER_MOVABLE);
>> mapping->private_data = NULL;
>> mapping->backing_dev_info = &default_backing_dev_info;
>> diff --git a/include/linux/fs.h b/include/linux/fs.h
>> index 338e6f7..71d17c9 100644
>> --- a/include/linux/fs.h
>> +++ b/include/linux/fs.h
>> @@ -387,7 +387,7 @@ struct address_space {
>> struct inode *host; /* owner: inode, block_device */
>> struct radix_tree_root page_tree; /* radix tree of all pages */
>> spinlock_t tree_lock; /* and lock protecting it */
>> - unsigned int i_mmap_writable;/* count VM_SHARED mappings */
>> + atomic_t i_mmap_writable;/* count VM_SHARED mappings */
>> struct rb_root i_mmap; /* tree of private and shared mappings */
>> struct list_head i_mmap_nonlinear;/*list VM_NONLINEAR mappings */
>> struct mutex i_mmap_mutex; /* protect tree, count, list */
>> @@ -470,10 +470,35 @@ static inline int mapping_mapped(struct address_space *mapping)
>> * Note that i_mmap_writable counts all VM_SHARED vmas: do_mmap_pgoff
>> * marks vma as VM_SHARED if it is shared, and the file was opened for
>> * writing i.e. vma may be mprotected writable even if now readonly.
>> + *
>> + * If i_mmap_writable is negative, no new writable mappings are allowed. You
>> + * can only deny writable mappings, if none exists right now.
>> */
>> static inline int mapping_writably_mapped(struct address_space *mapping)
>> {
>> - return mapping->i_mmap_writable != 0;
>> + return atomic_read(&mapping->i_mmap_writable) > 0;
>
> I have a slight anxiety that you're halving the writable range here:
> but I think that if we can overflow with this change, we could overflow
> before; and there are int counts elsewhere which would overflow easier.

Currently you need a new vm_area_struct to increase that counter.
vm_area_struct is about 144 bytes, times INT_MAX is something around
300GB. This does indeed look like a limit that can be reached not far
from now on common machines. I somehow doubt it's the only atomic that
might overflow soon, though. And there's always RLIMIT_AS you can set
to avoid such attacks.

I wouldn't oppose to using atomic_long_t, though. Many VM counters
already use atomic_long_t instead of atomic_t.

>> +}
>> +
>> +static inline int mapping_map_writable(struct address_space *mapping)
>> +{
>> + return atomic_inc_unless_negative(&mapping->i_mmap_writable) ?
>> + 0 : -EPERM;
>> +}
>> +
>> +static inline void mapping_unmap_writable(struct address_space *mapping)
>> +{
>> + return atomic_dec(&mapping->i_mmap_writable);
>
> Better just
>
> atomic_dec(&mapping->i_mmap_writable);
>
> I didn't realize before that the compiler allows you to return a void
> function from a void function without warning, but better not rely on that.

Fixed, thanks.

>> +}
>> +
>> +static inline int mapping_deny_writable(struct address_space *mapping)
>> +{
>> + return atomic_dec_unless_positive(&mapping->i_mmap_writable) ?
>> + 0 : -EBUSY;
>> +}
>> +
>> +static inline void mapping_allow_writable(struct address_space *mapping)
>> +{
>> + atomic_inc(&mapping->i_mmap_writable);
>> }
>>
>> /*
>> diff --git a/kernel/fork.c b/kernel/fork.c
>> index d2799d1..f1f127e 100644
>> --- a/kernel/fork.c
>> +++ b/kernel/fork.c
>> @@ -421,7 +421,7 @@ static int dup_mmap(struct mm_struct *mm, struct mm_struct *oldmm)
>> atomic_dec(&inode->i_writecount);
>> mutex_lock(&mapping->i_mmap_mutex);
>> if (tmp->vm_flags & VM_SHARED)
>> - mapping->i_mmap_writable++;
>> + atomic_inc(&mapping->i_mmap_writable);
>> flush_dcache_mmap_lock(mapping);
>> /* insert tmp into the share list, just after mpnt */
>> if (unlikely(tmp->vm_flags & VM_NONLINEAR))
>> diff --git a/mm/mmap.c b/mm/mmap.c
>> index 129b847..19b6562 100644
>> --- a/mm/mmap.c
>> +++ b/mm/mmap.c
>> @@ -216,7 +216,7 @@ static void __remove_shared_vm_struct(struct vm_area_struct *vma,
>> if (vma->vm_flags & VM_DENYWRITE)
>> atomic_inc(&file_inode(file)->i_writecount);
>> if (vma->vm_flags & VM_SHARED)
>> - mapping->i_mmap_writable--;
>> + mapping_unmap_writable(mapping);
>
> Okay. I waste ridiculous time trying to decide whether it's better to say
>
> mapping_unmap_writable(mapping)
> or
> atomic_dec(&mapping->i_mmap_writable);
>
> here: there are consistency arguments both ways. I usually solve this
> problem by not giving wrapper names to such operations, but in this case
> I think the answer is just to accept whatever you decided was best.
>
>>
>> flush_dcache_mmap_lock(mapping);
>> if (unlikely(vma->vm_flags & VM_NONLINEAR))
>> @@ -617,7 +617,7 @@ static void __vma_link_file(struct vm_area_struct *vma)
>> if (vma->vm_flags & VM_DENYWRITE)
>> atomic_dec(&file_inode(file)->i_writecount);
>> if (vma->vm_flags & VM_SHARED)
>> - mapping->i_mmap_writable++;
>> + atomic_inc(&mapping->i_mmap_writable);
>>
>> flush_dcache_mmap_lock(mapping);
>> if (unlikely(vma->vm_flags & VM_NONLINEAR))
>> @@ -1572,6 +1572,11 @@ munmap_back:
>> if (error)
>> goto free_vma;
>> }
>> + if (vm_flags & VM_SHARED) {
>> + error = mapping_map_writable(file->f_mapping);
>> + if (error)
>> + goto allow_write_and_free_vma;
>> + }
>> vma->vm_file = get_file(file);
>> error = file->f_op->mmap(file, vma);
>> if (error)
>> @@ -1611,8 +1616,12 @@ munmap_back:
>>
>> vma_link(mm, vma, prev, rb_link, rb_parent);
>> /* Once vma denies write, undo our temporary denial count */
>> - if (vm_flags & VM_DENYWRITE)
>> - allow_write_access(file);
>> + if (file) {
>> + if (vm_flags & VM_SHARED)
>> + mapping_unmap_writable(file->f_mapping);
>> + if (vm_flags & VM_DENYWRITE)
>> + allow_write_access(file);
>> + }
>> file = vma->vm_file;
>
> Very good. A bit subtle, and for a while I was preparing to warn you
> of the odd shmem_zero_setup() case (which materializes a file when
> none came in), and the danger of the count going wrong on that.
>
> But you have it handled, with the "if (file)" block immediately before
> the new assignment of file, which should force anyone to think about it.
> And the ordering here, with respect to file and error exits, has always
> been tricky - I don't think you are making it any worse.

I tried to keep it as similar to VM_DENYWRITE as possible. It's not
really obvious and I had to read it several times, too. But looks all
fine to me now.

> Oh, more subtle than I realized above: I've just gone through a last
> minute "hey, it's wrong; oh, no, it's okay" waver here, remembering
> how mmap /dev/zero (and some others, I suppose) gives you a new file.
>
> That is okay: we don't even have to assume that sealing is limited
> to your memfd_create(): it's safe to assume that any device which
> forks you a new file, is safe to assert mapping_map_writable() upon
> temporarily; and the new file given could never come already sealed.
>
> But maybe a brief comment to reassure us in future?

I added a comment explaining that ->mmap has to make sure vma_link()
succeeds in raising these counters in case it modified vma->vm_file.
All users I found return files that haven't been exposed to user-space
at that point, therefore they're safe. They also don't do initial
sealing or VM_DENYWRITE, so they are safe in both regards.

>> out:
>> perf_event_mmap(vma);
>> @@ -1641,14 +1650,17 @@ out:
>> return addr;
>>
>> unmap_and_free_vma:
>> - if (vm_flags & VM_DENYWRITE)
>> - allow_write_access(file);
>> vma->vm_file = NULL;
>> fput(file);
>>
>> /* Undo any partial mapping done by a device driver. */
>> unmap_region(mm, vma, prev, vma->vm_start, vma->vm_end);
>> charged = 0;
>> + if (vm_flags & VM_SHARED)
>> + mapping_unmap_writable(file->f_mapping);
>> +allow_write_and_free_vma:
>> + if (vm_flags & VM_DENYWRITE)
>> + allow_write_access(file);
>
> Okay. I don't enjoy that rearrangement, such that those two uses of
> file come after the fput(file); but I believe there are good reasons
> why it can never be the final fput(file), so I think you're okay.

Actually, the fput(file) should rather be "put(vma->vm_file). The
reference we free here is not the reference that we got passed by the
syscall-context, but it's the reference we took for vma->vm_file.
Therefore, I think it's fine to access "file" afterwards, as this is
an independent reference. The vma->vm_file reference is also taken
_after_ doing VM_DENYWRITE and VM_SHARED accounting. Therefore, I put
the cleanup of vma->vm_file before reverting the counters for
symmetry's sake.

Note that we cannot turn fput(file) into fput(vma->vm_file), though.
The vma->vm_file field might be left set to something else after
->mmap() failed.

Thanks
David

>> free_vma:
>> kmem_cache_free(vm_area_cachep, vma);
>> unacct_error:
>> diff --git a/mm/swap_state.c b/mm/swap_state.c
>> index 2972eee..31321fa 100644
>> --- a/mm/swap_state.c
>> +++ b/mm/swap_state.c
>> @@ -39,6 +39,7 @@ static struct backing_dev_info swap_backing_dev_info = {
>> struct address_space swapper_spaces[MAX_SWAPFILES] = {
>> [0 ... MAX_SWAPFILES - 1] = {
>> .page_tree = RADIX_TREE_INIT(GFP_ATOMIC|__GFP_NOWARN),
>> + .i_mmap_writable = ATOMIC_INIT(0),
>> .a_ops = &swap_aops,
>> .backing_dev_info = &swap_backing_dev_info,
>> }
>> --
>> 2.0.0
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/