Re: [RFC][PATCH] x86, pkeys: scalable pkey_set()/pkey_get()

From: Dave Hansen
Date: Fri Jul 08 2016 - 10:54:50 EST


On 07/08/2016 03:35 AM, Mel Gorman wrote:
> On Thu, Jul 07, 2016 at 04:09:22PM -0700, Dave Hansen wrote:
>> b/arch/x86/include/asm/pkeys.h | 39 ++++++++++++++++++++++++++++++++++-----
>> b/mm/mprotect.c | 4 ----
>> 2 files changed, 34 insertions(+), 9 deletions(-)
>>
>> diff -puN mm/mprotect.c~pkeys-119-fast-set-get mm/mprotect.c
>> --- a/mm/mprotect.c~pkeys-119-fast-set-get 2016-07-07 12:25:49.582075153 -0700
>> +++ b/mm/mprotect.c 2016-07-07 12:42:50.516384977 -0700
>> @@ -542,10 +542,8 @@ SYSCALL_DEFINE2(pkey_get, int, pkey, uns
>> if (flags)
>> return -EINVAL;
>>
>> - down_write(&current->mm->mmap_sem);
>> if (!mm_pkey_is_allocated(current->mm, pkey))
>> ret = -EBADF;
>> - up_write(&current->mm->mmap_sem);
>>
>> if (ret)
>> return ret;
>
> This does allow the possibility of
>
> thread a thread b
> pkey_get enter
> pkey_free
> pkey_alloc
> pkey_get leave
>
> The kernel can tell if the key is allocated but not if it is the same
> allocation userspace expected or not. That's why I thought this may need
> to be a sequence counter. Unfortunately, now I realise that even that is
> insufficient because the seqcounter would only detect that something
> changed, it would have no idea if the pkey of interest was affected or
> not. It gets rapidly messy after that.
>
> Userspace may have no choice other than to serialise itself but the
> documentation needs to be clear that the above race is possible.

Yeah, I'll clarify the documentation. But, I do think this is one of
those races like an stat(). A stat() tells you that a file was once
there with so and so properties, but it does not mean that it is there
any more or that what _is_ there is the same thing you stat()'d.

>> diff -puN arch/x86/include/asm/pkeys.h~pkeys-119-fast-set-get arch/x86/include/asm/pkeys.h
>> --- a/arch/x86/include/asm/pkeys.h~pkeys-119-fast-set-get 2016-07-07 12:26:19.265421712 -0700
>> +++ b/arch/x86/include/asm/pkeys.h 2016-07-07 15:18:15.391642423 -0700
>> @@ -35,18 +35,47 @@ extern int __arch_set_user_pkey_access(s
>>
>> #define ARCH_VM_PKEY_FLAGS (VM_PKEY_BIT0 | VM_PKEY_BIT1 | VM_PKEY_BIT2 | VM_PKEY_BIT3)
>>
>> +#define PKEY_MAP_SET 1
>> +#define PKEY_MAP_CLEAR 2
>> #define mm_pkey_allocation_map(mm) (mm->context.pkey_allocation_map)
>> -#define mm_set_pkey_allocated(mm, pkey) do { \
>> - mm_pkey_allocation_map(mm) |= (1U << pkey); \
>> +static inline
>> +void mm_modify_pkey_alloc_map(struct mm_struct *mm, int pkey, int setclear)
>> +{
>> + u16 new_map = mm_pkey_allocation_map(mm);
>> + if (setclear == PKEY_MAP_SET)
>> + new_map |= (1U << pkey);
>> + else if (setclear == PKEY_MAP_CLEAR)
>> + new_map &= ~(1U << pkey);
>> + else
>> + BUILD_BUG_ON(1);
>> + /*
>> + * Make sure that mm_pkey_is_allocated() callers never
>> + * see intermediate states by using WRITE_ONCE().
>> + * Concurrent calls to this function are excluded by
>> + * down_write(mm->mmap_sem) so we only need to protect
>> + * against readers.
>> + */
>> + WRITE_ONCE(mm_pkey_allocation_map(mm), new_map);
>> +}
>
> What prevents two pkey_set operations overwriting each others change with
> WRITE_ONCE? Does this not need to be a cmpxchg read-modify-write loops?

pkey_set() only reads the allocation map and only writes to PKRU which
is thread-local.

The writers to the allocation map are pkey_alloc()/free() and those are
still mmap_sem-protected.