Re: [PATCH v1 2/3] perf auxtrace: Add compat_auxtrace_mmap__{read_head|write_tail}

From: James Clark
Date: Mon Aug 23 2021 - 06:58:02 EST




On 23/08/2021 10:51, Leo Yan wrote:
> Hi James,
>
> On Fri, Aug 13, 2021 at 05:22:31PM +0100, James Clark wrote:
>> On 09/08/2021 12:27, Leo Yan wrote:
>>> +/*
>>> + * In the compat mode kernel runs in 64-bit and perf tool runs in 32-bit mode,
>>> + * 32-bit perf tool cannot access 64-bit value atomically, which might lead to
>>> + * the issues caused by the below sequence on multiple CPUs: when perf tool
>>> + * accesses either the load operation or the store operation for 64-bit value,
>>> + * on some architectures the operation is divided into two instructions, one
>>> + * is for accessing the low 32-bit value and another is for the high 32-bit;
>>> + * thus these two user operations can give the kernel chances to access the
>>> + * 64-bit value, and thus leads to the unexpected load values.
>>> + *
>>> + * kernel (64-bit) user (32-bit)
>>> + *
>>> + * if (LOAD ->aux_tail) { --, LOAD ->aux_head_lo
>>> + * STORE $aux_data | ,--->
>>> + * FLUSH $aux_data | | LOAD ->aux_head_hi
>>> + * STORE ->aux_head --|-------` smp_rmb()
>>> + * } | LOAD $data
>>> + * | smp_mb()
>>> + * | STORE ->aux_tail_lo
>>> + * `----------->
>>> + * STORE ->aux_tail_hi
>>> + *
>>> + * For this reason, it's impossible for the perf tool to work correctly when
>>> + * the AUX head or tail is bigger than 4GB (more than 32 bits length); and we
>>> + * can not simply limit the AUX ring buffer to less than 4GB, the reason is
>>> + * the pointers can be increased monotonically, whatever the buffer size it is,
>>> + * at the end the head and tail can be bigger than 4GB and carry out to the
>>> + * high 32-bit.
>>> + *
>>> + * To mitigate the issues and improve the user experience, we can allow the
>>> + * perf tool working in certain conditions and bail out with error if detect
>>> + * any overflow cannot be handled.
>>> + *
>>> + * For reading the AUX head, it reads out the values for three times, and
>>> + * compares the high 4 bytes of the values between the first time and the last
>>> + * time, if there has no change for high 4 bytes injected by the kernel during
>>> + * the user reading sequence, it's safe for use the second value.
>>> + *
>>> + * When update the AUX tail and detects any carrying in the high 32 bits, it
>>> + * means there have two store operations in user space and it cannot promise
>>> + * the atomicity for 64-bit write, so return '-1' in this case to tell the
>>> + * caller an overflow error has happened.
>>> + */
>>> +u64 __weak compat_auxtrace_mmap__read_head(struct auxtrace_mmap *mm)
>>> +{
>>> + struct perf_event_mmap_page *pc = mm->userpg;
>>> + u64 first, second, last;
>>> + u64 mask = (u64)(UINT32_MAX) << 32;
>>> +
>>> + do {
>>> + first = READ_ONCE(pc->aux_head);
>>> + /* Ensure all reads are done after we read the head */
>>> + smp_rmb();
>>> + second = READ_ONCE(pc->aux_head);
>>> + /* Ensure all reads are done after we read the head */
>>> + smp_rmb();
>>> + last = READ_ONCE(pc->aux_head);
>>> + } while ((first & mask) != (last & mask));
>>> +
>>> + return second;
>>> +}
>>> +
>>
>> Hi Leo,
>>
>> I had a couple of questions about this bit. If we're assuming that the
>> high bytes of 'first' and 'last' are equal, then 'second' is supposed
>> to be somewhere in between or equal to 'first' and 'last'.
>>
>> If that's the case, wouldn't it be better to return 'last', because it's
>> closer to the value at the time of reading?
>
>> And then in that case, if last is returned, then why do a read for
>> 'second' at all? Can 'second' be skipped and just read first and last?
>
> Simply to say, the logic can be depicted as:
>
> step 1: read 'first'
> step 2: read 'second' -> There have no any atomicity risk if 'first'
> is same with 'last'
> step 3: read 'last'
>
> The key point is if the 'first' and 'last' have the same value in the
> high word, there have no any increment for high word in the middle of
> 'first' and 'last', so we don't worry about the atomicity for 'second'.
>
> But we cannot promise the atomicity for reading 'last', let's see
> below sequence:
>
> CPU(a) CPU(b)
> step 1: read 'first' (high word)
> read 'first' (low word)
> step 2: read 'second' (high word)
> read 'second' (low word)
> step 3: read 'last' (high word)
> --> write 'last' (high word)
> --> write 'last' (low word)
> read 'last' (low word)
>
>
> Even 'first' and 'last' have the same high word, but the 'last' cannot
> be trusted.
>
>> Also maybe it won't make a difference, but is there a missing smp_rmb()
>> between the read of 'last' and 'first'?
>
> Good question, from my understanding, we only need to promise the flow
> from step 1 to step 3, it's not necessary to add barrier in the middle
> of the two continuous loops.
>
> Thanks for reviewing!
>

Ok thanks for the explanation, that makes sense now. I do have one other
point about the documentation for the function:

> + * When update the AUX tail and detects any carrying in the high 32 bits, it
> + * means there have two store operations in user space and it cannot promise
> + * the atomicity for 64-bit write, so return '-1' in this case to tell the
> + * caller an overflow error has happened.
> + */

I couldn't see how it can ever return -1, it seems like it would loop forever
until it reads the correct value.


> Leo
>