Re: [RFC PATCH 11/14] context-tracking: Introduce work deferral infrastructure

From: Valentin Schneider
Date: Thu Jul 06 2023 - 07:32:28 EST


On 06/07/23 00:39, Peter Zijlstra wrote:
> On Wed, Jul 05, 2023 at 07:12:53PM +0100, Valentin Schneider wrote:
>
>> Note: A previous approach by PeterZ [1] used an extra bit in
>> context_tracking.state to flag the presence of deferred callbacks to
>> execute, and the actual callbacks were stored in a separate atomic
>> variable.
>>
>> This meant that the atomic read of context_tracking.state was sufficient to
>> determine whether there are any deferred callbacks to execute.
>> Unfortunately, it presents a race window. Consider the work setting
>> function as:
>>
>> preempt_disable();
>> seq = atomic_read(&ct->seq);
>> if (__context_tracking_seq_in_user(seq)) {
>> /* ctrl-dep */
>> atomic_or(work, &ct->work);
>> ret = atomic_try_cmpxchg(&ct->seq, &seq, seq|CT_SEQ_WORK);
>> }
>> preempt_enable();
>>
>> return ret;
>>
>> Then the following can happen:
>>
>> CPUx CPUy
>> CT_SEQ_WORK \in context_tracking.state
>> atomic_or(WORK_N, &ct->work);
>> ct_kernel_enter()
>> ct_state_inc();
>> atomic_try_cmpxchg(&ct->seq, &seq, seq|CT_SEQ_WORK);
>>
>> The cmpxchg() would fail, ultimately causing an IPI for WORK_N to be
>> sent. Unfortunately, the work bit would remain set, and it can't be sanely
>> cleared in case another CPU set it concurrently - this would ultimately
>> lead to a double execution of the callback, one as a deferred callback and
>> one in the IPI. As not all IPI callbacks are idempotent, this is
>> undesirable.
>
> So adding another atomic is arguably worse.
>
> The thing is, if the NOHZ_FULL CPU is actually doing context transitions
> (SYSCALLs etc..) then everything is fundamentally racy, there is no
> winning that game, we could find the remote CPU is in-kernel, send an
> IPI, the remote CPU does return-to-user and receives the IPI.
>
> And then the USER is upset... because he got an IPI.
>

Yeah, that part is inevitably racy.

The thing I was especially worried about was the potential double
executions (once in IPI, again in deferred work). It's not /too/ bad as the
only two deferred callbacks I'm introducing here are costly-but-stateless,
but IMO is a bad foundation.

But it seems like we can reuse the existing atomic and squeeze some bits in
there, so let's see how that goes :-)