Re: [PATCH 3/4 v0.4] sched/umcg: add Documentation/userspace-api/umcg.rst

From: Peter Oskolkov
Date: Wed Aug 04 2021 - 17:48:22 EST


On Wed, Aug 4, 2021 at 12:13 PM Thierry Delisle <tdelisle@xxxxxxxxxxxx> wrote:
>
> These state transition descriptions are very helpful, but what is not
> clear is the details of these transitions when there are concurrent
> wake/waits. I do not know enough about the kernel code to be able to
> read the implementation and answer my own questions.
>
> For example, imagine two worker threads W1 and W2. W1 adds itself to a
> concurrent list and calls umcg_wait(next_tid = 0). W2 pops from the list
> and calls umcg_wait(UMCG_WAIT_WAKE_ONLY | UMCG_WAIT_WF_CURRENT_CPU) on the
> popped worker, W1 in this example.

All _umcg_ state changes here happen in the userspace before
sys_umcg_wait() is called. So:

W1: cmpxchg W1:RUNNING => W1:IDLE
- if OK, call sys_umcg_wait()
- if failed, do something else (notes below)

W2: cmpxchg W1:IDLE => W1:RUNNING
- if OK, lock itself, set W2:next_tid to W1, call sys_umcg_wait()
(will not block nor spin), restore next_tid and state/unlock upon
syscall return
- if failed, do something else

So assuming the cmpxchg() calls succeeded, sys_umcg_wait() will be called.

W1 sys_umcg_wait(): (W1 sleeping):
- (1) mark itself as TASK_INTERRUPTIBLE (sleeping)
- (2) check _umcg_ state
- (a) if UMCG_RUNNING, mark itself as TASK_RUNNING, return to userspace
- (b) if still UMCG_IDLE, sleep
- (3) upon wakeup, go to step (1)

W2 sys_umcg_wait(): (wake W1):
- call try_to_wake_up(W1): if W1 is INTERRUPTIBLE, change it to
TASK_RUNNING, wake
- return

Note the ordering and interplay of UMCG state changes
(UMCG_IDLE/UMCG_RUNNING) and TASK state changes
(TASK_INTERRUPTIBLE/TASK_RUNNING).

As you can see, W2 does not block nor spin. W1 will either catch
_umcg_ state change to UMCG_RUNNING and abort, or ttwu() will wake it
_after_ it is marked as UMCG_RUNNING.

Now what happens if cmpxchg() calls above fail? That means that W1 is
still running when W2 tries to change its state RUNNING => IDLE. This
is a race in the userspace, and two options are available:
- the userspace spins waiting for W1 to become IDLE (note that the
userspace spins, not the kernel)
- the userspace "queues the wakeup" and returns; the sleeper (W1) sees
wakeup_queued and does not go to sleep; this is the solution I
have/use. See the previous version here:
https://lore.kernel.org/patchwork/patch/1433971/, search for
UMCG_TF_WAKEUP_QUEUED.

In the current version 0.4 WAKEUP_QUEUED is a purely userspace flag,
so it is not documented in the _kernel API_ doc. I'll post the
userspace parts (libumcg, selftests) a bit later. In short, wait/wake
races do not result in spinning, and sometimes even elide syscalls by
using WAKEUP_QUEUED userspace state flag.

>
> If W1 calls umcg_wait first, W2 context-switches to W1 and W2's state
> changes to IDLE. My understanding is that wake detection/block does not
> apply to this case.
>
> If W2 calls umcg_wait first, what happens? I can imagine two different
> behaviour in this case:
>
> 1. W2 waits for W1 to call umcg_wait, by spinning or blocking, and then
> execution proceed like the first ordering.
>
> 2. W2 sets W1's state to RUNNING. When W1 eventually calls umcg_wait, it
> simply notices the state change and returns without context-switching.
> In this case, W1 is not migrated to W2's CPU.
>
> Behaviour 1 makes me uncomfortable since it means umcg_wait must wait for
> cooperation that potentially never comes.
>
> But in Behaviour 2, the state of W2 after both calls to umcg_wait is not
> clear to me, either. I could imagine that W2 is set to IDLE, but since W1
> is not migrated, W2 could also simply be left RUNNING.
>
> Which behaviour is correct and in what state does W2 end up?

W2 will always end up RUNNING, as everything here is about W1. W2 will
never sleep nor spin. Just a couple of atomic ops and maybe a syscall
that does the same.

>
> Thierry
>