Re: [PATCH v9 30/32] timers: Implement the hierarchical pull model

From: Anna-Maria Behnsen
Date: Fri Dec 08 2023 - 05:32:16 EST


Sebastian Siewior <bigeasy@xxxxxxxxxxxxx> writes:

> On 2023-12-01 10:26:52 [+0100], Anna-Maria Behnsen wrote:
>> diff --git a/kernel/time/timer_migration.c b/kernel/time/timer_migration.c
>> new file mode 100644
>> index 000000000000..05cd8f1bc45d
>> --- /dev/null
>> +++ b/kernel/time/timer_migration.c
>> @@ -0,0 +1,1636 @@
>
>> + * Protection of the tmigr group state information:
>> + * ------------------------------------------------
>> + *
>> + * The state information with the list of active children and migrator needs to
>> + * be protected by a sequence counter. It prevents a race when updates in a
>
> s/a$//
>
>> + * child groups are propagated in changed order. The following scenario
>> + * describes what happens without updating the sequence counter:
>> + *
>> + * Therefore, let's take three groups and four CPUs (CPU2 and CPU3 as well
>> + * as GRP0:1 will not change during the scenario):
>> + *
>> + * LVL 1 [GRP1:0]
>> + * migrator = GRP0:1
>> + * active = GRP0:0, GRP0:1
>> + * / \
>> + * LVL 0 [GRP0:0] [GRP0:1]
>> + * migrator = CPU0 migrator = CPU2
>> + * active = CPU0 active = CPU2
>> + * / \ / \
>> + * CPUs 0 1 2 3
>> + * active idle active idle
>> + *
>> + *
>> + * 1. CPU0 goes idle (changes are updated in GRP0:0; afterwards the current
>> + * states of GRP0:0 and GRP1:0 are stored in the data for walking the
>> + * hierarchy):
>
> CPU0 goes idle. The state update is performed lock less and group
> wise. In the first step only GRP0:0 has been updated. The update of
> GRP1:0 is pending, the CPU walks through the hierarchy.
>
>> + *
>> + * LVL 1 [GRP1:0]
>> + * migrator = GRP0:1
>> + * active = GRP0:0, GRP0:1
>> + * / \
>> + * LVL 0 [GRP0:0] [GRP0:1]
>> + * --> migrator = TMIGR_NONE migrator = CPU2
>> + * --> active = active = CPU2
>> + * / \ / \
>> + * CPUs 0 1 2 3
>> + * --> idle idle active idle
>
>> + * 2. CPU1 comes out of idle (changes are update in GRP0:0; afterwards the
>> + * current states of GRP0:0 and GRP1:0 are stored in the data for walking the
>> + * hierarchy):
>
> While CPU0 goes idle and continues to update the state, CPU1 comes
> out of idle. CPU1 updates GRP0:0. The update for GRP1:0 is pending,
> tge CPU walks through the hierarchy. Both CPUs now walk the hierarchy
> to perform the needed update from their point of view.
> The currently visible state:
>
>> + *
>> + * LVL 1 [GRP1:0]
>> + * migrator = GRP0:1
>> + * active = GRP0:0, GRP0:1
>> + * / \
>> + * LVL 0 [GRP0:0] [GRP0:1]
>> + * --> migrator = CPU1 migrator = CPU2
>> + * --> active = CPU1 active = CPU2
>> + * / \ / \
>> + * CPUs 0 1 2 3
>> + * idle --> active active idle
>> + *
>> + * 3. Here comes the change of the order: Propagating the changes of step 2
>> + * through the hierarchy to GRP1:0 - nothing to be done, because GRP0:0
>> + * is already up to date.
>
> Here is the race condition: CPU1 managed to propagate its changes
> through the hierarchy to GRP1:0 before CPU0 did. The active members
> of GRP1:0 remain unchanged after the update since it is still valid
> from CPU1 current point of view:
>
> LVL 1 [GRP1:0]
> --> migrator = GRP0:1
> --> active = GRP0:0, GRP0:1
> / \
> LVL 0 [GRP0:0] [GRP0:1]
> migrator = CPU1 migrator = CPU2
> active = CPU1 active = CPU2
> / \ / \
> CPUs 0 1 2 3
> idle active active idle
>
> [ I take it as the migrator remains set to GRP0:1 by CPU1 but it could
> be changed to GRP0:0. I assume that both fields (migrator+active) are
> changed there via the propagation and the arrow in both fields denotes
> this. ]
>
>> + * 4. Propagating the changes of step 1 through the hierarchy to GRP1:0
>
> Now CPU0 finally propagates its changes to GRP1:0.
>
>> + *
>> + * LVL 1 [GRP1:0]
>> + * --> migrator = GRP0:1
>> + * --> active = GRP0:1
>> + * / \
>> + * LVL 0 [GRP0:0] [GRP0:1]
>> + * migrator = CPU1 migrator = CPU2
>> + * active = CPU1 active = CPU2
>> + * / \ / \
>> + * CPUs 0 1 2 3
>> + * idle active active idle
>> + *
>> + * Now there is a inconsistent overall state because GRP0:0 is active, but
>> + * it is marked as idle in the GRP1:0. This is prevented by incrementing
>> + * sequence counter whenever changing the state.
>
> The race of CPU0 vs CPU1 led to an inconsistent state in GRP1:0.
> CPU1 is active and is correctly listed as active in GRP0:0. However
> GRP1:0 does not have GRP0:0 listed as active which is wrong.
> The sequence counter has been added to avoid inconsistent states
> during updates. The state is updated atomically only if all members,
> including the sequence counter, match the expected value
> (compare-and-exchange).
> Looking back at the previous example with the addition of the
> sequence number: The update as performed by CPU0 in step 4 will fail.
> CPU1 changed the sequence number during the update in step 3 so the
> expected old value (as seen by CPU0 before starting the walk) does
> not match.
>

Thanks a lot for rephrasing the documentation to make it clearer for the
reader! I use your proposal with some minor changes.

Thanks,

Anna-Maria