Re: [PATCH v9 30/32] timers: Implement the hierarchical pull model

From: Sebastian Siewior
Date: Fri Dec 08 2023 - 13:18:14 EST


On 2023-12-01 10:26:52 [+0100], Anna-Maria Behnsen wrote:
> diff --git a/kernel/time/timer_migration.c b/kernel/time/timer_migration.c
> new file mode 100644
> index 000000000000..05cd8f1bc45d
> --- /dev/null
> +++ b/kernel/time/timer_migration.c
> @@ -0,0 +1,1636 @@

> + * Required event and timerqueue update after a remote expiry:
> + * -----------------------------------------------------------
> + *
> + * After a remote expiry of a CPU, a walk through the hierarchy updating the
> + * events and timerqueues has to be done when there is a 'new' global timer of
> + * the remote CPU (which is obvious) but also if there is no new global timer,
> + * but the remote CPU is still idle:

After expiring timers of a remote CPU, a walk through the hierarchy and
updating events timerqueues is required. It is obviously needed if there
is a 'new' global timer but also if there no new global timer but the
remote CPU is still idle.

> + * 1. CPU2 is the migrator and does the remote expiry in GRP1:0; expiry of
> + * evt-CPU0 and evt-CPU1 are equal:

CPU0 and CPU1 have both a timer expiring at the same time so both
have an event enqueued in the timerqueue. CPU2 and CPU3 have no
global timer pending and CPU2 is the only active CPU and also the
migrator.

> + *
> + * LVL 1 [GRP1:0]
> + * migrator = GRP0:1
> + * active = GRP0:1
> + * --> timerqueue = evt-GRP0:0
> + * / \
> + * LVL 0 [GRP0:0] [GRP0:1]
> + * migrator = TMIGR_NONE migrator = CPU2
> + * active = active = CPU2
> + * groupevt.ignore = false groupevt.ignore = true
> + * groupevt.cpu = CPU0 groupevt.cpu =
> + * timerqueue = evt-CPU0, timerqueue =
> + * evt-CPU1
> + * / \ / \
> + * CPUs 0 1 2 3
> + * idle idle active idle
> + *
> + * 2. Remove the first event of the timerqueue in GRP1:0 and expire the timers
> + * of CPU0 (see evt-GRP0:0->cpu value):

CPU2 begins to expire remote timers. It starts with own group
GRP0:1. GRP0:1 has nothing in ts timerqueue and continues with its
parent, GRP1:0. In GRP1:0 it dequeues the first event. It looks at
CPU member expires the pending timer of CPU0.

> + * LVL 1 [GRP1:0]
> + * migrator = GRP0:1
> + * active = GRP0:1
> + * --> timerqueue =
> + * / \
> + * LVL 0 [GRP0:0] [GRP0:1]
> + * migrator = TMIGR_NONE migrator = CPU2
> + * active = active = CPU2
> + * groupevt.ignore = false groupevt.ignore = true
> + * --> groupevt.cpu = CPU0 groupevt.cpu =
> + * timerqueue = evt-CPU0, timerqueue =
> + * evt-CPU1
> + * / \ / \
> + * CPUs 0 1 2 3
> + * idle idle active idle
> + *
> + * 3. After the remote expiry CPU0 has no global timer that needs to be
> + * enqueued. When skipping the walk, the global timer of CPU1 is not handled,
> + * as the group event of GRP0:0 is not updated and not enqueued into GRP1:0. The
> + * walk has to be done to update the group events and timerqueues:

The work isn't over after expiring timers of CPU0. If we stop
here, then CPU1's timer have not been expired and the timerqueue of
GRP0:0 has still an event for CPU0 enqueued which has just been
processed. So it is required to walk the hierarchy from CPU0's
point of view and update it accordingly.
CPU0 will be removed from the timerqueue because it has no pending
timer. If CPU0 would have a timer pending then it has to expire
after CPU1's first timer because all timer from this period have
just been expired.
Either way CPU1 will be first in GRP0:0's timerqueue and therefore
set in the CPU field of the group event which is enqueued in
GRP1:0's timerqueue.

> + * LVL 1 [GRP1:0]
> + * migrator = GRP0:1
> + * active = GRP0:1
> + * --> timerqueue = evt-GRP0:0
> + * / \
> + * LVL 0 [GRP0:0] [GRP0:1]
> + * migrator = TMIGR_NONE migrator = CPU2
> + * active = active = CPU2
> + * groupevt.ignore = false groupevt.ignore = true
> + * --> groupevt.cpu = CPU1 groupevt.cpu =
> + * --> timerqueue = evt-CPU1 timerqueue =
> + * / \ / \
> + * CPUs 0 1 2 3
> + * idle idle active idle
> + *
> + * Now CPU2 (migrator) is able to handle the timer of CPU1 as CPU2 only scans
> + * the timerqueues of GRP0:1 and GRP1:0.

Now CPU2 continues step 2 at GRP1:0 and will expire the timer of
CPU1.

> + * The update of step 3 is valid to be skipped, when the remote CPU went offline
> + * in the meantime because an update was already done during inactive path. When
> + * CPU became active in the meantime, update isn't required as well, because
> + * GRP0:0 is now longer idle.

The hierarchy walk in step 3 can be skipped if the migrator notices
that a CPU of GRP0:0 is active. The CPU will mark GRP0:0 active and
take care of the group and any needed updates within the hierarchy.

I skipped the "offline" part because it is not needed. Before the CPU
can go offline it has first to come out of idle. While going offline it
won't (probably) participate here and the remaining timer will be
migrated to another CPU.

> + */

> +
> +typedef bool (*up_f)(struct tmigr_group *, struct tmigr_group *, void *);
> +
> +static void __walk_groups(up_f up, void *data,
> + struct tmigr_cpu *tmc)
> +{
> + struct tmigr_group *child = NULL, *group = tmc->tmgroup;
> +
> + do {
> + WARN_ON_ONCE(group->level >= tmigr_hierarchy_levels);
> +
> + if (up(group, child, data))
> + break;
> +
> + child = group;
> + group = group->parent;
> + } while (group);
> +}
> +
> +static void walk_groups(up_f up, void *data, struct tmigr_cpu *tmc)
> +{
> + lockdep_assert_held(&tmc->lock);
> +
> + __walk_groups(up, data, tmc);
> +}

So these two. walk_groups() uses all have tmigr_cpu::lock acquired and
__walk_groups() don't. Also the `up' function passed walk_groups() has
always the same data type while the data argument passed to
__walk_groups() has also the same type but different compared to the
former.

Given the locking situation and the type of the data argument looks like
walk_groups() is used for thing#1 and __walk_groups() for thing#2.
Therefore it could make sense have two separate functions (instead of
walk_groups() and __walk_groups()) to distinguish this.
Now it is too late but maybe later I figure out why the one type
requires locking and the other doesn't.


> +/*
> + * Return the next event which is already expired of the group timerqueue
> + *
> + * Event, which is returned, is also removed from the queue.
> + */
> +static struct tmigr_event *tmigr_next_expired_groupevt(struct tmigr_group *group,
> + u64 now)
> +{
> + struct tmigr_event *evt = tmigr_next_groupevt(group);
> +
> + if (!evt || now < evt->nextevt.expires)
> + return NULL;
> +
> + /*
> + * The event is already expired. Remove it. If it's not the last event,
> + * then update all group event related information.
> + */

The event expired, remove it. Update group's next expire time.

> + if (timerqueue_del(&group->events, &evt->nextevt))
> + tmigr_next_groupevt(group);
> + else
> + WRITE_ONCE(group->next_expiry, KTIME_MAX);

And then you can invoke tmigr_next_groupevt() unconditionally.

> + return evt;
> +}
> +

Sebastian