Re: [PATCH v9 30/32] timers: Implement the hierarchical pull model

From: Sebastian Siewior
Date: Tue Dec 12 2023 - 10:59:48 EST


On 2023-12-12 12:31:19 [+0100], Anna-Maria Behnsen wrote:
> Sebastian Siewior <bigeasy@xxxxxxxxxxxxx> writes:
>
> > On 2023-12-01 10:26:52 [+0100], Anna-Maria Behnsen wrote:
> >> diff --git a/kernel/time/timer_migration.c b/kernel/time/timer_migration.c
> >> new file mode 100644
> >> index 000000000000..05cd8f1bc45d
> >> --- /dev/null
> >> +++ b/kernel/time/timer_migration.c
> >> @@ -0,0 +1,1636 @@
> > …
> >> +static bool tmigr_active_up(struct tmigr_group *group,
> >> + struct tmigr_group *child,
> >> + void *ptr)
> >> +{
> >> + union tmigr_state curstate, newstate;
> >> + struct tmigr_walk *data = ptr;
> >> + bool walk_done;
> >> + u8 childmask;
> >> +
> >> + childmask = data->childmask;
> >> + newstate = curstate = data->groupstate;
> >> +
> >> +retry:
> >> + walk_done = true;
> >> +
> >> + if (newstate.migrator == TMIGR_NONE) {
> >> + newstate.migrator = childmask;
> >> +
> >> + /* Changes need to be propagated */
> >> + walk_done = false;
> >> + }
> >> +
> >> + newstate.active |= childmask;
> >> +
> >> + newstate.seq++;
> >> +
> >> + if (!atomic_try_cmpxchg(&group->migr_state, &curstate.state, newstate.state)) {
> >> + newstate.state = curstate.state;
> >
> > This does not look right. If
> > group->migr_state != curstate.state
> > then
> > curstate.state = newstate.state
> >
> > does not make a difference since curstate is on stack. So this should
> > become
> >
> > | if (!atomic_try_cmpxchg(&group->migr_state, &curstate.state, newstate.state)) {
> > | curstate.state = newstate.state = atomic_read(&group->parent->migr_state);
> >
> > and now I question the existence of tmigr_walk::groupstate. It does not
> > match the comment for the struct saying it will be re-read if the
> > cmpxchg() fails because it does not happen (at least here). Also why do
> > we have it? Is it avoid atomic_read() and have it "cached"?
>
> atomic_try_cmpxchg() updates curstate.state with the actual
> group->migr_state when those values do not match. So it is reread by
> atomic_try_cmpxchg() and still matches the description. (This at least
> tells the function description of atomic_try_cmpxchg()).

Ach. Indeed. That part slipped my mind. Could still replace it with:
newstate = curstate

to match the assignment at the top of the function? Or do something
like:

| childmask = data->childmask;
| curstate = data->groupstate;
| retry:
| newstate = curstate;
|
| walk_done = true;

| if (!atomic_try_cmpxchg(&group->migr_state, &curstate.state, newstate.state))
| goto retry;

So gcc can save a branch and recycle the upper the cooking the code.
gcc-13 does not recognise this, clang-16 does.

> But beside of this, why do I want to update curstate.state with
> group->parent->migr_state when cmpxchg of this group wasn't successful
> yet or was it a copy paste error?

It was an error.

> >> + data->groupstate.state = atomic_read(&group->parent->migr_state);
> >> + data->childmask = group->childmask;
> >
> > We don't re-read in case the cmpxchg failed assuming someone else is
> > updating the state. Wouldn't it make sense to read the childmask at top
> > of the function from the child pointer. There is no need to keep around
> > in the data pointer, right?
>
> When we are in lvl0, then @child is NULL as the child is a tmigr_cpu and
> not a tmigr_group. This is the reason why I decided to store it inside
> the tmigr_walk struct.

But it supposed to be group->migr_state for the cmpxchg. So considering
the previous bit, why not:

| childmask = data->childmask;
| curstate = atomic_read(&group->migr_state);
|
| do {
| newstate = curstate;
| walk_done = true;
|
| if (newstate.migrator == TMIGR_NONE) {
| newstate.migrator = childmask;
|
| /* Changes need to be propagated */
| walk_done = false;
| }
|
| newstate.active |= childmask;
|
| newstate.seq++;
|
| } while (!atomic_try_cmpxchg(&group->migr_state, &curstate.state, newstate.state));

this seems nice.

> >> + }
> >> +
> >> + /*
> >> + * The group is active and the event will be ignored - the ignore flag is
> >> + * updated without holding the lock. In case the bit is set while
> >> + * another CPU already handles remote events, nothing happens, because
> >> + * it is clear that the CPU became active just in this moment, or in
> >> + * worst case the event is handled remote. Nothing to worry about.
> >> + */
> >
> > The CPU is becoming active, so is the group. The ignore flag for the
> > group is updated lock less to reflect this. Another CPU might also
> > set it true while becoming active. In worst case the migrator
> > observes it too late and expires remotely timer belonging to this
> > group. The lock is held while going idle (and setting ignore to
> > false) so the state is not lost.
> >
>
> This is what I wanted to explain:
>
> /*
> * The group is active (again). The group event might be still queued
> * into the parent group's timerqueue but can now be handled by the the

s/the$@@

> * migrator of this group. Therefore the ignore flag for the group event
> * is updated to reflect this.
> *
> * The update of the ignore flag in the active path is done lock
lockless

> * less. In worst case the migrator of the parent group observes the
> * change too late and expires remotely timer belonging to this
a timer?

> * group. The lock is held while updating the ignore flag in idle
> * path. So this state change will not be lost.
> */
>
> >> + group->groupevt.ignore = true;
> >> +
> >> + return walk_done;
> >> +}

> >> + if (!tmc->online || tmc->remote || tmc->cpuevt.ignore ||
> >> + now < tmc->cpuevt.nextevt.expires) {
> >> + raw_spin_unlock_irq(&tmc->lock);
> >> + return next;
> >
> > Looking at the last condition where the timerqueue has been forwarded by
> > a jiffy+, shouldn't we return _that_ values next instead of KTIME_MAX?
>
> No. Because the event is already queued into the hierarchy and the
> migrator takes care. If hiererachy is completely idle, the CPU which
> updated the event takes care. I'll add this to the comment above.

So another CPU took care of it and we set tmc->wakeup to KTIME_MAX…

One confusing part is that this return value (if not aborted early but
after completing this function) is used to set tmc->wakeup based on the
next pending timer for the CPU that was expired remotely.
We could have expired four CPUs and the next timer of the last CPU may
not be the earliest timer for the fist CPU in the group.
And this fine because it can be stale and only valid if the CPU goes
idle?

> >> + /*
> >> + * When the CPU is idle, check whether the recalculation of @tmc->wakeup
> >> + * is required. @tmc->wakeup_recalc is set by a remote CPU which is
> >> + * about to go offline, was the last active CPU in the whole timer
> >> + * migration hierarchy and now delegates handling of the hierarchy to
> >> + * this CPU.
> >
> > I'm failing here…
>
> If the CPU is idle, check whether the recalculation of @tmc->wakeup
> is required. @tmc->wakeup_recalc is set, when the last active CPU
> went offline. The last active CPU delegated the handling of the timer
> migration hierarchy to another (this) CPU by updating this flag and
> sending a reschedule.
>
> Better?

So the last CPU going offline had to be the migrator because otherwise
it wouldn't matter?


> >> +
> >> + if (!atomic_try_cmpxchg(&group->migr_state, &curstate.state, newstate.state)) {
> >
> > This one appears wrong, too. The curstate is not getting updated during
> > retry.
>
> See the answer above.

Yes, and I think the do { } while should work here, too.

> >> + * data->nextexp was set by tmigr_update_events() and contains the
> >> + * expiry of the first global event which needs to be handled
> >> + */
> >> + if (data->nextexp != KTIME_MAX) {
> >> + WARN_ON_ONCE(group->parent);
> >> + /*
> >> + * Top level path - If this CPU is about going offline, wake
> >> + * up some random other CPU so it will take over the
> >> + * migrator duty and program its timer properly. Ideally
> >> + * wake the CPU with the closest expiry time, but that's
> >> + * overkill to figure out.
> >> + *
> >> + * Set wakeup_recalc of remote CPU, to make sure the complete
> >> + * idle hierarchy with enqueued timers is reevaluated.
> >> + */
> >> + if (!(this_cpu_ptr(&tmigr_cpu)->online)) {
> >> + struct tmigr_cpu *tmc = this_cpu_ptr(&tmigr_cpu);
> >> + unsigned int cpu = smp_processor_id();
> >> + struct tmigr_cpu *tmc_resched;
> >> +
> >> + cpu = cpumask_any_but(cpu_online_mask, cpu);
> >> + tmc_resched = per_cpu_ptr(&tmigr_cpu, cpu);
> >> +
> >> + raw_spin_unlock(&tmc->lock);
> >> +
> >> + raw_spin_lock(&tmc_resched->lock);
> >> + tmc_resched->wakeup_recalc = true;
> >> + raw_spin_unlock(&tmc_resched->lock);
> >> +
> >> + raw_spin_lock(&tmc->lock);
> >> + smp_send_reschedule(cpu);
> >
> > This whole thing confuses me.
> > If the CPU goes offline, it needs to get removed from the migration
> > hierarchy and this is it. Everything else is handled by the migrator. If
> > the migrator is going offline then it needs wake a random CPU and make
> > sure it takes the migrator role. I am confused by the whole ::wakeup and
> > ::wakeup_recalc thingy.
> >
>
> wakeup_recalc is required to indicate, that the CPU was chosen as the
> new migrator CPU when the last active CPU in timer migration hierarchy
> went offline.

Aha! I suspected this. So this is more like need_new_migrator.


> Hopefully this is now clear?

yes.

> Thanks,
>
> Anna-Maria

Sebastian