Re: [PATCH v9 30/32] timers: Implement the hierarchical pull model

From: Sebastian Siewior
Date: Mon Dec 11 2023 - 13:04:27 EST


On 2023-12-01 10:26:52 [+0100], Anna-Maria Behnsen wrote:
> diff --git a/kernel/time/timer_migration.c b/kernel/time/timer_migration.c
> new file mode 100644
> index 000000000000..05cd8f1bc45d
> --- /dev/null
> +++ b/kernel/time/timer_migration.c
> @@ -0,0 +1,1636 @@

> +static bool tmigr_active_up(struct tmigr_group *group,
> + struct tmigr_group *child,
> + void *ptr)
> +{
> + union tmigr_state curstate, newstate;
> + struct tmigr_walk *data = ptr;
> + bool walk_done;
> + u8 childmask;
> +
> + childmask = data->childmask;
> + newstate = curstate = data->groupstate;
> +
> +retry:
> + walk_done = true;
> +
> + if (newstate.migrator == TMIGR_NONE) {
> + newstate.migrator = childmask;
> +
> + /* Changes need to be propagated */
> + walk_done = false;
> + }
> +
> + newstate.active |= childmask;
> +
> + newstate.seq++;
> +
> + if (!atomic_try_cmpxchg(&group->migr_state, &curstate.state, newstate.state)) {
> + newstate.state = curstate.state;

This does not look right. If
group->migr_state != curstate.state
then
curstate.state = newstate.state

does not make a difference since curstate is on stack. So this should
become

| if (!atomic_try_cmpxchg(&group->migr_state, &curstate.state, newstate.state)) {
| curstate.state = newstate.state = atomic_read(&group->parent->migr_state);

and now I question the existence of tmigr_walk::groupstate. It does not
match the comment for the struct saying it will be re-read if the
cmpxchg() fails because it does not happen (at least here). Also why do
we have it? Is it avoid atomic_read() and have it "cached"?

> + goto retry;
> + }
> +
> + if (group->parent && (walk_done == false)) {

The group's parent doesn't change so it can be accessed lock-less. It is
just that the topmost group has no parent so we need this check. I would
move the walk_done check to the left so it can be evaluated first.

> + data->groupstate.state = atomic_read(&group->parent->migr_state);
> + data->childmask = group->childmask;

We don't re-read in case the cmpxchg failed assuming someone else is
updating the state. Wouldn't it make sense to read the childmask at top
of the function from the child pointer. There is no need to keep around
in the data pointer, right?

> + }
> +
> + /*
> + * The group is active and the event will be ignored - the ignore flag is
> + * updated without holding the lock. In case the bit is set while
> + * another CPU already handles remote events, nothing happens, because
> + * it is clear that the CPU became active just in this moment, or in
> + * worst case the event is handled remote. Nothing to worry about.
> + */

The CPU is becoming active, so is the group. The ignore flag for the
group is updated lock less to reflect this. Another CPU might also
set it true while becoming active. In worst case the migrator
observes it too late and expires remotely timer belonging to this
group. The lock is held while going idle (and setting ignore to
false) so the state is not lost.

> + group->groupevt.ignore = true;
> +
> + return walk_done;
> +}

> +/*
> + * Returns the expiry of the next timer that needs to be handled. KTIME_MAX is
> + * returned, when an active CPU will handle all the timer migration hierarchy

s/when/if ?

> + * timers.
> + */
> +static u64 tmigr_new_timer(struct tmigr_cpu *tmc, u64 nextexp)
> +{
> + struct tmigr_walk data = { .evt = &tmc->cpuevt,
> + .nextexp = nextexp };
> +
> + lockdep_assert_held(&tmc->lock);
> +
> + if (tmc->remote)
> + return KTIME_MAX;
> +
> + tmc->cpuevt.ignore = false;
> + data.remote = false;
> +
> + walk_groups(&tmigr_new_timer_up, &data, tmc);
> +
> + /* If there is a new first global event, make sure it is handled */
> + return data.nextexp;
> +}
> +
> +static u64 tmigr_handle_remote_cpu(unsigned int cpu, u64 now,
> + unsigned long jif)
> +{
> + struct timer_events tevt;
> + struct tmigr_walk data;
> + struct tmigr_cpu *tmc;
> + u64 next = KTIME_MAX;
> +
> + tmc = per_cpu_ptr(&tmigr_cpu, cpu);
> +
> + raw_spin_lock_irq(&tmc->lock);
> +
> + /*
> + * The remote CPU is offline or the CPU event does not has to be handled
> + * (the CPU is active or there is no longer an event to expire) or
> + * another CPU handles the CPU timers already or the next event was
> + * already expired - return!
> + */

The comment is the english version of the C code below. The *why* is
usually the thing we care about. This basically sums up to:

If remote CPU is offline then the timer events have been
migrated away.
If tmigr_cpu::remote is set then another CPU takes care of this.
If tmigr_event::ignore is set then the CPU is returning from
idle and take care of its timers.
If the next event expires in the future then the event node has
been updated and there are no timer to expire right now.

> + if (!tmc->online || tmc->remote || tmc->cpuevt.ignore ||
> + now < tmc->cpuevt.nextevt.expires) {
> + raw_spin_unlock_irq(&tmc->lock);
> + return next;

Looking at the last condition where the timerqueue has been forwarded by
a jiffy+, shouldn't we return _that_ values next instead of KTIME_MAX?

> + }
> +
> + tmc->remote = true;
> + WRITE_ONCE(tmc->wakeup, KTIME_MAX);
> +
> + /* Drop the lock to allow the remote CPU to exit idle */
> + raw_spin_unlock_irq(&tmc->lock);
> +
> + if (cpu != smp_processor_id())
> + timer_expire_remote(cpu);
> +
> + /*
> + * Lock ordering needs to be preserved - timer_base locks before tmigr
> + * related locks (see section "Locking rules" in the documentation at
> + * the top). During fetching the next timer interrupt, also tmc->lock
> + * needs to be held. Otherwise there is a possible race window against
> + * the CPU itself when it comes out of idle, updates the first timer in
> + * the hierarchy and goes back to idle.
> + *
> + * timer base locks are dropped as fast as possible: After checking
> + * whether the remote CPU went offline in the meantime and after
> + * fetching the next remote timer interrupt. Dropping the locks as fast
> + * as possible keeps the locking region small and prevents holding
> + * several (unnecessary) locks during walking the hierarchy for updating
> + * the timerqueue and group events.
> + */
> + local_irq_disable();
> + timer_lock_remote_bases(cpu);
> + raw_spin_lock(&tmc->lock);
> +
> + /*
> + * When the CPU went offline in the meantime, no hierarchy walk has to
> + * be done for updating the queued events, because the walk was
> + * already done during marking the CPU offline in the hierarchy.
> + *
> + * When the CPU is no longer idle, the CPU takes care of the timers and
If the CPU is no longer idle then it takes care of its timers and
also of the timers in its hierarchy.

> + * also of the timers in the path to the top.
> + *
> + * (See also section "Required event and timerqueue update after
> + * remote expiry" in the documentation at the top)

Perfect. The reasoning why it is safe to skip walk + expire timers, has
been explained.

> + */
> + if (!tmc->online || !tmc->idle) {
> + timer_unlock_remote_bases(cpu);
> + goto unlock;
> + } else {
> + /* next event of CPU */
> + fetch_next_timer_interrupt_remote(jif, now, &tevt, cpu);
> + }
> +
> + timer_unlock_remote_bases(cpu);
> +
> + data.evt = &tmc->cpuevt;
> + data.nextexp = tevt.global;
> + data.remote = true;
> +
> + /*
> + * The update is done even when there is no 'new' global timer pending
> + * on the remote CPU (see section "Required event and timerqueue update
> + * after remote expiry" in the documentation at the top)
> + */
> + walk_groups(&tmigr_new_timer_up, &data, tmc);
> +
> + next = data.nextexp;
> +
> +unlock:
> + tmc->remote = false;
> + raw_spin_unlock_irq(&tmc->lock);
> +
> + return next;
> +}
> +

> +/**
> + * tmigr_handle_remote() - Handle global timers of remote idle CPUs
> + *
> + * Called from the timer soft interrupt with interrupts enabled.
> + */
> +void tmigr_handle_remote(void)
> +{
> + struct tmigr_cpu *tmc = this_cpu_ptr(&tmigr_cpu);
> + struct tmigr_remote_data data;
> +
> + if (tmigr_is_not_available(tmc))
> + return;
> +
> + data.childmask = tmc->childmask;
> + data.nextexp = KTIME_MAX;
> +
> + /*
> + * NOTE: This is a doubled check because the migrator test will be done
> + * in tmigr_handle_remote_up() anyway. Keep this check to fasten the

s/fasten/speed up/

> + * return when nothing has to be done.
> + */
> + if (!tmigr_check_migrator(tmc->tmgroup, tmc->childmask))
> + return;
> +
> + data.now = get_jiffies_update(&data.basej);
> +
> + /*
> + * Update @tmc->wakeup only at the end and do not reset @tmc->wakeup to
> + * KTIME_MAX. Even if tmc->lock is not held during the whole remote
> + * handling, tmc->wakeup is fine to be stale as it is called in
> + * interrupt context and tick_nohz_next_event() is executed in interrupt
> + * exit path only after processing the last pending interrupt.
> + */
> +
> + __walk_groups(&tmigr_handle_remote_up, &data, tmc);
> +
> + raw_spin_lock_irq(&tmc->lock);
> + WRITE_ONCE(tmc->wakeup, data.nextexp);
> + raw_spin_unlock_irq(&tmc->lock);
> +}
> +
> +static bool tmigr_requires_handle_remote_up(struct tmigr_group *group,
> + struct tmigr_group *child,
> + void *ptr)
> +{
> + struct tmigr_remote_data *data = ptr;
> + u8 childmask;
> +
> + childmask = data->childmask;
> +
> + /*
> + * Handle the group only if the child is the migrator or if the group
> + * has no migrator. Otherwise the group is active and is handled by its
> + * own migrator.
> + */
> + if (!tmigr_check_migrator(group, childmask))
> + return true;
> +
> + /*
> + * When there is a parent group and the CPU which triggered the
> + * hierarchy walk is not active, proceed the walk to reach the top level
> + * group before reading the next_expiry value.
> + */
> + if (group->parent && !data->tmc_active)
> + goto out;
> +
> + /*
> + * On 32 bit systems the racy lockless check for next_expiry will
> + * turn into a random number generator. Therefore do the lockless
> + * check only on 64 bit systems.
> + */

The lock is required on 32bit architectures to read the
variable consistently with a concurrent writer. On 64bit the
lock is not required because the read operation is not split
and so is always consistent.

> + if (IS_ENABLED(CONFIG_64BIT)) {
> + data->nextexp = READ_ONCE(group->next_expiry);
> + if (data->now >= data->nextexp) {
> + data->check = true;
> + return true;
> + }
> + } else {
> + raw_spin_lock(&group->lock);
> + data->nextexp = group->next_expiry;
> + if (data->now >= group->next_expiry) {
> + data->check = true;
> + raw_spin_unlock(&group->lock);
> + return true;
> + }
> + raw_spin_unlock(&group->lock);
> + }
> +
> +out:
> + /* Update of childmask for the next level */
> + data->childmask = group->childmask;
> + return false;
> +}
> +
> +/**
> + * tmigr_requires_handle_remote() - Check the need of remote timer handling
> + *
> + * Must be called with interrupts disabled.
> + */
> +int tmigr_requires_handle_remote(void)
> +{
> + struct tmigr_cpu *tmc = this_cpu_ptr(&tmigr_cpu);
> + struct tmigr_remote_data data;
> + unsigned int ret = 0;
> + unsigned long jif;
> +
> + if (tmigr_is_not_available(tmc))
> + return ret;
> +
> + data.now = get_jiffies_update(&jif);
> + data.childmask = tmc->childmask;
> + data.nextexp = KTIME_MAX;
> + data.tmc_active = !tmc->idle;
> + data.check = false;
> +
> + /*
> + * When the CPU is active, walking the hierarchy to check whether a
> + * remote expiry is required.

s/When/If
s/walking/walk

> + *
> + * Check is done lockless as interrupts are disabled and @tmc->idle is
> + * set only by the local CPU.
> + */
> + if (!tmc->idle) {
> + __walk_groups(&tmigr_requires_handle_remote_up, &data, tmc);
> +
> + if (data.nextexp != KTIME_MAX)
> + ret = 1;
> +
> + return ret;
> + }
> +
> + /*
> + * When the CPU is idle, check whether the recalculation of @tmc->wakeup
> + * is required. @tmc->wakeup_recalc is set by a remote CPU which is
> + * about to go offline, was the last active CPU in the whole timer
> + * migration hierarchy and now delegates handling of the hierarchy to
> + * this CPU.

I'm failing here…

> + * Racy lockless check is valid:
> + * - @tmc->wakeup_recalc is set by the remote CPU before it issues
> + * reschedule IPI.
> + * - As interrupts are disabled here this CPU will either observe
> + * @tmc->wakeup_recalc set before the reschedule IPI can be handled or
> + * it will observe it when this function is called again on return
> + * from handling the reschedule IPI.
> + */
> + if (tmc->wakeup_recalc) {
> + raw_spin_lock(&tmc->lock);
> +
> + __walk_groups(&tmigr_requires_handle_remote_up, &data, tmc);
> +
> + if (data.nextexp != KTIME_MAX)
> + ret = 1;
> +
> + WRITE_ONCE(tmc->wakeup, data.nextexp);
> + tmc->wakeup_recalc = false;
> + raw_spin_unlock(&tmc->lock);
> +
> + return ret;
> + }
> +
> + /*
> + * When the CPU is idle and @tmc->wakeup is reliable, compare it with
> + * @data.now. On 64 bit it is valid to do this lockless. On 32 bit
> + * systems, holding the lock is required to get valid data on concurrent
> + * writers.

The wakeup value is reliable when the CPU is idle?

See the comments regarding 64bit.

> + */
> + if (IS_ENABLED(CONFIG_64BIT)) {
> + if (data.now >= READ_ONCE(tmc->wakeup))
> + ret = 1;
> + } else {
> + raw_spin_lock(&tmc->lock);
> + if (data.now >= tmc->wakeup)
> + ret = 1;
> + raw_spin_unlock(&tmc->lock);
> + }
> +
> + return ret;
> +}



> +static bool tmigr_inactive_up(struct tmigr_group *group,
> + struct tmigr_group *child,
> + void *ptr)
> +{
> + union tmigr_state curstate, newstate;
> + struct tmigr_walk *data = ptr;
> + bool walk_done;
> + u8 childmask;
> +
> + childmask = data->childmask;
> + newstate = curstate = data->groupstate;
> +
> +retry:
> + walk_done = true;
> +
> + /* Reset active bit when the child is no longer active */
> + if (!data->childstate.active)
> + newstate.active &= ~childmask;
> +
> + if (newstate.migrator == childmask) {
> + /*
> + * Find a new migrator for the group, because the child group is
> + * idle!
> + */
> + if (!data->childstate.active) {
> + unsigned long new_migr_bit, active = newstate.active;
> +
> + new_migr_bit = find_first_bit(&active, BIT_CNT);
> +
> + if (new_migr_bit != BIT_CNT) {
> + newstate.migrator = BIT(new_migr_bit);
> + } else {
> + newstate.migrator = TMIGR_NONE;
> +
> + /* Changes need to be propagated */
> + walk_done = false;
> + }
> + }
> + }
> +
> + newstate.seq++;
> +
> + WARN_ON_ONCE((newstate.migrator != TMIGR_NONE) && !(newstate.active));
> +
> + if (!atomic_try_cmpxchg(&group->migr_state, &curstate.state, newstate.state)) {

This one appears wrong, too. The curstate is not getting updated during
retry.

> + newstate.state = curstate.state;
> +
> + /*
> + * Something changed in the child/parent group in the meantime,
> + * reread the state of the child and parent; Update of
> + * data->childstate is required for event handling;
> + */
> + if (child)
> + data->childstate.state = atomic_read(&child->migr_state);
> +
> + goto retry;
> + }
> +
> + data->groupstate = newstate;
> + data->remote = false;
> +
> + /* Event Handling */
> + tmigr_update_events(group, child, data);
> +
> + if (group->parent && (walk_done == false)) {
> + data->childmask = group->childmask;
> + data->childstate = newstate;
> + data->groupstate.state = atomic_read(&group->parent->migr_state);
> + }
> +
> + /*
> + * data->nextexp was set by tmigr_update_events() and contains the
> + * expiry of the first global event which needs to be handled
> + */
> + if (data->nextexp != KTIME_MAX) {
> + WARN_ON_ONCE(group->parent);
> + /*
> + * Top level path - If this CPU is about going offline, wake
> + * up some random other CPU so it will take over the
> + * migrator duty and program its timer properly. Ideally
> + * wake the CPU with the closest expiry time, but that's
> + * overkill to figure out.
> + *
> + * Set wakeup_recalc of remote CPU, to make sure the complete
> + * idle hierarchy with enqueued timers is reevaluated.
> + */
> + if (!(this_cpu_ptr(&tmigr_cpu)->online)) {
> + struct tmigr_cpu *tmc = this_cpu_ptr(&tmigr_cpu);
> + unsigned int cpu = smp_processor_id();
> + struct tmigr_cpu *tmc_resched;
> +
> + cpu = cpumask_any_but(cpu_online_mask, cpu);
> + tmc_resched = per_cpu_ptr(&tmigr_cpu, cpu);
> +
> + raw_spin_unlock(&tmc->lock);
> +
> + raw_spin_lock(&tmc_resched->lock);
> + tmc_resched->wakeup_recalc = true;
> + raw_spin_unlock(&tmc_resched->lock);
> +
> + raw_spin_lock(&tmc->lock);
> + smp_send_reschedule(cpu);

This whole thing confuses me.
If the CPU goes offline, it needs to get removed from the migration
hierarchy and this is it. Everything else is handled by the migrator. If
the migrator is going offline then it needs wake a random CPU and make
sure it takes the migrator role. I am confused by the whole ::wakeup and
::wakeup_recalc thingy.

> + }
> + }
> +
> + return walk_done;
> +}

Sebastian