Re: [PATCH v10 18/20] timers: Implement the hierarchical pull model

From: Frederic Weisbecker
Date: Sun Jan 28 2024 - 20:04:39 EST


Le Mon, Jan 15, 2024 at 03:37:41PM +0100, Anna-Maria Behnsen a écrit :
> +int tmigr_requires_handle_remote(void)
> +{
> + struct tmigr_cpu *tmc = this_cpu_ptr(&tmigr_cpu);
> + struct tmigr_remote_data data;
> + unsigned int ret = 0;
> + unsigned long jif;
> +
> + if (tmigr_is_not_available(tmc))
> + return ret;
> +
> + data.now = get_jiffies_update(&jif);
> + data.childmask = tmc->childmask;
> + data.firstexp = KTIME_MAX;
> + data.tmc_active = !tmc->idle;
> + data.check = false;
> +
> + /*
> + * If the CPU is active, walk the hierarchy to check whether a remote
> + * expiry is required.
> + *
> + * Check is done lockless as interrupts are disabled and @tmc->idle is
> + * set only by the local CPU.
> + */
> + if (!tmc->idle) {
> + __walk_groups(&tmigr_requires_handle_remote_up, &data, tmc);
> +
> + if (data.firstexp != KTIME_MAX)
> + ret = 1;
> +
> + return ret;
> + }
> +
> + /*
> + * If the CPU is idle, check whether the recalculation of @tmc->wakeup
> + * is required. @tmc->wakeup_recalc is set, when the last active CPU
> + * went offline. The last active CPU delegated the handling of the timer
> + * migration hierarchy to another (this) CPU by updating this flag and
> + * sending a reschedule.
> + *
> + * Racy lockless check is valid:
> + * - @tmc->wakeup_recalc is set by the remote CPU before it issues
> + * reschedule IPI.
> + * - As interrupts are disabled here this CPU will either observe
> + * @tmc->wakeup_recalc set before the reschedule IPI can be handled or
> + * it will observe it when this function is called again on return
> + * from handling the reschedule IPI.
> + */
> + if (tmc->wakeup_recalc) {
> + __walk_groups(&tmigr_requires_handle_remote_up, &data, tmc);
> +
> + if (data.firstexp != KTIME_MAX)
> + ret = 1;
> +
> + raw_spin_lock(&tmc->lock);
> + WRITE_ONCE(tmc->wakeup, data.firstexp);
> + tmc->wakeup_recalc = false;
> + raw_spin_unlock(&tmc->lock);

Suppose we have:

[GRP1:0]
migrator = GRP0:1
active = GRP0:0, GRP0:1
/ \
[GRP0:0] [GRP0:1]
migrator = CPU 1 migrator = CPU 3
active = CPU 1 active = CPU 3
/ \ / \
CPUs 0 1 2 3
idle active idle active

CPU 0 and CPU 2 have no timer.
CPU 1 has a timer in a few millisecs.

[GRP1:0]
migrator = GRP0:1
active = GRP0:1
/ \
[GRP0:0] [GRP0:1]
migrator = NONE migrator = CPU 3
active = NONE active = CPU 3
/ \ / \
CPUs 0 1 2 3
idle idle idle active


CPU 1 went idle, CPU 3 will take care of CPU 1's timer. Then come two
things happening at the same time: CPU 0 has a timer interrupt, due to
RCU callbacks handling for example, and CPU 3 goes offline:

CPU 0 CPU 3
----- -----
// On top level [GRP1:0], just set migrator = TMIGR_NONE
tmigr_inactive_up() {
cpu = cpumask_any_but(cpu_online_mask, cpu);
//cpu == 0
tmc_resched = per_cpu_ptr(&tmigr_cpu, CPU 0);
raw_spin_lock(&tmc_resched->lock);
tmc_resched->wakeup_recalc = true;
raw_spin_unlock(&tmc_resched->lock);
// timer interrupt
run_local_timers() {
tmigr_requires_handle_remote() {
data.firstexp = KTIME_MAX;
// CPU 0 sees the tmc_resched->wakeup_recalc
// latest update
if (tmc->wakeup_recalc) {
tmigr_requires_handle_remote_up() {
// CPU 0 doesn't see GRP0:0
// latest update from CPU 1,
// because it has no locking
// and does a racy check.
if (!tmigr_check_migrator(group, childmask))
return true;
}
raw_spin_lock(&tmc->lock);
WRITE_ONCE(tmc->wakeup, data.firstexp);
tmc->wakeup_recalc = false;
raw_spin_unlock(&tmc->lock)
return 0;
}
// IPI is sent only now
smp_send_reschedule(cpu);
}


There is nothing that prevents CPU 0 from not seeing the hierarchy updates from
other CPUs since it checks the migrators in a racy way. As a result the timer of
CPU 1 may be ignored by CPU 0.

You'd need to lock the tmc while calling tmigr_requires_handle_remote_up(), so
that CPU 0 "inherits" everything that CPU 3 has seen, and that includes changes
from CPU 1.


But I see that tmigr_cpu_new_timer() does it right. Wouldn't it be possible to
exlusively let tmigr_cpu_new_timer() handle the wakeup_recalc thing? This is
going to be called after the end of the IRQ (whether timer interrupt or sched
IPI) in any case.

Thanks.