Re: [PATCH] random: set fast pool count to zero in cpuhp teardown

From: Sebastian Andrzej Siewior
Date: Mon Feb 14 2022 - 06:28:54 EST


On 2022-02-13 22:53:43 [+0100], Jason A. Donenfeld wrote:
> diff --git a/drivers/char/random.c b/drivers/char/random.c
> index 805a924b1f5f..e177d806db1d 100644
> --- a/drivers/char/random.c
> +++ b/drivers/char/random.c
> @@ -1216,6 +1216,19 @@ static void fast_mix(u32 pool[4])
>
> static DEFINE_PER_CPU(struct fast_pool, irq_randomness);
>
> +int random_offline_cpu(unsigned int cpu)
> +{
> + /*
> + * Set the count to zero when offlining the CPU for two
> + * reasons: 1) so that all new accumulated irqs are fresh
> + * when it comes back online, and 2) so that its worker is
> + * permitted to schedule again when it comes back online,
> + * since the MIX_INFLIGHT flag will be cleared.
> + */
> + per_cpu_ptr(&irq_randomness, cpu)->count = 0;

You registered that handler between
CPUHP_AP_ONLINE_IDLE … CPUHP_AP_ACTIVE

That means:
- interrupts are still active. So you could schedule a worker after the
handler run (need 64 interrupts).
- That handler will be invoked on the CPU that goes offline.
- it will be invoked before the worker is unbound from the offline-going
CPU. Once unbound at CPUHP_AP_WORKQUEUE_ONLINE the worker may remain
on the CPU before the scheduler pushes it away.

You could:
- keeping it at this position but
- this_cpu_ptr() could be used.
- move it to the online phase so it is reset for the CPU coming up
(from this point, you could have >= 64 interrupts before the CPU is
no longer serving interrupts).

- move it to CPUHP_OFFLINE … CPUHP_BRINGUP_CPU. This is invoked on
another CPU once the is dead / before it comes up.
- in that case the function can remain as-is. But we have "rollback".

One slight complication:
After the CPUHP_AP_WORKQUEUE_ONLINE state (going down), the worker is
tainted. It may be moved to another CPU but may remain on the (correct)
CPU and run. Moving HP-handler to an earlier point has the advantage
that you know.

We also have this "rollback". That means the CPU goes down and then one
step fails and it rolls back online. That means if you have your handler
at an earlier point, you don't notice it and the worker may have skipped
its turn. (Also I didn't check but I *assume* that after
CPUHP_AP_WORKQUEUE_ONLINE there is no pool bound to the CPU and any new
worker will be moved to an unbound pool).

The more I think I about it, moving the state left and right, I think
that having a CPU up handler after CPUHP_AP_WORKQUEUE_ONLINE (like you
have now (but up, not down)) doing:

fast_pool = this_cpu_ptr(&irq_randomness);
local_irq_disable();
if (fast_pool->count & FAST_POOL_MIX_INFLIGHT) {
fast_pool->count = 0;
fast_pool->last = jiffies
}
local_irq_enable();

should work just fine. This covers:
- should a worker be scheduled while CPU is going down and do its job,
we good.

- should a worker be scheduled while CPU is going down, moved to another
CPU and skip its work then we have FAST_POOL_MIX_INFLIGHT set and
nothing pending. This gets reset once the CPU is going online.
This also covers the possible rollback scenarios.

- should a CPU already contribute entropy then the HP-handler is not
going to reset it.

- should a CPU already contribute entropy and schedule a worker then we
reset the FAST_POOL_MIX_INFLIGHT bit. However ->last is also reset so
no new worker is scheduled due the time and because count == 0.
So either the worker runs (and does the same reset) _or_ we wait
another 64 interrupts + jiffy (in case the worker was scheduled but
did nothing because it was run on another CPU).

> + return 0;
> +}
> +

Sebastian