Re: [PATCH] random: ensure mix_interrupt_randomness() is consistent

From: Sebastian Andrzej Siewior
Date: Fri Feb 11 2022 - 09:51:29 EST


On 2022-02-11 11:48:15 [+0100], Jason A. Donenfeld wrote:
> Hi Sebastian,
Hi,

> On Fri, Feb 11, 2022 at 9:16 AM Sebastian Andrzej Siewior
> <bigeasy@xxxxxxxxxxxxx> wrote:
> > But I'm trying to avoid the migrate_disable(), so:
> > To close the racy with losing the workqueue bit, wouldn't it be
> > sufficient to set it to zero via atomic_cmpxchg()? Also if the counter
> > before the memcpy() and after (at cmpxchg time) didn't change then the
> > pool wasn't modified. So basically
> >
> > do {
> > counter = atomic_read(&fast_pool->count); // no need to cast
> > memcpy(pool, fast_pool->pool_long, ARRAY_SIZE(pool));
> > } while (atomic_cmpxchg(&fast_pool->count, counter, 0) != counter);
> >
> >
> > then it also shouldn't matter if we are _accidentally_ on the wrong CPU.
>
> This won't work. If we're executing on a different CPU, the CPU
> mutating the pool won't necessarily update the count at the right
> time. This isn't actually a seqlock or something like that. Rather, it

But it is atomic, isn't it?

> depends on running on the same CPU, where the interrupting irq handler
> runs in full before giving control back, so that count and pool are
> either both updated or not at all. Making this work across CPUs makes
> things a lot more complicated and I'd rather not do that.

but this isn't the rule, is it? It runs on the same CPU so we should
observe the update in IRQ context and the worker should observe the
counter _and_ pool update.

And cross CPU isn't the rule. We only re-do the loop if
- an interrupt came in on the local-CPU between atomic_read() and
atomic_cmpxchg().

- the worker was migrated due CPU hotplug and we managed properly reset
counter back to 0.

> Actually, though, a nicer fix would be to just disable local
> interrupts for that *2 word copy*. That's a tiny period of time. If
> you permit me, that seems nicer. But if you don't like that, I'll keep
> that loop.

Here, I don't mind but I don't think it is needed.

> Unfortunately, though, I think disabling migration is required. Sultan
> (CC'd) found that these workqueues can migrate even midway through
> running. And generally the whole idea is to keep this on the *same*
> CPU so that we don't have to introduce locks and synchronization.

They can't. Your workqueue is not unbound _and_ you specify a specific
CPU instead of WORK_CPU_UNBOUND (or an offlined CPU).
The only way it can migrate is if the CPU goes down while the worker is
running (or before it had a chance I think) which forces the scheduler
to break its (worker's) CPU affinity and move it to another CPU.

> I'll add comments around the acquire/release. The remaining question
> I believe is: would you prefer disabing irqs during the 2 word memcpy,
> or this counter double read loop?

I would prefer the cmpxchg in case it highly unlikely gets moved to
another CPU and we may lose that SCHED bit. That is why we switched to
atomics I think. Otherwise if the updates are only local can disable
interrupts during the update.
But I don't mind disabling interrupts for that copy.

> Jason

Sebastian