Re: [PATCH RFC v4 1/1] random: WARN on large getrandom() waits and introduce getrandom2()

From: Andy Lutomirski
Date: Fri Sep 20 2019 - 15:52:39 EST




> On Sep 20, 2019, at 12:37 PM, Willy Tarreau <w@xxxxxx> wrote:
>
> ïOn Fri, Sep 20, 2019 at 12:22:17PM -0700, Andy Lutomirski wrote:
>> Perhaps userland could register a helper that takes over and does
>> something better?
>
> If userland sees the failure it can do whatever the developer/distro
> packager thought suitable for the system facing this condition.
>
>> But I think the kernel really should do something
>> vaguely reasonable all by itself.
>
> Definitely, that's what Linus' proposal was doing. Sleeping for some time
> is what I call "vaguely reasonable".

I donât buy it. We have existing programs that can deadlock on boot. Just throwing -EAGAIN at them in a syscall that didnât previously block does not strike me as reasonable.

>
>> If nothing else, we want the ext4
>> patch that provoked this whole discussion to be applied,
>
> Oh absolutely!
>
>> which means
>> that we need to unbreak userspace somehow, and returning garbage it to
>> is not a good choice.
>
> It depends how it's used. I'd claim that we certainly use randoms for
> other things (such as ASLR/hashtables) *before* using them to generate
> long lived keys thus we can have a bit more time to get some more
> entropy before reaching the point of producing these keys.

The problem is that we donât know what userspace is doing with the output from getrandom(..., 0), so I think we have to be conservative. New kernels need to work with old user code. Itâs okay if theyâre slower to boot than they could be.

>
>> Here are some possible approaches that come to mind:
>>
>> int count;
>> while (crng isn't inited) {
>> msleep(1);
>> }
>>
>> and modify add_timer_randomness() to at least credit a tiny bit to
>> crng_init_cnt.
>
> Without a timeout it's sure we'll still face some situations where
> it blocks forever, which is the current problem.

The point is that we keep the timer running by looping like this, which should cause add_timer_randomness() to get called continuously, which should prevent the deadlock. I assume the deadlock is because we go into nohz-idle and we sit there with nothing happening at all.

>
>> Or we do something like intentionally triggering readahead on some
>> offset on the root block device.
>
> You don't necessarily have such a device, especially when you're
> in an initramfs. It's precisely where userland can be smarter. When
> the caller is sfdisk for example, it does have more chances to try
> to perform I/O than when it's a tiny http server starting to present
> a configuration page.

What I mean is: allow user code to register a usermode helper that helps get entropy. Or just convince distros to bundle some useful daemon that starts at early boot and lives in the initramfs.