Re: [PATCH] kexec: Fix reboot race during device_shutdown()

From: Joel Fernandes
Date: Mon Oct 02 2023 - 14:19:18 EST


Hi Eric,

On Fri, Sep 29, 2023 at 12:01 PM Eric W. Biederman
<ebiederm@xxxxxxxxxxxx> wrote:
>
> "Joel Fernandes (Google)" <joel@xxxxxxxxxxxxxxxxx> writes:
>
> > During kexec reboot, it is possible for a race to occur between
> > device_shutdown() and userspace. This causes accesses to GPU after pm_runtime
> > suspend has already happened. Fix this by calling freeze_processes() before
> > device_shutdown().
>
> Is there any reason why this same race with between sys_kexec and the
> adreno_ioctl can not happen during a normal reboot?

Thanks for the response. It can happen during a normal reboot. I think
the reason it does not show up in the wild is because the "reboot"
command implementation typically sends one of SIGSTOP or SIGKILL to
all processes which effectively prevents the race.

In any case, there is also a school of thought that says the kernel
should be resilient to crashes and a userspace workaround involving
sending signals could be looked at as papering over the real issue. I
do sympathize/agree with that school of thought as well.

> Is there any reason why there is not a .shutdown method to prevent the
> race?
> I would think the thing to do is to prevent this race in
> kernel_restart_prepare or in the GPUs .shutdown method. As I don't see
> anything that would prevent this during a normal reboot.

What you're saying is essentially what I remember trying, the issue is
not in the GPU driver but rather there the interconnect in the SoC
shutdown and causes an "SError" exception if the CPU tries to access
the memory locations, as also seen in the stack. I was not able to
trace exactly when the interconnect becomes unavailable and perhaps
there is a possibility of a more intricate fix where we can signal the
GPU driver to not access the bus anymore, but my suspicion is that
will add a lot of complexity and perhaps leave the door open to
similar issues.

> > Such freezing is already being done if kernel supports KEXEC_JUMP and
> > kexec_image->preserve_context is true. However, doing it if either of these are
> > not true prevents crashes/races.
>
> The KEXEC_JUMP case is something else entirely. It is supposed to work
> like suspend to RAM. Maybe reboot should as well, but I am
> uncomfortable making a generic device fix kexec specific.

I see your point of view. I think regular reboot should also be fixed
to avoid similar crash possibilities. I am happy to make a change for
that similar to this patch if we want to proceed that way.

Thoughts?

thanks,

- Joel