Re: [PATCH 1/2] devcoredump: Remove devcoredump device if failing device is gone

From: Mukesh Ojha
Date: Wed Jan 31 2024 - 12:23:40 EST




On 1/29/2024 11:18 PM, Johannes Berg wrote:
On Fri, 2024-01-26 at 10:11 -0500, Rodrigo Vivi wrote:
Make dev_coredumpm a real device managed helper, that not only
frees the device after a scheduled delay (DEVCD_TIMEOUT), but
also when the failing/crashed device is gone.

The module remove for the drivers using devcoredump are currently
broken if attempted between the crash and the DEVCD_TIMEOUT, since
the symbolic sysfs link won't be deleted.

Hmm, is it a problem to remove a whole dev when it still has some link
here? Maybe we could just make the link be managed/auto-removed?

Probably regardless of that you should change the comment in
devcd_dev_release() since it's no longer a concern?

On top of that, for PCI devices, the unbind of the device will
call the pci .remove void function, that cannot fail. At that
time, our device is pretty much gone, but the read and free
functions are alive trough the devcoredump device and they
^ through, I guess

can get some NULL dereferences or use after free.

Not sure I understand this part, how's this related to PCI's .remove?

So, if the failing-device is gone let's also request for the
devcoredump-device removal using the same mod_delayed_work
as when writing anything through data. The flush cannot be
used since it is synchronous and the devcd would be surely
gone right before the mutex_unlock on the next line.

Can we just decouple it instead and remove the symlink? Which is kind of
what the comment in devcd_dev_release() says but at the time I wasn't
aware of all the devm mechanics etc.

Are we going to do this ?

-Mukesh


I'm thinking this might be annoying in certain recovery cases, e.g.
iwlwifi uses this but may sometimes unbind/rebind itself to recover from
certain errors, and that'd make the FW dumps disappear.

johannes