Re: RFC design of device coredump collection on panic in Pstore

From: Pavan Kondeti
Date: Mon May 15 2023 - 02:55:49 EST


On Mon, May 08, 2023 at 09:21:00PM +0530, Mukesh Ojha wrote:
> Hi All,
>
> We are exploring a way where device driver(registered/interested and optional could be
> callback) can dump their data(consist of list of buffers or hardware registers) for post
> mortam debugging of a crash via dumping the content to pstore region. Some of the example
> data could be like clock dump/regulators/ etc.
>
> These stuff should be already part of entire RAM dump but in some cases it's however not
> feasible to capture the entire content of RAM, so was thinking if pstore region can be
> used to capture above information either in the form of elf or binary and how pstore
> can get this in human readable format can be discussed further.
>
> Also, existing devcoredump driver does not cover panic usecases so we thought of something
> like below RFC design where,
>
> 1. Device_coredump allocates some configurable contigous memory that can be controlled
> via CONFIG or bootargs and later registers for panic notifiers.
> 2. Notifier gets added.
> 3. Pstore adds device_coredump as its front-end via dumper registration similar
> to kmsg being dump today.
> 4. Successful registration of dumper.
> 5. A device driver(A-Z) can register their buffer to be dumped as part of panic.
> 6. buffer gets added to the dump list.
> 7. Panic occurs.
> 8. iterate over registered drivers and copy their dump list to its own memory and if
> it crosses device core dump memory log an error stop iterating.
> 9. Similar to devcore_dump() inline with kmsg_dump()
> 10.Copy the content to pstore region and this could be elf or raw binary that can be
> discussed.
>
>
> Device coredump memory(1) size could be passed from pstore and should be same as size
> of devcoredump frontend size given in DT or some other way.
>
> Let me know your concern and view on this.
>
>
> pstore device_coredump deviceA..Z panic
> ┼ │
> │ │ panic_notifier(1) │
> │ ├───────────────────┬─────────────────►│
> │ │ notifier added │ (2) │
> │ │◄──────────────────┼──────────────────┤
> pstore can give │ (3) │ │ │
> its region for │ dumper_registration │ │ │
> dump ├─────────────────────►│ │ │
> │ (4) │ │ │
> │◄─────────────────────┤ │ │
> │ register_dumper │ │ │
> │ │ │ │
> │ │ │ │
> │ │ │ init │
> │ │ ───┼── │
> │ │ (5) │ │
> │ │devcoredump_register(dev,buf)); │
> │ │◄──────────────────┐ │
> │ │ (6) │ │
> │ ├──────────────────►│ │
> │ │device/buf gets add│d │
> │ │ to the list │ │panic
> │ │ │ ──┼───
> │ │ │ │
> │ │(7) panic_notifier │ call │
> │ │◄──────────────────┼──────────────────┤
> │ │ prepare coredump │for the deviceA-Z │
> │ ├──────────────────►├────┐ │
> │ │ (8) │ │ │
> │ │ │ │ │
> │ │ │ │ │
> │ │ │◄───┘ │
> dumper will │ │ (9) │dev_coredump()
> write all the device dump to │◄──────────────────┼──────────────────┤
> pstore ┌◄───┬─────────────────┤ │ │
> region │ │ (10) │ │ │
> │ │ │ │ │
> │ │ │ │ │
> │◄───┘ │ │ │
> │ │ │ │
> │ │
>
>

I really like the idea of providing an interface for device drivers to
dump specific data for offline use. Currently in MSM/QCOM downstream
kernel, we do register for panic notifiers and dump relavant data for
later parsing/analysis. In some cases, this message gets printed to console
so that it would also available in pstore / extracted dmesg. However
having separate record/buffer for individual drivers would help.

I believe dev_coredump() is just an analogy here as it is meant for
taking a dump of the device when the driver thinks its device/firmware
is hung/crashed etc, AFAICT. IIUC, What we are looking for here a different
interface/semantics i.e querying devices to dump any information when
the *system* is crashed/paniced. May be something like dev_panicdump()
and a separate method in device_driver.

Also, are there any use cases for this dev_panicdump() outside the
pstore()? For ex: it would still be helpful to gather all the relavant
data of a device and put it a separate buffer. The list of such buffers
can be parsed offline via standard ramdump analysis.

Thanks,
Pavan