Re: Inquiring about Debugging Platform Drivers using Crash Utility for Kernel Coredump

From: Stephen Brennan
Date: Tue Jun 20 2023 - 12:42:03 EST


Hi Talel,

Thanks for the message, this is definitely the right place to discuss
these sorts of questions.

"Shenhar, Talel" <talel@xxxxxxxxxx> writes:
> Dear Linux Kernel Community,
>
> I hope this message finds you well.
>
> I'd like to use crash utility for postmortem of my kernel coredump
> analysis.
>
> I was able to collect coredump and able to use various operation from
> within the crash utility such as irq -s,  log, files and others.
>
> I am using: crash-arm64 version: 7.3.0, gdb version: 7.6, kernel version
> 4.19.

You've definitely got the hard part done if you've got the core dump and
crash all working.

> My specific interest lies in debugging drivers internal state, e.g.
> platform drivers.

Please excuse my ignorance on your particular use case, I haven't done a
ton of work with device drivers or ARM-specific ones either!

> For some hands-on experience with crash utility I'd like to start by
> iterating over all the platform drivers and print their names,
>
> However, I am finding it challenging to get started with this process
> and I am uncertain of the best approach to achieve this. I have scoured
> various resources for insights, but the information related to this
> specific usage seems to be scattered and not exhaustive.

Crash has some excellent helpers, as you've seen (irq, log, files, kmem,
etc...). If you're lucky enough to have a crash command that deals with
the particular area you're debugging, then that can go a long way.
Unfortunately not every subsystem has such a helper command, and this is
especially true for device drivers.

So no matter what tool you use for this -- crash, drgn, or others -- you
will not be relying on a nice "list-all-platform-devices --name"
command. Instead, you'll simply need to use your knowledge of the code
for the subsystem to help you navigate it.

As I said, I don't know much about device drivers, but what I've
frequently seen with subsystems is some struct with function pointers
and maybe a name, then a "register_xxx()" function call, which would
register a driver or backend, which then places a struct on a linked
list of all the drivers or backends.

So a good place to start for this particular question would be to find
the global variable declaring the list head for your drivers. Use the
crash "list" command (it takes a good few minutes to get your head
around all the options, but it's powerful) to enumerate them and print
relevant fields, such as the name.

As an example which isn't driver-specific, you might want to look at all
of the slab caches (struct kmem_cache) and print their names. They have
a field "name", and a field "list" which is a list_head. There is an
external global variable named "slab_caches" which is the list head of
the list of all caches. You could iterate over all of them with:

list -s kmem_cache.name -o kmem_cache.list -H slab_caches

The "-s kmem_cache.name" tells it what to print, the "-o
kmem_cache.list" tells it to use that as the struct list_head linking
the list, and "-H slab_caches" tells it that this is an external head of
the list.

I assume a similar method could be used for your particular situation.
Then the "struct" and "p" commands can be used to interpret data
structures you find.

> Given the collective expertise on this mailing list, I thought it would
> be the best place to seek guidance. Specifically, I would appreciate it
> if you could provide:
>
> Any relevant documentation, guides, or tutorials to debug platform
> drivers using the crash utility for kernel coredump analysis.
> Some simple examples of using the crash utility to debug platform
> drivers, if possible.

Unfortunately, debugging resources and guides are rather thin on the
ground, and usually there isn't one tailored to your particular
subsystem. If you haven't found one, unfortunately I don't have a
particular resource for platform devices. Instead, you'll need to apply
guides from other areas with your knowledge of the subsystem. Also, rely
heavily on the built-in crash "help" command.

> Any important points or common pitfalls to keep in mind while performing
> this kind of analysis.
> Any other tips, best practices, or recommendations to effectively debug
> platform drivers using the crash utility would also be greatly appreciated.

One thing I'd mention is that: when crash has helpers that are tailored
for your use case, it's definitely a super power. It makes doing
debugging tasks a breeze. But when there's no helper for your particular
subsystem, it's a lot more frustrating to do, as you're generally poring
over struct listings. Unfortunately it's a bit difficult to write new
crash helpers.

If you're familiar with Python code, then I might recommend Drgn [1] to
you. It's a Python library which allows very natural access to the
vmcore's variables and data structures. So you can write your own
helpers in Python to explore the subsystem you care about. You'll find
that many of the people on this mailing list are quite familiar with
drgn as well :)

Good luck in debugging!
Stephen

[1]: https://github.com/osandov/drgn

> Thank you for your time and assistance. I look forward to hearing from you.
>
> Best regards,
> Talel, Shenhar.