Re: [PATCH] ACPI: PHAT: Add Platform Health Assessment Table support

From: Rafael J. Wysocki
Date: Mon Aug 21 2023 - 14:01:27 EST


On Mon, Aug 21, 2023 at 7:52 PM Rafael J. Wysocki <rafael@xxxxxxxxxx> wrote:
>
> On Mon, Aug 21, 2023 at 7:35 PM Limonciello, Mario
> <mario.limonciello@xxxxxxx> wrote:
> >
> >
> >
> > On 8/21/2023 12:29 PM, Rafael J. Wysocki wrote:
> > > On Mon, Aug 21, 2023 at 7:17 PM Limonciello, Mario
> > > <mario.limonciello@xxxxxxx> wrote:
> > >>
> > >> On 8/21/2023 12:12 PM, Rafael J. Wysocki wrote:
> > >> <snip>
> > >>>> I was just talking to some colleagues about PHAT recently as well.
> > >>>>
> > >>>> The use case that jumps out is "system randomly rebooted while I was
> > >>>> doing XYZ". You don't know what happened, but you keep using your
> > >>>> system. Then it happens again.
> > >>>>
> > >>>> If the reason for the random reboot is captured to dmesg you can cross
> > >>>> reference your journal from the next boot after any random reboot and
> > >>>> get the reason for it. If a user reports this to a Gitlab issue tracker
> > >>>> or Bugzilla it can be helpful in establishing a pattern.
> > >>>>
> > >>>>>> The below location may be appropriate in that case:
> > >>>>>> /sys/firmware/acpi/
> > >>>>>
> > >>>>> Yes, it may. >
> > >>>>>> We already have FPDT and BGRT being exported from there.
> > >>>>>
> > >>>>> In fact, all of the ACPI tables can be retrieved verbatim from
> > >>>>> /sys/firmware/acpi/tables/ already, so why exactly do you want the
> > >>>>> kernel to parse PHAT in particular?
> > >>>>>
> > >>>>
> > >>>> It's not to say that /sys/firmware/acpi/PHAT isn't useful, but having
> > >>>> something internal to the kernel "automatically" parsing it and saving
> > >>>> information to a place like the kernel log that is already captured by
> > >>>> existing userspace tools I think is "more" useful.
> > >>>
> > >>> What existing user space tools do you mean? Is there anything already
> > >>> making use of the kernel's PHAT output?
> > >>>
> > >>
> > >> I was meaning things like systemd already capture the kernel long
> > >> ringbuffer. If you save stuff like this into the kernel log, it's going
> > >> to be indexed and easier to grep for boots that had it.
> > >>
> > >>> And why can't user space simply parse PHAT by itself?
> > >>> > There are multiple ACPI tables that could be dumped into the kernel
> > >>> log, but they aren't. Guess why.
> > >>
> > >> Right; there's not reason it can't be done by userspace directly.
> > >>
> > >> Another way to approach this problem could be to modify tools that
> > >> excavate records from a reboot to also get PHAT. For example
> > >> systemd-pstore will get any kernel panics from the previous boot from
> > >> the EFI pstore and put them into /var/lib/systemd/pstore.
> > >>
> > >> No reason that couldn't be done automatically for PHAT too.
> > >
> > > I'm not sure about the connection between the PHAT dump in the kernel
> > > log and pstore.
> > >
> > > The PHAT dump would be from the time before the failure, so it is
> > > unclear to me how useful it can be for diagnosing it. However, after
> > > a reboot one should be able to retrieve PHAT data from the table
> > > directly and that may include some information regarding the failure.
> >
> > Right so the thought is that at bootup you get the last entry from PHAT
> > and save that into the log.
> >
> > Let's say you have 3 boots:
> > X - Triggered a random reboot
> > Y - Cleanly shut down
> > Z - Boot after a clean shut down
> >
> > So on boot Y you would have in your logs the reason that boot X rebooted.
>
> Yes, and the same can be retrieved from the PHAT directly from user
> space at that time, can't it?
>
> > On boot Z you would see something about how boot Y's reason.
> >
> > >
> > > With pstore, the assumption is that there will be some information
> > > relevant for diagnosing the failure in the kernel buffer, but I'm not
> > > sure how the PHAT dump from before the failure can help here?
> >
> > Alone it's not useful.
> > I had figured if you can put it together with other data it's useful.
> > For example if you had some thermal data in the logs showing which
> > component overheated or if you looked at pstore and found a NULL pointer
> > dereference.
>
> IIUC, the current PHAT content can be useful. The PHAT content from
> boot X (before the failure) which is what will be there in pstore
> after the random reboot, is of limited value AFAICS.

To be more precise, I don't see why the kernel needs to be made a
man-in-the-middle between the firmware which is the source of the
information and user space that consumes it.