Re: [PATCH] ACPI: PHAT: Add Platform Health Assessment Table support

From: Yazen Ghannam
Date: Mon Aug 21 2023 - 14:56:35 EST


On 8/21/23 2:01 PM, Rafael J. Wysocki wrote:
> On Mon, Aug 21, 2023 at 7:52 PM Rafael J. Wysocki <rafael@xxxxxxxxxx> wrote:
>>
>> On Mon, Aug 21, 2023 at 7:35 PM Limonciello, Mario
>> <mario.limonciello@xxxxxxx> wrote:
>>>
>>>
>>>
>>> On 8/21/2023 12:29 PM, Rafael J. Wysocki wrote:
>>>> On Mon, Aug 21, 2023 at 7:17 PM Limonciello, Mario
>>>> <mario.limonciello@xxxxxxx> wrote:
>>>>>
>>>>> On 8/21/2023 12:12 PM, Rafael J. Wysocki wrote:
>>>>> <snip>
>>>>>>> I was just talking to some colleagues about PHAT recently as well.
>>>>>>>
>>>>>>> The use case that jumps out is "system randomly rebooted while I was
>>>>>>> doing XYZ". You don't know what happened, but you keep using your
>>>>>>> system. Then it happens again.
>>>>>>>
>>>>>>> If the reason for the random reboot is captured to dmesg you can cross
>>>>>>> reference your journal from the next boot after any random reboot and
>>>>>>> get the reason for it. If a user reports this to a Gitlab issue tracker
>>>>>>> or Bugzilla it can be helpful in establishing a pattern.
>>>>>>>
>>>>>>>>> The below location may be appropriate in that case:
>>>>>>>>> /sys/firmware/acpi/
>>>>>>>>
>>>>>>>> Yes, it may. >
>>>>>>>>> We already have FPDT and BGRT being exported from there.
>>>>>>>>
>>>>>>>> In fact, all of the ACPI tables can be retrieved verbatim from
>>>>>>>> /sys/firmware/acpi/tables/ already, so why exactly do you want the
>>>>>>>> kernel to parse PHAT in particular?
>>>>>>>>
>>>>>>>
>>>>>>> It's not to say that /sys/firmware/acpi/PHAT isn't useful, but having
>>>>>>> something internal to the kernel "automatically" parsing it and saving
>>>>>>> information to a place like the kernel log that is already captured by
>>>>>>> existing userspace tools I think is "more" useful.
>>>>>>
>>>>>> What existing user space tools do you mean? Is there anything already
>>>>>> making use of the kernel's PHAT output?
>>>>>>
>>>>>
>>>>> I was meaning things like systemd already capture the kernel long
>>>>> ringbuffer. If you save stuff like this into the kernel log, it's going
>>>>> to be indexed and easier to grep for boots that had it.
>>>>>
>>>>>> And why can't user space simply parse PHAT by itself?
>>>>>> > There are multiple ACPI tables that could be dumped into the kernel
>>>>>> log, but they aren't. Guess why.
>>>>>
>>>>> Right; there's not reason it can't be done by userspace directly.
>>>>>
>>>>> Another way to approach this problem could be to modify tools that
>>>>> excavate records from a reboot to also get PHAT. For example
>>>>> systemd-pstore will get any kernel panics from the previous boot from
>>>>> the EFI pstore and put them into /var/lib/systemd/pstore.
>>>>>
>>>>> No reason that couldn't be done automatically for PHAT too.
>>>>
>>>> I'm not sure about the connection between the PHAT dump in the kernel
>>>> log and pstore.
>>>>
>>>> The PHAT dump would be from the time before the failure, so it is
>>>> unclear to me how useful it can be for diagnosing it. However, after
>>>> a reboot one should be able to retrieve PHAT data from the table
>>>> directly and that may include some information regarding the failure.
>>>
>>> Right so the thought is that at bootup you get the last entry from PHAT
>>> and save that into the log.
>>>
>>> Let's say you have 3 boots:
>>> X - Triggered a random reboot
>>> Y - Cleanly shut down
>>> Z - Boot after a clean shut down
>>>
>>> So on boot Y you would have in your logs the reason that boot X rebooted.
>>
>> Yes, and the same can be retrieved from the PHAT directly from user
>> space at that time, can't it?
>>
>>> On boot Z you would see something about how boot Y's reason.
>>>
>>>>
>>>> With pstore, the assumption is that there will be some information
>>>> relevant for diagnosing the failure in the kernel buffer, but I'm not
>>>> sure how the PHAT dump from before the failure can help here?
>>>
>>> Alone it's not useful.
>>> I had figured if you can put it together with other data it's useful.
>>> For example if you had some thermal data in the logs showing which
>>> component overheated or if you looked at pstore and found a NULL pointer
>>> dereference.
>>
>> IIUC, the current PHAT content can be useful. The PHAT content from
>> boot X (before the failure) which is what will be there in pstore
>> after the random reboot, is of limited value AFAICS.
>
> To be more precise, I don't see why the kernel needs to be made a
> man-in-the-middle between the firmware which is the source of the
> information and user space that consumes it.

I think that's a fair point.

Is there a preferred set of tools that can be updated?

If not, would it make sense to develop a set of common kernel tools for
this?

In my experience, it seems many folks use tools from their vendors or
custom tools.

Thanks,
Yazen