Re: [PATCH 2/2] acpi, apei: use appropriate pgprot_t to map GHES memory

From: Zhang, Jonathan Zhixiong
Date: Tue Aug 25 2015 - 13:30:16 EST




On 8/25/2015 1:59 AM, Ingo Molnar wrote:

* Zhang, Jonathan Zhixiong <zjzhang@xxxxxxxxxxxxxx> wrote:



On 8/22/2015 2:24 AM, Ingo Molnar wrote:

* Jonathan (Zhixiong) Zhang <zjzhang@xxxxxxxxxxxxxx> wrote:

From: "Jonathan (Zhixiong) Zhang" <zjzhang@xxxxxxxxxxxxxx>

With ACPI APEI firmware first handling, generic hardware error
record is updated by firmware in GHES memory region. On an arm64
platform, firmware updates GHES memory region with uncached
access attribute, and then Linux reads stale data from cache.

This paragraph *still* doesn't parse for me. It's not any English
I can recognize: what is a 'With ACPI APEI firmware first handling'?
APEI is ACPI Platform Error Interface; it is part of ACPI spec,
defining the aspect of hardware error handling. "firmware first
handling" is a terminology used in APEI. It describes such mechanism
that when hardware error happens, firmware intersects/handles such
hardware error, formulates hardware error record and writes the record
to GHES memory region, notifies the kernel through NMI/interrupt, then
the kernel GHES driver grabs the error record from the GHES memory
region.

Argh. So how about translating that to English and putting that misnomer into
scare quotes, and saying something like:

If the ACPI APEI firmware handles the error first (called "firmware first
handling"), the generic hardware error record is updated by the firmware in the
GHES memory region.

( Also note all the missing articles I added for readability. The rest of the
changelog is missing articles as well. )
Thank you very much, Ingo. Input are taken.

... plus what this changelog still doesn't mention is the most important part
of any bug fix description: how does the user notice this in practice and why
does he care?

The changelog mentioned that Linux would read stale data from cache. When stale
data is read, kernel reports there is no new hardware error when there actually
is.

Note that this is the most valuable sentence so far, in this whole changelog and
discussion. And we needed how many emails to get to this point?

obviously saying 'stale data' in itself does not mean much - it could mean a
harmless inconsistency nobody really cares about, or in fact it could mean
something more serious:
Sure, makes sense.

[...] This may lead to further damage in various scenarios, such as error
propagation caused data corruption.

Please outline this better. How users are affected in practice is far more
important than any other detail.
Yes, will do. I just sent out an update for your review.

Thanks,

Ingo


--
Jonathan (Zhixiong) Zhang
The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum,
a Linux Foundation Collaborative Project
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/