Re: [PATCH v5] ACPI / APEI: fix the regression of synchronous external aborts occur in user-mode

From: James Morse
Date: Fri Jun 04 2021 - 10:19:07 EST


Hi Xiaofei Tan,

Sorry for the delayed response,
this still applies and builds to v5.13-rc4.

On 10/12/2020 12:09, Xiaofei Tan wrote:
> After the commit 8fcc4ae6faf8 ("arm64: acpi: Make apei_claim_sea()
> synchronise with APEI's irq work") applied, do_sea() return directly
> for user-mode if apei_claim_sea() handled any error record. Therefore,
> each error record reported by the user-mode SEA must be effectively
> processed in APEI GHES driver.

If you describe it the other way round, it would be clearer what the problem here is.
Something like:
| Before commit 8fcc4ae6faf8 ("arm64: acpi: Make apei_claim_sea() synchronise
| with APEI's irq work"), do_sea() would unconditionally signal the affected task
| from the arch code. Since that change, the GHES driver sends the signals,.
| This exposes a problem as errors the GHES driver doesn't understand are silently
| ignored.


> Currently, GHES driver only processes Memory Error Section.(Ignore PCIe
> Error Section, as it has nothing to do with SEA).

(you're starting to confuse me! - I went and checked before I realised you were talking to
me, not describing the code...)

> It is not enough. > Because ARM Processor Error could also be used for SEA in some hardware
> platforms, such as Kunpeng9xx series. We can't ask them to switch to
> use Memory Error Section for two reasons:
> 1)The server was delivered to customers, and it will introduce
> compatibility issue.
> 2)It make sense to use ARM Processor Error Section. Because either
> cache or memory errors could generate SEA when consumed by a processor.

I think you just need to say:
| Existing firmware on Kunpeng9xx systems reports cache errors with the 'ARM Processor
| Error' CPER records.


Could you add something about why the silent-ignore is a problem? Do the errors get taken
again? Does user-space get stuck in this loop?


> Do memory failure handling for ARM Processor Error Section just like
> for Memory Error Section.

> diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
> index fce7ade..0893968 100644
> --- a/drivers/acpi/apei/ghes.c
> +++ b/drivers/acpi/apei/ghes.c

> +static bool ghes_handle_arm_hw_error(struct acpi_hest_generic_data *gdata, int sev)
> +{
> + struct cper_sec_proc_arm *err = acpi_hest_get_payload(gdata);
> + struct cper_arm_err_info *err_info;
> + bool queued = false;
> + int sec_sev, i;
> +
> + log_arm_hw_error(err);
> +
> + sec_sev = ghes_severity(gdata->error_severity);
> + if (sev != GHES_SEV_RECOVERABLE || sec_sev != GHES_SEV_RECOVERABLE)
> + return false;
> +
> + err_info = (struct cper_arm_err_info *) (err + 1);
> + for (i = 0; i < err->err_info_num; i++, err_info++) {

err_info has a version and a length, so its expected to be made bigger at some point.
It would be better to use the length instead of 'err_info++', or at least to break out of
the loop if a length > sizeof(*err_info) is seen.

With that:
Reviewed-by: James Morse <james.morse@xxxxxxx>


The following nits would make this easier to read:

> + bool is_cache = (err_info->type == CPER_ARM_CACHE_ERROR);
> + bool has_pa = (err_info->validation_bits & CPER_ARM_INFO_VALID_PHYSICAL_ADDR);

> + /*
> + * The field (err_info->error_info & BIT(26)) is fixed to set to
> + * 1 in some old firmware of HiSilicon Kunpeng920. We assume that
> + * firmware won't mix corrected errors in an uncorrected section,
> + * and don't filter out 'corrected' error here.
> + */
(Nothing reads err_info->error_info, I guess this is a warning to the next person to touch
this)


> + if (!is_cache || !has_pa) {
> + pr_warn_ratelimited(FW_WARN GHES_PFX
> + "Unhandled processor error type %s\n",
> + err_info->type < ARRAY_SIZE(cper_proc_error_type_strs) ?
> + cper_proc_error_type_strs[err_info->type] : "unknown error");
> + continue;

This is hard to read. The convention is to indent the extra lines to the relevant '('.
e.g.:
| pr_warn_ratelimited(FW_WARN GHES_PFX
| "Unhandled processor error type %s\n",

You could make it shorter by working out the error_type string earlier
e.g.:
| char *error_type = "unknown_error";
|
| if (err_info->type < ARRAY_SIZE(cper_proc_error_type_strs)
| error_type = cper_proc_error_type_strs[err_info->type];


> + }

> + if (ghes_do_memory_failure(err_info->physical_fault_addr, 0))
> + queued = true;

| if (it_returned_true())
| queued = true;

Looks funny, and if you moved this earlier, your pr_warn_ratelimted() would have an extra
level of indentation to play with.
i.e.:
| if (is_cache && has_pa) {
| queued = ghes_do_memory_failure(err_info->physical_fault_addr, 0);
| continue;
| }


Thanks,

James