Re: [PATCH 2/4] lib: add error_report_notify to collect debugging tools' reports

From: Alexander Potapenko
Date: Fri Jan 15 2021 - 05:19:46 EST


On Thu, Jan 14, 2021 at 10:51 AM Alexander Potapenko <glider@xxxxxxxxxx> wrote:
>
> On Thu, Jan 14, 2021 at 1:06 AM Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx> wrote:
> >
> > On Wed, 13 Jan 2021 10:16:55 +0100 Alexander Potapenko <glider@xxxxxxxxxx> wrote:
> >
> > > With the introduction of various production error-detection tools, such as
> > > MTE-based KASAN and KFENCE, the need arises to efficiently notify the
> > > userspace OS components about kernel errors. Currently, no facility exists
> > > to notify userspace about a kernel error from such bug-detection tools.
> > > The problem is obviously not restricted to the above bug detection tools,
> > > and applies to any error reporting mechanism that does not panic the
> > > kernel; this series, however, will only add support for KASAN and KFENCE
> > > reporting.
> > >
> > > All such error reports appear in the kernel log. But, when such errors
> > > occur, userspace would normally need to read the entire kernel log and
> > > parse the relevant errors. This is error prone and inefficient, as
> > > userspace needs to continuously monitor the kernel log for error messages.
> > > On certain devices, this is unfortunately not acceptable. Therefore, we
> > > need to revisit how reports are propagated to userspace.
> > >
> > > The library added, error_report_notify (CONFIG_ERROR_REPORT_NOTIFY),
> > > solves the above by using the error_report_start/error_report_end tracing
> > > events and exposing the last report and the total report count to the
> > > userspace via /sys/kernel/error_report/last_report and
> > > /sys/kernel/error_report/report_count.
> > >
> > > Userspace apps can call poll(POLLPRI) on those files to get notified about
> > > the new reports without having to watch dmesg in a loop.
> >
> > It would be nice to see some user-facing documentation for this, under
> > Documentation/. How to use it, what the shortcomings are, etc.
>
> Good point, will do.
Added in v2.

> > For instance... what happens when userspace is slow reading
> > /sys/kernel/error_report/last_report? Does that file buffer multiple
> > reports? Does the previous one get overwritten? etc. Words on how
> > this obvious issue is handled...
>
> Yes, there can be issues with overwriting, and the recommended way to
> handle them would be to check the value in
> /sys/kernel/error_report/report_count before and after reading the
> report.

After looking closer it occurs to me that sysfs retains the buffer
returned by the attribute's show() method, so that one can read the
whole report up to the end even if the file contents change.

> > There's really nothing "memory" specific about this? Any kernel
> > subsystem could use it?
>
> Indeed. Perhaps it's better to emphasize "production" here, because
> users of debugging tools are more or less happy with dmesg output.

Changed to "error reports from debugging tools".