Re: [PATCH 1/2] Generic hardware error reporting mechanism

From: Huang Ying
Date: Fri Nov 19 2010 - 03:45:23 EST


Sorry, forget to Cc: Greg for device model part.

Best Regards,
Huang Ying

On Fri, 2010-11-19 at 16:10 +0800, Huang, Ying wrote:
> There are many hardware error detecting and reporting components in
> kernel, including x86 Machine Check, PCIe AER, EDAC, APEI GHES
> etc. Each one has its error reporting implementation, including user
> space interface, error record format, in kernel buffer, etc. This
> patch provides a generic hardware error reporting mechanism to reduce
> the duplicated effort and add more common services.
>
>
> A highly extensible generic hardware error record data structure is
> defined to accommodate various hardware error information from various
> hardware error sources. The overall structure of error record is as
> follow:
>
> -----------------------------------------------------------------
> | rcd hdr | sec 1 hdr | sec 1 data | sec 2 hdr | sec2 data | ...
> -----------------------------------------------------------------
>
> Several error sections can be incorporated into one error record to
> accumulate information from multiple hardware components related to
> one error. For example, for an error on a device on the secondary
> side of a PCIe bridge, it is useful to record error information from
> the PCIe bridge and the PCIe device. Multiple section can be used to
> hold both the cooked and the raw error information. So that the
> abstract information can be provided by the cooked one and no
> information will be lost because the raw one is provided too.
>
> There are "reversion" (rev) and "length" field in record header and
> "type" and "length" field in section header, so the user space error
> daemon can skip unrecognized error record or error section. This
> makes old version error daemon can work with the newer kernel.
>
> New error section type can be added to support new error type, error
> sources.
>
>
> The hardware error reporting mechanism designed by the patch
> integrates well with device model in kernel. struct dev_herr_info is
> defined and pointed to by "error" field of struct device. This is
> used to hold error reporting related information for each device. One
> sysfs directory "error" will be created for each hardware error
> reporting device. Some files for error reporting statistics and
> control are created in sysfs "error" directory. For example, the
> "error" directory for APEI GHES is as follow.
>
> /sys/devices/platform/GHES.0/error/logs
> /sys/devices/platform/GHES.0/error/overflows
> /sys/devices/platform/GHES.0/error/throttles
>
> Where "logs" is number of error records logged; "throttles" is number
> of error records not logged because the reporting rate is too high;
> "overflows" is number of error records not logged because there is no
> space available.
>
> Not all devices will report errors, so struct dev_herr_info and sysfs
> directory/files are only allocated/created for devices explicitly
> enable it. So to enumerate the error sources of system, you just need
> to enumerate "error" directory for each device directory in
> /sys/devices.
>
>
> One device file (/dev/error/error) which mixed error records from all
> hardware error reporting devices is created to convey error records
> from kernel space to user space. Because hardware devices are dealt
> with, a device file is the most natural way to do that. Because
> hardware error reporting should not hurts system performance, the
> throughput of the interface should be controlled to a low level (done
> by user space error daemon), ordinary "read" is sufficient from
> performance point of view.
>
>
> The patch provides common services for hardware error reporting
> devices too.
>
> A lock-less hardware error record allocator is provided. So for
> hardware error that can be ignored (such as corrected errors), it is
> not needed to pre-allocate the error record or allocate the error
> record on stack. Because the possibility for two hardware parts to go
> error simultaneously is very small, one big unified memory pool for
> hardware errors is better than one memory pool or buffer for each
> device.
>
> After filling in all necessary fields in hardware error record, the
> error reporting is quite straightforward, just calling
> herr_record_report, parameters are the error record itself and the
> corresponding struct device.
>
> Hardware errors may burst, for example, same hardware errors may be
> reported at high rate within a short interval, this will use up all
> pre-allocated memory for error reporting, so that other hardware
> errors come from same or different hardware device can not be logged.
> To deal with this issue, a throttle algorithm is implemented. The
> logging rate for errors come from one hardware error device is
> throttled based on the available pre-allocated memory for error
> reporting. In this way we can log as many kinds of errors as possible
> comes from as many devices as possible.
>
>
> This patch is designed by Andi Kleen and Huang Ying.
>
> Signed-off-by: Huang Ying <ying.huang@xxxxxxxxx>
> Reviewed-by: Andi Kleen <ak@xxxxxxxxxxxxxxx>
> ---
> drivers/Kconfig | 2
> drivers/Makefile | 1
> drivers/base/Makefile | 1
> drivers/base/herror.c | 98 ++++++++
> drivers/herror/Kconfig | 5
> drivers/herror/Makefile | 1
> drivers/herror/herr-core.c | 488 ++++++++++++++++++++++++++++++++++++++++++
> include/linux/Kbuild | 1
> include/linux/device.h | 14 +
> include/linux/herror.h | 35 +++
> include/linux/herror_record.h | 100 ++++++++
> kernel/Makefile | 1
> 12 files changed, 747 insertions(+)
> create mode 100644 drivers/base/herror.c
> create mode 100644 drivers/herror/Kconfig
> create mode 100644 drivers/herror/Makefile
> create mode 100644 drivers/herror/herr-core.c
> create mode 100644 include/linux/herror.h
> create mode 100644 include/linux/herror_record.h
>
> --- a/drivers/Kconfig
> +++ b/drivers/Kconfig
> @@ -111,4 +111,6 @@ source "drivers/xen/Kconfig"
> source "drivers/staging/Kconfig"
>
> source "drivers/platform/Kconfig"
> +
> +source "drivers/herror/Kconfig"
> endmenu
> --- a/drivers/Makefile
> +++ b/drivers/Makefile
> @@ -115,3 +115,4 @@ obj-$(CONFIG_VLYNQ) += vlynq/
> obj-$(CONFIG_STAGING) += staging/
> obj-y += platform/
> obj-y += ieee802154/
> +obj-$(CONFIG_HERR_CORE) += herror/
> --- a/drivers/base/Makefile
> +++ b/drivers/base/Makefile
> @@ -18,6 +18,7 @@ ifeq ($(CONFIG_SYSFS),y)
> obj-$(CONFIG_MODULES) += module.o
> endif
> obj-$(CONFIG_SYS_HYPERVISOR) += hypervisor.o
> +obj-$(CONFIG_HERR_CORE) += herror.o
>
> ccflags-$(CONFIG_DEBUG_DRIVER) := -DDEBUG
>
> --- /dev/null
> +++ b/drivers/base/herror.c
> @@ -0,0 +1,98 @@
> +/*
> + * Hardware error reporting related functions
> + *
> + * Copyright 2010 Intel Corp.
> + * Author: Huang Ying <ying.huang@xxxxxxxxx>
> + *
> + * This program is free software; you can redistribute it and/or
> + * modify it under the terms of the GNU General Public License version
> + * 2 as published by the Free Software Foundation;
> + *
> + * This program is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
> + * GNU General Public License for more details.
> + *
> + * You should have received a copy of the GNU General Public License
> + * along with this program; if not, write to the Free Software
> + * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
> + */
> +
> +#include <linux/kernel.h>
> +#include <linux/device.h>
> +#include <linux/slab.h>
> +
> +#define HERR_COUNTER_ATTR(_name) \
> + static ssize_t herr_##_name##_show(struct device *dev, \
> + struct device_attribute *attr, \
> + char *buf) \
> + { \
> + int counter; \
> + \
> + counter = atomic_read(&dev->error->_name); \
> + return sprintf(buf, "%d\n", counter); \
> + } \
> + static ssize_t herr_##_name##_store(struct device *dev, \
> + struct device_attribute *attr, \
> + const char *buf, \
> + size_t count) \
> + { \
> + atomic_set(&dev->error->_name, 0); \
> + return count; \
> + } \
> + static struct device_attribute herr_attr_##_name = \
> + __ATTR(_name, 0600, herr_##_name##_show, \
> + herr_##_name##_store)
> +
> +HERR_COUNTER_ATTR(logs);
> +HERR_COUNTER_ATTR(overflows);
> +HERR_COUNTER_ATTR(throttles);
> +
> +static struct attribute *herr_attrs[] = {
> + &herr_attr_logs.attr,
> + &herr_attr_overflows.attr,
> + &herr_attr_throttles.attr,
> + NULL,
> +};
> +
> +static struct attribute_group herr_attr_group = {
> + .name = "error",
> + .attrs = herr_attrs,
> +};
> +
> +static void device_herr_init(struct device *dev)
> +{
> + atomic_set(&dev->error->logs, 0);
> + atomic_set(&dev->error->overflows, 0);
> + atomic_set(&dev->error->throttles, 0);
> + atomic64_set(&dev->error->timestamp, 0);
> +}
> +
> +int device_enable_error_reporting(struct device *dev)
> +{
> + int rc;
> +
> + BUG_ON(dev->error);
> + dev->error = kzalloc(sizeof(*dev->error), GFP_KERNEL);
> + if (!dev->error)
> + return -ENOMEM;
> + device_herr_init(dev);
> + rc = sysfs_create_group(&dev->kobj, &herr_attr_group);
> + if (rc)
> + goto err;
> + return 0;
> +err:
> + kfree(dev->error);
> + dev->error = NULL;
> + return rc;
> +}
> +EXPORT_SYMBOL_GPL(device_enable_error_reporting);
> +
> +void device_disable_error_reporting(struct device *dev)
> +{
> + if (dev->error) {
> + sysfs_remove_group(&dev->kobj, &herr_attr_group);
> + kfree(dev->error);
> + }
> +}
> +EXPORT_SYMBOL_GPL(device_disable_error_reporting);
> --- /dev/null
> +++ b/drivers/herror/Kconfig
> @@ -0,0 +1,5 @@
> +config HERR_CORE
> + bool "Hardware error reporting"
> + depends on ARCH_HAVE_NMI_SAFE_CMPXCHG
> + select LLIST
> + select GENERIC_ALLOCATOR
> --- /dev/null
> +++ b/drivers/herror/Makefile
> @@ -0,0 +1 @@
> +obj-y += herr-core.o
> --- /dev/null
> +++ b/drivers/herror/herr-core.c
> @@ -0,0 +1,488 @@
> +/*
> + * Generic hardware error reporting support
> + *
> + * This file provides some common services for hardware error
> + * reporting, including hardware error record lock-less allocator,
> + * error reporting mechanism, user space interface etc.
> + *
> + * Copyright 2010 Intel Corp.
> + * Author: Huang Ying <ying.huang@xxxxxxxxx>
> + *
> + * This program is free software; you can redistribute it and/or
> + * modify it under the terms of the GNU General Public License version
> + * 2 as published by the Free Software Foundation;
> + *
> + * This program is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
> + * GNU General Public License for more details.
> + *
> + * You should have received a copy of the GNU General Public License
> + * along with this program; if not, write to the Free Software
> + * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
> + */
> +
> +#include <linux/kernel.h>
> +#include <linux/module.h>
> +#include <linux/rculist.h>
> +#include <linux/mutex.h>
> +#include <linux/percpu.h>
> +#include <linux/sched.h>
> +#include <linux/slab.h>
> +#include <linux/trace_clock.h>
> +#include <linux/uaccess.h>
> +#include <linux/poll.h>
> +#include <linux/ratelimit.h>
> +#include <linux/nmi.h>
> +#include <linux/llist.h>
> +#include <linux/genalloc.h>
> +#include <linux/herror.h>
> +
> +#define HERR_NOTIFY_BIT 0
> +
> +static unsigned long herr_flags;
> +
> +/*
> + * Record list management and error reporting
> + */
> +
> +struct herr_node {
> + struct llist_node llist;
> + struct herr_record ercd __attribute__((aligned(HERR_MIN_ALIGN)));
> +};
> +
> +#define HERR_NODE_LEN(rcd_len) \
> + ((rcd_len) + sizeof(struct herr_node) - sizeof(struct herr_record))
> +
> +#define HERR_MIN_ALLOC_ORDER HERR_MIN_ALIGN_ORDER
> +#define HERR_CHUNKS_PER_CPU 2
> +#define HERR_RCD_LIST_NUM 2
> +
> +struct herr_rcd_lists {
> + struct llist_head *write;
> + struct llist_head *read;
> + struct llist_head heads[HERR_RCD_LIST_NUM];
> +};
> +
> +static DEFINE_PER_CPU(struct herr_rcd_lists, herr_rcd_lists);
> +
> +static DEFINE_PER_CPU(struct gen_pool *, herr_gen_pool);
> +
> +static void herr_rcd_lists_init(void)
> +{
> + int cpu, i;
> + struct herr_rcd_lists *lists;
> +
> + for_each_possible_cpu(cpu) {
> + lists = per_cpu_ptr(&herr_rcd_lists, cpu);
> + for (i = 0; i < HERR_RCD_LIST_NUM; i++)
> + init_llist_head(&lists->heads[i]);
> + lists->write = &lists->heads[0];
> + lists->read = &lists->heads[1];
> + }
> +}
> +
> +static void herr_pool_fini(void)
> +{
> + struct gen_pool *pool;
> + struct gen_pool_chunk *chunk;
> + int cpu;
> +
> + for_each_possible_cpu(cpu) {
> + pool = per_cpu(herr_gen_pool, cpu);
> + gen_pool_for_each_chunk(chunk, pool)
> + free_page(chunk->start_addr);
> + gen_pool_destroy(pool);
> + }
> +}
> +
> +static int herr_pool_init(void)
> +{
> + struct gen_pool **pool;
> + int cpu, rc, nid, i;
> + unsigned long addr;
> +
> + for_each_possible_cpu(cpu) {
> + pool = per_cpu_ptr(&herr_gen_pool, cpu);
> + rc = -ENOMEM;
> + nid = cpu_to_node(cpu);
> + *pool = gen_pool_create(HERR_MIN_ALLOC_ORDER, nid);
> + if (!*pool)
> + goto err_pool_fini;
> + for (i = 0; i < HERR_CHUNKS_PER_CPU; i++) {
> + rc = -ENOMEM;
> + addr = __get_free_page(GFP_KERNEL);
> + if (!addr)
> + goto err_pool_fini;
> + rc = gen_pool_add(*pool, addr, PAGE_SIZE, nid);
> + if (rc)
> + goto err_pool_fini;
> + }
> + }
> +
> + return 0;
> +err_pool_fini:
> + herr_pool_fini();
> + return rc;
> +}
> +
> +/* Max interval: about 2 second */
> +#define HERR_THROTTLE_BASE_INTVL NSEC_PER_USEC
> +#define HERR_THROTTLE_MAX_RATIO 21
> +#define HERR_THROTTLE_MAX_INTVL \
> + ((1ULL << HERR_THROTTLE_MAX_RATIO) * HERR_THROTTLE_BASE_INTVL)
> +/*
> + * Pool size/used ratio considered spare, before this, interval
> + * between error reporting is ignored. After this, minimal interval
> + * needed is increased exponentially to max interval.
> + */
> +#define HERR_THROTTLE_SPARE_RATIO 3
> +
> +static int herr_throttle(struct device *dev)
> +{
> + struct gen_pool *pool;
> + unsigned long long last, now, min_intvl;
> + unsigned int size, used, ratio;
> +
> + pool = __get_cpu_var(herr_gen_pool);
> + size = gen_pool_size(pool);
> + used = size - gen_pool_avail(pool);
> + if (HERR_THROTTLE_SPARE_RATIO * used < size)
> + goto pass;
> + now = trace_clock_local();
> + last = atomic64_read(&dev->error->timestamp);
> + ratio = (used * HERR_THROTTLE_SPARE_RATIO - size) * \
> + HERR_THROTTLE_MAX_RATIO;
> + ratio = ratio / (size * HERR_THROTTLE_SPARE_RATIO - size) + 1;
> + min_intvl = (1ULL << ratio) * HERR_THROTTLE_BASE_INTVL;
> + if ((long long)(now - last) > min_intvl)
> + goto pass;
> + atomic_inc(&dev->error->throttles);
> + return 0;
> +pass:
> + return 1;
> +}
> +
> +static u64 herr_record_next_id(void)
> +{
> + static atomic64_t seq = ATOMIC64_INIT(0);
> +
> + if (!atomic64_read(&seq))
> + atomic64_set(&seq, (u64)get_seconds() << 32);
> +
> + return atomic64_inc_return(&seq);
> +}
> +
> +void herr_record_init(struct herr_record *ercd)
> +{
> + ercd->flags = 0;
> + ercd->rev = HERR_RCD_REV1_0;
> + ercd->id = herr_record_next_id();
> + ercd->timestamp = trace_clock_local();
> +}
> +EXPORT_SYMBOL_GPL(herr_record_init);
> +
> +struct herr_record *herr_record_alloc(unsigned int len, struct device *dev,
> + unsigned int flags)
> +{
> + struct gen_pool *pool;
> + struct herr_node *enode;
> + struct herr_record *ercd = NULL;
> +
> + BUG_ON(!dev->error);
> + preempt_disable();
> + if (!(flags & HERR_ALLOC_NO_THROTTLE)) {
> + if (!herr_throttle(dev)) {
> + preempt_enable_no_resched();
> + return NULL;
> + }
> + }
> +
> + pool = __get_cpu_var(herr_gen_pool);
> + enode = (struct herr_node *)gen_pool_alloc(pool, HERR_NODE_LEN(len));
> + if (enode) {
> + ercd = &enode->ercd;
> + herr_record_init(ercd);
> + ercd->length = len;
> +
> + atomic64_set(&dev->error->timestamp, trace_clock_local());
> + atomic_inc(&dev->error->logs);
> + } else
> + atomic_inc(&dev->error->overflows);
> + preempt_enable_no_resched();
> +
> + return ercd;
> +}
> +EXPORT_SYMBOL_GPL(herr_record_alloc);
> +
> +int herr_record_report(struct herr_record *ercd, struct device *dev)
> +{
> + struct herr_rcd_lists *lists;
> + struct herr_node *enode;
> +
> + preempt_disable();
> + lists = this_cpu_ptr(&herr_rcd_lists);
> + enode = container_of(ercd, struct herr_node, ercd);
> + llist_add(&enode->llist, lists->write);
> + preempt_enable_no_resched();
> +
> + set_bit(HERR_NOTIFY_BIT, &herr_flags);
> +
> + return 0;
> +}
> +EXPORT_SYMBOL_GPL(herr_record_report);
> +
> +void herr_record_free(struct herr_record *ercd)
> +{
> + struct herr_node *enode;
> + struct gen_pool *pool;
> +
> + enode = container_of(ercd, struct herr_node, ercd);
> + pool = get_cpu_var(herr_gen_pool);
> + gen_pool_free(pool, (unsigned long)enode,
> + HERR_NODE_LEN(enode->ercd.length));
> + put_cpu_var(pool);
> +}
> +EXPORT_SYMBOL_GPL(herr_record_free);
> +
> +/*
> + * The low 16 bit is freeze count, high 16 bit is thaw count. If they
> + * are not equal, someone is freezing the reader
> + */
> +static u32 herr_freeze_thaw;
> +
> +/*
> + * Stop the reader to consume error records, so that the error records
> + * can be checked in kernel space safely.
> + */
> +static void herr_freeze_reader(void)
> +{
> + u32 old, new;
> +
> + do {
> + new = old = herr_freeze_thaw;
> + new = ((new + 1) & 0xffff) | (old & 0xffff0000);
> + } while (cmpxchg(&herr_freeze_thaw, old, new) != old);
> +}
> +
> +static void herr_thaw_reader(void)
> +{
> + u32 old, new;
> +
> + do {
> + old = herr_freeze_thaw;
> + new = old + 0x10000;
> + } while (cmpxchg(&herr_freeze_thaw, old, new) != old);
> +}
> +
> +static int herr_reader_is_frozen(void)
> +{
> + u32 freeze_thaw = herr_freeze_thaw;
> + return (freeze_thaw & 0xffff) != (freeze_thaw >> 16);
> +}
> +
> +int herr_for_each_record(herr_traverse_func_t func, void *data)
> +{
> + int i, cpu, rc = 0;
> + struct herr_rcd_lists *lists;
> + struct herr_node *enode;
> +
> + preempt_disable();
> + herr_freeze_reader();
> + for_each_possible_cpu(cpu) {
> + lists = per_cpu_ptr(&herr_rcd_lists, cpu);
> + for (i = 0; i < HERR_RCD_LIST_NUM; i++) {
> + struct llist_head *head = &lists->heads[i];
> + llist_for_each_entry(enode, head->first, llist) {
> + rc = func(&enode->ercd, data);
> + if (rc)
> + goto out;
> + }
> + }
> + }
> +out:
> + herr_thaw_reader();
> + preempt_enable_no_resched();
> + return rc;
> +}
> +EXPORT_SYMBOL_GPL(herr_for_each_record);
> +
> +static ssize_t herr_rcd_lists_read(char __user *ubuf, size_t usize,
> + struct mutex *read_mutex)
> +{
> + int cpu, rc = 0, read;
> + struct herr_rcd_lists *lists;
> + struct gen_pool *pool;
> + ssize_t len, rsize = 0;
> + struct herr_node *enode;
> + struct llist_head *old_read;
> + struct llist_node *to_read;
> +
> + do {
> + read = 0;
> + for_each_possible_cpu(cpu) {
> + lists = per_cpu_ptr(&herr_rcd_lists, cpu);
> + pool = per_cpu(herr_gen_pool, cpu);
> + if (llist_empty(lists->read)) {
> + if (llist_empty(lists->write))
> + continue;
> + /*
> + * Error records are output in batch, so old
> + * error records can be output before new ones.
> + */
> + old_read = lists->read;
> + lists->read = lists->write;
> + lists->write = old_read;
> + }
> + rc = rsize ? 0 : -EBUSY;
> + if (herr_reader_is_frozen())
> + goto out;
> + to_read = llist_del_first(lists->read);
> + if (herr_reader_is_frozen())
> + goto out_readd;
> + enode = llist_entry(to_read, struct herr_node, llist);
> + len = enode->ercd.length;
> + rc = rsize ? 0 : -EINVAL;
> + if (len > usize - rsize)
> + goto out_readd;
> + rc = -EFAULT;
> + if (copy_to_user(ubuf + rsize, &enode->ercd, len))
> + goto out_readd;
> + gen_pool_free(pool, (unsigned long)enode,
> + HERR_NODE_LEN(len));
> + rsize += len;
> + read = 1;
> + }
> + if (need_resched()) {
> + mutex_unlock(read_mutex);
> + cond_resched();
> + mutex_lock(read_mutex);
> + }
> + } while (read);
> + rc = 0;
> +out:
> + return rc ? rc : rsize;
> +out_readd:
> + llist_add(to_read, lists->read);
> + goto out;
> +}
> +
> +static int herr_rcd_lists_is_empty(void)
> +{
> + int cpu, i;
> + struct herr_rcd_lists *lists;
> +
> + for_each_possible_cpu(cpu) {
> + lists = per_cpu_ptr(&herr_rcd_lists, cpu);
> + for (i = 0; i < HERR_RCD_LIST_NUM; i++) {
> + if (!llist_empty(&lists->heads[i]))
> + return 0;
> + }
> + }
> + return 1;
> +}
> +
> +
> +/*
> + * Hardware Error Mix Reporting Device
> + */
> +
> +static int herr_major;
> +static DECLARE_WAIT_QUEUE_HEAD(herr_mix_wait);
> +
> +static char *herr_devnode(struct device *dev, mode_t *mode)
> +{
> + return kasprintf(GFP_KERNEL, "error/%s", dev_name(dev));
> +}
> +
> +struct class herr_class = {
> + .name = "error",
> + .devnode = herr_devnode,
> +};
> +EXPORT_SYMBOL_GPL(herr_class);
> +
> +void herr_notify(void)
> +{
> + if (test_and_clear_bit(HERR_NOTIFY_BIT, &herr_flags))
> + wake_up_interruptible(&herr_mix_wait);
> +}
> +EXPORT_SYMBOL_GPL(herr_notify);
> +
> +static ssize_t herr_mix_read(struct file *filp, char __user *ubuf,
> + size_t usize, loff_t *off)
> +{
> + int rc;
> + static DEFINE_MUTEX(read_mutex);
> +
> + if (*off != 0)
> + return -EINVAL;
> +
> + rc = mutex_lock_interruptible(&read_mutex);
> + if (rc)
> + return rc;
> + rc = herr_rcd_lists_read(ubuf, usize, &read_mutex);
> + mutex_unlock(&read_mutex);
> +
> + return rc;
> +}
> +
> +static unsigned int herr_mix_poll(struct file *file, poll_table *wait)
> +{
> + poll_wait(file, &herr_mix_wait, wait);
> + if (!herr_rcd_lists_is_empty())
> + return POLLIN | POLLRDNORM;
> + return 0;
> +}
> +
> +static const struct file_operations herr_mix_dev_fops = {
> + .owner = THIS_MODULE,
> + .read = herr_mix_read,
> + .poll = herr_mix_poll,
> +};
> +
> +static int __init herr_mix_dev_init(void)
> +{
> + struct device *dev;
> + dev_t devt;
> +
> + devt = MKDEV(herr_major, 0);
> + dev = device_create(&herr_class, NULL, devt, NULL, "error");
> + if (IS_ERR(dev))
> + return PTR_ERR(dev);
> +
> + return 0;
> +}
> +device_initcall(herr_mix_dev_init);
> +
> +static int __init herr_core_init(void)
> +{
> + int rc;
> +
> + BUILD_BUG_ON(sizeof(struct herr_node) % HERR_MIN_ALIGN);
> + BUILD_BUG_ON(sizeof(struct herr_record) % HERR_MIN_ALIGN);
> + BUILD_BUG_ON(sizeof(struct herr_section) % HERR_MIN_ALIGN);
> +
> + herr_rcd_lists_init();
> +
> + rc = herr_pool_init();
> + if (rc)
> + goto err;
> +
> + rc = class_register(&herr_class);
> + if (rc)
> + goto err_free_pool;
> +
> + rc = herr_major = register_chrdev(0, "error", &herr_mix_dev_fops);
> + if (rc < 0)
> + goto err_free_class;
> +
> + return 0;
> +err_free_class:
> + class_unregister(&herr_class);
> +err_free_pool:
> + herr_pool_fini();
> +err:
> + return rc;
> +}
> +/* Initialize data structure used by device driver, so subsys_initcall */
> +subsys_initcall(herr_core_init);
> --- a/include/linux/Kbuild
> +++ b/include/linux/Kbuild
> @@ -141,6 +141,7 @@ header-y += gigaset_dev.h
> header-y += hdlc.h
> header-y += hdlcdrv.h
> header-y += hdreg.h
> +header-y += herror_record.h
> header-y += hid.h
> header-y += hiddev.h
> header-y += hidraw.h
> --- a/include/linux/device.h
> +++ b/include/linux/device.h
> @@ -394,6 +394,14 @@ extern int devres_release_group(struct d
> extern void *devm_kzalloc(struct device *dev, size_t size, gfp_t gfp);
> extern void devm_kfree(struct device *dev, void *p);
>
> +/* Device hardware error reporting related information */
> +struct dev_herr_info {
> + atomic_t logs;
> + atomic_t overflows;
> + atomic_t throttles;
> + atomic64_t timestamp;
> +};
> +
> struct device_dma_parameters {
> /*
> * a low level driver may set these to teach IOMMU code about
> @@ -422,6 +430,9 @@ struct device {
> void *platform_data; /* Platform specific data, device
> core doesn't touch it */
> struct dev_pm_info power;
> +#ifdef CONFIG_HERR_CORE
> + struct dev_herr_info *error; /* Hardware error reporting info */
> +#endif
>
> #ifdef CONFIG_NUMA
> int numa_node; /* NUMA node this device is close to */
> @@ -523,6 +534,9 @@ static inline bool device_async_suspend_
> return !!dev->power.async_suspend;
> }
>
> +extern int device_enable_error_reporting(struct device *dev);
> +extern void device_disable_error_reporting(struct device *dev);
> +
> static inline void device_lock(struct device *dev)
> {
> mutex_lock(&dev->mutex);
> --- /dev/null
> +++ b/include/linux/herror.h
> @@ -0,0 +1,35 @@
> +#ifndef LINUX_HERROR_H
> +#define LINUX_HERROR_H
> +
> +#include <linux/types.h>
> +#include <linux/list.h>
> +#include <linux/device.h>
> +#include <linux/herror_record.h>
> +
> +/*
> + * Hardware error reporting
> + */
> +
> +#define HERR_ALLOC_NO_THROTTLE 0x0001
> +
> +struct herr_dev;
> +
> +/* allocate a herr_record lock-lessly */
> +struct herr_record *herr_record_alloc(unsigned int len,
> + struct device *dev,
> + unsigned int flags);
> +void herr_record_init(struct herr_record *ercd);
> +/* report error */
> +int herr_record_report(struct herr_record *ercd, struct device *dev);
> +/* free the herr_record allocated before */
> +void herr_record_free(struct herr_record *ercd);
> +/*
> + * Notify waited user space hardware error daemon for the new error
> + * record, can not be used in NMI context
> + */
> +void herr_notify(void);
> +
> +/* Traverse all error records not consumed by user space */
> +typedef int (*herr_traverse_func_t)(struct herr_record *ercd, void *data);
> +int herr_for_each_record(herr_traverse_func_t func, void *data);
> +#endif
> --- /dev/null
> +++ b/include/linux/herror_record.h
> @@ -0,0 +1,100 @@
> +#ifndef LINUX_HERROR_RECORD_H
> +#define LINUX_HERROR_RECORD_H
> +
> +#include <linux/types.h>
> +
> +/*
> + * Hardware Error Record Definition
> + */
> +enum herr_severity {
> + HERR_SEV_NONE,
> + HERR_SEV_CORRECTED,
> + HERR_SEV_RECOVERABLE,
> + HERR_SEV_FATAL,
> +};
> +
> +#define HERR_RCD_REV1_0 0x0100
> +#define HERR_MIN_ALIGN_ORDER 3
> +#define HERR_MIN_ALIGN (1 << HERR_MIN_ALIGN_ORDER)
> +
> +enum herr_record_flags {
> + HERR_RCD_PREV = 0x0001, /* record is for previous boot */
> + HERR_RCD_PERSIST = 0x0002, /* record is from flash, need to be
> + * cleared after writing to disk */
> +};
> +
> +/*
> + * sizeof(struct herr_record) and sizeof(struct herr_section) should
> + * be multiple of HERR_MIN_ALIGN to make error record packing easier.
> + */
> +struct herr_record {
> + __u16 length;
> + __u16 flags;
> + __u16 rev;
> + __u8 severity;
> + __u8 pad1;
> + __u64 id;
> + __u64 timestamp;
> + __u8 data[0];
> +};
> +
> +/* Section type ID are allocated here */
> +enum herr_section_type_id {
> + /* 0x0 - 0xff are reserved by core */
> + /* 0x100 - 0x1ff are allocated to CPER */
> + HERR_TYPE_CPER = 0x0100,
> + HERR_TYPE_GESR = 0x0110, /* acpi_hest_generic_status */
> + /* 0x200 - 0x2ff are allocated to PCI/PCIe subsystem */
> + HERR_TYPE_PCIE_AER = 0x0200,
> +};
> +
> +struct herr_section {
> + __u16 length;
> + __u16 flags;
> + __u32 type;
> + __u8 data[0];
> +};
> +
> +#define herr_record_for_each_section(ercd, esec) \
> + for ((esec) = (struct herr_section *)(ercd)->data; \
> + (void *)(esec) - (void *)(ercd) < (ercd)->length; \
> + (esec) = (void *)(esec) + (esec)->length)
> +
> +#define HERR_SEC_LEN_ROUND(len) \
> + (((len) + HERR_MIN_ALIGN - 1) & ~(HERR_MIN_ALIGN - 1))
> +#define HERR_SEC_LEN(type) \
> + (sizeof(struct herr_section) + HERR_SEC_LEN_ROUND(sizeof(type)))
> +
> +#define HERR_RECORD_LEN_ROUND1(sec_len1) \
> + (sizeof(struct herr_record) + HERR_SEC_LEN_ROUND(sec_len1))
> +#define HERR_RECORD_LEN_ROUND2(sec_len1, sec_len2) \
> + (sizeof(struct herr_record) + HERR_SEC_LEN_ROUND(sec_len1) + \
> + HERR_SEC_LEN_ROUND(sec_len2))
> +#define HERR_RECORD_LEN_ROUND3(sec_len1, sec_len2, sec_len3) \
> + (sizeof(struct herr_record) + HERR_SEC_LEN_ROUND(sec_len1) + \
> + HERR_SEC_LEN_ROUND(sec_len2) + HERR_SEC_LEN_ROUND(sec_len3))
> +
> +#define HERR_RECORD_LEN1(sec_type1) \
> + (sizeof(struct herr_record) + HERR_SEC_LEN(sec_type1))
> +#define HERR_RECORD_LEN2(sec_type1, sec_type2) \
> + (sizeof(struct herr_record) + HERR_SEC_LEN(sec_type1) + \
> + HERR_SEC_LEN(sec_type2))
> +#define HERR_RECORD_LEN3(sec_type1, sec_type2, sec_type3) \
> + (sizeof(struct herr_record) + HERR_SEC_LEN(sec_type1) + \
> + HERR_SEC_LEN(sec_type2) + HERR_SEC_LEN(sec_type3))
> +
> +static inline struct herr_section *herr_first_sec(struct herr_record *ercd)
> +{
> + return (struct herr_section *)(ercd + 1);
> +}
> +
> +static inline struct herr_section *herr_next_sec(struct herr_section *esrc)
> +{
> + return (void *)esrc + esrc->length;
> +}
> +
> +static inline void *herr_sec_data(struct herr_section *esec)
> +{
> + return (void *)(esec + 1);
> +}
> +#endif
> --- a/kernel/Makefile
> +++ b/kernel/Makefile
> @@ -100,6 +100,7 @@ obj-$(CONFIG_FUNCTION_TRACER) += trace/
> obj-$(CONFIG_TRACING) += trace/
> obj-$(CONFIG_X86_DS) += trace/
> obj-$(CONFIG_RING_BUFFER) += trace/
> +obj-$(CONFIG_HERR_CORE) += trace/
> obj-$(CONFIG_SMP) += sched_cpupri.o
> obj-$(CONFIG_IRQ_WORK) += irq_work.o
> obj-$(CONFIG_PERF_EVENTS) += perf_event.o


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/