Re: [RFC PATCH] dax: add badblocks check to Device DAX

From: Kani, Toshimitsu
Date: Wed May 03 2017 - 18:41:41 EST


On Wed, 2017-05-03 at 14:48 -0700, Dan Williams wrote:
> On Wed, May 3, 2017 at 11:46 AM, Kani, Toshimitsu <toshi.kani@xxxxxxx
> > wrote:
> > On Wed, 2017-05-03 at 09:30 -0700, Dan Williams wrote:
> > > On Wed, May 3, 2017 at 9:09 AM, Kani, Toshimitsu <toshi.kani@hpe.
> > > com>
> > > wrote:
> > > > On Wed, 2017-05-03 at 08:52 -0700, Dan Williams wrote:
> > > > > On Wed, May 3, 2017 at 8:31 AM, Toshi Kani <toshi.kani@xxxxxx
> > > > > m>
> > > > > wrote:
> > > > > > This is a RFC patch for seeking suggestions.ÂÂIt adds
> > > > > > support of badblocks check in Device DAX by using region-
> > > > > > level badblocks list.ÂÂThis patch is only briefly tested.
> > > > > >
> > > > > > device_dax is a well-isolated self-contained module as it
> > > > > > calls alloc_dax() with dev_dax, which is private to
> > > > > > device_dax.ÂÂFor checking badblocks, it needs to call
> > > > > > dax_pmem to check with region-level badblocks.
> > > > > >
> > > > > > This patch attempts to keep device_dax self-contained.ÂÂIt
> > > > > > adds check_error() to dax_operations, and dax_check_error()
> > > > > > as a stub with *dev_dax and *dev pointers to convey it to
> > > > > > dax_pmem.ÂÂI am wondering if this is the right direction,
> > > > > > or we should change the modularity to let dax_pmem call
> > > > > > alloc_dax() with its dax_pmem (or I completely missed
> > > > > > something).
> > > > >
> > > > > The problem is that device-dax guarantees a given fault
> > > > > granularity. To make that guarantee we can't fallback from 1G
> > > > > or 2M mappings due to an error. We also can't reasonably go
> > > > > the other way and fail mappings that contain a badblock
> > > > > because that would change the blast radius of a media error
> > > > > to the fault size.
> > > >
> > > > Does it mean we expect users to have CPUs with MCE recovery for
> > > > Device DAX?ÂÂCan we add an attributes like allow error-check &
> > > > fall-back?
> > >
> > > Yes, without MCE recovery device-dax mappings that consume errors
> > > will reboot. If an application needs the kernel protection it
> > > should be using filesystem-dax.
> >
> > Understood.ÂÂAre we going to provide sysfs "badblocks" for Device
> > DAX as it is also needed for ndctl clear-error?
>
> No, I had started that way, but badblocks really needs write(2) or
> fallocate(PUNCH_HOLE) support for clearing errors. Since we don't
> want to support write(2) and were NAKd from supporting fallocate()
> the only interface that was left was sending clear-error-DSM ioctls
> directly to Âthe nvdimm bus. Since that is a very libnvdimm specific
> interface it made sense to then add badblocks at the libnvdimm-region
> level. The "ndctl clear-error" command is there to do the translation
> of error offsets in user space and supersedes the need for the kernel
> to carry a badblocks file for device-dax.

I am fine with using ndctl to clear errors. What I need is to allow an
application to avoid accessing to bad blocks by reading a sysfs file
and managing the bad blocks list by itself since the kernel does not
protect it at page faults. At least, data offset of Device DAX should
be provided for such application to do the translation by itself.

Thanks,
-Toshi