Re: Question about the "EXPERIMENTAL" tag for dax in XFS

From: Dave Chinner
Date: Sun Feb 28 2021 - 17:40:36 EST


On Sat, Feb 27, 2021 at 03:40:24PM -0800, Dan Williams wrote:
> On Sat, Feb 27, 2021 at 2:36 PM Dave Chinner <david@xxxxxxxxxxxxx> wrote:
> > On Fri, Feb 26, 2021 at 02:41:34PM -0800, Dan Williams wrote:
> > > On Fri, Feb 26, 2021 at 1:28 PM Dave Chinner <david@xxxxxxxxxxxxx> wrote:
> > > > On Fri, Feb 26, 2021 at 12:59:53PM -0800, Dan Williams wrote:
> > it points to, check if it points to the PMEM that is being removed,
> > grab the page it points to, map that to the relevant struct page,
> > run collect_procs() on that page, then kill the user processes that
> > map that page.
> >
> > So why can't we walk the ptescheck the physical pages that they
> > map to and if they map to a pmem page we go poison that
> > page and that kills any user process that maps it.
> >
> > i.e. I can't see how unexpected pmem device unplug is any different
> > to an MCE delivering a hwpoison event to a DAX mapped page.
>
> I guess the tradeoff is walking a long list of inodes vs walking a
> large array of pages.

Not really. You're assuming all a filesystem has to do is invalidate
everything if a device goes away, and that's not true. Finding if an
inode has a mapping that spans a specific device in a multi-device
filesystem can be a lot more complex than that. Just walking inodes
is easy - determining whihc inodes need invalidation is the hard
part.

That's where ->corrupt_range() comes in - the filesystem is already
set up to do reverse mapping from physical range to inode(s)
offsets...

> There's likely always more pages than inodes, but perhaps it's more
> efficient to walk the 'struct page' array than sb->s_inodes?

I really don't see you seem to be telling us that invalidation is an
either/or choice. There's more ways to convert physical block
address -> inode file offset and mapping index than brute force
inode cache walks....

.....

> > IOWs, what needs to happen at this point is very filesystem
> > specific. Assuming that "device unplug == filesystem dead" is not
> > correct, nor is specifying a generic action that assumes the
> > filesystem is dead because a device it is using went away.
>
> Ok, I think I set this discussion in the wrong direction implying any
> mapping of this action to a "filesystem dead" event. It's just a "zap
> all ptes" event and upper layers recover from there.

Yes, that's exactly what ->corrupt_range() is intended for. It
allows the filesystem to lock out access to the bad range
and then recover the data. Or metadata, if that's where the bad
range lands. If that recovery fails, it can then report a data
loss/filesystem shutdown event to userspace and kill user procs that
span the bad range...

FWIW, is this notification going to occur before or after the device
has been physically unplugged? i.e. what do we do about the
time-of-unplug-to-time-of-invalidation window where userspace can
still attempt to access the missing pmem though the
not-yet-invalidated ptes? It may not be likely that people just yank
pmem nvdimms out of machines, but with NVMe persistent memory
spaces, there's every chance that someone pulls the wrong device...

Cheers,

Dave.
--
Dave Chinner
david@xxxxxxxxxxxxx